From 1e16af8ab5d7f307b66e496eff6ec573d9fd8eb0 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Thu, 18 May 2023 10:37:55 -0700 Subject: [PATCH] Doc improvements for language tags and custom ICU collations. Separate the documentation for language tags themselves from the available collation settings which can be included in a language tag. Include tables of the available options, more details about the effects of each option, and additional examples. Also include an explanation of the "levels" of textual features and how they relate to collation. Discussion: https://postgr.es/m/25787ec7-4c04-9a8a-d241-4dc9be0b1ba3@postgresql.org Reviewed-by: Jonathan S. Katz --- doc/src/sgml/charset.sgml | 689 +++++++++++++++++++++++++++++++------- 1 file changed, 565 insertions(+), 124 deletions(-) diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index 6dd95b8966..be06f746a5 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en variants and customization options. + + ICU Locales + + ICU Locale Names + + The ICU format for the locale name is a Language Tag. + +CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP'); +CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr'); + + + + + Locale Canonicalization and Validation + + When defining a new ICU collation object or database with ICU as the + provider, the given locale name is transformed ("canonicalized") into a + language tag if not already in that form. For instance, + + +CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true'); +NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true" +CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8'); +NOTICE: using standard form "de-DE" for locale "de_DE.utf8" + + + If you see this notice, ensure that the PROVIDER and + LOCALE are the expected result. For consistent results + when using the ICU provider, specify the canonical language tag instead of relying on the + transformation. + + + A locale with no language name, or the special language name + root, is transformed to have the language + und ("undefined"). + + + ICU can transform most libc locale names, as well as some other formats, + into language tags for easier transition to ICU. If a libc locale name is + used in ICU, it may not have precisely the same behavior as in libc. + + + If there is a problem interpreting the locale name, or if the locale name + represents a language or region that ICU does not recognize, you will see + the following warning: + + +CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense'); +WARNING: ICU locale "nonsense" has unknown language "nonsense" +HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. +CREATE COLLATION + + + controls how the message is + reported. Unless set to ERROR, the collation will + still be created, but the behavior may not be what the user intended. + + + + Language Tag + + A language tag, defined in BCP 47, is a standardized identifier used to + identify languages, regions, and other information about a locale. + + + Basic language tags are simply + language-region; + or even just language. The + language is a language code + (e.g. fr for French), and + region is a region code + (e.g. CA for Canada). Examples: + ja-JP, de, or + fr-CA. + + + Collation settings may be included in the language tag to customize + collation behavior. ICU allows extensive customization, such as + sensitivity (or insensitivity) to accents, case, and punctuation; + treatment of digits within text; and many other options to satisfy a + variety of uses. + + + To include this additional collation information in a language tag, + append -u, which indicates there are additional + collation settings, followed by one or more + -key-value + pairs. The key is the key for a collation setting and + value is a valid value for that setting. For + boolean settings, the -key + may be specified without a corresponding + -value, which implies a + value of true. + + + For example, the language tag en-US-u-kn-ks-level2 + means the locale with the English language in the US region, with + collation settings kn set to true + and ks set to level2. Those + settings mean the collation will be case-insensitive and treat a sequence + of digits as a single number: + + +CREATE COLLATION mycollation5 (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'en-US-u-kn-ks-level2'); +SELECT 'aB' = 'Ab' COLLATE mycollation5 as result; + result +-------- + t +(1 row) + +SELECT 'N-45' < 'N-123' COLLATE mycollation5 as result; + result +-------- + t +(1 row) + + + + See for details and additional + examples of using language tags with custom collation information for the + locale. + + + Problems @@ -658,6 +785,13 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR"; code byte values. + + + The C and POSIX locales may behave + differently depending on the database encoding. + + + Additionally, two SQL standard collation names are available: @@ -869,132 +1003,24 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE'); ICU Collations - - ICU allows collations to be customized beyond the basic language+country - set that is preloaded by initdb. Users are encouraged - to define their own collation objects that make use of these facilities to - suit the sorting behavior to their requirements. - See - and for - information on ICU locale naming. The set of acceptable names and - attributes depends on the particular ICU version. - - - - Here are some examples: - - - - CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk'); - CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook'); - - German collation with phone book collation type - - The first example selects the ICU locale using a language - tag per BCP 47. The second example uses the traditional - ICU-specific locale syntax. The first style is preferred going - forward, and is used internally to store locales. - - - Note that you can name the collation objects in the SQL environment - anything you want. In this example, we follow the naming style that - the predefined collations use, which in turn also follow BCP 47, but - that is not required for user-defined collations. - - - - - - CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji'); - CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji'); - - - Root collation with Emoji collation type, per Unicode Technical Standard #51 - - - Observe how in the traditional ICU locale naming system, the root - locale is selected by an empty string. - - - - - - CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn'); - CREATE COLLATION latinlast (provider = icu, locale = 'en@colReorder=grek-latn'); - - - Sort Greek letters before Latin ones. (The default is Latin before Greek.) - - - - - - CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper'); - CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper'); - - - Sort upper-case letters before lower-case letters. (The default is - lower-case letters first.) - - - - - - CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn'); - CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=grek-latn'); - - - Combines both of the above options. - - - - - - CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true'); - CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes'); - - - Numeric ordering, sorts sequences of digits by their numeric value, - for example: A-21 < A-123 - (also known as natural sort). - - - - - - See Unicode - Technical Standard #35 - and BCP 47 for - details. The list of possible collation types (co - subtag) can be found in - the CLDR - repository. - - - - Note that while this system allows creating collations that ignore - case or ignore accents or similar (using the - ks key), in order for such collations to act in a - truly case- or accent-insensitive manner, they also need to be declared as not - deterministic in CREATE COLLATION; - see . - Otherwise, any strings that compare equal according to the collation but - are not byte-wise equal will be sorted according to their byte values. - - - - By design, ICU will accept almost any string as a locale name and match - it to the closest locale it can provide, using the fallback procedure - described in its documentation. Thus, there will be no direct feedback - if a collation specification is composed using features that the given - ICU installation does not actually support. It is therefore recommended - to create application-level test cases to check that the collation - definitions satisfy one's requirements. - - - + ICU collations can be created like: + +CREATE COLLATION german (provider = icu, locale = 'de-DE'); + + + ICU locales are specified as a BCP 47 Language Tag, but can also accept most + libc-style locale names. If possible, libc-style locale names are + transformed into language tags. + + + New ICU collations can customize collation behavior extensively by + including collation attributes in the langugage tag. See for details and examples. + + Copying Collations @@ -1072,6 +1098,421 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr + + ICU Custom Collations + + + ICU allows extensive control over collation behavior by defining new + collations with collation settings as a part of the language tag. These + settings can modify the collation order to suit a variety of needs. For + instance: + + +-- ignore differences in accents and case +CREATE COLLATION ignore_accent_case (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ks-level1'); +SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true +SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true + +-- upper case letters sort before lower case. +CREATE COLLATION upper_first (PROVIDER=icu, LOCALE = 'und-u-kf-upper'); +SELECT 'B' < 'b' COLLATE upper_first; -- true + +-- treat digits numerically and ignore punctuation +CREATE COLLATION num_ignore_punct (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ka-shifted-kn'); +SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true +SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true + + + Many of the available options are described in , or see for more details. + + + ICU Comparison Levels + + Comparison of two strings (collation) in ICU is determined by a + multi-level process, where textual features are grouped into + "levels". Treatment of each level is controlled by the collation settings. Higher + levels correspond to finer textual features. + + + + ICU Collation Levels + + + + Level + Description + 'f' = 'f' + 'ab' = U&'a\2063b' + 'x-y' = 'x_y' + 'g' = 'G' + 'n' = 'ñ' + 'y' = 'z' + + + + + level1 + Base Character + true + true + true + true + true + false + + + level2 + Accents + true + true + true + true + false + false + + + level3 + Case/Variants + true + true + true + false + false + false + + + level4 + Punctuation + true + true + false + false + false + false + + + identic + All + true + false + false + false + false + false + + + +
+ + The above table shows which textual feature differences are + considered significant when determining equality at the given level. The + unicode character U+2063 is an invisible separator, + and as seen in the table, is ignored for at all levels of comparison less + than identic. +
+ + At every level, even with full normalization off, basic normalization is + performed. For example, 'á' may be composed of the + code points U&'\0061\0301' or the single code + point U&'\00E1', and those sequences will be + considered equal even at the identic level. To treat + any difference in code point representation as distinct, use a collation + created with DETERMINISTIC set to + true. + + + Collation Level Examples + + + +CREATE COLLATION level3 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level3'); +CREATE COLLATION level4 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level4'); +CREATE COLLATION identic (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-identic'); + +-- invisible separator ignored at all levels except identic +SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true +SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false + +-- punctuation ignored at level3 but not at level 4 +SELECT 'x-y' = 'x_y' COLLATE level3; -- true +SELECT 'x-y' = 'x_y' COLLATE level4; -- false + + + + +
+ + Collation Settings for an ICU Locale + + + ICU Collation Settings + + + + Key + Values + Default + Description + + + + + ks + level1, level2, level3, level4, identic + level3 + + Sensitivity (or "strength") when determining equality, with + level1 the least sensitive to differences and + identic the most sensitive to differences. See + for details. + + + + ka + noignore, shifted + noignore + + If set to shifted, causes some characters + (e.g. punctuation or space) to be ignored in comparison. Key + ks must be set to level3 or + lower to take effect. Set key kv to control which + character classes are ignored. + + + + kb + true, false + false + + Backwards comparison for the level 2 differences. For example, + locale und-u-kb sorts 'àe' + before 'aé'. + + + + kk + true, false + false + + + Enable full normalization; may affect performance. Basic + normalization is performed even when set to + false. Locales for languages that require full + normalization typically enable it by default. + + + Full normalization is important in some cases, such as when + multiple accents are applied to a single character. For instance, + 'ệ' can be composed of code points + U&'\0065\0323\0302' or + U&'\0065\0302\0323'. With full normalization + on, these code point sequences are treated as equal; otherwise they + are unequal. + + + + + kc + true, false + false + + + Separates case into a "level 2.5" that falls between accents and + other level 3 features. + + + If set to true and ks is set + to level1, will ignore accents but take case + into account. + + + + + kf + + upper, lower, + false + + false + + If set to upper, upper case sorts before lower + case. If set to lower, lower case sorts before + upper case. If set to false, the sort depends on + the rules of the locale. + + + + kn + true, false + false + + If set to true, numbers within a string are + treated as a single numeric value rather than a sequence of + digits. For example, 'id-45' sorts before + 'id-123'. + + + + kr + + space, punct, + symbol, currency, + digit, script-id + + + + + Set to one or more of the valid values, or any BCP 47 + script-id, e.g. latn + ("Latin") or grek ("Greek"). Multiple values are + separated by "-". + + + Redefines the ordering of classes of characters; those characters + belonging to a class earlier in the list sort before characters + belonging to a class later in the list. For instance, the value + digit-currency-space (as part of a language tag + like und-u-kr-digit-currency-space) sorts + punctuation before digits and spaces. + + + + + kv + + space, punct, + symbol, currency + + punct + + Classes of characters ignored during comparison at level 3. Setting + to a later value includes earlier values; + e.g. symbol also includes + punct and space in the + characters to be ignored. Key ka must be set to + shifted and key ks must be set + to level3 or lower to take effect. + + + + co + emoji, phonebk, standard, ... + standard + + Collation type. See for additional options and details. + + + + +
+ Defaults may depend on locale. The above table is not meant to be + complete. See for additional + options and details. +
+ + + For many collation settings, you must create the collation with + set to false for the + setting to have the desired effect (see ). Additionally, some settings + only take effect when the key ka is set to + shifted (see ). + + +
+ + Examples + + + + CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk'); + + German collation with phone book collation type + + + + + CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji'); + + + Root collation with Emoji collation type, per Unicode Technical Standard #51 + + + + + + CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn'); + + + Sort Greek letters before Latin ones. (The default is Latin before Greek.) + + + + + + CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper'); + + + Sort upper-case letters before lower-case letters. (The default is + lower-case letters first.) + + + + + + CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn'); + + + Combines both of the above options. + + + + + + + + External References for ICU + + This section () is only a brief + overview of ICU behavior and language tags. Refer to the following + documents for technical details, additional options, and new behavior: + + + + + Unicode + Technical Standard #35 + + + + + BCP 47 + + + + + CLDR + repository + + + + + + + + + + + + + + +