diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index 6dd95b8966..be06f746a5 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en variants and customization options. + + ICU Locales + + ICU Locale Names + + The ICU format for the locale name is a Language Tag. + +CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP'); +CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr'); + + + + + Locale Canonicalization and Validation + + When defining a new ICU collation object or database with ICU as the + provider, the given locale name is transformed ("canonicalized") into a + language tag if not already in that form. For instance, + + +CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true'); +NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true" +CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8'); +NOTICE: using standard form "de-DE" for locale "de_DE.utf8" + + + If you see this notice, ensure that the PROVIDER and + LOCALE are the expected result. For consistent results + when using the ICU provider, specify the canonical language tag instead of relying on the + transformation. + + + A locale with no language name, or the special language name + root, is transformed to have the language + und ("undefined"). + + + ICU can transform most libc locale names, as well as some other formats, + into language tags for easier transition to ICU. If a libc locale name is + used in ICU, it may not have precisely the same behavior as in libc. + + + If there is a problem interpreting the locale name, or if the locale name + represents a language or region that ICU does not recognize, you will see + the following warning: + + +CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense'); +WARNING: ICU locale "nonsense" has unknown language "nonsense" +HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED. +CREATE COLLATION + + + controls how the message is + reported. Unless set to ERROR, the collation will + still be created, but the behavior may not be what the user intended. + + + + Language Tag + + A language tag, defined in BCP 47, is a standardized identifier used to + identify languages, regions, and other information about a locale. + + + Basic language tags are simply + language-region; + or even just language. The + language is a language code + (e.g. fr for French), and + region is a region code + (e.g. CA for Canada). Examples: + ja-JP, de, or + fr-CA. + + + Collation settings may be included in the language tag to customize + collation behavior. ICU allows extensive customization, such as + sensitivity (or insensitivity) to accents, case, and punctuation; + treatment of digits within text; and many other options to satisfy a + variety of uses. + + + To include this additional collation information in a language tag, + append -u, which indicates there are additional + collation settings, followed by one or more + -key-value + pairs. The key is the key for a collation setting and + value is a valid value for that setting. For + boolean settings, the -key + may be specified without a corresponding + -value, which implies a + value of true. + + + For example, the language tag en-US-u-kn-ks-level2 + means the locale with the English language in the US region, with + collation settings kn set to true + and ks set to level2. Those + settings mean the collation will be case-insensitive and treat a sequence + of digits as a single number: + + +CREATE COLLATION mycollation5 (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'en-US-u-kn-ks-level2'); +SELECT 'aB' = 'Ab' COLLATE mycollation5 as result; + result +-------- + t +(1 row) + +SELECT 'N-45' < 'N-123' COLLATE mycollation5 as result; + result +-------- + t +(1 row) + + + + See for details and additional + examples of using language tags with custom collation information for the + locale. + + + Problems @@ -658,6 +785,13 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR"; code byte values. + + + The C and POSIX locales may behave + differently depending on the database encoding. + + + Additionally, two SQL standard collation names are available: @@ -869,132 +1003,24 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE'); ICU Collations - - ICU allows collations to be customized beyond the basic language+country - set that is preloaded by initdb. Users are encouraged - to define their own collation objects that make use of these facilities to - suit the sorting behavior to their requirements. - See - and for - information on ICU locale naming. The set of acceptable names and - attributes depends on the particular ICU version. - - - - Here are some examples: - - - - CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk'); - CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook'); - - German collation with phone book collation type - - The first example selects the ICU locale using a language - tag per BCP 47. The second example uses the traditional - ICU-specific locale syntax. The first style is preferred going - forward, and is used internally to store locales. - - - Note that you can name the collation objects in the SQL environment - anything you want. In this example, we follow the naming style that - the predefined collations use, which in turn also follow BCP 47, but - that is not required for user-defined collations. - - - - - - CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji'); - CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji'); - - - Root collation with Emoji collation type, per Unicode Technical Standard #51 - - - Observe how in the traditional ICU locale naming system, the root - locale is selected by an empty string. - - - - - - CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn'); - CREATE COLLATION latinlast (provider = icu, locale = 'en@colReorder=grek-latn'); - - - Sort Greek letters before Latin ones. (The default is Latin before Greek.) - - - - - - CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper'); - CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper'); - - - Sort upper-case letters before lower-case letters. (The default is - lower-case letters first.) - - - - - - CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn'); - CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=grek-latn'); - - - Combines both of the above options. - - - - - - CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true'); - CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes'); - - - Numeric ordering, sorts sequences of digits by their numeric value, - for example: A-21 < A-123 - (also known as natural sort). - - - - - - See Unicode - Technical Standard #35 - and BCP 47 for - details. The list of possible collation types (co - subtag) can be found in - the CLDR - repository. - - - - Note that while this system allows creating collations that ignore - case or ignore accents or similar (using the - ks key), in order for such collations to act in a - truly case- or accent-insensitive manner, they also need to be declared as not - deterministic in CREATE COLLATION; - see . - Otherwise, any strings that compare equal according to the collation but - are not byte-wise equal will be sorted according to their byte values. - - - - By design, ICU will accept almost any string as a locale name and match - it to the closest locale it can provide, using the fallback procedure - described in its documentation. Thus, there will be no direct feedback - if a collation specification is composed using features that the given - ICU installation does not actually support. It is therefore recommended - to create application-level test cases to check that the collation - definitions satisfy one's requirements. - - - + ICU collations can be created like: + +CREATE COLLATION german (provider = icu, locale = 'de-DE'); + + + ICU locales are specified as a BCP 47 Language Tag, but can also accept most + libc-style locale names. If possible, libc-style locale names are + transformed into language tags. + + + New ICU collations can customize collation behavior extensively by + including collation attributes in the langugage tag. See for details and examples. + + Copying Collations @@ -1072,6 +1098,421 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr + + ICU Custom Collations + + + ICU allows extensive control over collation behavior by defining new + collations with collation settings as a part of the language tag. These + settings can modify the collation order to suit a variety of needs. For + instance: + + +-- ignore differences in accents and case +CREATE COLLATION ignore_accent_case (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ks-level1'); +SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true +SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true + +-- upper case letters sort before lower case. +CREATE COLLATION upper_first (PROVIDER=icu, LOCALE = 'und-u-kf-upper'); +SELECT 'B' < 'b' COLLATE upper_first; -- true + +-- treat digits numerically and ignore punctuation +CREATE COLLATION num_ignore_punct (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ka-shifted-kn'); +SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true +SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true + + + Many of the available options are described in , or see for more details. + + + ICU Comparison Levels + + Comparison of two strings (collation) in ICU is determined by a + multi-level process, where textual features are grouped into + "levels". Treatment of each level is controlled by the collation settings. Higher + levels correspond to finer textual features. + + + + ICU Collation Levels + + + + Level + Description + 'f' = 'f' + 'ab' = U&'a\2063b' + 'x-y' = 'x_y' + 'g' = 'G' + 'n' = 'ñ' + 'y' = 'z' + + + + + level1 + Base Character + true + true + true + true + true + false + + + level2 + Accents + true + true + true + true + false + false + + + level3 + Case/Variants + true + true + true + false + false + false + + + level4 + Punctuation + true + true + false + false + false + false + + + identic + All + true + false + false + false + false + false + + + +
+ + The above table shows which textual feature differences are + considered significant when determining equality at the given level. The + unicode character U+2063 is an invisible separator, + and as seen in the table, is ignored for at all levels of comparison less + than identic. +
+ + At every level, even with full normalization off, basic normalization is + performed. For example, 'á' may be composed of the + code points U&'\0061\0301' or the single code + point U&'\00E1', and those sequences will be + considered equal even at the identic level. To treat + any difference in code point representation as distinct, use a collation + created with DETERMINISTIC set to + true. + + + Collation Level Examples + + + +CREATE COLLATION level3 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level3'); +CREATE COLLATION level4 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level4'); +CREATE COLLATION identic (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-identic'); + +-- invisible separator ignored at all levels except identic +SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true +SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false + +-- punctuation ignored at level3 but not at level 4 +SELECT 'x-y' = 'x_y' COLLATE level3; -- true +SELECT 'x-y' = 'x_y' COLLATE level4; -- false + + + + +
+ + Collation Settings for an ICU Locale + + + ICU Collation Settings + + + + Key + Values + Default + Description + + + + + ks + level1, level2, level3, level4, identic + level3 + + Sensitivity (or "strength") when determining equality, with + level1 the least sensitive to differences and + identic the most sensitive to differences. See + for details. + + + + ka + noignore, shifted + noignore + + If set to shifted, causes some characters + (e.g. punctuation or space) to be ignored in comparison. Key + ks must be set to level3 or + lower to take effect. Set key kv to control which + character classes are ignored. + + + + kb + true, false + false + + Backwards comparison for the level 2 differences. For example, + locale und-u-kb sorts 'àe' + before 'aé'. + + + + kk + true, false + false + + + Enable full normalization; may affect performance. Basic + normalization is performed even when set to + false. Locales for languages that require full + normalization typically enable it by default. + + + Full normalization is important in some cases, such as when + multiple accents are applied to a single character. For instance, + 'ệ' can be composed of code points + U&'\0065\0323\0302' or + U&'\0065\0302\0323'. With full normalization + on, these code point sequences are treated as equal; otherwise they + are unequal. + + + + + kc + true, false + false + + + Separates case into a "level 2.5" that falls between accents and + other level 3 features. + + + If set to true and ks is set + to level1, will ignore accents but take case + into account. + + + + + kf + + upper, lower, + false + + false + + If set to upper, upper case sorts before lower + case. If set to lower, lower case sorts before + upper case. If set to false, the sort depends on + the rules of the locale. + + + + kn + true, false + false + + If set to true, numbers within a string are + treated as a single numeric value rather than a sequence of + digits. For example, 'id-45' sorts before + 'id-123'. + + + + kr + + space, punct, + symbol, currency, + digit, script-id + + + + + Set to one or more of the valid values, or any BCP 47 + script-id, e.g. latn + ("Latin") or grek ("Greek"). Multiple values are + separated by "-". + + + Redefines the ordering of classes of characters; those characters + belonging to a class earlier in the list sort before characters + belonging to a class later in the list. For instance, the value + digit-currency-space (as part of a language tag + like und-u-kr-digit-currency-space) sorts + punctuation before digits and spaces. + + + + + kv + + space, punct, + symbol, currency + + punct + + Classes of characters ignored during comparison at level 3. Setting + to a later value includes earlier values; + e.g. symbol also includes + punct and space in the + characters to be ignored. Key ka must be set to + shifted and key ks must be set + to level3 or lower to take effect. + + + + co + emoji, phonebk, standard, ... + standard + + Collation type. See for additional options and details. + + + + +
+ Defaults may depend on locale. The above table is not meant to be + complete. See for additional + options and details. +
+ + + For many collation settings, you must create the collation with + set to false for the + setting to have the desired effect (see ). Additionally, some settings + only take effect when the key ka is set to + shifted (see ). + + +
+ + Examples + + + + CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk'); + + German collation with phone book collation type + + + + + CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji'); + + + Root collation with Emoji collation type, per Unicode Technical Standard #51 + + + + + + CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn'); + + + Sort Greek letters before Latin ones. (The default is Latin before Greek.) + + + + + + CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper'); + + + Sort upper-case letters before lower-case letters. (The default is + lower-case letters first.) + + + + + + CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn'); + + + Combines both of the above options. + + + + + + + + External References for ICU + + This section () is only a brief + overview of ICU behavior and language tags. Refer to the following + documents for technical details, additional options, and new behavior: + + + + + Unicode + Technical Standard #35 + + + + + BCP 47 + + + + + CLDR + repository + + + + + + + + + + + + + + +