diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 6dd95b89664..be06f746a59 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -377,7 +377,134 @@ initdb --locale-provider=icu --icu-locale=en
variants and customization options.
+
+ ICU Locales
+
+ ICU Locale Names
+
+ The ICU format for the locale name is a Language Tag.
+
+CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP');
+CREATE COLLATION mycollation2 (PROVIDER = icu, LOCALE = 'fr');
+
+
+
+
+ Locale Canonicalization and Validation
+
+ When defining a new ICU collation object or database with ICU as the
+ provider, the given locale name is transformed ("canonicalized") into a
+ language tag if not already in that form. For instance,
+
+
+CREATE COLLATION mycollation3 (PROVIDER = icu, LOCALE = 'en-US-u-kn-true');
+NOTICE: using standard form "en-US-u-kn" for locale "en-US-u-kn-true"
+CREATE COLLATION mycollation4 (PROVIDER = icu, LOCALE = 'de_DE.utf8');
+NOTICE: using standard form "de-DE" for locale "de_DE.utf8"
+
+
+ If you see this notice, ensure that the PROVIDER and
+ LOCALE are the expected result. For consistent results
+ when using the ICU provider, specify the canonical language tag instead of relying on the
+ transformation.
+
+
+ A locale with no language name, or the special language name
+ root, is transformed to have the language
+ und ("undefined").
+
+
+ ICU can transform most libc locale names, as well as some other formats,
+ into language tags for easier transition to ICU. If a libc locale name is
+ used in ICU, it may not have precisely the same behavior as in libc.
+
+
+ If there is a problem interpreting the locale name, or if the locale name
+ represents a language or region that ICU does not recognize, you will see
+ the following warning:
+
+
+CREATE COLLATION nonsense (PROVIDER = icu, LOCALE = 'nonsense');
+WARNING: ICU locale "nonsense" has unknown language "nonsense"
+HINT: To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+CREATE COLLATION
+
+
+ controls how the message is
+ reported. Unless set to ERROR, the collation will
+ still be created, but the behavior may not be what the user intended.
+
+
+
+ Language Tag
+
+ A language tag, defined in BCP 47, is a standardized identifier used to
+ identify languages, regions, and other information about a locale.
+
+
+ Basic language tags are simply
+ language-region;
+ or even just language. The
+ language is a language code
+ (e.g. fr for French), and
+ region is a region code
+ (e.g. CA for Canada). Examples:
+ ja-JP, de, or
+ fr-CA.
+
+
+ Collation settings may be included in the language tag to customize
+ collation behavior. ICU allows extensive customization, such as
+ sensitivity (or insensitivity) to accents, case, and punctuation;
+ treatment of digits within text; and many other options to satisfy a
+ variety of uses.
+
+
+ To include this additional collation information in a language tag,
+ append -u, which indicates there are additional
+ collation settings, followed by one or more
+ -key-value
+ pairs. The key is the key for a collation setting and
+ value is a valid value for that setting. For
+ boolean settings, the -key
+ may be specified without a corresponding
+ -value, which implies a
+ value of true.
+
+
+ For example, the language tag en-US-u-kn-ks-level2
+ means the locale with the English language in the US region, with
+ collation settings kn set to true
+ and ks set to level2. Those
+ settings mean the collation will be case-insensitive and treat a sequence
+ of digits as a single number:
+
+
+CREATE COLLATION mycollation5 (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'en-US-u-kn-ks-level2');
+SELECT 'aB' = 'Ab' COLLATE mycollation5 as result;
+ result
+--------
+ t
+(1 row)
+
+SELECT 'N-45' < 'N-123' COLLATE mycollation5 as result;
+ result
+--------
+ t
+(1 row)
+
+
+
+ See for details and additional
+ examples of using language tags with custom collation information for the
+ locale.
+
+
+ Problems
@@ -658,6 +785,13 @@ SELECT * FROM test1 ORDER BY a || b COLLATE "fr_FR";
code byte values.
+
+
+ The C and POSIX locales may behave
+ differently depending on the database encoding.
+
+
+
Additionally, two SQL standard collation names are available:
@@ -869,132 +1003,24 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
ICU Collations
-
- ICU allows collations to be customized beyond the basic language+country
- set that is preloaded by initdb. Users are encouraged
- to define their own collation objects that make use of these facilities to
- suit the sorting behavior to their requirements.
- See
- and for
- information on ICU locale naming. The set of acceptable names and
- attributes depends on the particular ICU version.
-
-
-
- Here are some examples:
-
-
-
- CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');
- CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de@collation=phonebook');
-
- German collation with phone book collation type
-
- The first example selects the ICU locale using a language
- tag per BCP 47. The second example uses the traditional
- ICU-specific locale syntax. The first style is preferred going
- forward, and is used internally to store locales.
-
-
- Note that you can name the collation objects in the SQL environment
- anything you want. In this example, we follow the naming style that
- the predefined collations use, which in turn also follow BCP 47, but
- that is not required for user-defined collations.
-
-
-
-
-
- CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');
- CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = '@collation=emoji');
-
-
- Root collation with Emoji collation type, per Unicode Technical Standard #51
-
-
- Observe how in the traditional ICU locale naming system, the root
- locale is selected by an empty string.
-
-
-
-
-
- CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');
- CREATE COLLATION latinlast (provider = icu, locale = 'en@colReorder=grek-latn');
-
-
- Sort Greek letters before Latin ones. (The default is Latin before Greek.)
-
-
-
-
-
- CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
- CREATE COLLATION upperfirst (provider = icu, locale = 'en@colCaseFirst=upper');
-
-
- Sort upper-case letters before lower-case letters. (The default is
- lower-case letters first.)
-
-
-
-
-
- CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');
- CREATE COLLATION special (provider = icu, locale = 'en@colCaseFirst=upper;colReorder=grek-latn');
-
-
- Combines both of the above options.
-
-
-
-
-
- CREATE COLLATION numeric (provider = icu, locale = 'en-u-kn-true');
- CREATE COLLATION numeric (provider = icu, locale = 'en@colNumeric=yes');
-
-
- Numeric ordering, sorts sequences of digits by their numeric value,
- for example: A-21 < A-123
- (also known as natural sort).
-
-
-
-
-
- See Unicode
- Technical Standard #35
- and BCP 47 for
- details. The list of possible collation types (co
- subtag) can be found in
- the CLDR
- repository.
-
-
-
- Note that while this system allows creating collations that ignore
- case or ignore accents or similar (using the
- ks key), in order for such collations to act in a
- truly case- or accent-insensitive manner, they also need to be declared as not
- deterministic in CREATE COLLATION;
- see .
- Otherwise, any strings that compare equal according to the collation but
- are not byte-wise equal will be sorted according to their byte values.
-
-
-
- By design, ICU will accept almost any string as a locale name and match
- it to the closest locale it can provide, using the fallback procedure
- described in its documentation. Thus, there will be no direct feedback
- if a collation specification is composed using features that the given
- ICU installation does not actually support. It is therefore recommended
- to create application-level test cases to check that the collation
- definitions satisfy one's requirements.
-
-
-
+ ICU collations can be created like:
+
+CREATE COLLATION german (provider = icu, locale = 'de-DE');
+
+
+ ICU locales are specified as a BCP 47 Language Tag, but can also accept most
+ libc-style locale names. If possible, libc-style locale names are
+ transformed into language tags.
+
+
+ New ICU collations can customize collation behavior extensively by
+ including collation attributes in the langugage tag. See for details and examples.
+
+
Copying Collations
@@ -1072,6 +1098,421 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
+
+ ICU Custom Collations
+
+
+ ICU allows extensive control over collation behavior by defining new
+ collations with collation settings as a part of the language tag. These
+ settings can modify the collation order to suit a variety of needs. For
+ instance:
+
+
+-- ignore differences in accents and case
+CREATE COLLATION ignore_accent_case (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ks-level1');
+SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
+SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
+
+-- upper case letters sort before lower case.
+CREATE COLLATION upper_first (PROVIDER=icu, LOCALE = 'und-u-kf-upper');
+SELECT 'B' < 'b' COLLATE upper_first; -- true
+
+-- treat digits numerically and ignore punctuation
+CREATE COLLATION num_ignore_punct (PROVIDER = icu, DETERMINISTIC = false, LOCALE = 'und-u-ka-shifted-kn');
+SELECT 'id-45' < 'id-123' COLLATE num_ignore_punct; -- true
+SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
+
+
+ Many of the available options are described in , or see for more details.
+
+
+ ICU Comparison Levels
+
+ Comparison of two strings (collation) in ICU is determined by a
+ multi-level process, where textual features are grouped into
+ "levels". Treatment of each level is controlled by the collation settings. Higher
+ levels correspond to finer textual features.
+
+
+
+
+ The above table shows which textual feature differences are
+ considered significant when determining equality at the given level. The
+ unicode character U+2063 is an invisible separator,
+ and as seen in the table, is ignored for at all levels of comparison less
+ than identic.
+
+
+ At every level, even with full normalization off, basic normalization is
+ performed. For example, 'á' may be composed of the
+ code points U&'\0061\0301' or the single code
+ point U&'\00E1', and those sequences will be
+ considered equal even at the identic level. To treat
+ any difference in code point representation as distinct, use a collation
+ created with DETERMINISTIC set to
+ true.
+
+
+ Collation Level Examples
+
+
+
+CREATE COLLATION level3 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level3');
+CREATE COLLATION level4 (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-level4');
+CREATE COLLATION identic (PROVIDER=icu, DETERMINISTIC=false, LOCALE='und-u-ka-shifted-ks-identic');
+
+-- invisible separator ignored at all levels except identic
+SELECT 'ab' = U&'a\2063b' COLLATE level4; -- true
+SELECT 'ab' = U&'a\2063b' COLLATE identic; -- false
+
+-- punctuation ignored at level3 but not at level 4
+SELECT 'x-y' = 'x_y' COLLATE level3; -- true
+SELECT 'x-y' = 'x_y' COLLATE level4; -- false
+
+
+
+
+
+
+ Collation Settings for an ICU Locale
+
+
+ ICU Collation Settings
+
+
+
+ Key
+ Values
+ Default
+ Description
+
+
+
+
+ ks
+ level1, level2, level3, level4, identic
+ level3
+
+ Sensitivity (or "strength") when determining equality, with
+ level1 the least sensitive to differences and
+ identic the most sensitive to differences. See
+ for details.
+
+
+
+ ka
+ noignore, shifted
+ noignore
+
+ If set to shifted, causes some characters
+ (e.g. punctuation or space) to be ignored in comparison. Key
+ ks must be set to level3 or
+ lower to take effect. Set key kv to control which
+ character classes are ignored.
+
+
+
+ kb
+ true, false
+ false
+
+ Backwards comparison for the level 2 differences. For example,
+ locale und-u-kb sorts 'àe'
+ before 'aé'.
+
+
+
+ kk
+ true, false
+ false
+
+
+ Enable full normalization; may affect performance. Basic
+ normalization is performed even when set to
+ false. Locales for languages that require full
+ normalization typically enable it by default.
+
+
+ Full normalization is important in some cases, such as when
+ multiple accents are applied to a single character. For instance,
+ 'ệ' can be composed of code points
+ U&'\0065\0323\0302' or
+ U&'\0065\0302\0323'. With full normalization
+ on, these code point sequences are treated as equal; otherwise they
+ are unequal.
+
+
+
+
+ kc
+ true, false
+ false
+
+
+ Separates case into a "level 2.5" that falls between accents and
+ other level 3 features.
+
+
+ If set to true and ks is set
+ to level1, will ignore accents but take case
+ into account.
+
+
+
+
+ kf
+
+ upper, lower,
+ false
+
+ false
+
+ If set to upper, upper case sorts before lower
+ case. If set to lower, lower case sorts before
+ upper case. If set to false, the sort depends on
+ the rules of the locale.
+
+
+
+ kn
+ true, false
+ false
+
+ If set to true, numbers within a string are
+ treated as a single numeric value rather than a sequence of
+ digits. For example, 'id-45' sorts before
+ 'id-123'.
+
+
+
+ kr
+
+ space, punct,
+ symbol, currency,
+ digit, script-id
+
+
+
+
+ Set to one or more of the valid values, or any BCP 47
+ script-id, e.g. latn
+ ("Latin") or grek ("Greek"). Multiple values are
+ separated by "-".
+
+
+ Redefines the ordering of classes of characters; those characters
+ belonging to a class earlier in the list sort before characters
+ belonging to a class later in the list. For instance, the value
+ digit-currency-space (as part of a language tag
+ like und-u-kr-digit-currency-space) sorts
+ punctuation before digits and spaces.
+
+
+
+
+ kv
+
+ space, punct,
+ symbol, currency
+
+ punct
+
+ Classes of characters ignored during comparison at level 3. Setting
+ to a later value includes earlier values;
+ e.g. symbol also includes
+ punct and space in the
+ characters to be ignored. Key ka must be set to
+ shifted and key ks must be set
+ to level3 or lower to take effect.
+
+
+
+ co
+ emoji, phonebk, standard, ...
+ standard
+
+ Collation type. See for additional options and details.
+
+
+
+
+
+ Defaults may depend on locale. The above table is not meant to be
+ complete. See for additional
+ options and details.
+
+
+
+ For many collation settings, you must create the collation with
+ set to false for the
+ setting to have the desired effect (see ). Additionally, some settings
+ only take effect when the key ka is set to
+ shifted (see ).
+
+
+
+
+ Examples
+
+
+
+ CREATE COLLATION "de-u-co-phonebk-x-icu" (provider = icu, locale = 'de-u-co-phonebk');
+
+ German collation with phone book collation type
+
+
+
+
+ CREATE COLLATION "und-u-co-emoji-x-icu" (provider = icu, locale = 'und-u-co-emoji');
+
+
+ Root collation with Emoji collation type, per Unicode Technical Standard #51
+
+
+
+
+
+ CREATE COLLATION latinlast (provider = icu, locale = 'en-u-kr-grek-latn');
+
+
+ Sort Greek letters before Latin ones. (The default is Latin before Greek.)
+
+
+
+
+
+ CREATE COLLATION upperfirst (provider = icu, locale = 'en-u-kf-upper');
+
+
+ Sort upper-case letters before lower-case letters. (The default is
+ lower-case letters first.)
+
+
+
+
+
+ CREATE COLLATION special (provider = icu, locale = 'en-u-kf-upper-kr-grek-latn');
+
+
+ Combines both of the above options.
+
+
+
+
+
+
+
+ External References for ICU
+
+ This section () is only a brief
+ overview of ICU behavior and language tags. Refer to the following
+ documents for technical details, additional options, and new behavior:
+
+
+
+
+ Unicode
+ Technical Standard #35
+
+
+
+
+ BCP 47
+
+
+
+
+ CLDR
+ repository
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+