Re: Unicode FFFF Special Codepoint should always collate high.
От | Thomas Munro |
---|---|
Тема | Re: Unicode FFFF Special Codepoint should always collate high. |
Дата | |
Msg-id | CA+hUKGKnYyS01bVmY7p6ozBgB0ZA=ReSkyWXG5bbHUt6fh-aig@mail.gmail.com обсуждение исходный текст |
Ответ на | Unicode FFFF Special Codepoint should always collate high. (Telford Tendys <psql@lnx-bsp.net>) |
Ответы |
Re: Unicode FFFF Special Codepoint should always collate high.
|
Список | pgsql-bugs |
On Tue, Jun 22, 2021 at 9:39 PM Telford Tendys <psql@lnx-bsp.net> wrote: > The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting > after the non-character codepoint 0xffff, which is supposed to always have > the highest possible primary weight in all locales, and it is the only > codepoint available to serve this purpose. The other 4-byte non-character > codepoints also incorrectly sort lower than real characters. Not an expert in this subject (and to make things more interesting, unicode.org has temporarily fallen off the internet, as mentioned in another thread nearby), but definitely curious... I guess this might refer to TR35: U+FFFF: This code point is tailored to have a primary weight higher than all other characters. This allows the reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF”, to include all strings starting with "sch" or equivalent. U+FFFE: This code point produces a CE with minimal, unique weights on primary and identical levels. For details see the CLDR Collation Algorithm above. Considering the squirrelly definition of noncharacters and their status as special values for internal use (internal to what?) and not for data interchange, and the specification of that rule with in the document controlling markup of collation rules (is it also specified somewhere else?), is this actually required to work the way you expect when external users of a conforming collation algorithm sort them? That's not a rhetorical question, I don't know the answer. In any case, this is not a PostgreSQL issue, surely. We simply call out to libc (see your manufacturer for more details) or ICU. Just for interest, I tried your query on a FreeBSD 13 system (its libc is based on UCA/CLDR too) and the output matched your RHEL8 example (using COLLATE "en_US.UTF-8"). I tried it on ICU (using COLLATE "en-x-icu") and it gave a third output, matching neither RHEL7 nor RHEL8 but instead putting U&'x\FFFF' last, which seems to match what you expect: tmunro=> SELECT ENCODE(x::bytea, 'hex') FROM ( SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF', U&'x\+01FFFF', 'x '])::text AS x ) z ORDER BY x COLLATE "en-x-icu"; encode ------------ 78 7820 782f 78f09fbfbf 78f0afbfbf 78efbfbf (6 rows)
В списке pgsql-bugs по дате отправления: