Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values
| От | Peter Geoghegan |
|---|---|
| Тема | Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values |
| Дата | |
| Msg-id | CAH2-Wzn0idkTAqz5xpSC_AiiyBVaZTKMQfzqsyQPkxh8TSP0yA@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: [BUGS] Crash report for some ICU-52 (debian8) COLLATE andwork_mem values (Peter Geoghegan <pg@bowt.ie>) |
| Список | pgsql-bugs |
On Thu, Aug 17, 2017 at 6:22 PM, Peter Geoghegan <pg@bowt.ie> wrote: > My argument for doing this is very simple: ICU/CLDR/BCP 47 provides > stability guarantees for locales, not collations [1]. For example, as > we discussed, de_BE didn't actually go away -- it just stopped being a > distinct collation within ICU, for reasons that are implementation > defined. I have data to back this up. I attach 2 files: one is a listing of locale XML files from within CLDR 1.9's ./common/main/, dating from December 2010, and the other is a similar listing for CLDR 3.1, dating from April 2017. This roughly covers every ICU version we'll support on day 1. The listing is sorted alphabetically, to ease comparison. Summary: $ cat locale_list_cldr-19.txt | wc -l 605 $ cat locale_list_cldr-31.txt | wc -l 722 $ diff -d -u locale_list_cldr-19.txt locale_list_cldr-31.txt | grep "^-[a-zA-Z]" | wc -l 144 $ diff -d -u locale_list_cldr-19.txt locale_list_cldr-31.txt | grep "^+[a-zA-Z]" | wc -l 261 So, there have been 144 locales removed in that time, and 261 added. My proposal to standardize on using all locales ICU makes available, rather than all behaviorally distinct collations, clearly does not ensure perfect stability. It does actually work pretty well in practice, though. The number 144 is misleadingly high. If you actually look at what went away in detail, it looks like there is a lot of script variants of the same language/country code. Plus, the changes themselves are non-technical in nature. The churn seems to be in part due to geopolitical changes, such as 5 years [1] passing after the dissolution of Serbia and Montenegro. However, it is mostly due to switching from ISO 639-1 to ISO 639-3 codes in cases where a finer distinction about cultural preferences needed to be made (note that they still only list *macro* language/region/script combinations as distinct collations). For example, Kurdish went from being "ku-" to 3 different macro languages: "ckb-" (Central Kurdish), "kmr-" (Northern Kurdish), and "sdh-" (Southern Kurdish). Wikipedia says of ISO 639-3: "Because it provides comprehensive language coverage, giving equal opportunity for all languages, and because of its wide adoption in information technologies, ISO 639-3 provides an important technology component addressing the digital divide problem". We can hope that it will be the last such revision ever needed, because this digital divide problem is solved once and for all, at least as far as these standards go. CLDR prefers to use ISO 639-1 language codes for compatibility [2], which is why the language codes are mostly still 2 letters (ISO 639-1). "en" did not change to "eng", because there was no cultural reason to do so, and thus there was a 1:1 mapping between "en" and "eng" anyway. Regions/countries will only change due to rare geopolitical events. In summary, I think that these changes are fairly low impact in practice, and are entirely explainable by political changes and cultural controversies. They really are minimal, because CLDR/ICU really does take the stability of collation names seriously. We can and should ensure that locales like "de_BE" are available in every ICU version, because that is an inexcusable technical oversight, and is not due to a cultural or political issue. [1] http://cldr.unicode.org/index/process/cldr-data-retention-policy [2] http://www.unicode.org/reports/tr35/#unicode_language_subtag_validity -- Peter Geoghegan -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Вложения
В списке pgsql-bugs по дате отправления: