Re: pg_collation.collversion for C.UTF-8
От | Jeff Davis |
---|---|
Тема | Re: pg_collation.collversion for C.UTF-8 |
Дата | |
Msg-id | 56ef55fc2212334e1f72b3d8128106e9ab37fe5a.camel@j-davis.com обсуждение исходный текст |
Ответ на | Re: pg_collation.collversion for C.UTF-8 (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: pg_collation.collversion for C.UTF-8
|
Список | pgsql-hackers |
On Thu, 2023-05-25 at 14:48 -0400, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > What should we do with locales like C.UTF-8 in both libc and ICU? > > I vote for passing those to the existing C-specific code paths, Great, this would be a big step toward solving the ICU usability issues in this thread: https://postgr.es/m/000b01d97465%24c34bbd60%2449e33820%24%40pcorp.us > Probably "C", or "C.anything", or "POSIX", or "POSIX.anything". > Case-independent might be good, but we haven't accepted such in > the past, so I don't feel strongly about it. (Arguably, passing > lower case "c" to the provider would provide an "out" to anybody > who dislikes our choices here.) Patch attached with your suggestions. It's based on the first patch in the series I posted here: https://postgr.es/m/a4388fa3acabf7794ac39fdb471ad97eebdfbe11.camel@j-davis.com We still need to consider backwards compatibility. If someone has a collation with locale name C.UTF-8 in an earlier version, any change to the interpretation of that locale name after an upgrade carries a corruption risk. The risks are different in ICU vs libc: For ICU: iculocale=C in an earlier version was a mistake that must have been explicitly requested by the user. However, if such a mistake was made, the indexes would have been created using the ICU root locale, which is very different from the C locale. So reinterpreting iculocale=C as memcmp() would be likely to result in index corruption. Patch 0002 (also based on a patch from the series linked above) solves this with a pg_upgrade check for iculocale=C in versions 15 and earlier. The upgrade check is not likely to affect many users, and those it does affect have a mis-defined collation and would benefit from the check. For libc: this change may affect any user who happened to have LANG=C.UTF-8 in their environment at initdb time, which is probably a lot of users, and some buildfarm members. However, the average risk seems to be much lower, because we've gone a long time with the assumption that C.UTF-8 has the same behavior as C, and this only recently came up. Also, I'm not sure how obscure the cases are even if there is a difference; perhaps they don't often occur in practice? It's not clear to me how we mitigate this risk further, though. Regards, Jeff Davis
Вложения
В списке pgsql-hackers по дате отправления: