Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
| От | Peter Geoghegan |
|---|---|
| Тема | Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5) |
| Дата | |
| Msg-id | CAM3SWZTums8PyGAHUpj=3gpt-xiGbbLqUun3mKK8-aTh-z9R2A@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5) (Peter Geoghegan <pg@heroku.com>) |
| Ответы |
Re: Missing rows with index scan when collation is not "C"
(PostgreSQL 9.5)
Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5) |
| Список | pgsql-bugs |
On Mon, Mar 21, 2016 at 9:04 PM, Peter Geoghegan <pg@heroku.com> wrote: > Can you look at generating a textual representation of the strxfrm() > blobs in question, using Robert's tool?: > > http://www.postgresql.org/message-id/CA+TgmoaOCyQpo8HK9yr6VTuyknWWvqgo7JeXi2kb=gpNveKR+g@mail.gmail.com I played with this tool myself, on an affected CentOS 6.7 VM: [vagrant@localhost ~]$ ldd --version ldd (GNU libc) 2.12 Copyright (C) 2010 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Written by Roland McGrath and Ulrich Drepper. I now think that we have this backwards: This isn't a bug in glibc's strxfrm(); it's a bug in glibc's strcoll(). Minimal testcase with modified tool, simplified to use ascii-safe strings: [vagrant@localhost ~]$ ./a.out de_DE.UTF-8 'xxx' 'x xx' "xxx" -> 2323230108080801020202 (11 bytes) "x xx" -> 2323230108080801020202010235 (14 bytes) strcmp(arg1, arg2) result: -1 strcoll(arg1, arg2) result: 6 If we assume for the sake of argument that this is a strxfrm() bug and strcoll() is a reliable source of truth, then I find it very curious that Germany's Austrian neighbors differ on this point about how text should be collated: [vagrant@localhost ~]$ ./a.out de_AT.UTF-8 'xxx' 'x xx' "xxx" -> 2323230108080801020202 (11 bytes) "x xx" -> 2323230108080801020202010235 (14 bytes) strcmp(arg1, arg2) result: -1 strcoll(arg1, arg2) result: -1 This surely adds doubt to the idea that strxfrm() in particular is broken. I find something else inconsistent with the strxfrm() theory: even the de_DE collation gives strxfrm()/strcoll() self-consistent answers when we move the rhs argument's space to the far side of its center 'x' char: [vagrant@localhost ~]$ ./a.out de_DE.UTF-8 'xxx' 'xx x' "xxx" -> 2323230108080801020202 (11 bytes) "xx x" -> 2323230108080801020202010335 (14 bytes) strcmp(arg1, arg2) result: -1 strcoll(arg1, arg2) result: -1 It seems very unlikely that this is because of a legitimate consideration that strcoll() makes about how German should be collated (one that strxfrm() fails to make, say). This is probably a worse situation for affected Postgres systems, though, because now they have no scope to turn the faulty part of the system off. I have a hard time believing that it's a good idea to trust strcoll() to be wrong in a consistent way that has collatable type opclasses at least follow "Notes to Operator Class Implementors". I'd like to hear more opinions on that, though, because it's a tricky thing to reason about. -- Peter Geoghegan
В списке pgsql-bugs по дате отправления: