Обсуждение: Yet another LIKE-indexing scheme
I had another thought about fixing our problems with deriving index bounds for LIKE patterns in non-ASCII locales. (If you don't remember the gory details here, please re-read threadSigh, LIKE indexing is *still* broken in foreign locales from pgsql-hackers archives of 7 to 10 June, 2000; there are also many previous go-rounds about this long-standing issue.) The problems that I've been told about seem to center around one- and two-character patterns that have special sort rules in some locales. Could we work around these problems by dropping one or perhaps two characters from the end of the given LIKE prefix? For example, givenWHERE name LIKE 'foobar%' drop the last fixed character ('r') and generate index bounds from what remains, using the same algorithm as in 7.0. So the index bounds would becomeWHERE name >= 'fooba' AND name < 'foobb' (at least in ASCII locale --- to make the upper bound, we'd search for a string considered greater than 'fooba' by the local strcmp()). The truncation would need to be multibyte-aware, of course. This would, for example, fix the example given by Erich Stamberger: > Another interresting feature of Czech collation is: > > H < "CH" < I > > and: > > B < C < C + CARON < D .. < H < "CH" < I > > So what happens with "WHERE name like 'Czec%`" ? Our existing code fails because it generates WHERE name >= 'Czec' AND name < 'Czed'; it will therefore not find names beginning 'Czech' because those are in another part of the index, between 'Czeh' and 'Czei'. But WHERE name >= 'Cze' AND name < 'Czf' would work. Are there examples where this still doesn't work? (Funny sort rules for trigraphs would break it, I'm sure, unless we drop two characters instead of just one.) Obviously we could still keep the last character in ASCII locale. That would be a good thing since it'd reduce the number of tuples scanned. Is there a portable way to determine whether it's safe to do so in other locales? (Some inquiry function about whether the sort ordering has any digraph or two-to-one rules might help, but I don't know if there is one.) regards, tom lane
Hello, On Sat, 2 Sep 2000, Tom Lane wrote: > This would, for example, fix the example given by Erich Stamberger: > > > Another interresting feature of Czech collation is: > > > > H < "CH" < I > > > > and: > > > > B < C < C + CARON < D .. < H < "CH" < I > > > > So what happens with "WHERE name like 'Czec%`" ? > > Our existing code fails because it generates WHERE name >= 'Czec' AND > name < 'Czed'; it will therefore not find names beginning 'Czech' > because those are in another part of the index, between 'Czeh' and > 'Czei'. But WHERE name >= 'Cze' AND name < 'Czf' would work. The Problem is: What tells us, that it is 'f' which sorts after 'e' in that locale? In the "C" locale you can simply add One to the character's code to get the next one, since the numerical ordering of the encoding is identical to the collation: 'e' + 1 = 'f' and 'e' < 'f'. This is *not* true for *every* real-world language-encoding-pair in the world - at least not for characters with codes above 127 and characters with codes below 65. In the example above we are in luck, although there are additional characters between 'e' and 'f' in Czech sorting: Collation => .. < 'e' = 'e + acute' = 'e + caron' < 'f' Encoding => 101 < 233 < 236 > 102 To my knowledge the ISO C API doesn't provide an interface to collation information (IAPITA!). There are no succ() and pred() functions like in PASCAL for example. And even if these functions could be emulated, I'm not sure about possible "side effects". French for example, has even more funny rules ("funny" from a programmer's point of view): Accented characters which appear later in a string are more important than accented characters which appear earlier. IMHO, using the OS's locale support in databases asks for trouble anyway: Who guaratees that the strcolls/localedefs floating around behave the same way? What, if some kind soul of system admistrator updates the OS and fixes a buggy locale definition file (maybe without knowing)? The next UPDATE or INSERT coming along will damage the indices of all databases using the affected locale. Even a simple SELECT may yield strange results. > > Are there examples where this still doesn't work? (Funny sort rules > for trigraphs would break it, I'm sure, unless we drop two characters > instead of just one.) > I don't know if there are any locales, where removing/appending "something" from/to a string can result in a higher/lower collation weight: "xyzab < xyz" or "xyz > xyzab". > Obviously we could still keep the last character in ASCII locale. > That would be a good thing since it'd reduce the number of tuples > scanned. Is there a portable way to determine whether it's safe to > do so in other locales? (Some inquiry function about whether the sort > ordering has any digraph or two-to-one rules might help, but I don't > know if there is one.) > Even ASCII (7-bit) encoded *locales* may be in big trouble here: gewi:~$ cat en.txt 1 2 ? ?2 ?A ?a -A -a + - / a b A B gewi:~$ export LANG="C" gewi:~$ sort en.txt + - -A -a / 1 2 ? ?2 ?A ?a A B a b gewi:~$ export LANG="en_US" gewi:~$ sort en.txt - ? / + 1 ?2 2 -A ?A A -a ?a a B b .. at least strings with punctuation characters will fail in certain cases. -- Erich (still thinking)
Erich Stamberger <eberger@gewi.kfunigraz.ac.at> writes: >> Our existing code fails because it generates WHERE name >= 'Czec' AND >> name < 'Czed'; it will therefore not find names beginning 'Czech' >> because those are in another part of the index, between 'Czeh' and >> 'Czei'. But WHERE name >= 'Cze' AND name < 'Czf' would work. > The Problem is: What tells us, that it is 'f' which sorts > after 'e' in that locale? We keep trying until we find a character that *does* sort after 'e'. I did say I was assuming that people had read the previous discussion and knew what the existing approach was ;-) However I've since thought of a different counterexample: if the LIKE pattern is 'Czech%' and we strip off the 'h', we lose since we'll be looking between 'Czec' and 'Czed' but the desired strings are in the index between 'Czeh' and 'Czei'. Back to the drawing board... regards, tom lane
On Sat, Sep 02, 2000 at 01:39:47PM -0400, Tom Lane wrote: > > So what happens with "WHERE name like 'Czec%`" ? > > Our existing code fails because it generates WHERE name >= 'Czec' AND > name < 'Czed'; it will therefore not find names beginning 'Czech' > because those are in another part of the index, between 'Czeh' and > 'Czei'. But WHERE name >= 'Cze' AND name < 'Czf' would work. (OK, I haven't read the previous discussion. Guilty, m'lud) Why should it? If 'ch' is one letter, then surely 'czech' isn't LIKE 'czec%'. Because 'czec%' has a second c, wheres, 'czech' only has one 'c' and one 'ch'? Jules
Any status on this? > Erich Stamberger <eberger@gewi.kfunigraz.ac.at> writes: > >> Our existing code fails because it generates WHERE name >= 'Czec' AND > >> name < 'Czed'; it will therefore not find names beginning 'Czech' > >> because those are in another part of the index, between 'Czeh' and > >> 'Czei'. But WHERE name >= 'Cze' AND name < 'Czf' would work. > > > The Problem is: What tells us, that it is 'f' which sorts > > after 'e' in that locale? > > We keep trying until we find a character that *does* sort after 'e'. > I did say I was assuming that people had read the previous discussion > and knew what the existing approach was ;-) > > However I've since thought of a different counterexample: if the LIKE > pattern is 'Czech%' and we strip off the 'h', we lose since we'll be > looking between 'Czec' and 'Czed' but the desired strings are in the > index between 'Czeh' and 'Czei'. Back to the drawing board... > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Any status on this? Still broken, no known fix short of disabling LIKE optimization in non-ASCII locales ... regards, tom lane
Can you give me a TODO item? > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Any status on this? > > Still broken, no known fix short of disabling LIKE optimization in > non-ASCII locales ... > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> Can you give me a TODO item? * Fix LIKE indexing optimization for non-ASCII locales regards, tom lane