Обсуждение: Yet another LIKE-indexing scheme

Поиск
Список
Период
Сортировка

Yet another LIKE-indexing scheme

От
Tom Lane
Дата:
I had another thought about fixing our problems with deriving index
bounds for LIKE patterns in non-ASCII locales.  (If you don't remember
the gory details here, please re-read threadSigh, LIKE indexing is *still* broken in foreign locales
from pgsql-hackers archives of 7 to 10 June, 2000; there are also many
previous go-rounds about this long-standing issue.)

The problems that I've been told about seem to center around one- and
two-character patterns that have special sort rules in some locales.
Could we work around these problems by dropping one or perhaps two
characters from the end of the given LIKE prefix?  For example, givenWHERE name LIKE 'foobar%'
drop the last fixed character ('r') and generate index bounds from what
remains, using the same algorithm as in 7.0.  So the index bounds would
becomeWHERE name >= 'fooba' AND name < 'foobb'
(at least in ASCII locale --- to make the upper bound, we'd search for
a string considered greater than 'fooba' by the local strcmp()).

The truncation would need to be multibyte-aware, of course.

This would, for example, fix the example given by Erich Stamberger:

> Another interresting feature of Czech collation is:
> 
> H < "CH" < I
> 
> and:
> 
> B < C < C + CARON < D .. < H < "CH" < I
> 
> So what happens with "WHERE name like 'Czec%`" ?

Our existing code fails because it generates WHERE name >= 'Czec' AND
name < 'Czed'; it will therefore not find names beginning 'Czech'
because those are in another part of the index, between 'Czeh' and
'Czei'.  But WHERE name >= 'Cze' AND name < 'Czf' would work.

Are there examples where this still doesn't work?  (Funny sort rules
for trigraphs would break it, I'm sure, unless we drop two characters
instead of just one.)

Obviously we could still keep the last character in ASCII locale.
That would be a good thing since it'd reduce the number of tuples
scanned.  Is there a portable way to determine whether it's safe to
do so in other locales?  (Some inquiry function about whether the sort
ordering has any digraph or two-to-one rules might help, but I don't
know if there is one.)
        regards, tom lane


Re: Yet another LIKE-indexing scheme

От
Erich Stamberger
Дата:
Hello,

On Sat, 2 Sep 2000, Tom Lane wrote:

> This would, for example, fix the example given by Erich Stamberger:
> 
> > Another interresting feature of Czech collation is:
> > 
> > H < "CH" < I
> > 
> > and:
> > 
> > B < C < C + CARON < D .. < H < "CH" < I
> > 
> > So what happens with "WHERE name like 'Czec%`" ?
> 
> Our existing code fails because it generates WHERE name >= 'Czec' AND
> name < 'Czed'; it will therefore not find names beginning 'Czech'
> because those are in another part of the index, between 'Czeh' and
> 'Czei'.  But WHERE name >= 'Cze' AND name < 'Czf' would work.

The Problem is: What tells us, that it is 'f' which sorts
after 'e' in that locale?  In the "C" locale you can
simply add One to the character's code to get the next one,
since the numerical ordering of the encoding is identical to the
collation: 'e' + 1 = 'f' and 'e' < 'f'.  This is *not* true
for *every* real-world language-encoding-pair in the world -
at least not for characters with codes above 127 and characters
with codes below 65.

In the example above we are in luck, although there are
additional characters between 'e' and 'f' in Czech sorting:

Collation => .. < 'e'   =   'e + acute'   =   'e + caron'   <   'f'
Encoding  =>      101   <   233           <   236           >   102
To my knowledge the ISO C API doesn't provide an interface
to collation information (IAPITA!).  There are no succ()
and pred() functions like in PASCAL for example.

And even if these functions could be emulated, I'm not
sure about possible "side effects".  French for example,
has even more funny rules ("funny" from a programmer's point
of view):  Accented characters which appear later in a string
are more important than accented characters which appear earlier.

IMHO, using the OS's locale support in databases asks
for trouble anyway:

Who guaratees that the strcolls/localedefs floating
around behave the same way?

What, if some kind soul of system admistrator updates the
OS and fixes a buggy locale definition file (maybe without
knowing)? The next UPDATE or INSERT coming along will
damage the indices of all databases using the affected locale.
Even a simple SELECT may yield strange results.

>
> Are there examples where this still doesn't work?  (Funny sort rules
> for trigraphs would break it, I'm sure, unless we drop two characters
> instead of just one.)
>

I don't know if there are any locales, where removing/appending
"something" from/to a string can result in a higher/lower
collation weight: "xyzab < xyz" or "xyz > xyzab". 

> Obviously we could still keep the last character in ASCII locale.
> That would be a good thing since it'd reduce the number of tuples
> scanned.  Is there a portable way to determine whether it's safe to
> do so in other locales?  (Some inquiry function about whether the sort
> ordering has any digraph or two-to-one rules might help, but I don't
> know if there is one.)
>
Even ASCII (7-bit) encoded *locales* may be in big trouble here:

gewi:~$ cat en.txt
1
2
?
?2
?A
?a
-A
-a
+
-
/
a
b
A
B

gewi:~$ export LANG="C"
gewi:~$ sort en.txt 
+
-
-A
-a
/
1
2
?
?2
?A
?a
A
B
a
b

gewi:~$ export LANG="en_US"
gewi:~$ sort en.txt
-
?
/
+
1
?2
2
-A
?A
A
-a
?a
a
B
b


.. at least strings with punctuation characters will fail
in certain cases.


--
Erich (still thinking)



Re: Yet another LIKE-indexing scheme

От
Tom Lane
Дата:
Erich Stamberger <eberger@gewi.kfunigraz.ac.at> writes:
>> Our existing code fails because it generates WHERE name >= 'Czec' AND
>> name < 'Czed'; it will therefore not find names beginning 'Czech'
>> because those are in another part of the index, between 'Czeh' and
>> 'Czei'.  But WHERE name >= 'Cze' AND name < 'Czf' would work.

> The Problem is: What tells us, that it is 'f' which sorts
> after 'e' in that locale?

We keep trying until we find a character that *does* sort after 'e'.
I did say I was assuming that people had read the previous discussion
and knew what the existing approach was ;-)

However I've since thought of a different counterexample: if the LIKE
pattern is 'Czech%' and we strip off the 'h', we lose since we'll be
looking between 'Czec' and 'Czed' but the desired strings are in the
index between 'Czeh' and 'Czei'.  Back to the drawing board...
        regards, tom lane


Re: Yet another LIKE-indexing scheme

От
Jules Bean
Дата:
On Sat, Sep 02, 2000 at 01:39:47PM -0400, Tom Lane wrote:
> > So what happens with "WHERE name like 'Czec%`" ?
> 
> Our existing code fails because it generates WHERE name >= 'Czec' AND
> name < 'Czed'; it will therefore not find names beginning 'Czech'
> because those are in another part of the index, between 'Czeh' and
> 'Czei'.  But WHERE name >= 'Cze' AND name < 'Czf' would work.

(OK, I haven't read the previous discussion.  Guilty, m'lud)

Why should it?  If 'ch' is one letter, then surely 'czech' isn't LIKE
'czec%'.  Because 'czec%' has a second c, wheres, 'czech' only has one
'c' and one 'ch'?

Jules


Re: Yet another LIKE-indexing scheme

От
Bruce Momjian
Дата:
Any status on this?


> Erich Stamberger <eberger@gewi.kfunigraz.ac.at> writes:
> >> Our existing code fails because it generates WHERE name >= 'Czec' AND
> >> name < 'Czed'; it will therefore not find names beginning 'Czech'
> >> because those are in another part of the index, between 'Czeh' and
> >> 'Czei'.  But WHERE name >= 'Cze' AND name < 'Czf' would work.
> 
> > The Problem is: What tells us, that it is 'f' which sorts
> > after 'e' in that locale?
> 
> We keep trying until we find a character that *does* sort after 'e'.
> I did say I was assuming that people had read the previous discussion
> and knew what the existing approach was ;-)
> 
> However I've since thought of a different counterexample: if the LIKE
> pattern is 'Czech%' and we strip off the 'h', we lose since we'll be
> looking between 'Czec' and 'Czed' but the desired strings are in the
> index between 'Czeh' and 'Czei'.  Back to the drawing board...
> 
>             regards, tom lane
> 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Yet another LIKE-indexing scheme

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Any status on this?

Still broken, no known fix short of disabling LIKE optimization in
non-ASCII locales ...
        regards, tom lane


Re: Yet another LIKE-indexing scheme

От
Bruce Momjian
Дата:
Can you give me a TODO item?


> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Any status on this?
> 
> Still broken, no known fix short of disabling LIKE optimization in
> non-ASCII locales ...
> 
>             regards, tom lane
> 


--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Yet another LIKE-indexing scheme

От
Tom Lane
Дата:
> Can you give me a TODO item?

* Fix LIKE indexing optimization for non-ASCII locales
        regards, tom lane