Обсуждение: indexable and locale
Hello again, I thought I should start making some small contibutions before 7.0. Attached is a patch to the old problem discussed feverly before 6.5. What is does: for locale-enabled servers: use index if last char before '%' is ascii. for non-locale servers: do not use locale if last char is non-ascii since it is wrong anyway. Comments? regards, -- ----------------- Göran Thyni On quiet nights you can hear Windows NT reboot!diff -c pgsql/src/backend/optimizer/path/indxpath.c work/pgsql/src/backend/optimizer/path/indxpath.c *** pgsql/src/backend/optimizer/path/indxpath.c Wed Oct 6 18:33:57 1999 --- work/pgsql/src/backend/optimizer/path/indxpath.c Fri Oct 15 19:54:34 1999 *************** *** 1934,1968 **** op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); expr = make_opclause(op, leftop, (Var *) con); result = lcons(expr, NIL); - /* ! * In ASCII locale we say "x <= prefix\377". This does not ! * work for non-ASCII collation orders, and it's not really ! * right even for ASCII. FIX ME! ! * Note we assume the passed prefix string is workspace with ! * an extra byte, as created by the xxx_fixed_prefix routines above. */ ! #ifndef USE_LOCALE ! prefixlen = strlen(prefix); ! prefix[prefixlen] = '\377'; ! prefix[prefixlen+1] = '\0'; ! ! optup = SearchSysCacheTuple(OPRNAME, ! PointerGetDatum("<="), ! ObjectIdGetDatum(datatype), ! ObjectIdGetDatum(datatype), ! CharGetDatum('b')); ! if (!HeapTupleIsValid(optup)) ! elog(ERROR, "prefix_quals: no <= operator for type %u", datatype); ! conval = (datatype == NAMEOID) ? ! (void*) namein(prefix) : (void*) textin(prefix); ! con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1), ! PointerGetDatum(conval), ! false, false, false, false); ! op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); ! expr = make_opclause(op, leftop, (Var *) con); ! result = lappend(result, expr); ! #endif ! return result; } --- 1934,1970 ---- op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); expr = make_opclause(op, leftop, (Var *) con); result = lcons(expr, NIL); /* ! * If last is in ascii range make it indexable, ! * else let it be. ! * FIXME: find way to use locate for this to support ! * indexing of non-ascii characters. */ ! prefixlen = strlen(prefix) - 1; ! elog(DEBUG, "XXX1 %s", prefix); ! if ((unsigned) prefix[prefixlen] < 126) ! { ! prefix[prefixlen]++; ! elog(DEBUG, "XXX2 %s", prefix); ! optup = SearchSysCacheTuple(OPRNAME, ! PointerGetDatum("<="), ! ObjectIdGetDatum(datatype), ! ObjectIdGetDatum(datatype), ! CharGetDatum('b')); ! if (!HeapTupleIsValid(optup)) ! elog(ERROR, "prefix_quals: no <= operator for type %u", datatype); ! conval = (datatype == NAMEOID) ? ! (void*) namein(prefix) : (void*) textin(prefix); ! con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1), ! PointerGetDatum(conval), ! false, false, false, false); ! op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); ! expr = make_opclause(op, leftop, (Var *) con); ! result = lappend(result, expr); ! } return result; }
Вложения
> Hello again, > I thought I should start making some small contibutions before 7.0. > > Attached is a patch to the old problem discussed feverly before 6.5. > What is does: > for locale-enabled servers: > use index if last char before '%' is ascii. > for non-locale servers: > do not use locale if last char is non-ascii since it is wrong anyway. > > Comments? I tried your patches but it seems malformed: patch: **** unexpected end of file in patch So this is a guess from reading them. I think your pacthes break non-ascii multi-byte character sets data and should be surrounded by #ifdef LOCALE rather than replacing current codes surrounded by #ifndef LOCALE. --- Tatsuo Ishii
Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> Attached is a patch to the old problem discussed feverly before 6.5. > ... I think your pacthes break > non-ascii multi-byte character sets data and should be surrounded by > #ifdef LOCALE rather than replacing current codes surrounded by > #ifndef LOCALE. I am worried about this patch too. Under MULTIBYTE could it generate invalid characters? Also, do all non-ASCII locales sort codes 0-126 in the same order as ASCII? I didn't think they do, but I'm not an expert. The approach I was considering for fixing the problem was to use a loop that would repeatedly try to generate a string greater than the prefix string. The basic loop step would increment the rightmost byte as Goran has done (or, if it's already up to the limit, chop it off and increment the next character position). Then test to see whether the '<' operator actually believes the result is greater than the given prefix, and repeat if not. This avoids making any strong assumptions about the sort order of different character codes. However, there are two significant issues that would have to be surmounted to make it work reliably: 1. In MULTIBYTE mode incrementing the rightmost byte might yield an illegal multibyte character. Some way to prevent or detect this would be needed, lest it confuse the comparison operator. I think we have some multibyte routines that could be used to check for a valid result, but I haven't looked into it. 2. I think there are some locales out there that have context- sensitive sorting rules, ie, a given character string may sort differently than you'd expect from considering the characters in isolation. For example, in German isn't "ss" treated specially? If "pqrss" does not sort between "pqrs" and "pqrt" then the entire premise of *both* sides of the LIKE optimization falls apart, because you can't be sure what will happen when comparing a prefix string like "pqrs" against longer strings from the database. I do not know if this is really a problem, nor what we could do to avoid it if it is. regards, tom lane
> Tatsuo Ishii <t-ishii@sra.co.jp> writes: > >> Attached is a patch to the old problem discussed feverly before 6.5. > > > ... I think your pacthes break > > non-ascii multi-byte character sets data and should be surrounded by > > #ifdef LOCALE rather than replacing current codes surrounded by > > #ifndef LOCALE. > > I am worried about this patch too. Under MULTIBYTE could it > generate invalid characters? I assume you are talking about following code fragment in the pacthes: prefix[prefixlen]++; This would not generate invalid characters under MULTIBYTE since it skips the multi-byte characters by: if ((unsigned) prefix[prefixlen] < 126) This would not make non-ASCII multi-byte characters indexable, however. > Also, do all non-ASCII locales sort > codes 0-126 in the same order as ASCII? I didn't think they do, > but I'm not an expert. As far as I know they do. At least all encodings MULTIBYTE mode can handle have same code point as ASCII in 0-126 range. They have following characteristics: o code point 0x00-0x7f are compatible with ASCII. o code point over 0x80 are variable length multi-byte characters. For example, ISO-8859-1 (Germany, Fernch etc...) has themulti-byte length to always 1, while EUC_JP (Japanese) has 2 to 3. > The approach I was considering for fixing the problem was to use a > loop that would repeatedly try to generate a string greater than the > prefix string. The basic loop step would increment the rightmost > byte as Goran has done (or, if it's already up to the limit, chop > it off and increment the next character position). Then test to > see whether the '<' operator actually believes the result is > greater than the given prefix, and repeat if not. This avoids making > any strong assumptions about the sort order of different character > codes. However, there are two significant issues that would have > to be surmounted to make it work reliably: Sounds good idea. > 1. In MULTIBYTE mode incrementing the rightmost byte might yield > an illegal multibyte character. Some way to prevent or detect this > would be needed, lest it confuse the comparison operator. I think > we have some multibyte routines that could be used to check for > a valid result, but I haven't looked into it. I don't think this is an issue as long as locale isn't enabled. For multibyte encodings (Japanese, Chinese etc..) locale is totally useless and usually I don't enable it. > 2. I think there are some locales out there that have context- > sensitive sorting rules, ie, a given character string may sort > differently than you'd expect from considering the characters in > isolation. For example, in German isn't "ss" treated specially? > If "pqrss" does not sort between "pqrs" and "pqrt" then the entire > premise of *both* sides of the LIKE optimization falls apart, > because you can't be sure what will happen when comparing a prefix > string like "pqrs" against longer strings from the database. > I do not know if this is really a problem, nor what we could do > to avoid it if it is. I'm not sure about it but I am afraid it could be a problem. I think real soultion would be supporting the standard CREATE COLLATION. --- Tatsuo Ishii
> > Hello again, > > I thought I should start making some small contibutions before 7.0. > > > > Attached is a patch to the old problem discussed feverly before 6.5. > > What is does: > > for locale-enabled servers: > > use index if last char before '%' is ascii. > > for non-locale servers: > > do not use locale if last char is non-ascii since it is wrong anyway. > > > > Comments? > > I tried your patches but it seems malformed: > > patch: **** unexpected end of file in patch Yes, I had to apply it manually. > So this is a guess from reading them. I think your pacthes break > non-ascii multi-byte character sets data and should be surrounded by > #ifdef LOCALE rather than replacing current codes surrounded by > #ifndef LOCALE. Can you supply a patch against the current tree? I don't understand this. Thanks. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Applied. [Charset iso-8859-1 unsupported, filtering to ASCII...] > Hello again, > I thought I should start making some small contibutions before 7.0. > > Attached is a patch to the old problem discussed feverly before 6.5. > What is does: > for locale-enabled servers: > use index if last char before '%' is ascii. > for non-locale servers: > do not use locale if last char is non-ascii since it is wrong anyway. > > Comments? > > regards, > -- > ----------------- > G_ran Thyni > On quiet nights you can hear Windows NT reboot! > diff -c pgsql/src/backend/optimizer/path/indxpath.c work/pgsql/src/backend/optimizer/path/indxpath.c > *** pgsql/src/backend/optimizer/path/indxpath.c Wed Oct 6 18:33:57 1999 > --- work/pgsql/src/backend/optimizer/path/indxpath.c Fri Oct 15 19:54:34 1999 > *************** > *** 1934,1968 **** > op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); > expr = make_opclause(op, leftop, (Var *) con); > result = lcons(expr, NIL); > - > /* > ! * In ASCII locale we say "x <= prefix\377". This does not > ! * work for non-ASCII collation orders, and it's not really > ! * right even for ASCII. FIX ME! > ! * Note we assume the passed prefix string is workspace with > ! * an extra byte, as created by the xxx_fixed_prefix routines above. > */ > ! #ifndef USE_LOCALE > ! prefixlen = strlen(prefix); > ! prefix[prefixlen] = '\377'; > ! prefix[prefixlen+1] = '\0'; > ! > ! optup = SearchSysCacheTuple(OPRNAME, > ! PointerGetDatum("<="), > ! ObjectIdGetDatum(datatype), > ! ObjectIdGetDatum(datatype), > ! CharGetDatum('b')); > ! if (!HeapTupleIsValid(optup)) > ! elog(ERROR, "prefix_quals: no <= operator for type %u", datatype); > ! conval = (datatype == NAMEOID) ? > ! (void*) namein(prefix) : (void*) textin(prefix); > ! con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1), > ! PointerGetDatum(conval), > ! false, false, false, false); > ! op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); > ! expr = make_opclause(op, leftop, (Var *) con); > ! result = lappend(result, expr); > ! #endif > ! > return result; > } > --- 1934,1970 ---- > op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); > expr = make_opclause(op, leftop, (Var *) con); > result = lcons(expr, NIL); > /* > ! * If last is in ascii range make it indexable, > ! * else let it be. > ! * FIXME: find way to use locate for this to support > ! * indexing of non-ascii characters. > */ > ! prefixlen = strlen(prefix) - 1; > ! elog(DEBUG, "XXX1 %s", prefix); > ! if ((unsigned) prefix[prefixlen] < 126) > ! { > ! prefix[prefixlen]++; > ! elog(DEBUG, "XXX2 %s", prefix); > ! optup = SearchSysCacheTuple(OPRNAME, > ! PointerGetDatum("<="), > ! ObjectIdGetDatum(datatype), > ! ObjectIdGetDatum(datatype), > ! CharGetDatum('b')); > ! if (!HeapTupleIsValid(optup)) > ! elog(ERROR, "prefix_quals: no <= operator for type %u", datatype); > ! conval = (datatype == NAMEOID) ? > ! (void*) namein(prefix) : (void*) textin(prefix); > ! con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1), > ! PointerGetDatum(conval), > ! false, false, false, false); > ! op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); > ! expr = make_opclause(op, leftop, (Var *) con); > ! result = lappend(result, expr); > ! } > return result; > } [application/x-gzip is not supported, skipping...] -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Sorry, found messages of people objecting to the patch. Patch reversed out. [Charset iso-8859-1 unsupported, filtering to ASCII...] > Hello again, > I thought I should start making some small contibutions before 7.0. > > Attached is a patch to the old problem discussed feverly before 6.5. > What is does: > for locale-enabled servers: > use index if last char before '%' is ascii. > for non-locale servers: > do not use locale if last char is non-ascii since it is wrong anyway. > > Comments? > > regards, > -- > ----------------- > G_ran Thyni > On quiet nights you can hear Windows NT reboot! > diff -c pgsql/src/backend/optimizer/path/indxpath.c work/pgsql/src/backend/optimizer/path/indxpath.c > *** pgsql/src/backend/optimizer/path/indxpath.c Wed Oct 6 18:33:57 1999 > --- work/pgsql/src/backend/optimizer/path/indxpath.c Fri Oct 15 19:54:34 1999 > *************** > *** 1934,1968 **** > op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); > expr = make_opclause(op, leftop, (Var *) con); > result = lcons(expr, NIL); > - > /* > ! * In ASCII locale we say "x <= prefix\377". This does not > ! * work for non-ASCII collation orders, and it's not really > ! * right even for ASCII. FIX ME! > ! * Note we assume the passed prefix string is workspace with > ! * an extra byte, as created by the xxx_fixed_prefix routines above. > */ > ! #ifndef USE_LOCALE > ! prefixlen = strlen(prefix); > ! prefix[prefixlen] = '\377'; > ! prefix[prefixlen+1] = '\0'; > ! > ! optup = SearchSysCacheTuple(OPRNAME, > ! PointerGetDatum("<="), > ! ObjectIdGetDatum(datatype), > ! ObjectIdGetDatum(datatype), > ! CharGetDatum('b')); > ! if (!HeapTupleIsValid(optup)) > ! elog(ERROR, "prefix_quals: no <= operator for type %u", datatype); > ! conval = (datatype == NAMEOID) ? > ! (void*) namein(prefix) : (void*) textin(prefix); > ! con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1), > ! PointerGetDatum(conval), > ! false, false, false, false); > ! op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); > ! expr = make_opclause(op, leftop, (Var *) con); > ! result = lappend(result, expr); > ! #endif > ! > return result; > } > --- 1934,1970 ---- > op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); > expr = make_opclause(op, leftop, (Var *) con); > result = lcons(expr, NIL); > /* > ! * If last is in ascii range make it indexable, > ! * else let it be. > ! * FIXME: find way to use locate for this to support > ! * indexing of non-ascii characters. > */ > ! prefixlen = strlen(prefix) - 1; > ! elog(DEBUG, "XXX1 %s", prefix); > ! if ((unsigned) prefix[prefixlen] < 126) > ! { > ! prefix[prefixlen]++; > ! elog(DEBUG, "XXX2 %s", prefix); > ! optup = SearchSysCacheTuple(OPRNAME, > ! PointerGetDatum("<="), > ! ObjectIdGetDatum(datatype), > ! ObjectIdGetDatum(datatype), > ! CharGetDatum('b')); > ! if (!HeapTupleIsValid(optup)) > ! elog(ERROR, "prefix_quals: no <= operator for type %u", datatype); > ! conval = (datatype == NAMEOID) ? > ! (void*) namein(prefix) : (void*) textin(prefix); > ! con = makeConst(datatype, ((datatype == NAMEOID) ? NAMEDATALEN : -1), > ! PointerGetDatum(conval), > ! false, false, false, false); > ! op = makeOper(optup->t_data->t_oid, InvalidOid, BOOLOID, 0, NULL); > ! expr = make_opclause(op, leftop, (Var *) con); > ! result = lappend(result, expr); > ! } > return result; > } [application/x-gzip is not supported, skipping...] -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Here is Tom's comment on the patch. > Tatsuo Ishii <t-ishii@sra.co.jp> writes: > >> Attached is a patch to the old problem discussed feverly before 6.5. > > > ... I think your pacthes break > > non-ascii multi-byte character sets data and should be surrounded by > > #ifdef LOCALE rather than replacing current codes surrounded by > > #ifndef LOCALE. > > I am worried about this patch too. Under MULTIBYTE could it > generate invalid characters? Also, do all non-ASCII locales sort > codes 0-126 in the same order as ASCII? I didn't think they do, > but I'm not an expert. > > The approach I was considering for fixing the problem was to use a > loop that would repeatedly try to generate a string greater than the > prefix string. The basic loop step would increment the rightmost > byte as Goran has done (or, if it's already up to the limit, chop > it off and increment the next character position). Then test to > see whether the '<' operator actually believes the result is > greater than the given prefix, and repeat if not. This avoids making > any strong assumptions about the sort order of different character > codes. However, there are two significant issues that would have > to be surmounted to make it work reliably: > > 1. In MULTIBYTE mode incrementing the rightmost byte might yield > an illegal multibyte character. Some way to prevent or detect this > would be needed, lest it confuse the comparison operator. I think > we have some multibyte routines that could be used to check for > a valid result, but I haven't looked into it. > > 2. I think there are some locales out there that have context- > sensitive sorting rules, ie, a given character string may sort > differently than you'd expect from considering the characters in > isolation. For example, in German isn't "ss" treated specially? > If "pqrss" does not sort between "pqrs" and "pqrt" then the entire > premise of *both* sides of the LIKE optimization falls apart, > because you can't be sure what will happen when comparing a prefix > string like "pqrs" against longer strings from the database. > I do not know if this is really a problem, nor what we could do > to avoid it if it is. > > regards, tom lane > > ************ > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026