Обсуждение: Re: [PATCHES] char/varchar locale support

Поиск
Список
Период
Сортировка

Re: [PATCHES] char/varchar locale support

От
"Thomas G. Lockhart"
Дата:
(moved to hackers list)

> I am working on extending locale support for char/varchar types.
> Q1. I touched ...src/include/utils/builtins.h to insert the following
> macros:
> -----
> #ifdef USE_LOCALE
>    #define pgstrcmp(s1,s2,l) strcoll(s1,s2)
> #else
>    #define pgstrcmp(s1,s2,l) strncmp(s1,s2,l)
> #endif
> -----
> Is it right place? I think so, am I wrong?

Probably the right place. Probably the wrong code; see below...

> Q2. Bartunov said me I should read varlena.c. I read it and found
> that for every strcoll() for both strings there are calls to allocate
> memory (to make them null-terminated). Oleg said I need the same for
> varchar.
> Do I really need to allocate space for varchar? What about char? Is it
> 0-terminated already?

No, neither bpchar nor varchar are guaranteed to be null terminated.
Yes, you will need to allocate (palloc()) local memory for this. Your
pgstrcmp() macros are not equivalent, since strncmp() will stop the
comparison at the specified limit (l) where strcoll() requires a null
terminated string.

If you look in varlena.c you will find several places with
  #if USE_LOCALE
  ...
  #else
  ...
  #endif

Those blocks will need to be replicated in varchar.c for both bpchar and
varchar support routines.

The first example I looked at in varlena.c seems to have trouble in that
the code looks a bit troublesome :( In the code snippet below (from
text_lt), both input strings are replicated and copied to the same
output length, even though the input lengths can be different. Looks
wrong to me:

    memcpy(a1p, VARDATA(arg1), len);
    *(a1p + len) = '\0';
    memcpy(a2p, VARDATA(arg2), len);
    *(a2p + len) = '\0';

Instead of "len" in each expression it should probably be
  len1 = VARSIZE(arg1)-VARHDRSZ
  len2 = VARSIZE(arg2)-VARHDRSZ

Another possibility for implementation is to write a string comparison
routine (e.g. varlena_cmp()) which takes two arguments and returns -1,
0, or 1 for less than, equals, and greater than. All of the comparison
routines can call that one (which would have the #if USE_LOCALE), rather
than having USE_LOCALE spread through each comparison routine.

                       - Tom

Re: [HACKERS] Re: [PATCHES] char/varchar locale support

От
Oleg Broytmann
Дата:
Hi!

On Fri, 15 May 1998, Thomas G. Lockhart wrote:
> Another possibility for implementation is to write a string comparison
> routine (e.g. varlena_cmp()) which takes two arguments and returns -1,
> 0, or 1 for less than, equals, and greater than. All of the comparison
> routines can call that one (which would have the #if USE_LOCALE), rather
> than having USE_LOCALE spread through each comparison routine.

   Yes, I thinked about this recently. It seems the best solution, perhaps.
   Thank you. I'll continue my work.

Oleg.
----
  Oleg Broytmann     http://members.tripod.com/~phd2/     phd2@earthling.net
           Programmers don't die, they just GOSUB without RETURN.


Re: [HACKERS] Re: [PATCHES] char/varchar locale support

От
Mattias Kregert
Дата:
Oleg Broytmann wrote:
>
> Hi!
>
> On Fri, 15 May 1998, Thomas G. Lockhart wrote:
> > Another possibility for implementation is to write a string comparison
> > routine (e.g. varlena_cmp()) which takes two arguments and returns -1,
> > 0, or 1 for less than, equals, and greater than. All of the comparison
> > routines can call that one (which would have the #if USE_LOCALE), rather
> > than having USE_LOCALE spread through each comparison routine.
>
>    Yes, I thinked about this recently. It seems the best solution, perhaps.
>    Thank you. I'll continue my work.
>
> Oleg.
> ----
>   Oleg Broytmann     http://members.tripod.com/~phd2/     phd2@earthling.net
>            Programmers don't die, they just GOSUB without RETURN.


Shouldn't this be done only for NATIONAL CHAR?

/* m */

Re: [HACKERS] Re: [PATCHES] char/varchar locale support

От
Oleg Broytmann
Дата:
Hi!

On Mon, 18 May 1998, Mattias Kregert wrote:
> > > Another possibility for implementation is to write a string comparison
> > > routine (e.g. varlena_cmp()) which takes two arguments and returns -1,
> > > 0, or 1 for less than, equals, and greater than. All of the comparison
> > > routines can call that one (which would have the #if USE_LOCALE), rather
> > > than having USE_LOCALE spread through each comparison routine.
>
> Shouldn't this be done only for NATIONAL CHAR?

   It is what USE_LOCALE is intended for, isn't it?

Oleg.
----
  Oleg Broytmann     http://members.tripod.com/~phd2/     phd2@earthling.net
           Programmers don't die, they just GOSUB without RETURN.


Re: [HACKERS] Re: [PATCHES] char/varchar locale support

От
"Thomas G. Lockhart"
Дата:
> > Shouldn't this be done only for NATIONAL CHAR?
>    It is what USE_LOCALE is intended for, isn't it?

SQL92 defines NATIONAL CHAR/VARCHAR as the data type to support implicit
local character sets. The usual CHAR/VARCHAR would use the default
SQL_TEXT character set. I suppose we could extend it to include NATIONAL
TEXT also...

Additionally, SQL92 allows one to specify an explicit character set and
an explicit collating sequence. The standard is not explicit on how one
actually makes these known to the database, but Postgres should be well
suited to accomplishing this.

Anyway, I'm not certain how common and wide-spread the NATIONAL CHAR
usage is. Would users with installations having non-English data find
using NCHAR/NATIONAL CHAR/NATIONAL CHARACTER an inconvenience? Or would
most non-English installations find this better and more solid??

At the moment we have support for Russian and Japanese character sets,
and these would need the maintainers to agree to changes.

btw, if we do implement NATIONAL CHARACTER I would like to do so by
having it fit in with the full SQL92 character sets and collating
sequences capabilities. Then one could specify what NATIONAL CHAR means
for an installation or perhaps at run time without having to
recompile...

                          - Tom

Re: [HACKERS] Re: [PATCHES] char/varchar locale support

От
Michal Mosiewicz
Дата:
Thomas G. Lockhart wrote:

> btw, if we do implement NATIONAL CHARACTER I would like to do so by
> having it fit in with the full SQL92 character sets and collating
> sequences capabilities. Then one could specify what NATIONAL CHAR means
> for an installation or perhaps at run time without having to
> recompile...

I fully agree that there should be a CREATE COLLATION syntax or similiar
with ability to add collation keyword in every place that needs a
character comparision, like btree indexes, orders, or simply comparision
operators.

This mean that we should start probably from creating three-parameter
comparision functions with added a third parameter to select collation.

Additionally, it's worth to note that using strcoll is highly expensive.
I've got some reports from people who used postgreSQL with national
characters and noticed performance drop-downs up to 20 times (Linux). So
it's needed to create a cheap comparision functions that will preserve
it's translation tables during sessions.

Anyhow, if anybody wants to try inefficient strcoll, long time ago I've
sent a patch to sort chars/varchars using it. But I don't recommend it.

Mike

--
WWW: http://www.lodz.pdi.net/~mimo  tel: Int. Acc. Code + 48 42 148340
add: Michal Mosiewicz  *  Bugaj 66 m.54 *  95-200 Pabianice  *  POLAND

Re: [HACKERS] Re: [PATCHES] char/varchar locale support

От
t-ishii@sra.co.jp
Дата:
>> > Shouldn't this be done only for NATIONAL CHAR?
>>    It is what USE_LOCALE is intended for, isn't it?

LOCALE is not very usefull for multi-byte speakers.

>SQL92 defines NATIONAL CHAR/VARCHAR as the data type to support implicit
>local character sets. The usual CHAR/VARCHAR would use the default
>SQL_TEXT character set. I suppose we could extend it to include NATIONAL
>TEXT also...
>
>Additionally, SQL92 allows one to specify an explicit character set and
>an explicit collating sequence. The standard is not explicit on how one
>actually makes these known to the database, but Postgres should be well
>suited to accomplishing this.
>
>Anyway, I'm not certain how common and wide-spread the NATIONAL CHAR
>usage is. Would users with installations having non-English data find
>using NCHAR/NATIONAL CHAR/NATIONAL CHARACTER an inconvenience? Or would
>most non-English installations find this better and more solid??

The capability to specify implicit character sets for CHAR (that's
what MB does) looks enough for multi-byte speakers except the
collation sequences.

One question to the SQL92's NCHAR is how one can specify several
charcter sets at one time. As you might know Japanese, Chineses,
Korean uses multiple charcter sets. For example, EUC_JP, a widly used
Japanese encoding system on Unix, includes 4 character sets: ASCII,
JISX0201, JISX0208 and JISX0212.

>At the moment we have support for Russian and Japanese character sets,
>and these would need the maintainers to agree to changes.

Additionally we have support for Chinese, Korean. Moreover if the mule
internal code or unicode is prefered for the internal encoding system,
one could use almost any language in the world:-)

>btw, if we do implement NATIONAL CHARACTER I would like to do so by
>having it fit in with the full SQL92 character sets and collating
>sequences capabilities. Then one could specify what NATIONAL CHAR means
>for an installation or perhaps at run time without having to
>recompile...

Collating sequences look very usesful.
Also it would be nice if we could specify default character sets when
creating a database, table or fields.
--
Tatsuo Ishii
t-ishii@sra.co.jp