Обсуждение: [WIP] collation support revisited (phase 1)

Поиск
Список
Период
Сортировка

[WIP] collation support revisited (phase 1)

От
"Radek Strnad"
Дата:
Hi,<br /><br />after long discussion with Mr. Kotala, we've decided to redesign our collation support proposal.<br
/>Forthose of you who aren't familiar with my WIP patch and comments from other hackers here's the original mail: <a
href="http://archives.postgresql.org/pgsql-hackers/2008-07/msg00019.php">http://archives.postgresql.org/pgsql-hackers/2008-07/msg00019.php</a><br
/><br/>In a few sentences - I'm writing collation support for PostgreSQL that is almost independent on used collating
function.I will implement POSIX locales but switch to ICU will be quite easy. Collations and character sets defined by
SQLstandard will be hard coded so we avoid non-existence in some functions.<br /><br />The whole project will be
dividedinto two phases:<br /><br />phase 1<br />Implement "sort of framework" so the PostgreSQL will have basic guts
(pg_collation& pg_charset catalogs, CREATE COLLATION, add collation support for each type needed) and will support
collationat database level. This phase has been accepted as a Google Summer of Code project.<br /><br />phase 2<br
/>Implementthe rest - full collation at column level. I will continue working on this after finishing phase one and it
willbe my master degree thesis.<br /><br />How will the first part work?<br /><br />Catalogs<br /> - new catalogs
pg_collationand pg_charset will be defined<br />- pg_collation and pg_charset will contain SQL standard collations +
optionaldefault collation (when set other than SQL standard one)<br />- pg_type, pg_attribute, pg_namespace will be
extendedwith references to default records in pg_collation and pg_charset<br /><br />initdb<br />- pg_collation &
pg_charsetwill contain each pre-defined records regarding SQL standard and optionally one record that will be
non-standardset when creating initdb (the one using system locales) <br />- these two records will be referenced by
pg_type,pg_attribute, pg_namespace in concerned columns and will be concidered as default collation that will be
inherited<br/><br />CREATE DATABASE ... COLLATE ...<br />- after copying the new database the collation will be default
(sameas cluster collation) or changed by COLLATE statement. Then we update pg_type, pg_attribute and pg_namespace
catalogs<br/> - reindex database<br /><br />When changing databases the database collation will be retrieved from type
textfrom pg_type. This part should be the only one that will be deleted when proceeding with phase 2. But that will
takea while :-)<br /><br />Thanks for all your comments<br /><br />    Regards<br /><br />        Radek Strnad<br /><br
/>

Re: [WIP] collation support revisited (phase 1)

От
Martijn van Oosterhout
Дата:
On Thu, Jul 10, 2008 at 11:24:29PM +0200, Radek Strnad wrote:
> Hi,
>
> after long discussion with Mr. Kotala, we've decided to redesign our
> collation support proposal.
> For those of you who aren't familiar with my WIP patch and comments from
> other hackers here's the original mail:
> http://archives.postgresql.org/pgsql-hackers/2008-07/msg00019.php

<snip>
> phase 1
> Implement "sort of framework" so the PostgreSQL will have basic guts
> (pg_collation & pg_charset catalogs, CREATE COLLATION, add collation support
> for each type needed) and will support collation at database level. This
> phase has been accepted as a Google Summer of Code project.

Why bother with pg_charset? I don't think I've seen anyone even asking
for multiple charsets in a single DB and certainly no actual use case.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [WIP] collation support revisited (phase 1)

От
Zdenek Kotala
Дата:
Martijn van Oosterhout napsal(a):
> On Thu, Jul 10, 2008 at 11:24:29PM +0200, Radek Strnad wrote:
>> Hi,
>>
>> after long discussion with Mr. Kotala, we've decided to redesign our
>> collation support proposal.
>> For those of you who aren't familiar with my WIP patch and comments from
>> other hackers here's the original mail:
>> http://archives.postgresql.org/pgsql-hackers/2008-07/msg00019.php
> 
> <snip>
>> phase 1
>> Implement "sort of framework" so the PostgreSQL will have basic guts
>> (pg_collation & pg_charset catalogs, CREATE COLLATION, add collation support
>> for each type needed) and will support collation at database level. This
>> phase has been accepted as a Google Summer of Code project.
> 
> Why bother with pg_charset? I don't think I've seen anyone even asking
> for multiple charsets in a single DB and certainly no actual use case.

The example is when you have translation data (vocabulary) in database. But the  reason is that ANSI specify (chapter
4.2)charset as a part of string 
 
descriptor. See below:

— The length or maximum length in characters of the character string type.
— The catalog name, schema name, and character set name of the character set of 
the character string type.
— The catalog name, schema name, and collation name of the collation of the 
character string type.

If I think about it, the question is if we need both information in pg_type and 
so on, because collation automatically define charset.
    any idea?
    Zdenek



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] collation support revisited (phase 1)

От
Alvaro Herrera
Дата:
Zdenek Kotala escribió:

> The example is when you have translation data (vocabulary) in database. 
> But the  reason is that ANSI specify (chapter 4.2) charset as a part of 
> string descriptor. See below:
>
> — The length or maximum length in characters of the character string type.
> — The catalog name, schema name, and character set name of the character 
> set of the character string type.
> — The catalog name, schema name, and collation name of the collation of 
> the character string type.

We already support multiple charsets, and are able to do conversions
between them.  The set of charsets is hardcoded and it's hard to make a
case that a user needs to create new ones.  I concur with Martijn's
suggestion -- there's no need for this to appear in a system catalog.

Perhaps it could be argued that we need to be able to specify the
charset a given string is in -- currently all strings are in the server
encoding (charset) which is fixed at initdb time.  Making the system
support multiple server encodings would be a major undertaking in itself
and I'm not sure that there's a point.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: [WIP] collation support revisited (phase 1)

От
Zdenek Kotala
Дата:
Alvaro Herrera napsal(a):
> Zdenek Kotala escribió:
> 
>> The example is when you have translation data (vocabulary) in database. 
>> But the  reason is that ANSI specify (chapter 4.2) charset as a part of 
>> string descriptor. See below:
>>
>> — The length or maximum length in characters of the character string type.
>> — The catalog name, schema name, and character set name of the character 
>> set of the character string type.
>> — The catalog name, schema name, and collation name of the collation of 
>> the character string type.
> 
> We already support multiple charsets, and are able to do conversions
> between them.  The set of charsets is hardcoded and it's hard to make a
> case that a user needs to create new ones.  I concur with Martijn's
> suggestion -- there's no need for this to appear in a system catalog.
> 
> Perhaps it could be argued that we need to be able to specify the
> charset a given string is in -- currently all strings are in the server
> encoding (charset) which is fixed at initdb time.  Making the system
> support multiple server encodings would be a major undertaking in itself
> and I'm not sure that there's a point.
> 

Background:
We specify encoding in initdb phase. ANSI specify repertoire, charset, encoding 
and collation. If I understand it correctly, then charset is subset of 
repertoire and specify list of allowed characters for language->collation. 
Encoding is mapping of character set to binary format. For example for Czech 
alphabet(charset) we have 6 different encoding for 8bit ASCII, but on other side 
for UTF8 there is specified multi charsets.


I think if we support UTF8 encoding, than it make sense to create own charsets, 
because system locales could have defined collation for that. We need conversion 
only in case when client encoding is not compatible with charset and conversion 
is not defined.
    Any comments?
        Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] collation support revisited (phase 1)

От
Martijn van Oosterhout
Дата:
On Sat, Jul 12, 2008 at 10:02:24AM +0200, Zdenek Kotala wrote:
> Background:
> We specify encoding in initdb phase. ANSI specify repertoire, charset,
> encoding and collation. If I understand it correctly, then charset is
> subset of repertoire and specify list of allowed characters for
> language->collation. Encoding is mapping of character set to binary format.
> For example for Czech alphabet(charset) we have 6 different encoding for
> 8bit ASCII, but on other side for UTF8 there is specified multi charsets.

Oh, so you're thinking of a charset as a sort of check constraint. If
your locale is turkish and you have a column marked charset ASCII then
storing lower('HI') results in an error.

A collation must be defined over all possible characters, it can't
depend on the character set. That doesn't mean sorting in en_US must do
something meaningful with japanese characters, it does mean it can't
throw an error (the usual procedure is to sort on unicode point).

> I think if we support UTF8 encoding, than it make sense to create own
> charsets, because system locales could have defined collation for that. We
> need conversion only in case when client encoding is not compatible with
> charset and conversion is not defined.

The problem is that locales in POSIX are defined on an encoding, not a
charset. In locale en_US.UTF-8 doesn't actually sort any differently
than en_US.latin1, it's just that japanese characters are not
representable in the latter.

locale-gen can create a locale for any pair of (locale code,encoding),
whether the result is meaningful is another question.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [WIP] collation support revisited (phase 1)

От
Tom Lane
Дата:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> I think if we support UTF8 encoding, than it make sense to create own charsets, 
> because system locales could have defined collation for that.

Say what?  I cannot imagine a scenario in which a user-defined encoding
would be useful. The amount of infrastructure you need for a new
encoding is so large that providing management commands is just silly
--- anyone who can create the infrastructure can do the last little bit
for themselves.  The analogy to index access methods is on point, again.
        regards, tom lane


Re: [WIP] collation support revisited (phase 1)

От
"Radek Strnad"
Дата:
<div dir="ltr">I was trying to sort out the problem with not creating new catalog for character sets and I came up
followingideas. Correct me if my ideas are wrong.<br /><br />Since collation has to have a defined character set I'm
suggestingto use already written infrastructure of encodings and to use list of encodings in chklocale.c. Currently
databasesare not created with specified character set but with specified encoding. I think instead of pointing a record
incollation catalog to another record in character set catalog we might use only name (string) of the encoding.<br
/><br/>So each collation will be set over these encodings set in chklocale.c. Each database will be able to use only
collationsthat are created over same ("compatible") encodings regarding encoding_match_list. Each standard collation
(SQLstandard) will be defined over all possible encodings (hard-coded). <br /><br />Comments?<br /><br />Regards<br
/><br/>     Radek Strnad<br /><br /><div class="gmail_quote">On Sat, Jul 12, 2008 at 5:17 PM, Tom Lane <<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>wrote:<br /><blockquote class="gmail_quote"
style="border-left:1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div
class="Ih2E3d">ZdenekKotala <Zdenek.Kotala@Sun.COM> writes:<br /> > I think if we support UTF8 encoding, than
itmake sense to create own charsets,<br /> > because system locales could have defined collation for that.<br /><br
/></div>Saywhat?  I cannot imagine a scenario in which a user-defined encoding<br /> would be useful. The amount of
infrastructureyou need for a new<br /> encoding is so large that providing management commands is just silly<br /> ---
anyonewho can create the infrastructure can do the last little bit<br /> for themselves.  The analogy to index access
methodsis on point, again.<br /><br />                        regards, tom lane<br /></blockquote></div><br /></div> 

Re: [WIP] collation support revisited (phase 1)

От
Martijn van Oosterhout
Дата:
On Mon, Jul 21, 2008 at 03:15:56AM +0200, Radek Strnad wrote:
> I was trying to sort out the problem with not creating new catalog for
> character sets and I came up following ideas. Correct me if my ideas are
> wrong.
>
> Since collation has to have a defined character set.

Not really. AIUI at least glibc and ICU define a collation over all
possible characters (ie unicode). When you create a locale you take a
subset and use that. Think about it: if you want to sort strings and
one of them happens to contain a chinese charater, it can't *fail*.
Note strcoll() has no error return for unknown characters.

> I'm suggesting to use
> already written infrastructure of encodings and to use list of encodings in
> chklocale.c. Currently databases are not created with specified character
> set but with specified encoding. I think instead of pointing a record in
> collation catalog to another record in character set catalog we might use
> only name (string) of the encoding.

That's reasonable. From an abstract point of view collations and
encodings are orthoginal, it's only when you're using POSIX locales
that there are limitations on how you combine them. I think you can
assume a collation can handle any characters that can be produced by
encoding.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: [WIP] collation support revisited (phase 1)

От
Zdenek Kotala
Дата:
Martijn van Oosterhout napsal(a):
> On Sat, Jul 12, 2008 at 10:02:24AM +0200, Zdenek Kotala wrote:
>> Background:
>> We specify encoding in initdb phase. ANSI specify repertoire, charset, 
>> encoding and collation. If I understand it correctly, then charset is 
>> subset of repertoire and specify list of allowed characters for 
>> language->collation. Encoding is mapping of character set to binary format. 
>> For example for Czech alphabet(charset) we have 6 different encoding for 
>> 8bit ASCII, but on other side for UTF8 there is specified multi charsets.
> 
> Oh, so you're thinking of a charset as a sort of check constraint. If
> your locale is turkish and you have a column marked charset ASCII then
> storing lower('HI') results in an error.

Yeah, if you use strcoll function it fails when illegal character is found.
See
http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html

> A collation must be defined over all possible characters, it can't
> depend on the character set. That doesn't mean sorting in en_US must do
> something meaningful with japanese characters, it does mean it can't
> throw an error (the usual procedure is to sort on unicode point).

Collation cannot be defined on any character. There is not any relation between
Latin and Chines characters. Collation has sense when you are able to specify <  = > operators.

If you need compare Japanese and Latin characters then ansi specify default 
collation for each repertoire. I think it is usually bitwise comparing.

    Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] collation support revisited (phase 1)

От
Zdenek Kotala
Дата:
Tom Lane napsal(a):
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
>> I think if we support UTF8 encoding, than it make sense to create own charsets, 
>> because system locales could have defined collation for that.
> 
> Say what?  I cannot imagine a scenario in which a user-defined encoding
> would be useful. 

I did not mean user defined encoding but user defined charset. For example
cs_CZ.UTF-8@euro locale uses UTF8 encoding and collation is defined on czech 
charset which specifies list of allowed character. If somebody will have 
installed e.g. Turkish locale then he will want to have also Turkish charset in  postgres. I guess, Charset also
defineshow upper/lower case will work (see 
 
i/I in Turkish).

Please, correct me if I wrong.
    thanks Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] collation support revisited (phase 1)

От
Zdenek Kotala
Дата:
Martijn van Oosterhout napsal(a):
> On Mon, Jul 21, 2008 at 03:15:56AM +0200, Radek Strnad wrote:
>> I was trying to sort out the problem with not creating new catalog for
>> character sets and I came up following ideas. Correct me if my ideas are
>> wrong.
>>
>> Since collation has to have a defined character set.
> 
> Not really. AIUI at least glibc and ICU define a collation over all
> possible characters (ie unicode). When you create a locale you take a
> subset and use that. Think about it: if you want to sort strings and
> one of them happens to contain a chinese charater, it can't *fail*.
> Note strcoll() has no error return for unknown characters.

It has.
See http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html

The strcoll() function may fail if:
    [EINVAL]        [CX]  The s1 or s2 arguments contain characters outside the domain of 
the collating sequence.


>> I'm suggesting to use
>> already written infrastructure of encodings and to use list of encodings in
>> chklocale.c. Currently databases are not created with specified character
>> set but with specified encoding. I think instead of pointing a record in
>> collation catalog to another record in character set catalog we might use
>> only name (string) of the encoding.
> 
> That's reasonable. From an abstract point of view collations and
> encodings are orthoginal, it's only when you're using POSIX locales
> that there are limitations on how you combine them. I think you can
> assume a collation can handle any characters that can be produced by
> encoding.

I think you are not correct. You cannot use collation over all UNICODE. See 
http://www.unicode.org/reports/tr10/#Common_Misperceptions. Same characters can 
be ordered differently in different languages.
    Zdenek



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql



Re: [WIP] collation support revisited (phase 1)

От
Martijn van Oosterhout
Дата:
On Tue, Jul 22, 2008 at 04:32:39PM +0200, Zdenek Kotala wrote:
> >Oh, so you're thinking of a charset as a sort of check constraint. If
> >your locale is turkish and you have a column marked charset ASCII then
> >storing lower('HI') results in an error.
>
> Yeah, if you use strcoll function it fails when illegal character is found.
> See
> http://www.opengroup.org/onlinepubs/009695399/functions/strcoll.html

Wierd, at least in glibc and ICU it can't happen but perhaps there are
other implementations where it can...

> Collation cannot be defined on any character. There is not any relation
> between
> Latin and Chines characters. Collation has sense when you are able to
> specify < = > operators.

There is no standardised relation. However, if your collation library
decides to define all chinese characters after all latin characters,
they will have defined a collation that will work for all strings with
any characters... Which is basically the approach glibc/ICU takes.

I think the standard is kind of pathetic to say that strcoll can set
errno but have no value to indicate error. I wonder how many platforms
actually use that "feature".

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.