Re: Re: LIKE gripes

Поиск
Список
Период
Сортировка
От Thomas Lockhart
Тема Re: Re: LIKE gripes
Дата
Msg-id 39940A27.EA03F12D@alumni.caltech.edu
обсуждение исходный текст
Ответ на Re: Re: LIKE gripes  (Thomas Lockhart <lockhart@alumni.caltech.edu>)
Ответы Re: Re: LIKE gripes  (Tatsuo Ishii <t-ishii@sra.co.jp>)
Список pgsql-hackers
> > I think I have a solution for the current code; could someone test its
> > behavior with MB enabled? It is now committed to the source tree; I know
> > it compiles, but afaik am not equipped to test it :(
> It passed the MB test, but fails the string test. Yes, I know it fails
> becasue ILIKE for MB is not implemented (yet). I'm looking forward to
> implement the missing part. Is it ok for you, Thomas?

Whew! I'm glad "fails the string test" is because of the ILIKE/tolower()
issue; I was afraid you would say "... because Thomas' bad code dumps
core..." :)

Yes, feel free to implement the missing parts. I'm not even sure how to
do it! Do you think it would be best in the meantime to disable the
ILIKE tests, or perhaps to separate that out into a different test?

> Please note that existing MB implementation does not need such an
> extra conversion cost except some MB-aware-functions(text_length
> etc.), regex, like and the input/output stage. Also MB stores native
> encodings 'as is' onto the disk.

Yes. I am probably getting a skewed view of MB since the LIKE code is an
edge case which illustrates the difficulties in handling character sets
in general no matter what solution is used.

> Anyway, it looks like MB would eventually be merged into/deplicated by
> your new implementaion of multiple encodings support.

I've started writing up a description of my plans (based on our previous
discussions), and as I do so I appreciate more and more your current
solution ;) imho you have solved several issues such as storage format,
client/server communication, and mixed-encoding comparison and
manipulation which would all need to be solved by a "new
implementation".

My current thought is to leave MB intact, and to start implementing
"character sets" as distinct types (I know you have said that this is a
lot of work, and I agree that is true for the complete set). Once I have
done one or a few character sets (perhaps using a Latin subset of
Unicode so I can test it by converting between ASCII and Unicode using
character sets I know how to read ;) then we can start implementing a
"complete solution" for those character sets which includes character
and string comparison building blocks like "<", ">", and "tolower()",
full comparison functions, and conversion routines between different
character sets.

But that by itself does not solve, for example, client/server encoding
issues, so let's think about that again once we have some "type-full"
character sets to play with. The default solution will of course use MB
to handle this.

> BTW, Thomas, do you have a plan to support collation functions?

Yes, that is something that I hope will come out naturally from a
combination of SQL9x language features and use of the type system to
handle character sets. Then, for example (hmm, examples might be better
in Japanese since you have such a rich mix of encodings ;),
 CREATE TABLE t1 (name TEXT COLLATE francais);

will (or might ;) result in using the "francais" data type for the name
column.
 SELECT * FROM t1 WHERE name < _FRANCAIS 'merci';

will use the "francais" data type for the string literal. And
 CREATE TABLE t1 (name VARCHAR(10) CHARACTER SET latin1 COLLATE
francais);

will (might?) use, say, the "latin1_francais" data type. Each of these
data types will be a loadable module (which could be installed into
template1 to make them available to every new database), and each can
reuse underlying support routines to avoid as much duplicate code as
possible.

Maybe there would be defined a default encoding for a type, say "latin1"
for "francais", so that the backend or some external scripts can help
set these up. There is a good chance we will need (yet another) system
table to allow us to tie these types into character sets and collations;
otherwise Postgres might not be able to recognize that a type is
implementing these language features and, for example, pg_dump might not
be able to reconstruct the correct table creation syntax.

I notice that SQL99 has *a lot* of new specifics on character set
support, which prescribe things like CREATE COLLATION... and DROP
COLLATION... This means that there is less thinking involved in the
syntax but more work to make those exact commands fit into Postgres.
SQL92 left most of this as an exercise for the reader. I'd be happier if
we knew this stuff *could* be implemented by seeing another DB implement
it. Are you aware of any that do (besides our own of course)?
                    - Thomas


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Identified a problem in pg_dump with serial data type and mixed case
Следующее
От: Tom Lane
Дата:
Сообщение: Re: CREATE INDEX test_idx ON test (UPPER(varchar_field)) doesn't work...