Re: Re: LIKE gripes
От | Thomas Lockhart |
---|---|
Тема | Re: Re: LIKE gripes |
Дата | |
Msg-id | 399024E9.6CB300F6@alumni.caltech.edu обсуждение исходный текст |
Ответ на | RE: Re: LIKE gripes ("Hiroshi Inoue" <Inoue@tpf.co.jp>) |
Ответы |
Re: Re: LIKE gripes
|
Список | pgsql-hackers |
> Where has MULTIBYTE Stuff in like.c gone ? Uh, I was wondering where it was in the first place! Will fix it asap... There was some string copying stuff in a middle layer of the like() code, but I had thought that it was there only to get a null-terminated string. When I rewrote the code to eliminate the need for null termination (by using the length attribute of the text data type) then the need for copying went away. Or so I thought :( The other piece to the puzzle is that the lowest-level like() support routine traversed the strings using the increment operator, and so I didn't understand that there was any MB support in there. I now see that *all* of these strings get stuffed into unsigned int arrays during copying; I had (sort of) understood some of the encoding schemes (most use a combination of one to three byte sequences for each character) and didn't realize that this normalization was being done on the fly. So, this answers some questions I have related to implementing character sets: 1) For each character set, we would need to provide operators for "next character" and for boolean comparisons for each character set. Why don't we have those now? Answer: because everything is getting promoted to a 32-bit internal encoding every time a comparison or traversal is required. 2) For each character set, we would need to provide conversion functions to other "compatible" character sets, or to a character "superset". Why don't we have those conversion functions? Answer: we do! There is an internal 32-bit encoding within which all comparisons are done. Anyway, I think it will be pretty easy to put the MB stuff back in, by #ifdef'ing some string copying inside each of the routines (such as namelike()). The underlying routine no longer requires a null-terminated string (using explicit lengths instead) so I'll generate those lengths in the same place unless they are already provided by the char->int MB support code. In the future, I'd like to see us use alternate encodings as-is, or as a common set like UniCode (16 bits wide afaik) rather than having to do this widening to 32 bits on the fly. Then, each supported character set can be efficiently manipulated internally, and only converted to another encoding when mixing with another character set. Any and all advice welcome and accepted (though "keep your hands off the MB code!" seems a bit too late ;) Sorry for the shake-up... - Thomas
В списке pgsql-hackers по дате отправления: