Re: UTF8MatchText
От | Andrew Dunstan |
---|---|
Тема | Re: UTF8MatchText |
Дата | |
Msg-id | 464CD63B.7000609@dunslane.net обсуждение исходный текст |
Ответ на | Re: UTF8MatchText (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: UTF8MatchText
|
Список | pgsql-patches |
Tom Lane wrote: > > * At a pattern backslash, it applies CHAREQ() but then advances > byte-by-byte over the matched characters (implicitly assuming that none > of these bytes will look like the magic characters). While that works > for backend-safe encodings, it seems a bit strange; you've already paid > the price of determining the character length once, not to mention > matching the bytes of the characters once, and then throw that knowledge > away. I think BYTEEQ would make more sense in the backslash path. > Probably, although the use of CHAREQ is in the present code. Is it legal to follow escape by anything other than _ % or escape? > > So the actual optimization here is that we do bytewise comparison and > advancing, but only when we are either at the start of a character > (on both sides, and the pattern char is not wildcard) or we are in the > middle of a character (on both sides) and we've already proven that both > sides matched for the previous byte(s) of the character. > I think that's correct. > On the strength of this closer reading, I would say that the patch isn't > relying on UTF8's first-byte-vs-not-first-byte property after all. > All that it's relying on is that no MB character is a prefix of another > one, which seems like a necessary property for any sane encoding; plus > that characters are considered equal only if they're bytewise equal. > So are we sure it doesn't work for non-UTF8 encodings? Maybe that > earlier conclusion was based on a misunderstanding of what the patch > really does. > > > Indeed. One more thing - I'm thinking of rolling up the bytea matching routines as well as the text routines to eliminate all the duplication of logic. I can do it by a little type casting from bytea* to text* and back again, or if that's not acceptable by some preprocessor magic. I think the casting is likely to be safe enough in this case - I don't think a null byte will hurt us anywhere in this code - and presumably the varlena stuff is all the same. Does that sound reasonable? cheers andrew
В списке pgsql-patches по дате отправления: