Re: UTF8MatchText
От | Tom Lane |
---|---|
Тема | Re: UTF8MatchText |
Дата | |
Msg-id | 2132.1179680285@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: UTF8MatchText (Andrew Dunstan <andrew@dunslane.net>) |
Ответы |
Re: UTF8MatchText
Re: UTF8MatchText |
Список | pgsql-patches |
Andrew Dunstan <andrew@dunslane.net> writes: > Are you sure? The big remaining char-matching bottleneck will surely > be in the code that scans for a place to start matching a %. But > that's exactly where we can't use byte matching for cases where the > charset might include AB and BA as characters - the pattern might > contain %BA and the string AB. However, this isn't a danger for UTF8, > which leads me to think that we do indeed need a special case for > UTF8, but for a different improvement from that proposed in the > original patch. I'll post an updated patch shortly. > Here is a patch that implements this. Please analyse for possible > breakage. On the strength of this analysis, shouldn't we drop the separate UTF8 match function and just use SB_MatchText for UTF8? It strikes me that we may be overcomplicating matters in another way too. If you believe that the %-scan code is now the bottleneck, that is, the key loop is where we have pattern '%foo' and we are trying to match 'f' to each successive data position, then you should be bothered that SB_MatchTextIC is applying tolower() to 'f' again for each data character. Worst-case we could have O(N^2) applications of tolower() during a match. I think there's a fair case to be made that we should get rid of SB_MatchTextIC and implement *all* the case-insensitive variants by means of an initial lower() call. This would leave us with just two match functions and allow considerable unification of the setup logic. regards, tom lane
В списке pgsql-patches по дате отправления: