Re: UTF8MatchText

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: UTF8MatchText
Дата	20 мая 2007 г. 13:58:53
Msg-id	2132.1179680285@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: UTF8MatchText (Andrew Dunstan <andrew@dunslane.net>)
Ответы	Re: UTF8MatchText Re: UTF8MatchText
Список	pgsql-patches

Дерево обсуждения

Andrew Dunstan <andrew@dunslane.net> writes:
> Are you sure? The big remaining char-matching bottleneck will surely
> be in the code that scans for a place to start matching a %. But
> that's exactly where we can't use byte matching for cases where the
> charset might include AB and BA as characters - the pattern might
> contain %BA and the string AB. However, this isn't a danger for UTF8,
> which leads me to think that we do indeed need a special case for
> UTF8, but for a different improvement from that proposed in the
> original patch. I'll post an updated patch shortly.

> Here is a patch that implements this. Please analyse for possible
> breakage.

On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?

It strikes me that we may be overcomplicating matters in another way
too.  If you believe that the %-scan code is now the bottleneck, that
is, the key loop is where we have pattern '%foo' and we are trying to
match 'f' to each successive data position, then you should be bothered
that SB_MatchTextIC is applying tolower() to 'f' again for each data
character.  Worst-case we could have O(N^2) applications of tolower()
during a match.  I think there's a fair case to be made that we should
get rid of SB_MatchTextIC and implement *all* the case-insensitive
variants by means of an initial lower() call.  This would leave us with
just two match functions and allow considerable unification of the setup
logic.

            regards, tom lane

В списке pgsql-patches по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: UTF8MatchText