Speeding up text_position_next with multibyte encodings
| От | Heikki Linnakangas |
|---|---|
| Тема | Speeding up text_position_next with multibyte encodings |
| Дата | |
| Msg-id | 3173d989-bc1c-fc8a-3b69-f24246f73876@iki.fi обсуждение исходный текст |
| Ответы |
Re: Speeding up text_position_next with multibyte encodings
Re: Speeding up text_position_next with multibyte encodings |
| Список | pgsql-hackers |
Attached is a patch to speed up text_position_setup/next(), in some common cases with multibyte encodings. text_position_next() uses the Boyer-Moore-Horspool search algorithm, with a skip table. Currently, with a multi-byte encoding, we first convert both input strings to arrays of wchars, and run the algorithm on those arrays. This patch removes the mb->wchar conversion, and instead runs the single-byte version of the algorithm directly on the inputs, even with multi-byte encodings. That avoids the mb->wchar conversion altogether, when there is no match. Even when there is a match, we don't need to convert the whole input string to wchars. It's enough to count the characters up to the match, using pg_mblen(). And when text_position_setup/next() are used as part of split_part() or replace() functions, we're not actually even interested in the character position, so we can skip that too. Avoiding the large allocation is nice, too. That was actually why I stated to look into this: a customer complained that text_position fails with strings larger than 256 MB. Using the byte-oriented search on multibyte strings might return false positives, though. To make that work, after finding a match, we verify that the match doesn't fall in the middle of a multi-byte character. However, as an important special case, that cannot happen with UTF-8, because it has the property that the byte sequence of one character never appears as part of another character. I think some other encodings might have the same property, but I wasn't sure. This is a win in most cases. One case is slower: calling position() with a large haystack input, where the match is near the end of the string. Counting the characters up to that point is slower than the mb->wchar conversion. I think we could avoid that regression too, if we had a more optimized function to count characters. Currently, the code calls pg_mblen() in a loop, whereas in pg_mb2wchar_with_len(), the similar loop through all the characters is a tight loop within the function. Thoughts? - Heikki
Вложения
В списке pgsql-hackers по дате отправления: