Re: Speeding up text_position_next with multibyte encodings
От | John Naylor |
---|---|
Тема | Re: Speeding up text_position_next with multibyte encodings |
Дата | |
Msg-id | CAJVSVGUVe9G6yUb3F-YrC_myrj_nqhagYrpmP3OT8sUjzvAuaA@mail.gmail.com обсуждение исходный текст |
Ответ на | Speeding up text_position_next with multibyte encodings (Heikki Linnakangas <hlinnaka@iki.fi>) |
Ответы |
Re: Speeding up text_position_next with multibyte encodings
Re: Speeding up text_position_next with multibyte encodings |
Список | pgsql-hackers |
On 11/30/18, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Unfortunately, patch doesn't compile anymore due: > > varlena.c: In function ‘text_position_next_internal’: > varlena.c:1337:13: error: ‘start_ptr’ undeclared (first use in this > function) > Assert(start_ptr >= haystack && start_ptr <= haystack_end); > > Could you please send an updated version? For now I'm moving it to the next > CF. I signed up to be a reviewer, and I will be busy next month, so I went ahead and fixed the typo in the patch that broke assert-enabled builds. While at it, I standardized on the spelling "start_ptr" in a few places to match the rest of the file. It's a bit concerning that it wouldn't compile with asserts, but the patch was written by a committer and seems to work. On 10/19/18, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > This is a win in most cases. One case is slower: calling position() with > a large haystack input, where the match is near the end of the string. I wanted to test this case in detail, so I ran the attached script, which runs position() in three scenarios: short: 1 character needle at the end of a ~1000 character haystack, repeated 1 million times medium: 50 character needle at the end of a ~1 million character haystack, repeated 1 million times long: 250 character needle at the end of an ~80 million character haystack (~230MB, comfortably below the 256MB limit Heikki reported), done just one time I took the average of 5 runs using Korean characters in both UTF-8 (3-byte) and EUC-KR (2-byte) encodings. UTF-8 length master patch short 2.26s 2.51s med 2.28s 2.54s long 3.07s 3.11s EUC-KR length master patch short 2.29s 2.37s med 2.29s 2.36s long 1.75s 1.71s With UTF-8, the patch is 11-12% slower on short and medium strings, and about the same on long strings. With EUC-KR, the patch is about 3% slower on short and medium strings, and 2-3% faster on long strings. It seems the worst case is not that bad, and could be optimized, as Heikki said. -John Naylor
Вложения
В списке pgsql-hackers по дате отправления: