Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING
От | Heikki Linnakangas |
---|---|
Тема | Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING |
Дата | |
Msg-id | 53FB4B0F.9080502@vmware.com обсуждение исходный текст |
Ответ на | Re: Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING (Jeff Davis <pgsql@j-davis.com>) |
Ответы |
Re: Allow multi-byte characters as escape in SIMILAR TO
and SUBSTRING
|
Список | pgsql-hackers |
On 07/12/2014 05:16 AM, Jeff Davis wrote: > On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote: >> Jeff Davis <pgsql@j-davis.com> writes: >>> Attached is a small patch to $SUBJECT. >>> In master, only single-byte characters are allowed as an escape. Of >>> course, with the patch it must still be a single character, but it may >>> be multi-byte. >> >> I'm concerned about the performance cost of this patch. Have you done >> any measurements about what kind of overhead you are putting on the >> inner loop of similar_escape? > > I didn't consider this very performance critical, because this is > looping through the pattern, which I wouldn't expect to be a long > string. On my machine using en_US.UTF-8, the difference is imperceptible > for a SIMILAR TO ... ESCAPE query. > > I was able to see about a 2% increase in runtime when using the > similar_escape function directly. I made a 10M tuple table and did: > > explain analyze > select > similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t; > > which was the worst reasonable case I could think of. Actually, that gets optimized to a constant in the planner: postgres=# explain verbose select similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#') from t; QUERY PLAN -------------------------------------------------------------------------------- ---------- Seq Scan on public.t (cost=0.00..144247.85 rows=9999985 width=0) Output: '^(?:ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ )$'::text Planning time: 0.033 ms (3 rows) With a working test case: create table t (pattern text); insert into t select 'ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ' from generate_series(1, 1000000); vacuum t; explain (analyze) select similar_escape(pattern,'#') from t; your patch seems to be about 2x-3x as slow as unpatched master. So this needs some optimization. A couple of ideas: 1. If the escape string is in fact a single-byte character, you can proceed with the loop just as it is today, without the pg_mblen calls. 2. Since pg_mblen() will always return an integer between 1-6, it would probably be faster to replace the memcpy() and memcmp() calls with simple for-loops iterating byte-by-byte. In very brief testing, with the 1. change above, the performance with this patch is back to what it's without the patch. See attached. - Heikki
Вложения
В списке pgsql-hackers по дате отправления: