Re: Fuzzy substring searching with the pg_trgm extension

Поиск
Список
Период
Сортировка
От Artur Zakirov
Тема Re: Fuzzy substring searching with the pg_trgm extension
Дата
Msg-id 569CB46F.3060505@postgrespro.ru
обсуждение исходный текст
Ответ на Re: Fuzzy substring searching with the pg_trgm extension  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Ответы Re: Fuzzy substring searching with the pg_trgm extension  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-hackers
I have did some changes to the patch:
- changed version from 1.3 to 1.2
- added check_only parameter to the function
calc_substring_similarity(). If check_only == true then a
calc_substring_similarity() calculates similarity which greater than
pg_trgm.substring_limit variable. This behavior is used in the
substring_similarity_op() and it should increase performance of the
operator. In the substring_similarity() function check_only == false.

On 12.01.2016 02:31, Alvaro Herrera wrote:
> I gave a quick look through the patch and noticed a few minor things
> while trying to understand it.
>
> I think the test corpus isn't particularly interesting for how big it
> is.  I'd rather have (a) a small corpus (say 100 words) with which to do
> detailed regression testing, and (b) some larger document for more
> extensive testing.  I'm not sure (b) is actually necessary.
>
> Overall I think the new functions could stand a lot more commentary.

I added some comments to new functions iterate_substring_similarity()
and calc_substring_similarity().

>>
>> + #ifdef DIVUNION
>> + #define CALCSML(count, len1, len2) ((float4) (count)) / ((float4) ((len1) + (len2) - (count)))
>> + #else
>> + #define CALCSML(count, len1, len2) ((float4) (count)) / ((float4) (((len1) > (len2)) ? (len1) : (len2)))
>> + #endif
>
> This macro has a multiple evaluation gotcha.  Maybe we don't really care
> because of the limited applicability but I'd put a comment on top.

I added a comment to explain various similarity formula.

>
> BTW you don't need a ternary op just to say "? true : false" -- just
> make it
>     res = foo == bar;
> which immediately assigns the correct boolean value.  (This pattern
> occurs in many places both in the original and in your submitted code.
> I see no reason to perpetuate it.)
>

Fixed all ternary operators.

>
>> + Datum
>> + set_substring_limit(PG_FUNCTION_ARGS)
>> + {
>> +     float4        nlimit = PG_GETARG_FLOAT4(0);
>> +
>> +     if (nlimit < 0 || nlimit > 1.0)
>> +         elog(ERROR, "wrong limit, should be between 0 and 1");
>
> Should be an ereport().

Fixed. But an elog() function occurs in trgm_gin.c and trgm_gist.c also.
I'm not sure that it should be replaced with an ereport().

>
>
>> ! static void
>> ! protect_out_of_mem(int slen)
>> ! {
>> !     /*
>> !      * Guard against possible overflow in the palloc requests below.  (We
>> !      * don't worry about the additive constants, since palloc can detect
>> !      * requests that are a little above MaxAllocSize --- we just need to
>> !      * prevent integer overflow in the multiplications.)
>> !      */
>> !     if ((Size) (slen / 2) >= (MaxAllocSize / (sizeof(trgm) * 3)) ||
>> !         (Size) slen >= (MaxAllocSize / pg_database_encoding_max_length()))
>> !         ereport(ERROR,
>> !                 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
>> !                  errmsg("out of memory")));
>> ! }
>
> I think the comment should now be above the function, not inside it.

Fixed.

>>
>> + /* Trigram with position */
>> + typedef struct
>> + {
>> +     trgm    t;
>> +     int        i;
>> + } ptrgm;
>
> This struct definition could stand a better name, and better member
> names.  Also, our policy (not always enforced) is to have these near the
> top of the file rather than in the middle.  If the struct is used in
> many places then IMO it's better that it go at the top; if it's used
> only in one function I'd say it's okay to have it just after that
> function.

Renamed this structure to pos_trgm and move it near the top since this
structure is used in a couple of functions. I renamed member names to
trg and index accordingly.

>
>> + /*
>> +  * Make array of positional trigrams from two trigram arrays. The first array
>> +  * is required substing which positions don't matter and replaced with -1.
>> +  * The second array is haystack where we search and have to store its
>> +  * positions.
>> +  */
>> + static ptrgm *
>> + make_positional_trgm(trgm *trg1, int len1, trgm *trg2, int len2)
>> + {
>> +     ptrgm *result;
>> +     int i, len = len1 + len2;
>> +
>> +     result = (ptrgm *) palloc(sizeof(ptrgm) * len);
>> +
>> +     for (i = 0; i < len1; i++)
>> +     {
>> +         memcpy(&result[i].t, &trg1[i], sizeof(trgm));
>> +         result[i].i = -1;
>> +     }
>> +
>> +     for (i = 0; i < len2; i++)
>> +     {
>> +         memcpy(&result[i + len1].t, &trg2[i], sizeof(trgm));
>> +         result[i + len1].i = i;
>> +     }
>> +
>> +     return result;
>> + }
>
> So the output trigram array is the concatenation of both input trigrams?
> That's not quite evident from the comment above the function ... I think
> some rephrasing there is in order.

Rephrased the comment.

>>
>> !     /* if inexact then len2 is equal to count */
>> !     if (inexact)
>> !         return CALCSML(count, len1, count);
>> !     else
>> !         return CALCSML(count, len1, len2);
>
> This is where a ternary op would make better code.
>     CALCSML(count, len1, inexact ? count : len2);
> but a better comment would actually explain why we do this, rather than
> just stating we do.

Fixed a code and a comment.

>
>
>> +      <row>
>> +
<entry><function>show_substring_limit()</function><indexterm><primary>show_substring_limit</primary></indexterm></entry>
>> +       <entry><type>real</type></entry>
>> +       <entry>
>> +        Returns the current similarity threshold used by the <literal><%</>
>> +        operator.  This sets the minimum substring similarity between
>> +        two phrases.
>> +       </entry>
>> +      </row>
>> +      <row>
>> +
<entry><function>set_substring_limit(real)</function><indexterm><primary>set_substring_limit</primary></indexterm></entry>
>> +       <entry><type>real</type></entry>
>> +       <entry>
>> +        Sets the current substring similarity threshold that is used by
>> +        the <literal><%</> operator.  The threshold must be between
>> +        0 and 1 (default is 0.6).  Returns the same value passed in.
>> +       </entry>
>
> I don't quite understand why aren't we using a custom GUC variable here.
> These already have SHOW and SET support ...
>

Added GUC variables:
- pg_trgm.limit
- pg_trgm.substring_limit
I added this variables to the documentation.
show_limit() and set_limit() functions work correctly and they are
marked as deprecated.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: exposing pg_controldata and pg_config as functions
Следующее
От: "Shulgin, Oleksandr"
Дата:
Сообщение: Re: More stable query plans via more predictable column statistics