Обсуждение: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

Поиск
Список
Период
Сортировка

pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Heikki Linnakangas
Дата:
Allow opclasses to provide tri-valued GIN consistent functions.

With the GIN "fast scan" feature, GIN can skip items without fetching all
the keys for them, if it can prove that they don't match regardless of
those keys. So far, it has done the proving by calling the boolean
consistent function with all combinations of TRUE/FALSE for the unfetched
keys, but since that's O(n^2), it becomes unfeasible with more than a few
keys. We can avoid calling consistent with all the combinations, if we can
tell the operator class implementation directly which keys are unknown.

This commit includes a triConsistent function for the built-in array and
tsvector opclasses.

Alexander Korotkov, with some changes by me.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/c5608ea26a1f51998ad3cf987c3f0bda643c87a8

Modified Files
--------------
doc/src/sgml/gin.sgml                    |   54 ++++++++++++++--
doc/src/sgml/xindex.sgml                 |   13 +++-
src/backend/access/gin/ginarrayproc.c    |   84 ++++++++++++++++++++++++
src/backend/access/gin/ginlogic.c        |   68 +++++++++++++++++--
src/backend/access/gin/ginutil.c         |   28 +++++++-
src/backend/utils/adt/tsginidx.c         |  104 +++++++++++++++++++++++++++++-
src/include/access/gin.h                 |   18 +++++-
src/include/access/gin_private.h         |   15 +----
src/include/catalog/catversion.h         |    2 +-
src/include/catalog/pg_am.h              |    2 +-
src/include/catalog/pg_amproc.h          |   31 +++++++++
src/include/catalog/pg_proc.h            |    4 ++
src/include/tsearch/ts_utils.h           |    1 +
src/include/utils/builtins.h             |    1 +
src/test/regress/expected/opr_sanity.out |   77 +++++++++++-----------
src/test/regress/sql/opr_sanity.sql      |   70 +++++++++++---------
16 files changed, 469 insertions(+), 103 deletions(-)


Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Thom Brown
Дата:
On 12 March 2014 15:52, Heikki Linnakangas <heikki.linnakangas@iki.fi> wrote:
> Allow opclasses to provide tri-valued GIN consistent functions.
>
> With the GIN "fast scan" feature, GIN can skip items without fetching all
> the keys for them, if it can prove that they don't match regardless of
> those keys. So far, it has done the proving by calling the boolean
> consistent function with all combinations of TRUE/FALSE for the unfetched
> keys, but since that's O(n^2), it becomes unfeasible with more than a few
> keys. We can avoid calling consistent with all the combinations, if we can
> tell the operator class implementation directly which keys are unknown.
>
> This commit includes a triConsistent function for the built-in array and
> tsvector opclasses.
>
> Alexander Korotkov, with some changes by me.

Patch to fix a couple typos attached.

--
Thom

Вложения

Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Heikki Linnakangas
Дата:
On 03/13/2014 02:55 PM, Thom Brown wrote:
> Patch to fix a couple typos attached.

Thanks, fixed.

- Heikki


Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Andres Freund
Дата:
Hi,

On 2014-03-12 15:52:58 +0000, Heikki Linnakangas wrote:
> Allow opclasses to provide tri-valued GIN consistent functions.

clang warns me about a mistake here:
In file included from /home/andres/src/postgresql/src/backend/access/gin/ginxlog.c:16:
In file included from /home/andres/src/postgresql/src/include/access/gin_private.h:14:
/home/andres/src/postgresql/src/include/access/gin.h:57:3: warning: no previous extern declaration for
      non-static variable 'GinLogicValueEnum'
      [-Wmissing-variable-declarations]
} GinLogicValueEnum;

I think the GinLogicValueEnum is supposed to be an enum's name, not a
variable name, right?

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Tom Lane
Дата:
Andres Freund <andres@2ndquadrant.com> writes:
> I think the GinLogicValueEnum is supposed to be an enum's name, not a
> variable name, right?

I think the whole thing is too cute by half.  Why isn't it just

typedef enum GinLogicValue
{
    GIN_FALSE = 0,           /* item is present / matches */
    GIN_TRUE = 1,            /* item is not present / does not match */
    GIN_MAYBE = 2            /* don't know if item is present / don't know if
                              * matches */
} GinLogicValue;

instead of thinking that we are smarter than the compiler about how
to store the enum?  For that matter, those explicit specifications of
the enum tag numeric values seem unnecessary and bad style IMV.

            regards, tom lane


Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Andres Freund
Дата:
On 2014-03-21 17:37:35 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I think the GinLogicValueEnum is supposed to be an enum's name, not a
> > variable name, right?
>
> I think the whole thing is too cute by half.  Why isn't it just
>
> typedef enum GinLogicValue
> {
>     GIN_FALSE = 0,           /* item is present / matches */
>     GIN_TRUE = 1,            /* item is not present / does not match */
>     GIN_MAYBE = 2            /* don't know if item is present / don't know if
>                               * matches */
> } GinLogicValue;
>
> instead of thinking that we are smarter than the compiler about how
> to store the enum?

It seems to be a memory only type, so using anything but the raw enum
type seems odd. If it were ondisk alignment stuff could make it
advantageous, but this way...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Tom Lane
Дата:
I wrote:
> I think the whole thing is too cute by half.  Why isn't it just

> typedef enum GinLogicValue
> {
>     GIN_FALSE = 0,           /* item is present / matches */
>     GIN_TRUE = 1,            /* item is not present / does not match */
>     GIN_MAYBE = 2            /* don't know if item is present / don't know if
>                               * matches */
> } GinLogicValue;

... and now that I look at it a second time: *please* tell me the comments
on the first two values should be swapped.  The API isn't really "false
means a match", is it?

            regards, tom lane


Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Heikki Linnakangas
Дата:
On 03/21/2014 10:58 PM, Tom Lane wrote:
> I wrote:
>> I think the whole thing is too cute by half.  Why isn't it just
>
>> typedef enum GinLogicValue
>> {
>>      GIN_FALSE = 0,           /* item is present / matches */
>>      GIN_TRUE = 1,            /* item is not present / does not match */
>>      GIN_MAYBE = 2            /* don't know if item is present / don't know if
>>                                * matches */
>> } GinLogicValue;
>
> ... and now that I look at it a second time: *please* tell me the comments
> on the first two values should be swapped.  The API isn't really "false
> means a match", is it?

*facepalm*. Yes, the comments are backwards...

- Heikki


Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Heikki Linnakangas
Дата:
On 03/21/2014 10:47 PM, Andres Freund wrote:
> On 2014-03-21 17:37:35 -0400, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>>> I think the GinLogicValueEnum is supposed to be an enum's name, not a
>>> variable name, right?
>>
>> I think the whole thing is too cute by half.  Why isn't it just
>>
>> typedef enum GinLogicValue
>> {
>>      GIN_FALSE = 0,           /* item is present / matches */
>>      GIN_TRUE = 1,            /* item is not present / does not match */
>>      GIN_MAYBE = 2            /* don't know if item is present / don't know if
>>                                * matches */
>> } GinLogicValue;
>>
>> instead of thinking that we are smarter than the compiler about how
>> to store the enum?
>
> It seems to be a memory only type, so using anything but the raw enum
> type seems odd. If it were ondisk alignment stuff could make it
> advantageous, but this way...

That enum is used in the "check" arrays that are passed around GIN, with
one element per index term being searched. I'd really like to keep it
small, because the can be hundreds of elements long, and if the compiler
decides to store it as a 4- or 8-byte value instead of one byte, it
starts to add up.

Besides memory usage, it's convenient that an array of GinLogicValues is
compatible with an array of booleans, as long as there are no GIN_MAYBE
values in it.

I committed your original fix to make it an enum type, like it was
supposed to be. Thanks!

- Heikki


Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Andrew Dunstan
Дата:
On 03/21/2014 06:43 PM, Heikki Linnakangas wrote:
> On 03/21/2014 10:47 PM, Andres Freund wrote:
>> On 2014-03-21 17:37:35 -0400, Tom Lane wrote:
>>> Andres Freund <andres@2ndquadrant.com> writes:
>>>> I think the GinLogicValueEnum is supposed to be an enum's name, not a
>>>> variable name, right?
>>>
>>> I think the whole thing is too cute by half.  Why isn't it just
>>>
>>> typedef enum GinLogicValue
>>> {
>>>      GIN_FALSE = 0,           /* item is present / matches */
>>>      GIN_TRUE = 1,            /* item is not present / does not
>>> match */
>>>      GIN_MAYBE = 2            /* don't know if item is present /
>>> don't know if
>>>                                * matches */
>>> } GinLogicValue;
>>>
>>> instead of thinking that we are smarter than the compiler about how
>>> to store the enum?
>>
>> It seems to be a memory only type, so using anything but the raw enum
>> type seems odd. If it were ondisk alignment stuff could make it
>> advantageous, but this way...
>
> That enum is used in the "check" arrays that are passed around GIN,
> with one element per index term being searched. I'd really like to
> keep it small, because the can be hundreds of elements long, and if
> the compiler decides to store it as a 4- or 8-byte value instead of
> one byte, it starts to add up.
>
> Besides memory usage, it's convenient that an array of GinLogicValues
> is compatible with an array of booleans, as long as there are no
> GIN_MAYBE values in it.
>
> I committed your original fix to make it an enum type, like it was
> supposed to be. Thanks!
>
>


I don't think there's any guarantee it's only going to be one byte. The
ANSI C standard says:

    Each enumerated type shall be compatible with char, a signed integer
    type, or an unsigned integer type. The choice of type is
    implementation-defined. (6.7.2.2 Enumerationspecifiers)



cheers

andrew



Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Heikki Linnakangas
Дата:
On 03/22/2014 03:33 PM, Andrew Dunstan wrote:
>
> On 03/21/2014 06:43 PM, Heikki Linnakangas wrote:
>> On 03/21/2014 10:47 PM, Andres Freund wrote:
>>> On 2014-03-21 17:37:35 -0400, Tom Lane wrote:
>>>> Andres Freund <andres@2ndquadrant.com> writes:
>>>>> I think the GinLogicValueEnum is supposed to be an enum's name, not a
>>>>> variable name, right?
>>>>
>>>> I think the whole thing is too cute by half.  Why isn't it just
>>>>
>>>> typedef enum GinLogicValue
>>>> {
>>>>       GIN_FALSE = 0,           /* item is present / matches */
>>>>       GIN_TRUE = 1,            /* item is not present / does not
>>>> match */
>>>>       GIN_MAYBE = 2            /* don't know if item is present /
>>>> don't know if
>>>>                                 * matches */
>>>> } GinLogicValue;
>>>>
>>>> instead of thinking that we are smarter than the compiler about how
>>>> to store the enum?
>>>
>>> It seems to be a memory only type, so using anything but the raw enum
>>> type seems odd. If it were ondisk alignment stuff could make it
>>> advantageous, but this way...
>>
>> That enum is used in the "check" arrays that are passed around GIN,
>> with one element per index term being searched. I'd really like to
>> keep it small, because the can be hundreds of elements long, and if
>> the compiler decides to store it as a 4- or 8-byte value instead of
>> one byte, it starts to add up.
>>
>> Besides memory usage, it's convenient that an array of GinLogicValues
>> is compatible with an array of booleans, as long as there are no
>> GIN_MAYBE values in it.
>>
>> I committed your original fix to make it an enum type, like it was
>> supposed to be. Thanks!
>
> I don't think there's any guarantee it's only going to be one byte. The
> ANSI C standard says:
>
>      Each enumerated type shall be compatible with char, a signed integer
>      type, or an unsigned integer type. The choice of type is
>      implementation-defined. (6.7.2.2 Enumerationspecifiers)

Here's the code from gin.h (after fixing the issues Andres and Tom
pointed out):

> enum GinLogicValueEnum
> {
>         GIN_FALSE = 0,                  /* item is not present / does not match */
>         GIN_TRUE = 1,                   /* item is present / matches */
>         GIN_MAYBE = 2                   /* don't know if item is present / don't know if
>                                                          * matches */
> };
>
> typedef char GinLogicValue;

The reason for the typedef is precisely that an enum is not guaranteed
to be one byte. Tom suggested getting rid of the typedef, but it's
needed to make sure it's stored as one byte.

I'll go add a comment to it, explaining why it's needed.

- Heikki


Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Andrew Dunstan
Дата:
On 03/23/2014 09:22 AM, Heikki Linnakangas wrote:
>
> Here's the code from gin.h (after fixing the issues Andres and Tom
> pointed out):
>
>> enum GinLogicValueEnum
>> {
>>         GIN_FALSE = 0,                  /* item is not present / does
>> not match */
>>         GIN_TRUE = 1,                   /* item is present / matches */
>>         GIN_MAYBE = 2                   /* don't know if item is
>> present / don't know if
>>                                                          * matches */
>> };
>>
>> typedef char GinLogicValue;
>
> The reason for the typedef is precisely that an enum is not guaranteed
> to be one byte. Tom suggested getting rid of the typedef, but it's
> needed to make sure it's stored as one byte.
>
> I'll go add a comment to it, explaining why it's needed.
>
>

Ak, OK, I probably misread your previous email.

cheers

andrew



Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Greg Stark
Дата:

On Sun, Mar 23, 2014 at 1:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
The reason for the typedef is precisely that an enum is not guaranteed to be one byte. Tom suggested getting rid of the typedef, but it's needed to make sure it's stored as one byte.

I'll go add a comment to it, explaining why it's needed.

If we're not using the enum type perhaps it would be better to make it a bunch of #defines? The main advantage of enum types is that the debugger knows what values are legal and can decode them for you. 

That said, I guess there's not much reason to go and change things now either. 



--
greg

Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Heikki Linnakangas
Дата:
On 03/23/2014 06:07 PM, Greg Stark wrote:
> On Sun, Mar 23, 2014 at 1:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com
>> wrote:
>
>> The reason for the typedef is precisely that an enum is not guaranteed to
>> be one byte. Tom suggested getting rid of the typedef, but it's needed to
>> make sure it's stored as one byte.
>>
>> I'll go add a comment to it, explaining why it's needed.
>
> If we're not using the enum type perhaps it would be better to make it a
> bunch of #defines? The main advantage of enum types is that the debugger
> knows what values are legal and can decode them for you.

Yeah, I think you're right. Changed it to use #defines, renamed the
typedef to GinTernaryValue as that's more descriptive, and added a brief
comment mentioning that a GinTernaryValue is compatible with a boolean.

- Heikki


Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Andres Freund
Дата:
On 2014-03-31 10:31:00 +0300, Heikki Linnakangas wrote:
> On 03/23/2014 06:07 PM, Greg Stark wrote:
> >On Sun, Mar 23, 2014 at 1:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com
> >>wrote:
> >
> >>The reason for the typedef is precisely that an enum is not guaranteed to
> >>be one byte. Tom suggested getting rid of the typedef, but it's needed to
> >>make sure it's stored as one byte.
> >>
> >>I'll go add a comment to it, explaining why it's needed.
> >
> >If we're not using the enum type perhaps it would be better to make it a
> >bunch of #defines? The main advantage of enum types is that the debugger
> >knows what values are legal and can decode them for you.
>
> Yeah, I think you're right. Changed it to use #defines, renamed the typedef
> to GinTernaryValue as that's more descriptive, and added a brief comment
> mentioning that a GinTernaryValue is compatible with a boolean.

It doesn't matter much in this case, but in general I find the reasoning
here less than convincing. For one, debuggers have a much easier time
providing the symbolic name (e.g. p WHATEVER_ENUM_VAL) than purely
defines. Also, it's perfectly possible to switch to the enum in
switch()es, and benefit from compiler warnings. Using defines prohibits
that...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: pgsql: Allow opclasses to provide tri-valued GIN consistent functions.

От
Heikki Linnakangas
Дата:
On 03/31/2014 12:25 PM, Andres Freund wrote:
> On 2014-03-31 10:31:00 +0300, Heikki Linnakangas wrote:
>> On 03/23/2014 06:07 PM, Greg Stark wrote:
>>> On Sun, Mar 23, 2014 at 1:22 PM, Heikki Linnakangas <hlinnakangas@vmware.com
>>>> wrote:
>>>
>>>> The reason for the typedef is precisely that an enum is not guaranteed to
>>>> be one byte. Tom suggested getting rid of the typedef, but it's needed to
>>>> make sure it's stored as one byte.
>>>>
>>>> I'll go add a comment to it, explaining why it's needed.
>>>
>>> If we're not using the enum type perhaps it would be better to make it a
>>> bunch of #defines? The main advantage of enum types is that the debugger
>>> knows what values are legal and can decode them for you.
>>
>> Yeah, I think you're right. Changed it to use #defines, renamed the typedef
>> to GinTernaryValue as that's more descriptive, and added a brief comment
>> mentioning that a GinTernaryValue is compatible with a boolean.
>
> It doesn't matter much in this case, but in general I find the reasoning
> here less than convincing. For one, debuggers have a much easier time
> providing the symbolic name (e.g. p WHATEVER_ENUM_VAL) than purely
> defines. Also, it's perfectly possible to switch to the enum in
> switch()es, and benefit from compiler warnings. Using defines prohibits
> that...

Yeah, I could've gone either way on this case. The enum wasn't actually
used anywhere, the typedef was, so debugger wouldn't have been able use
the enum values. Same for switch warnings, unless you explicitly cast to
the enum type. It seems pretty unlikely that you would remember to do
the cast, but forget to handle one of the three values. If there was any
chance that we would add more values to the enum in the future, a cast
in a switch would be useful to catch places where we forgot to handle
the new value, but we're not going to add more values to this enum.

- Heikki