Обсуждение: intarray GiST index gets wrong answers for '{}' <@ anything

Поиск

Список

Период

Сортировка

intarray GiST index gets wrong answers for '{}' <@ anything

От

Tom Lane

Дата:

06 августа 2019 г., 17:55:41

While looking at the pending patch for faster GIN index searches
on no-key queries, I was motivated to improve contrib/intarray's
regression test to exercise the GIN_SEARCH_MODE_ALL case, because
it didn't.  And then I thought well, let's try to bring the code
coverage of _int_gin.c up to something respectable, which led me
to the regression test additions shown in the attached.  And I
was astonished to observe that the GiST index cases mostly got
the wrong answer for the <@ query.  Sometimes they got the right
answer, but mostly not.  After some digging I saw that the problem
was that there are a number of empty arrays ('{}') in the data,
and those should surely all match the WHERE a <@ '{73,23,20}'
condition, but the GiST opclasses were not reliably finding them.

The reason appears to be that the condition for descending through a
non-leaf index key for the RTContainedBy case is incorrectly optimistic:
it supposes that we only need to descend into subtrees whose union key
overlaps the query array.  But this does not guarantee to find subtrees
that contain empty-array entries.  Worse, such entries could be anywhere
in the tree, and because of the way that the insertion penalty is
calculated, they probably are.  (We will compute a zero penalty to add
an empty array item to any subtree.)  The reason it sometimes works
seems to be that GiST randomizes its insertion decisions when there are
equal penalties (cf gistchoose()), and sometimes by luck it puts all
of the empty-array entries into subtrees that the existing rule will
search.

So as far as I can see, we have little choice but to lobotomize the
RTContainedBy case and force a whole-index search.  This applies to
both the gist__int_ops and gist__intbig_ops opclasses.  This is
pretty awful for any applications that are depending on such queries
to be fast, but it's hard to argue with "it gets the wrong answer,
and not even reproducibly so".

In the future we might think about removing <@ from these opclasses,
or making a non-backward-compatible change to segregate empty arrays
from everything else in the index.  But neither answer seems very
back-patchable, and I'm not really sure I want to put so much work
into a second-class-citizen contrib module anyway.

Comments?

            regards, tom lane

diff --git a/contrib/intarray/_int_gist.c b/contrib/intarray/_int_gist.c
index 13dd7ac..e5a8011 100644
--- a/contrib/intarray/_int_gist.c
+++ b/contrib/intarray/_int_gist.c
@@ -96,8 +96,13 @@ g_int_consistent(PG_FUNCTION_ARGS)
                 retval = inner_int_contains(query,
                                             (ArrayType *) DatumGetPointer(entry->key));
             else
-                retval = inner_int_overlap((ArrayType *) DatumGetPointer(entry->key),
-                                           query);
+            {
+                /*
+                 * Unfortunately, because empty arrays could be anywhere in
+                 * the index, we must search the whole tree.
+                 */
+                retval = true;
+            }
             break;
         default:
             retval = false;
diff --git a/contrib/intarray/_intbig_gist.c b/contrib/intarray/_intbig_gist.c
index 8d3b3f5..2a20abe 100644
--- a/contrib/intarray/_intbig_gist.c
+++ b/contrib/intarray/_intbig_gist.c
@@ -567,7 +567,13 @@ g_intbig_consistent(PG_FUNCTION_ARGS)
                 }
             }
             else
-                retval = _intbig_overlap((GISTTYPE *) DatumGetPointer(entry->key), query);
+            {
+                /*
+                 * Unfortunately, because empty arrays could be anywhere in
+                 * the index, we must search the whole tree.
+                 */
+                retval = true;
+            }
             break;
         default:
             retval = false;
diff --git a/contrib/intarray/expected/_int.out b/contrib/intarray/expected/_int.out
index 105c951..c92a565 100644
--- a/contrib/intarray/expected/_int.out
+++ b/contrib/intarray/expected/_int.out
@@ -431,6 +431,18 @@ SELECT count(*) from test__int WHERE a @> '{20,23}';
     12
 (1 row)

+SELECT count(*) from test__int WHERE a <@ '{73,23,20}';
+ count
+-------
+    10
+(1 row)
+
+SELECT count(*) from test__int WHERE a = '{73,23,20}';
+ count
+-------
+     1
+(1 row)
+
 SELECT count(*) from test__int WHERE a @@ '50&68';
  count
 -------
@@ -449,6 +461,19 @@ SELECT count(*) from test__int WHERE a @@ '(20&23)|(50&68)';
     21
 (1 row)

+SELECT count(*) from test__int WHERE a @@ '20 | !21';
+ count
+-------
+  6566
+(1 row)
+
+SELECT count(*) from test__int WHERE a @@ '!20 & !21';
+ count
+-------
+  6343
+(1 row)
+
+SET enable_seqscan = off;  -- not all of these would use index by default
 CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );
 SELECT count(*) from test__int WHERE a && '{23,50}';
  count
@@ -480,6 +505,18 @@ SELECT count(*) from test__int WHERE a @> '{20,23}';
     12
 (1 row)

+SELECT count(*) from test__int WHERE a <@ '{73,23,20}';
+ count
+-------
+    10
+(1 row)
+
+SELECT count(*) from test__int WHERE a = '{73,23,20}';
+ count
+-------
+     1
+(1 row)
+
 SELECT count(*) from test__int WHERE a @@ '50&68';
  count
 -------
@@ -498,6 +535,18 @@ SELECT count(*) from test__int WHERE a @@ '(20&23)|(50&68)';
     21
 (1 row)

+SELECT count(*) from test__int WHERE a @@ '20 | !21';
+ count
+-------
+  6566
+(1 row)
+
+SELECT count(*) from test__int WHERE a @@ '!20 & !21';
+ count
+-------
+  6343
+(1 row)
+
 DROP INDEX text_idx;
 CREATE INDEX text_idx on test__int using gist ( a gist__intbig_ops );
 SELECT count(*) from test__int WHERE a && '{23,50}';
@@ -530,6 +579,18 @@ SELECT count(*) from test__int WHERE a @> '{20,23}';
     12
 (1 row)

+SELECT count(*) from test__int WHERE a <@ '{73,23,20}';
+ count
+-------
+    10
+(1 row)
+
+SELECT count(*) from test__int WHERE a = '{73,23,20}';
+ count
+-------
+     1
+(1 row)
+
 SELECT count(*) from test__int WHERE a @@ '50&68';
  count
 -------
@@ -548,6 +609,18 @@ SELECT count(*) from test__int WHERE a @@ '(20&23)|(50&68)';
     21
 (1 row)

+SELECT count(*) from test__int WHERE a @@ '20 | !21';
+ count
+-------
+  6566
+(1 row)
+
+SELECT count(*) from test__int WHERE a @@ '!20 & !21';
+ count
+-------
+  6343
+(1 row)
+
 DROP INDEX text_idx;
 CREATE INDEX text_idx on test__int using gin ( a gin__int_ops );
 SELECT count(*) from test__int WHERE a && '{23,50}';
@@ -580,6 +653,18 @@ SELECT count(*) from test__int WHERE a @> '{20,23}';
     12
 (1 row)

+SELECT count(*) from test__int WHERE a <@ '{73,23,20}';
+ count
+-------
+    10
+(1 row)
+
+SELECT count(*) from test__int WHERE a = '{73,23,20}';
+ count
+-------
+     1
+(1 row)
+
 SELECT count(*) from test__int WHERE a @@ '50&68';
  count
 -------
@@ -598,3 +683,16 @@ SELECT count(*) from test__int WHERE a @@ '(20&23)|(50&68)';
     21
 (1 row)

+SELECT count(*) from test__int WHERE a @@ '20 | !21';
+ count
+-------
+  6566
+(1 row)
+
+SELECT count(*) from test__int WHERE a @@ '!20 & !21';
+ count
+-------
+  6343
+(1 row)
+
+RESET enable_seqscan;
diff --git a/contrib/intarray/sql/_int.sql b/contrib/intarray/sql/_int.sql
index 40225c6..6ca7e3c 100644
--- a/contrib/intarray/sql/_int.sql
+++ b/contrib/intarray/sql/_int.sql
@@ -85,9 +85,15 @@ SELECT count(*) from test__int WHERE a @@ '23|50';
 SELECT count(*) from test__int WHERE a @> '{23,50}';
 SELECT count(*) from test__int WHERE a @@ '23&50';
 SELECT count(*) from test__int WHERE a @> '{20,23}';
+SELECT count(*) from test__int WHERE a <@ '{73,23,20}';
+SELECT count(*) from test__int WHERE a = '{73,23,20}';
 SELECT count(*) from test__int WHERE a @@ '50&68';
 SELECT count(*) from test__int WHERE a @> '{20,23}' or a @> '{50,68}';
 SELECT count(*) from test__int WHERE a @@ '(20&23)|(50&68)';
+SELECT count(*) from test__int WHERE a @@ '20 | !21';
+SELECT count(*) from test__int WHERE a @@ '!20 & !21';
+
+SET enable_seqscan = off;  -- not all of these would use index by default

 CREATE INDEX text_idx on test__int using gist ( a gist__int_ops );

@@ -96,9 +102,13 @@ SELECT count(*) from test__int WHERE a @@ '23|50';
 SELECT count(*) from test__int WHERE a @> '{23,50}';
 SELECT count(*) from test__int WHERE a @@ '23&50';
 SELECT count(*) from test__int WHERE a @> '{20,23}';
+SELECT count(*) from test__int WHERE a <@ '{73,23,20}';
+SELECT count(*) from test__int WHERE a = '{73,23,20}';
 SELECT count(*) from test__int WHERE a @@ '50&68';
 SELECT count(*) from test__int WHERE a @> '{20,23}' or a @> '{50,68}';
 SELECT count(*) from test__int WHERE a @@ '(20&23)|(50&68)';
+SELECT count(*) from test__int WHERE a @@ '20 | !21';
+SELECT count(*) from test__int WHERE a @@ '!20 & !21';

 DROP INDEX text_idx;
 CREATE INDEX text_idx on test__int using gist ( a gist__intbig_ops );
@@ -108,9 +118,13 @@ SELECT count(*) from test__int WHERE a @@ '23|50';
 SELECT count(*) from test__int WHERE a @> '{23,50}';
 SELECT count(*) from test__int WHERE a @@ '23&50';
 SELECT count(*) from test__int WHERE a @> '{20,23}';
+SELECT count(*) from test__int WHERE a <@ '{73,23,20}';
+SELECT count(*) from test__int WHERE a = '{73,23,20}';
 SELECT count(*) from test__int WHERE a @@ '50&68';
 SELECT count(*) from test__int WHERE a @> '{20,23}' or a @> '{50,68}';
 SELECT count(*) from test__int WHERE a @@ '(20&23)|(50&68)';
+SELECT count(*) from test__int WHERE a @@ '20 | !21';
+SELECT count(*) from test__int WHERE a @@ '!20 & !21';

 DROP INDEX text_idx;
 CREATE INDEX text_idx on test__int using gin ( a gin__int_ops );
@@ -120,6 +134,12 @@ SELECT count(*) from test__int WHERE a @@ '23|50';
 SELECT count(*) from test__int WHERE a @> '{23,50}';
 SELECT count(*) from test__int WHERE a @@ '23&50';
 SELECT count(*) from test__int WHERE a @> '{20,23}';
+SELECT count(*) from test__int WHERE a <@ '{73,23,20}';
+SELECT count(*) from test__int WHERE a = '{73,23,20}';
 SELECT count(*) from test__int WHERE a @@ '50&68';
 SELECT count(*) from test__int WHERE a @> '{20,23}' or a @> '{50,68}';
 SELECT count(*) from test__int WHERE a @@ '(20&23)|(50&68)';
+SELECT count(*) from test__int WHERE a @@ '20 | !21';
+SELECT count(*) from test__int WHERE a @@ '!20 & !21';
+
+RESET enable_seqscan;

Re: intarray GiST index gets wrong answers for '{}' <@ anything

От

Alexander Korotkov

Дата:

06 августа 2019 г., 18:18:41

Hi!

On Tue, Aug 6, 2019 at 8:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The reason appears to be that the condition for descending through a
> non-leaf index key for the RTContainedBy case is incorrectly optimistic:
> it supposes that we only need to descend into subtrees whose union key
> overlaps the query array.  But this does not guarantee to find subtrees
> that contain empty-array entries.  Worse, such entries could be anywhere
> in the tree, and because of the way that the insertion penalty is
> calculated, they probably are.  (We will compute a zero penalty to add
> an empty array item to any subtree.)  The reason it sometimes works
> seems to be that GiST randomizes its insertion decisions when there are
> equal penalties (cf gistchoose()), and sometimes by luck it puts all
> of the empty-array entries into subtrees that the existing rule will
> search.

Right, existing logic could work correctly, when dataset contains no
empty arrays.  But it clearly doesn't handle empty arrays.

> So as far as I can see, we have little choice but to lobotomize the
> RTContainedBy case and force a whole-index search.  This applies to
> both the gist__int_ops and gist__intbig_ops opclasses.  This is
> pretty awful for any applications that are depending on such queries
> to be fast, but it's hard to argue with "it gets the wrong answer,
> and not even reproducibly so".

+1 for pushing this

> In the future we might think about removing <@ from these opclasses,
> or making a non-backward-compatible change to segregate empty arrays
> from everything else in the index.  But neither answer seems very
> back-patchable, and I'm not really sure I want to put so much work
> into a second-class-citizen contrib module anyway.

+1 for removing <@ from opclasses.  Trying to segregate empty arrays
looks like invention of new opclass rather than bugfix for current
one.  One, who is interested in this piece of work, can implement this
new opclass.

Users, who likes existing behavior of handling <@ operator in intarray
opclasses, may be advised to rewrite their queries as following.

"col <@ const" => "col <@ const AND col && const"

New queries would have opclass support and handle non-empty arrays in
the same way.  It will be slightly slower because of evaluation of two
operators instead of one.  But this doesn't seem critical.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: intarray GiST index gets wrong answers for '{}' <@ anything

От

Tom Lane

Дата:

06 августа 2019 г., 18:42:46

Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> Users, who likes existing behavior of handling <@ operator in intarray
> opclasses, may be advised to rewrite their queries as following.

> "col <@ const" => "col <@ const AND col && const"

Oh, that's a good suggestion --- it will work, and work reasonably
well, with either unpatched or patched intarray code; and also with
some future version that doesn't consider <@ indexable at all.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: intarray GiST index gets wrong answers for '{}' <@ anything

intarray GiST index gets wrong answers for '{}' <@ anything

Re: intarray GiST index gets wrong answers for '{}' <@ anything

Re: intarray GiST index gets wrong answers for '{}' <@ anything