[HACKERS] <> join selectivity estimate question

Поиск

Список

Период

Сортировка

От	Thomas Munro
Тема	[HACKERS] <> join selectivity estimate question
Дата	17 марта 2017 г. 08:54:46
Msg-id	CAEepm=270ze2hVxWkJw-5eKzc3AB4C9KpH3L2kih75R5pdSogg@mail.gmail.com обсуждение исходный текст
Ответы	Re: [HACKERS] <> join selectivity estimate question
Список	pgsql-hackers

Дерево обсуждения

Hi hackers,

While studying a regression reported[1] against my parallel hash join
patch, I noticed that we can also reach a good and a bad plan in
unpatched master.  One of the causes seems to be the estimated
selectivity of a semi-join with an extra <> filter qual.

Here are some times I measured for TPCH Q21 at scale 10 and work_mem
of 1GB.  That is a query with a large anti-join and a large semi-join.

  8 workers = 8.3s
  7 workers = 8.2s
  6 workers = 8.5s
  5 workers = 8.9s
  4 workers = 9.5s
  3 workers = 39.7s
  2 workers = 36.9s
  1 worker  = 38.2s
  0 workers = 47.9s

Please see the attached query plans showing the change in plan from
Hash Semi Join to Nested Loop Semi Join that happens only once we
reach 4 workers and the (partial) base relation size becomes smaller.
The interesting thing is that row estimate for the semi-join and
anti-join come out as 1 (I think this is 0 clamped to 1).

The same thing can be seen with a simple semi-join, if you happen to
have TPCH loaded.  Compare these two queries:

 SELECT *
   FROM lineitem l1
  WHERE EXISTS (SELECT *
                  FROM lineitem l2
                 WHERE l1.l_orderkey = l2.l_orderkey);

 -> estimates 59986012 rows, actual rows 59,986,052 (scale 10 TPCH)

 SELECT *
   FROM lineitem l1
  WHERE EXISTS (SELECT *
                  FROM lineitem l2
                 WHERE l1.l_orderkey = l2.l_orderkey
                   AND l1.l_suppkey <> l2.l_suppkey);

 -> estimates 1 row, actual rows 57,842,090 (scale 10 TPCH)

Or for a standalone example:

  CREATE TABLE foo AS
  SELECT (generate_series(1, 1000000) / 4)::int AS a,
         (generate_series(1, 1000000) % 100)::int AS b;

  ANALYZE foo;

  SELECT *
    FROM foo f1
   WHERE EXISTS (SELECT *
                   FROM foo f2
                  WHERE f1.a = f2.a);

 -> estimates 1,000,000 rows

  SELECT *
    FROM foo f1
   WHERE EXISTS (SELECT *
                   FROM foo f2
                  WHERE f1.a = f2.a
                    AND f1.b <> f2.b);

 -> estimates 1 row

I'm trying to wrap my brain around the selectivity code, but am too
green to grok how this part of the planner that I haven't previously
focused on works so far, and I'd like to understand whether this is
expected behaviour so that I can figure out how to tackle the reported
regression with my patch.  What is happening here?

Thanks for reading.

[1] https://www.postgresql.org/message-id/CAEepm%3D3Og-7-b3WOkiT%3Dc%2B6y3eZ0VVSyb1K%2BSOvF17BO5KAt0A%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

[HACKERS] <> join selectivity estimate question

Вложения