Thinking About Correlated Columns (again)

Поиск

Список

Период

Сортировка

От	Shaun Thomas
Тема	Thinking About Correlated Columns (again)
Дата	15 мая 2013 г. 15:31:58
Msg-id	5193AA62.2040700@optionshouse.com обсуждение исходный текст
Ответы	Re: Thinking About Correlated Columns (again) Re: Thinking About Correlated Columns (again) Re: Thinking About Correlated Columns (again)
Список	pgsql-performance

Дерево обсуждения

Hi!

This has been a pain point for quite a while. While we've had several
discussions in the area, it always seems to just kinda trail off and
eventually vanish every time it comes up.

A really basic example of how bad the planner is here:

CREATE TABLE foo AS
SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b
   FROM generate_series(1, 1000000) a(id);

CREATE INDEX idx_foo_ab ON foo (col_a, col_b);

ANALYZE foo;

EXPLAIN ANALYZE
SELECT *
   FROM foo
  WHERE col_a = 50
    AND col_b = 50;

Index Scan using idx_foo_ab on foo  (cost=0.00..6.35 rows=1 width=12)
            (actual time=0.030..3.643 rows=1000 loops=1)
   Index Cond: ((col_a = 50) AND (col_b = 50))

Hey, look! The row estimate is off by a factor of 1000. This particular
case doesn't suffer terribly from the mis-estimation, but others do.
Boy, do they ever.

Since there really is no fix for this aside from completely rewriting
the query or horribly misusing CTEs (to force sequence scans instead of
bloated nested loops, for example), I recently tried to use some
existing ALTER TABLE syntax:

ALTER TABLE foo ALTER col_a SET (n_distinct = 1);
ALTER TABLE foo ALTER col_b SET (n_distinct = 1);

The new explain output:

Index Scan using idx_foo_ab on foo  (cost=0.00..9.37 rows=2 width=12)
            (actual time=0.013..0.871 rows=1000 loops=1)
   Index Cond: ((col_a = 50) AND (col_b = 50))

Well... 2 is closer to 1000 than 1 was, but it's pretty clear the
histogram is still the basis for the cost estimates. I'm curious what
benefit overriding the n_distinct pg_stats field actually gives us in
practice.

I'm worried that without an easy answer for cases like this, more DBAs
will abuse optimization fences to get what they want and we'll end up in
a scenario that's actually worse than query hints. Theoretically, query
hints can be deprecated or have GUCs to remove their influence, but CTEs
are forever, or until the next code refactor.

I've seen conversations on this since at least 2005. There were even
proposed patches every once in a while, but never any consensus. Anyone
care to comment?

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

В списке pgsql-performance по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Thinking About Correlated Columns (again)