Thinking About Correlated Columns (again)
От | Shaun Thomas |
---|---|
Тема | Thinking About Correlated Columns (again) |
Дата | |
Msg-id | 5193AA62.2040700@optionshouse.com обсуждение исходный текст |
Ответы |
Re: Thinking About Correlated Columns (again)
Re: Thinking About Correlated Columns (again) Re: Thinking About Correlated Columns (again) |
Список | pgsql-performance |
Hi! This has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up. A really basic example of how bad the planner is here: CREATE TABLE foo AS SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b FROM generate_series(1, 1000000) a(id); CREATE INDEX idx_foo_ab ON foo (col_a, col_b); ANALYZE foo; EXPLAIN ANALYZE SELECT * FROM foo WHERE col_a = 50 AND col_b = 50; Index Scan using idx_foo_ab on foo (cost=0.00..6.35 rows=1 width=12) (actual time=0.030..3.643 rows=1000 loops=1) Index Cond: ((col_a = 50) AND (col_b = 50)) Hey, look! The row estimate is off by a factor of 1000. This particular case doesn't suffer terribly from the mis-estimation, but others do. Boy, do they ever. Since there really is no fix for this aside from completely rewriting the query or horribly misusing CTEs (to force sequence scans instead of bloated nested loops, for example), I recently tried to use some existing ALTER TABLE syntax: ALTER TABLE foo ALTER col_a SET (n_distinct = 1); ALTER TABLE foo ALTER col_b SET (n_distinct = 1); The new explain output: Index Scan using idx_foo_ab on foo (cost=0.00..9.37 rows=2 width=12) (actual time=0.013..0.871 rows=1000 loops=1) Index Cond: ((col_a = 50) AND (col_b = 50)) Well... 2 is closer to 1000 than 1 was, but it's pretty clear the histogram is still the basis for the cost estimates. I'm curious what benefit overriding the n_distinct pg_stats field actually gives us in practice. I'm worried that without an easy answer for cases like this, more DBAs will abuse optimization fences to get what they want and we'll end up in a scenario that's actually worse than query hints. Theoretically, query hints can be deprecated or have GUCs to remove their influence, but CTEs are forever, or until the next code refactor. I've seen conversations on this since at least 2005. There were even proposed patches every once in a while, but never any consensus. Anyone care to comment? -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 sthomas@optionshouse.com ______________________________________________ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email
В списке pgsql-performance по дате отправления: