Re: BUG #16993: Query Planner does not use index for EXISTS-query on a large table with all equal foreign key values
От | Tom Lane |
---|---|
Тема | Re: BUG #16993: Query Planner does not use index for EXISTS-query on a large table with all equal foreign key values |
Дата | |
Msg-id | 4035793.1620236861@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | BUG #16993: Query Planner does not use index for EXISTS-query on a large table with all equal foreign key values (PG Bug reporting form <noreply@postgresql.org>) |
Ответы |
Re: BUG #16993: Query Planner does not use index for EXISTS-query on a large table with all equal foreign key values
|
Список | pgsql-bugs |
PG Bug reporting form <noreply@postgresql.org> writes: > CREATE TABLE data_node (id BIGINT PRIMARY KEY); > CREATE TABLE data_entry (id BIGINT PRIMARY KEY, node_fk BIGINT NOT NULL, > FOREIGN KEY (node_fk) REFERENCES data_node (id)); > CREATE INDEX node_ix ON data_entry USING BTREE (node_fk); > INSERT INTO data_node (id) SELECT generate_series(1,10); > INSERT INTO data_entry (id, node_fk) SELECT s, 2 FROM > generate_series(1,10000000) s; > VACUUM ANALYZE data_node; > VACUUM ANALYZE data_entry; > ... > SELECT * FROM data_node > WHERE EXISTS ( > SELECT 1 FROM data_entry WHERE node_fk = data_node.id > ); I looked into this. The reason for the strange behavior is that, thanks to the extremely skewed data distribution in data_entry.node_fk, the indexscan you're hoping it would use is actually costed as being more expensive than a seqscan. That causes it to get discarded at the relation planning level. Later on, when we consider join plans, there's a special exception that understands that a semijoin with an inner indexscan that uses all the join quals is particularly efficient; which is why the desirable plan is correctly given a small cost, *if* we consider it at all. But if the indexscan was already thrown away, we won't. Of course, disabling the seqscan allows the indexscan to survive the scan-level path tournament. I'm not sure what a good fix for this would look like. We could imagine trying to mark indexscan paths that satisfy has_indexed_join_quals() at time of generation, and then giving them a free pass in add_path(). However (1) as written, has_indexed_join_quals() depends on the particular join being considered. That might be just an artifact of coding convenience, but I'm not sure. (2) add_path() is enough of a hot spot that giving it yet more things to worry about is fairly unattractive. (3) allowing more paths to survive the tournament slows things down overall. Tweaking the example shows that you need a *very* skewed data distribution to make this happen, so maybe we should write it off as a corner case that's not worth fixing. I'm for sure not eager to slow planning down overall to improve this case, so I don't much like the above proposal even if it could be made to work. A different idea is to reconsider the selectivity estimation rules with an eye to cutting the estimated cost of the indexscan. The thing that I'm wondering about here is that we're estimating that using eqsel() rather than eqjoinsel(), and it seems like the latter might give us a more useful value. AFAIR we never thought hard about that aspect when we invented parameterized plans. But for sure, changing that would be a research project with a lot of potential impact. A narrower fix would be to hack var_eq_non_const so that it doesn't assume that the comparison value must be one of the entries in the column. But it seems like whatever change we made in that line would be a very unprincipled hack, because what are you going to assume instead? Anyway, short answer is that this is more complicated than it looks, and there isn't any simple improvement we can make. regards, tom lane
В списке pgsql-bugs по дате отправления: