Re: [PATCH] Support SK_SEARCHNULL / SK_SEARCHNOTNULL for heap-only scans
От | Jacob Champion |
---|---|
Тема | Re: [PATCH] Support SK_SEARCHNULL / SK_SEARCHNOTNULL for heap-only scans |
Дата | |
Msg-id | 5a8f60f7-d2dc-d781-5709-cda5ab7a0064@timescale.com обсуждение исходный текст |
Ответ на | Re: [PATCH] Support SK_SEARCHNULL / SK_SEARCHNOTNULL for heap-only scans (Jacob Champion <jchampion@timescale.com>) |
Список | pgsql-hackers |
On 7/19/23 16:44, Jacob Champion wrote: > This patch pushes down any > forced-null and not-null Vars as ScanKeys. It doesn't remove the > redundant quals after turning them into ScanKeys, so it's needlessly > inefficient, but there's still a decent speedup for some of the basic > benchmarks in 0003. > > Plans look something like this: > > # EXPLAIN SELECT * FROM t WHERE i IS NULL; > QUERY PLAN > ------------------------------------------------------------ > Seq Scan on t (cost=0.00..1393.00 rows=49530 width=4) > Scan Cond: (i IS NULL) > Filter: (i IS NULL) > (3 rows) Redundant clauses are now filtered out in v3. So the new plans look more like what you'd expect: =# EXPLAIN SELECT * FROM table1 WHERE a IS NOT NULL AND b = 2; QUERY PLAN --------------------------------------------------------- Seq Scan on table1 (cost=0.00..3344.00 rows=1 width=4) Scan Cond: (a IS NOT NULL) Filter: (b = 2) (3 rows) > The non-nullable case worries me a bit because so many things imply IS > NOT NULL. I think I need to do some sort of cost analysis using the > null_frac statistics -- it probably only makes sense to push an > implicit SK_SEARCHNOTNULL down to the AM layer if some fraction of > rows would actually be filtered out -- but I'm not really sure how to > choose a threshold. v3 also uses the nullfrac and the expected cost of the filter clauses to decide whether to promote a derived IS NOT NULL condition to a ScanKey. (Explicit IS [NOT] NULL clauses are always promoted.) I'm still not sure how to fine-tune the expected cost of the ScanKey, but the number I've chosen for now (`0.1 * cpu_operator_cost`) doesn't seem to regress my benchmarks, for whatever that's worth. I recorded several of the changes to the regression EXPLAIN output, but I've left a few broken because I'm not sure if they're useful or if I should just disable the optimization. For example: explain (analyze, costs off, summary off, timing off) select * from list_part where a = list_part_fn(1) + a; QUERY PLAN ------------------------------------------------------------------ Append (actual rows=0 loops=1) -> Seq Scan on list_part1 list_part_1 (actual rows=0 loops=1) + Scan Cond: (a IS NOT NULL) Filter: (a = (list_part_fn(1) + a)) Rows Removed by Filter: 1 -> Seq Scan on list_part2 list_part_2 (actual rows=0 loops=1) + Scan Cond: (a IS NOT NULL) Filter: (a = (list_part_fn(1) + a)) Rows Removed by Filter: 1 -> Seq Scan on list_part3 list_part_3 (actual rows=0 loops=1) + Scan Cond: (a IS NOT NULL) Filter: (a = (list_part_fn(1) + a)) Rows Removed by Filter: 1 -> Seq Scan on list_part4 list_part_4 (actual rows=0 loops=1) + Scan Cond: (a IS NOT NULL) Filter: (a = (list_part_fn(1) + a)) Rows Removed by Filter: 1 These new conditions are due to a lack of statistics for the tiny partitions; the filters are considered expensive enough that the savings against a DEFAULT_UNK_SEL proportion of NULLs would hypothetically be worth it. Best I can tell, this is a non-issue, since autovacuum will fix the problem by the time the table grows to the point where the pointless ScanKey would cause a slowdown. But it sure _looks_ like a mistake, especially since these particular partitions can't even contain NULL. Do I need to do something about this short-lived case? There's still other work to do -- for instance, I think my modifications to the filter clauses have probably broken some isolation levels until I fix up SeqRecheck(), and better benchmarks would be good -- but I think this is ready for early CF feedback. Thanks, --Jacob
Вложения
В списке pgsql-hackers по дате отправления: