Multicolumn index scan efficiency
| От | Vitalii Tymchyshyn |
|---|---|
| Тема | Multicolumn index scan efficiency |
| Дата | |
| Msg-id | CABWW-d1Eri1zibcix1WdsgnOPnb6O=M4gkVVoSteBiNmZxY+oQ@mail.gmail.com обсуждение исходный текст |
| Ответы |
Re: Multicolumn index scan efficiency
|
| Список | pgsql-performance |
Hi.
-- 0. DISABLE SEQSCAN to ensure index is used
SET ENABLE_SEQSCAN=false;
-- 1. SETUP: Clean slate
DROP TABLE IF EXISTS application_specs;
-- 2. SCHEMA
CREATE TABLE application_specs (
namespace text NOT NULL,
application text NOT NULL,
version text NOT NULL,
latest boolean,
data text,
PRIMARY KEY (namespace, application, version) WITH (fillfactor = 90)
);
-- 3. GENERATE DATA: ~4.5 Million Rows Total
-- 150,000 Applications * ~30 versions each
INSERT INTO application_specs (namespace, application, version, latest, data)
SELECT
'default',
'test_application_pipeline_with__simulated_long_name_' || lpad(a::text, 6, '0'),
md5(random()::text || clock_timestamp()::text)::uuid::text,
(v = 30), -- Make the 30th version the 'latest' one
repeat('x', 100)
FROM generate_series(1, 150000) a
CROSS JOIN generate_series(1, 30) v;
-- 4. ANALYZE to ensure planner has accurate stats
VACUUM (ANALYZE, FREEZE) application_specs;
-- =========================================
-- THE TEST
-- Target a middle application: '...075000'
-- =========================================
---------------------------------------------------
-- TEST A: Original Row-wise Query (Should be slow)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default', 'test_application_pipeline_with__simulated_long_name_075000', '')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL);
---------------------------------------------------
-- TEST B: Equality Query (Should be fast)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE namespace = 'default'
AND application = 'test_application_pipeline_with__simulated_long_name_075000'
AND version > ''
AND (latest = true OR latest is NULL);
---------------------------------------------------
-- TEST C: (Should be slow)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application) >= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL) AND version > '';
---------------------------------------------------
-- TEST D: (Should be fast)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application) = ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL) AND version > '';
Since Friday I have been trying to diagnose the slowness of scanning the multicolumn index in Postgres. I figured out that multicolumns conditions like (a,b,c) > (x,y,z) with a,b and c being part of primary index work slow in postgresql 9.6 (original version in prod, I know it's old). I thought that it would be fast in 15 (one I had handy), but it was still slow. It was finally fast in 18. Note that all 3 are Google CloudSQL, but I believe those are pretty vanilla in terms of index scan internals.
I am wondering about 2 things:
1) Does anyone know which specific change / version made it fast?
2) What was the proper way to do a range index scan like WHERE (a,b,c) between (x1,y1,z1) and (x2,y2,z2) before the improvement.
Note that my tests can mostly be rewritten as equality at least for some columns (and this is what we'll do), but sometimes we do need a range scan like above, so understanding it would be important. Also I am curious :).
The full test will be towards the end, but the query I started from is
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default', 'test_application_pipeline_with__simulated_long_name_075000', '')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL);
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default', 'test_application_pipeline_with__simulated_long_name_075000', '')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL);
Plans:
--- 9.6
Index Scan using application_specs_pkey on application_specs (cost=0.56..6.33 rows=1 width=198) (actual rows=1 loops=1)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Buffers: shared hit=35442
Planning time: 0.153 ms
Execution time: 5049.123 ms
(7 rows)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Buffers: shared hit=35442
Planning time: 0.153 ms
Execution time: 5049.123 ms
(7 rows)
--- 15
Index Scan using application_specs_pkey on application_specs (cost=0.56..6.33 rows=1 width=206) (actual rows=1 loops=1)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Buffers: shared hit=45839
Planning Time: 0.171 ms
Execution Time: 8190.116 ms
(7 rows)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Buffers: shared hit=45839
Planning Time: 0.171 ms
Execution Time: 8190.116 ms
(7 rows)
--- 18
Index Scan using application_specs_pkey on application_specs (cost=0.56..6.33 rows=1 width=206) (actual rows=1.00 loops=1)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Index Searches: 1
Buffers: shared hit=35
Planning:
Buffers: shared hit=29
Planning Time: 0.243 ms
Execution Time: 0.155 ms
(10 rows)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Index Searches: 1
Buffers: shared hit=35
Planning:
Buffers: shared hit=29
Planning Time: 0.243 ms
Execution Time: 0.155 ms
(10 rows)
So, it's index scan in all cases, but for some reason index scan is very inefficient in the old versions and it has to scan a huge portion of index to find a few rows.
--- Full test
-- 0. DISABLE SEQSCAN to ensure index is used
SET ENABLE_SEQSCAN=false;
-- 1. SETUP: Clean slate
DROP TABLE IF EXISTS application_specs;
-- 2. SCHEMA
CREATE TABLE application_specs (
namespace text NOT NULL,
application text NOT NULL,
version text NOT NULL,
latest boolean,
data text,
PRIMARY KEY (namespace, application, version) WITH (fillfactor = 90)
);
-- 3. GENERATE DATA: ~4.5 Million Rows Total
-- 150,000 Applications * ~30 versions each
INSERT INTO application_specs (namespace, application, version, latest, data)
SELECT
'default',
'test_application_pipeline_with__simulated_long_name_' || lpad(a::text, 6, '0'),
md5(random()::text || clock_timestamp()::text)::uuid::text,
(v = 30), -- Make the 30th version the 'latest' one
repeat('x', 100)
FROM generate_series(1, 150000) a
CROSS JOIN generate_series(1, 30) v;
-- 4. ANALYZE to ensure planner has accurate stats
VACUUM (ANALYZE, FREEZE) application_specs;
-- =========================================
-- THE TEST
-- Target a middle application: '...075000'
-- =========================================
---------------------------------------------------
-- TEST A: Original Row-wise Query (Should be slow)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default', 'test_application_pipeline_with__simulated_long_name_075000', '')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL);
---------------------------------------------------
-- TEST B: Equality Query (Should be fast)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE namespace = 'default'
AND application = 'test_application_pipeline_with__simulated_long_name_075000'
AND version > ''
AND (latest = true OR latest is NULL);
---------------------------------------------------
-- TEST C: (Should be slow)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application) >= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL) AND version > '';
---------------------------------------------------
-- TEST D: (Should be fast)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application) = ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL) AND version > '';
--- End of test
Best regards, Vitalii Tymchyshyn
В списке pgsql-performance по дате отправления: