Обсуждение: Multicolumn index scan efficiency
Hi.
-- 0. DISABLE SEQSCAN to ensure index is used
SET ENABLE_SEQSCAN=false;
-- 1. SETUP: Clean slate
DROP TABLE IF EXISTS application_specs;
-- 2. SCHEMA
CREATE TABLE application_specs (
namespace text NOT NULL,
application text NOT NULL,
version text NOT NULL,
latest boolean,
data text,
PRIMARY KEY (namespace, application, version) WITH (fillfactor = 90)
);
-- 3. GENERATE DATA: ~4.5 Million Rows Total
-- 150,000 Applications * ~30 versions each
INSERT INTO application_specs (namespace, application, version, latest, data)
SELECT
'default',
'test_application_pipeline_with__simulated_long_name_' || lpad(a::text, 6, '0'),
md5(random()::text || clock_timestamp()::text)::uuid::text,
(v = 30), -- Make the 30th version the 'latest' one
repeat('x', 100)
FROM generate_series(1, 150000) a
CROSS JOIN generate_series(1, 30) v;
-- 4. ANALYZE to ensure planner has accurate stats
VACUUM (ANALYZE, FREEZE) application_specs;
-- =========================================
-- THE TEST
-- Target a middle application: '...075000'
-- =========================================
---------------------------------------------------
-- TEST A: Original Row-wise Query (Should be slow)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default', 'test_application_pipeline_with__simulated_long_name_075000', '')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL);
---------------------------------------------------
-- TEST B: Equality Query (Should be fast)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE namespace = 'default'
AND application = 'test_application_pipeline_with__simulated_long_name_075000'
AND version > ''
AND (latest = true OR latest is NULL);
---------------------------------------------------
-- TEST C: (Should be slow)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application) >= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL) AND version > '';
---------------------------------------------------
-- TEST D: (Should be fast)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application) = ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL) AND version > '';
Since Friday I have been trying to diagnose the slowness of scanning the multicolumn index in Postgres. I figured out that multicolumns conditions like (a,b,c) > (x,y,z) with a,b and c being part of primary index work slow in postgresql 9.6 (original version in prod, I know it's old). I thought that it would be fast in 15 (one I had handy), but it was still slow. It was finally fast in 18. Note that all 3 are Google CloudSQL, but I believe those are pretty vanilla in terms of index scan internals.
I am wondering about 2 things:
1) Does anyone know which specific change / version made it fast?
2) What was the proper way to do a range index scan like WHERE (a,b,c) between (x1,y1,z1) and (x2,y2,z2) before the improvement.
Note that my tests can mostly be rewritten as equality at least for some columns (and this is what we'll do), but sometimes we do need a range scan like above, so understanding it would be important. Also I am curious :).
The full test will be towards the end, but the query I started from is
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default', 'test_application_pipeline_with__simulated_long_name_075000', '')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL);
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default', 'test_application_pipeline_with__simulated_long_name_075000', '')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL);
Plans:
--- 9.6
Index Scan using application_specs_pkey on application_specs (cost=0.56..6.33 rows=1 width=198) (actual rows=1 loops=1)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Buffers: shared hit=35442
Planning time: 0.153 ms
Execution time: 5049.123 ms
(7 rows)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Buffers: shared hit=35442
Planning time: 0.153 ms
Execution time: 5049.123 ms
(7 rows)
--- 15
Index Scan using application_specs_pkey on application_specs (cost=0.56..6.33 rows=1 width=206) (actual rows=1 loops=1)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Buffers: shared hit=45839
Planning Time: 0.171 ms
Execution Time: 8190.116 ms
(7 rows)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Buffers: shared hit=45839
Planning Time: 0.171 ms
Execution Time: 8190.116 ms
(7 rows)
--- 18
Index Scan using application_specs_pkey on application_specs (cost=0.56..6.33 rows=1 width=206) (actual rows=1.00 loops=1)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Index Searches: 1
Buffers: shared hit=35
Planning:
Buffers: shared hit=29
Planning Time: 0.243 ms
Execution Time: 0.155 ms
(10 rows)
Index Cond: ((ROW(namespace, application, version) > ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text, ''::text)) AND (ROW(namespace, application) <= ROW('default'::text, 'test_application_pipeline_with__simulated_long_name_075000'::text)))
Filter: (latest OR (latest IS NULL))
Rows Removed by Filter: 29
Index Searches: 1
Buffers: shared hit=35
Planning:
Buffers: shared hit=29
Planning Time: 0.243 ms
Execution Time: 0.155 ms
(10 rows)
So, it's index scan in all cases, but for some reason index scan is very inefficient in the old versions and it has to scan a huge portion of index to find a few rows.
--- Full test
-- 0. DISABLE SEQSCAN to ensure index is used
SET ENABLE_SEQSCAN=false;
-- 1. SETUP: Clean slate
DROP TABLE IF EXISTS application_specs;
-- 2. SCHEMA
CREATE TABLE application_specs (
namespace text NOT NULL,
application text NOT NULL,
version text NOT NULL,
latest boolean,
data text,
PRIMARY KEY (namespace, application, version) WITH (fillfactor = 90)
);
-- 3. GENERATE DATA: ~4.5 Million Rows Total
-- 150,000 Applications * ~30 versions each
INSERT INTO application_specs (namespace, application, version, latest, data)
SELECT
'default',
'test_application_pipeline_with__simulated_long_name_' || lpad(a::text, 6, '0'),
md5(random()::text || clock_timestamp()::text)::uuid::text,
(v = 30), -- Make the 30th version the 'latest' one
repeat('x', 100)
FROM generate_series(1, 150000) a
CROSS JOIN generate_series(1, 30) v;
-- 4. ANALYZE to ensure planner has accurate stats
VACUUM (ANALYZE, FREEZE) application_specs;
-- =========================================
-- THE TEST
-- Target a middle application: '...075000'
-- =========================================
---------------------------------------------------
-- TEST A: Original Row-wise Query (Should be slow)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default', 'test_application_pipeline_with__simulated_long_name_075000', '')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL);
---------------------------------------------------
-- TEST B: Equality Query (Should be fast)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE namespace = 'default'
AND application = 'test_application_pipeline_with__simulated_long_name_075000'
AND version > ''
AND (latest = true OR latest is NULL);
---------------------------------------------------
-- TEST C: (Should be slow)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application) >= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (namespace, application) <= ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL) AND version > '';
---------------------------------------------------
-- TEST D: (Should be fast)
---------------------------------------------------
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application) = ('default', 'test_application_pipeline_with__simulated_long_name_075000')
AND (latest = true OR latest is NULL) AND version > '';
--- End of test
Best regards, Vitalii Tymchyshyn
On Sun, Nov 9, 2025 at 9:44 PM Vitalii Tymchyshyn <vit@tym.im> wrote: > I am wondering about 2 things: > 1) Does anyone know which specific change / version made it fast? > 2) What was the proper way to do a range index scan like WHERE (a,b,c) between (x1,y1,z1) and (x2,y2,z2) before the improvement. > Note that my tests can mostly be rewritten as equality at least for some columns (and this is what we'll do), but sometimeswe do need a range scan like above, so understanding it would be important. Also I am curious :). This improvement you're seeing here is down to work in commit bd3f59fd. The short version is that the way we used to decide when a condition like "WHERE (a,b,c) <= (x2,y2,z2)" was needlessly conservative. If there were many "a" values equal to x2, we'd have to scan the index until we got to the next distinct/non-equal "a" value -- without realizing that we're already past the point where there cannot possibly be any more matches. See the discussion on this thread which complained about the problem, particularly my response to the complaint: https://www.postgresql.org/message-id/flat/CAH2-WzmLREy6r68A6SEHXnstg01kNs1HiQtOvSO5cTvWuaducw%40mail.gmail.com#62e393ac8bbf06f0f73598ba2ceeab69 -- Peter Geoghegan
Thank you so much for both clarifying and fixing it!
In our case (FYI, this is from http://github.com/cdapio/cdap) a lot of users have just a single namespace, so it effectively means scanning till the end of the index.
We'll fix https://github.com/cdapio/cdap/blob/develop/cdap-data-fabric/src/main/java/io/cdap/cdap/spi/data/sql/PostgreSqlStructuredTable.java to detect equality scan prefixes and make corresponding SQL. That would fix it for all postgres versions.
Best regards, Vitalii Tymchyshyn
нд, 9 лист. 2025 р. о 20:21 Peter Geoghegan <pg@bowt.ie> пише:
On Sun, Nov 9, 2025 at 9:44 PM Vitalii Tymchyshyn <vit@tym.im> wrote:
> I am wondering about 2 things:
> 1) Does anyone know which specific change / version made it fast?
> 2) What was the proper way to do a range index scan like WHERE (a,b,c) between (x1,y1,z1) and (x2,y2,z2) before the improvement.
> Note that my tests can mostly be rewritten as equality at least for some columns (and this is what we'll do), but sometimes we do need a range scan like above, so understanding it would be important. Also I am curious :).
This improvement you're seeing here is down to work in commit
bd3f59fd. The short version is that the way we used to decide when a
condition like "WHERE (a,b,c) <= (x2,y2,z2)" was needlessly
conservative. If there were many "a" values equal to x2, we'd have to
scan the index until we got to the next distinct/non-equal "a" value
-- without realizing that we're already past the point where there
cannot possibly be any more matches.
See the discussion on this thread which complained about the problem,
particularly my response to the complaint:
https://www.postgresql.org/message-id/flat/CAH2-WzmLREy6r68A6SEHXnstg01kNs1HiQtOvSO5cTvWuaducw%40mail.gmail.com#62e393ac8bbf06f0f73598ba2ceeab69
--
Peter Geoghegan
On Mon, Nov 10, 2025 at 12:12 AM Vitalii Tymchyshyn <vit@tym.im> wrote: > Thank you so much for both clarifying and fixing it! FWIW the problem is limited to row compares/row constructor comparisons that are used to decide when to end the scan. Note in particular that row compares that decide where in the index (what leaf page) the scan should *begin* from were never affected -- only those that determine where the scan should end. In other words, for a forwards scan, > and >= row compares aren't affected (but < and <= row compares are). For backwards scans/with ORDER BY a DESC, b DESC, it's exactly the other way around (it's > and >= row compares that'll end the scan/that had this problem). My guess is that this issue wasn't noticed sooner because in practice a lot of users of row compares only use them to determine where each scan begins from, in the context of apply row compares to implement keyset pagination [1]. I think that it's typical to use an ORDER BY ... LIMIT, or a FETCH FIRST ... ROWS WITH TIES to limit the size of the result set on each individual query. It was a nasty and surprising issue, but it didn't actually come up all that often. After all, if you use a < or a <= condition to end each scan, the total number of rows that'll be returned each time is unpredictable -- and potentially very large. That isn't generally desirable with keyset pagination; what users usually do is have Postgres return a more or less uniform number of rows for each individual query that fetches the next portion of the "total result set". That's kinda the natural way to do it. [1] https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf -- Peter Geoghegan