Re: Do we want a hashset type?
От | Tomas Vondra |
---|---|
Тема | Re: Do we want a hashset type? |
Дата | |
Msg-id | 257197c1-4d28-a8c1-c311-510cbc862e03@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: Do we want a hashset type? ("Joel Jacobson" <joel@compiler.org>) |
Ответы |
Re: Do we want a hashset type?
("Joel Jacobson" <joel@compiler.org>)
|
Список | pgsql-hackers |
On 6/1/23 09:14, Joel Jacobson wrote: > On Thu, Jun 1, 2023, at 09:02, Joel Jacobson wrote: >> Here is an example using a real anonymised social network. > > I realised the "found" column is not necessary in this particular example, > since we only care about the friends at the exact depth level. Simplified query: > > CREATE OR REPLACE VIEW friends_of_friends AS > WITH RECURSIVE friends_of_friends AS ( > SELECT > ARRAY[5867::bigint] AS current, > 0 AS depth > UNION ALL > SELECT > new_current, > friends_of_friends.depth + 1 > FROM > friends_of_friends > CROSS JOIN LATERAL ( > SELECT > array_agg(DISTINCT edges.to_node) AS new_current > FROM > edges > WHERE > from_node = ANY(friends_of_friends.current) > ) q > WHERE > friends_of_friends.depth < 3 > ) > SELECT > depth, > coalesce(array_length(current, 1), 0) > FROM > friends_of_friends > WHERE > depth = 3; > ; > > -- PostgreSQL 15.2: > > EXPLAIN ANALYZE SELECT * FROM friends_of_friends; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------------------- > CTE Scan on friends_of_friends (cost=2687.88..2688.58 rows=1 width=8) (actual time=2076.362..2076.454 rows=1 loops=1) > Filter: (depth = 3) > Rows Removed by Filter: 3 > CTE friends_of_friends > -> Recursive Union (cost=0.00..2687.88 rows=31 width=36) (actual time=0.008..2075.073 rows=4 loops=1) > -> Result (cost=0.00..0.01 rows=1 width=36) (actual time=0.002..0.002 rows=1 loops=1) > -> Subquery Scan on "*SELECT* 2" (cost=89.44..268.75 rows=3 width=36) (actual time=518.613..518.622 rows=1loops=4) > -> Nested Loop (cost=89.44..268.64 rows=3 width=36) (actual time=515.523..515.523 rows=1 loops=4) > -> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=36)(actual time=0.001..0.001 rows=1 loops=4) > Filter: (depth < 3) > Rows Removed by Filter: 0 > -> Aggregate (cost=89.44..89.45 rows=1 width=32) (actual time=687.356..687.356 rows=1 loops=3) > -> Index Only Scan using edges_pkey on edges (cost=0.56..83.96 rows=2191 width=4) (actualtime=0.139..290.996 rows=3486910 loops=3) > Index Cond: (from_node = ANY (friends_of_friends_1.current)) > Heap Fetches: 0 > Planning Time: 0.557 ms > Execution Time: 2076.990 ms > (17 rows) > I've been thinking about this a bit on the way back from pgcon. Per CPU profile it seems most of the job is actually spent on qsort, calculating the array_agg(distinct) bit. I don't think we build We could replace this part by a hash-based set aggregate, which would be faster, but I doubt that may yield a massive improvement that'd change the fundamental cost. I forgot I did something like that a couple years back, implementing a count_distinct() aggregate that was meant as a faster alternative to count(distinct). And then I mostly abandoned it because the built-in sort-based approach improved significantly - it was still slower in cases, but the gap got small enough. Anyway, I hacked together a trivial set backed by an open addressing hash table: https://github.com/tvondra/hashset It's super-basic, providing just some bare minimum of functions, but hopefully good enough for experiments. - hashset data type - hash table in varlena - hashset_init - create new hashset - hashset_add / hashset_contains - add value / check membership - hashset_merge - merge two hashsets - hashset_count - count elements - hashset_to_array - return - hashset(int) aggregate This allows rewriting the recursive query like this: WITH RECURSIVE friends_of_friends AS ( SELECT ARRAY[5867::bigint] AS current, 0 AS depth UNION ALL SELECT new_current, friends_of_friends.depth + 1 FROM friends_of_friends CROSS JOIN LATERAL ( SELECT hashset_to_array(hashset(edges.to_node)) AS new_current FROM edges WHERE from_node = ANY(friends_of_friends.current) ) q WHERE friends_of_friends.depth < 3 ) SELECT depth, coalesce(array_length(current, 1), 0) FROM friends_of_friends WHERE depth = 3; On my laptop cuts the timing roughly in half - which is nice, but as I said I don't think it's not a fundamental speedup. The aggregate can be also parallelized, but I don't think that changes much. Furthermore, this has a number of drawbacks too - e.g. it can't spill data to disk, which might be an issue on more complex queries / larger data sets. FWIW I wonder how representative this query is, considering it returns ~1M node IDs, i.e. about 10% of the whole set of node IDs. Seems a bit on the high side. I've also experimented with doing stuff from plpgsql procedure (that's what the non-aggregate functions are about). I saw this mostly as a way to do stuff that'd be hard to do in the recursive CTE, but it has a lot of additional execution overhead due to plpgsql. Maybe it we had some smart trick to calculate adjacent nodes we could have a SRF written in C to get rid of the overhead. Anyway, this leads me to the question what graph databases are doing for these queries, if they're faster in answering such queries (by how much?). I'm not very familiar with that stuff, but surely they do something smart - precalculating the data, some special indexing, duplicating the data in multiple places? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Greg Sabino MullaneДата:
Сообщение: Re: Prevent psql \watch from running queries that return no rows
Следующее
От: Amit KapilaДата:
Сообщение: Re: Reload configuration more frequently in apply worker.