Обсуждение: Asynchronous MergeAppend
Hello.
I'd like to make MergeAppend node Async-capable like Append node.
Nowadays when planner chooses MergeAppend plan, asynchronous execution
is not possible. With attached patches you can see plans like
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Merge Append
Sort Key: async_pt.b, async_pt.a
-> Async Foreign Scan on public.async_p1 async_pt_1
Output: async_pt_1.a, async_pt_1.b, async_pt_1.c
Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b %
100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
-> Async Foreign Scan on public.async_p2 async_pt_2
Output: async_pt_2.a, async_pt_2.b, async_pt_2.c
Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (((b %
100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST
This can be quite profitable (in our test cases you can gain up to two
times better speed with MergeAppend async execution on remote servers).
Code for asynchronous execution in Merge Append was mostly borrowed from
Append node.
What significantly differs - in ExecMergeAppendAsyncGetNext() you must
return tuple from the specified slot.
Subplan number determines tuple slot where data should be retrieved to.
When subplan is ready to provide some data,
it's cached in ms_asyncresults. When we get tuple for subplan, specified
in ExecMergeAppendAsyncGetNext(),
ExecMergeAppendAsyncRequest() returns true and loop in
ExecMergeAppendAsyncGetNext() ends. We can fetch data for
subplans which either don't have cached result ready or have already
returned them to the upper node. This
flag is stored in ms_has_asyncresults. As we can get data for some
subplan either earlier or after loop in ExecMergeAppendAsyncRequest(),
we check this flag twice in this function.
Unlike ExecAppendAsyncEventWait(), it seems
ExecMergeAppendAsyncEventWait() doesn't need a timeout - as there's no
need to get result
from synchronous subplan if a tuple form async one was explicitly
requested.
Also we had to fix postgres_fdw to avoid directly looking at Append
fields. Perhaps, accesors to Append fields look strange, but allows
to avoid some code duplication. I suppose, duplication could be even
less if we reworked async Append implementation, but so far I haven't
tried to do this to avoid big diff from master.
Also mark_async_capable() believes that path corresponds to plan. This
can be not true when create_[merge_]append_plan() inserts sort node.
In this case mark_async_capable() can treat Sort plan node as some other
and crash, so there's a small fix for this.
--
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения
Hi! Thank you for your work on this subject! I think this is a very useful optimization) While looking through your code, I noticed some points that I think should be taken into account. Firstly, I noticed only two tests to verify the functionality of this function and I think that this is not enough. Are you thinking about adding some tests with queries involving, for example, join connections with different tables and unusual operators? In addition, I have a question about testing your feature on a benchmark. Are you going to do this? On 17.07.2024 16:24, Alexander Pyhalov wrote: > Hello. > > I'd like to make MergeAppend node Async-capable like Append node. > Nowadays when planner chooses MergeAppend plan, asynchronous execution > is not possible. With attached patches you can see plans like > > EXPLAIN (VERBOSE, COSTS OFF) > SELECT * FROM async_pt WHERE b % 100 = 0 ORDER BY b, a; > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------ > > Merge Append > Sort Key: async_pt.b, async_pt.a > -> Async Foreign Scan on public.async_p1 async_pt_1 > Output: async_pt_1.a, async_pt_1.b, async_pt_1.c > Remote SQL: SELECT a, b, c FROM public.base_tbl1 WHERE (((b % > 100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST > -> Async Foreign Scan on public.async_p2 async_pt_2 > Output: async_pt_2.a, async_pt_2.b, async_pt_2.c > Remote SQL: SELECT a, b, c FROM public.base_tbl2 WHERE (((b % > 100) = 0)) ORDER BY b ASC NULLS LAST, a ASC NULLS LAST > > This can be quite profitable (in our test cases you can gain up to two > times better speed with MergeAppend async execution on remote servers). > > Code for asynchronous execution in Merge Append was mostly borrowed > from Append node. > > What significantly differs - in ExecMergeAppendAsyncGetNext() you must > return tuple from the specified slot. > Subplan number determines tuple slot where data should be retrieved > to. When subplan is ready to provide some data, > it's cached in ms_asyncresults. When we get tuple for subplan, > specified in ExecMergeAppendAsyncGetNext(), > ExecMergeAppendAsyncRequest() returns true and loop in > ExecMergeAppendAsyncGetNext() ends. We can fetch data for > subplans which either don't have cached result ready or have already > returned them to the upper node. This > flag is stored in ms_has_asyncresults. As we can get data for some > subplan either earlier or after loop in ExecMergeAppendAsyncRequest(), > we check this flag twice in this function. > Unlike ExecAppendAsyncEventWait(), it seems > ExecMergeAppendAsyncEventWait() doesn't need a timeout - as there's no > need to get result > from synchronous subplan if a tuple form async one was explicitly > requested. > > Also we had to fix postgres_fdw to avoid directly looking at Append > fields. Perhaps, accesors to Append fields look strange, but allows > to avoid some code duplication. I suppose, duplication could be even > less if we reworked async Append implementation, but so far I haven't > tried to do this to avoid big diff from master. > > Also mark_async_capable() believes that path corresponds to plan. This > can be not true when create_[merge_]append_plan() inserts sort node. > In this case mark_async_capable() can treat Sort plan node as some > other and crash, so there's a small fix for this. I think you should add this explanation to the commit message because without it it's hard to understand the full picture of how your code works. -- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi.
I've updated patches for asynchronous merge append. They allowed us to
significantly improve performance in practice. Earlier select from
partitioned (and distributed table) could switch to synchronous merge
append plan from asynchronous append. Given that table could have 20+
partitions, it was cheaper, but much less efficient due to remote parts
executing synchronously.
In this version there's a couple of small fixes - earlier
ExecMergeAppend() scanned all asyncplans, but should do this only for
valid asyncplans. Also incorporated logic from
commit af717317a04f5217728ce296edf4a581eb7e6ea0
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed Mar 12 20:53:09 2025 +0200
Handle interrupts while waiting on Append's async subplans
into ExecMergeAppendAsyncEventWait().
--
Best regards,
Alexander Pyhalov,
Postgres Professional
Вложения
I noticed that this patch has gone largely unreviewed, but it needs rebase due to the GUC changes, so here it is again. Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Вложения
Hi, thanks for working on this! On Tue Aug 20, 2024 at 6:14 AM -03, Alexander Pyhalov wrote: >> In addition, I have a question about testing your feature on a >> benchmark. Are you going to do this? >> > > The main reason for this work is a dramatic performance degradation when > Append plans with async foreign scan nodes are switched to MergeAppend > plans with synchronous foreign scans. > > I've performed some synthetic tests to prove the benefits of async Merge > Append. So far tests are performed on one physical host. > > For tests I've deployed 3 PostgreSQL instances on ports 5432-5434. > > The first instance: > create server s2 foreign data wrapper postgres_fdw OPTIONS ( port > '5433', dbname 'postgres', async_capable 'on'); > create server s3 foreign data wrapper postgres_fdw OPTIONS ( port > '5434', dbname 'postgres', async_capable 'on'); > > create foreign table players_p1 partition of players for values with > (modulus 4, remainder 0) server s2; > create foreign table players_p2 partition of players for values with > (modulus 4, remainder 1) server s2; > create foreign table players_p3 partition of players for values with > (modulus 4, remainder 2) server s3; > create foreign table players_p4 partition of players for values with > (modulus 4, remainder 3) server s3; > > s2 instance: > create table players_p1 (id int, name text, score int); > create table players_p2 (id int, name text, score int); > create index on players_p1(score); > create index on players_p2(score); > > s3 instance: > create table players_p3 (id int, name text, score int); > create table players_p4 (id int, name text, score int); > create index on players_p3(score); > create index on players_p4(score); > > s1 instance: > insert into players select i, 'player_' ||i, random()* 100 from > generate_series(1,100000) i; > > pgbench script: > \set rnd_offset random(0,200) > \set rnd_limit random(10,20) > > select * from players order by score desc offset :rnd_offset limit > :rnd_limit; > > pgbench was run as: > pgbench -n -f 1.sql postgres -T 100 -c 16 -j 16 > > CPU idle was about 5-10%. > > pgbench results: > > [...] > However, if we set number of threads to 1, so that CPU has idle cores, > we'll see more evident improvements: > > Patched, async_capable on: > pgbench (14.13, server 18devel) > transaction type: 1.sql > scaling factor: 1 > query mode: simple > number of clients: 1 > number of threads: 1 > duration: 100 s > number of transactions actually processed: 20221 > latency average = 4.945 ms > initial connection time = 7.035 ms > tps = 202.221816 (without initial connection time) > > > Patched, async_capable off > transaction type: 1.sql > scaling factor: 1 > query mode: simple > number of clients: 1 > number of threads: 1 > duration: 100 s > number of transactions actually processed: 14941 > latency average = 6.693 ms > initial connection time = 7.037 ms > tps = 149.415688 (without initial connection time) > I ran some benchmarks based on v4 attached by Alvaro in [1] using a smaller number of threads so that some CPU cores would be idle and I also obtained better results: Patched, async_capable on: tps = 4301.567405 Master, async_capable on: tps = 3847.084545 So I'm +1 for the idea. I know it's been while since the last patch, and unfortunully it hasn't received reviews since then. Do you still plan to work on it? I still need to take a look on the code to see if I can help with some comments. During the tests I got compiler errors due to fce7c73fba4, so I'm attaching a v5 with guc_parameters.dat correctly sorted. The postgres_fdw/regress tests was also failling due to some whitespace problems, v5 also fix this. [1] https://www.postgresql.org/message-id/202510251154.isknefznk566%40alvherre.pgsql -- Matheus Alcantara