Обсуждение: Re: Bypassing cursors in postgres_fdw to enable parallel plans

Поиск
Список
Период
Сортировка

Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
Robert Haas
Дата:
On Mon, Jan 6, 2025 at 3:52 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> Now, to overcome this limitation, I have worked on this idea (suggested by my colleague Bernd Helmle) of bypassing
thecursors. The way it works is as follows, 
> there is a new GUC introduced postgres_fdw.use_cursor, which when unset uses the mode without the cursor. Now, it
usesPQsetChunkedRowsMode in create_cursor when non-cursor mode is used. The size of the chunk is the same as the
fetch_size.Now in fetch_more_data, when non-cursor mode is used, pgfdw_get_next_result is used to get the chunk in
PGresultand processed in the same manner as before. 
>
> Now, the issue comes when there are simultaneous queries, which is the case with the join queries where all the
tablesinvolved in the join are at the local server. Because in that case we have multiple cursors opened at the same
timeand without a cursor mechanism we do not have any information or any other structure to know what to fetch from
whichquery. To handle that case, we have a flag only_query, which is unset as soon as we have assigned the
cursor_number>= 2, in postgresBeginForeignScan. Now, in fetch_more data, when we find out that only_query is unset,
thenwe fetch all the data for the query and store it in a Tuplestore. These tuples are then transferred to the
fsstate->tuplesand then processed as usual. 
>
> So yes there is a performance drawback in the case of simultaneous queries, however, the ability to use parallel
plansis really an added advantage for the users. Plus, we can keep things as before by this new GUC -- use_cursor, in
casewe are losing more for some workloads.  So, in short I feel hopeful that this could be a good idea and a good time
toimprove postgres_fdw. 

Hi,

I think it might have been nice to credit me in this post, since I
made some relevant suggestions here off-list, in particular the idea
of using a Tuplestore when there are multiple queries running. But I
don't think this patch quite implements what I suggested. Here, you
have a flag only_query which gets set to true at some point in time
and thereafter remains true for the lifetime of a session. That means,
I think, that all future queries will use the tuplestore even though
there might not be multiple queries running any more. which doesn't
seem like what we want. And, actually, this looks like it will be set
as soon as you reach the second query in the same transaction, even if
the two queries don't overlap. I think what you want to do is test
whether, at the point where we would need to issue a new query,
whether an existing query is already running. If not, move that
query's remaining results into a Tuplestore so you can issue the new
query.

I'm not sure what the best way to implement that is, exactly. Perhaps
fsstate->conn_state needs to store some more details about the
connection, but that's just a guess. I don't think a global variable
is what you want. Not only is that session-lifetime, but it applies
globally to every connection to every server. You want to test
something that is specific to one connection to one server, so it
needs to be part of a data structure that is scoped that way.

I think you'll want to figure out a good way to test this patch. I
don't know if we need or can reasonably have automated test cases for
this new functionality, but you at least want to have a good way to do
manual testing, so that you can show that the tuplestore is used in
cases where it's necessary and not otherwise. I'm not yet sure whether
this patch needs automated test cases or whether they can reasonably
be written, but you at least want to have a good procedure for manual
validation so that you can verify that the Tuplestore is used in all
the cases where it needs to be and, hopefully, no others.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Bypassing cursors in postgres_fdw to enable parallel plans

От
Robert Haas
Дата:
On Fri, Jan 17, 2025 at 7:03 AM Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> Indeed you are right.
> Firstly, accept my apologies for not mentioning you in credits for this work. Thanks a lot for your efforts,
discussionswith you were helpful in shaping this patch and bringing it to this level. 
>
> Next, yes the last version was using tuplestore for queries within the same transaction after the second query. To
overcomethis, I came across this method to identify if there is any other simultaneous query running with the current
query;now there is an int variable num_queries which is incremented at every call of postgresBeginForeignScan and
decrementedat every call of postgresEndForeignScan. This way, if there are simultaneous queries running we get the
valueof num_queries greater than 1. Now, we check the value of num_queries and use tuplestore only when num_queries is
greaterthan 1. So, basically the understanding here is that if postgresBeginForeignScan is called and before the call
ofpostgresEndForeignScan if another call to postgresBeginForeignScan is made, then these are simultaneous queries. 

This wouldn't be true in case of error, I believe.

--
Robert Haas
EDB: http://www.enterprisedb.com