Обсуждение: POC: postgres_fdw insert batching
Hi, One of the issues I'm fairly regularly reminded by users/customers is that inserting into tables sharded using FDWs are rather slow. We do even get it reported on pgsql-bugs from time to time [1]. Some of the slowness / overhead is expected, doe to the latency between machines in the sharded setup. Even just 1ms latency will make it way more expensive than a single instance. But let's do a simple experiment, comparing a hash-partitioned regular partitions, and one with FDW partitions in the same instance. Scripts to run this are attached. The duration of inserting 1M rows to this table (average of 10 runs on my laptop) looks like this: regular: 2872 ms FDW: 64454 ms Yep, it's ~20x slower. On setup with ping latency well below 0.05ms. Imagine how would it look on sharded setups with 0.1ms or 1ms latency, which is probably where most single-DC clusters are :-( Now, the primary reason why the performance degrades like this is that while FDW has batching for SELECT queries (i.e. we read larger chunks of data from the cursors), we don't have that for INSERTs (or other DML). Every time you insert a row, it has to go all the way down into the partition synchronously. For some use cases this may be reduced by having many independent connnections from different users, so the per-user latency is higher but acceptable. But if you need to import larger amounts of data (say, a CSV file for analytics, ...) this may not work. Some time ago I wrote an ugly PoC adding batching, just to see how far would it get us, and it seems quite promising - results for he same INSERT benchmarks look like this: FDW batching: 4584 ms So, rather nice improvement, I'd say ... Before I spend more time hacking on this, I have a couple open questions about the design, restrictions etc. 1) Extend the FDW API? In the patch, the batching is simply "injected" into the existing insert API method, i.e. ExecForeignInsert et al. I wonder if it'd be better to extend the API with a "batched" version of the method, so that we can easily determine whether the FDW supports batching or not - it would require changes in the callers, though. OTOH it might be useful for COPY, where we could do something similar to multi_insert (COPY already benefits from this patch, but it does not use the batching built-into COPY). 2) What about the insert results? I'm not sure what to do about "result" status for the inserted rows. We only really "stash" the rows into a buffer, so we don't know if it will succeed or not. The patch simply assumes it will succeed, but that's clearly wrong, and it may result in reporting a wrong number or rows. The patch also disables the batching when the insert has a RETURNING clause, because there's just a single slot (for the currently inserted row). I suppose a "batching" method would take an array of slots. 3) What about the other DML operations (DELETE/UPDATE)? The other DML operations could probably benefit from the batching too. INSERT was good enough for a PoC, but having batching only for INSERT seems somewhat asmymetric. DELETE/UPDATE seem more complicated because of quals, but likely doable. 3) Should we do batching for COPY insteads? While looking at multi_insert, I've realized it's mostly exactly what the new "batching insert" API function would need to be. But it's only really used in COPY, so I wonder if we should just abandon the idea of batching INSERTs and do batching COPY for FDW tables. For cases that can replace INSERT with COPY this would be enough, but unfortunately it does nothing for DELETE/UPDATE so I'm hesitant to do this :-( 4) Expected consistency? I'm not entirely sure what are the consistency expectations for FDWs. Currently the FDW nodes pointing to the same server share a connection, so the inserted rows might be visible to other nodes. But if we only stash the rows in a local buffer for a while, that's no longer true. So maybe this breaks the consistency expectations? But maybe that's OK - I'm not sure how the prepared statements/cursors affect this. I can imagine restricting the batching only to plans where this is not an issue (single FDW node or something), but it seems rather fragile and undesirable. I was thinking about adding a GUC to enable/disable the batching at some level (global, server, table, ...) but it seems like a bad match because the consistency expectations likely depend on a query. There should be a GUC to set the batch size, though (it's hardcoded to 100 for now). regards [1] https://www.postgresql.org/message-id/CACnz%2BQ1q0%2B2KoJam9LyNMk8JmdC6qYHXWB895Wu2xcpoip18xQ%40mail.gmail.com -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
Hi Tomas, On Mon, Jun 29, 2020 at 12:10 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Hi, > > One of the issues I'm fairly regularly reminded by users/customers is > that inserting into tables sharded using FDWs are rather slow. We do > even get it reported on pgsql-bugs from time to time [1]. > > Some of the slowness / overhead is expected, doe to the latency between > machines in the sharded setup. Even just 1ms latency will make it way > more expensive than a single instance. > > But let's do a simple experiment, comparing a hash-partitioned regular > partitions, and one with FDW partitions in the same instance. Scripts to > run this are attached. The duration of inserting 1M rows to this table > (average of 10 runs on my laptop) looks like this: > > regular: 2872 ms > FDW: 64454 ms > > Yep, it's ~20x slower. On setup with ping latency well below 0.05ms. > Imagine how would it look on sharded setups with 0.1ms or 1ms latency, > which is probably where most single-DC clusters are :-( > > Now, the primary reason why the performance degrades like this is that > while FDW has batching for SELECT queries (i.e. we read larger chunks of > data from the cursors), we don't have that for INSERTs (or other DML). > Every time you insert a row, it has to go all the way down into the > partition synchronously. > > For some use cases this may be reduced by having many independent > connnections from different users, so the per-user latency is higher but > acceptable. But if you need to import larger amounts of data (say, a CSV > file for analytics, ...) this may not work. > > Some time ago I wrote an ugly PoC adding batching, just to see how far > would it get us, and it seems quite promising - results for he same > INSERT benchmarks look like this: > > FDW batching: 4584 ms > > So, rather nice improvement, I'd say ... Very nice indeed. > Before I spend more time hacking on this, I have a couple open questions > about the design, restrictions etc. I think you may want to take a look this recent proposal by Andrey Lepikhov: * [POC] Fast COPY FROM command for the table with foreign partitions * https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > FDW batching: 4584 ms > > So, rather nice improvement, I'd say ... Very nice. > > Before I spend more time hacking on this, I have a couple open questions > about the design, restrictions etc. > > > 1) Extend the FDW API? > > In the patch, the batching is simply "injected" into the existing insert > API method, i.e. ExecForeignInsert et al. I wonder if it'd be better to > extend the API with a "batched" version of the method, so that we can > easily determine whether the FDW supports batching or not - it would > require changes in the callers, though. OTOH it might be useful for > COPY, where we could do something similar to multi_insert (COPY already > benefits from this patch, but it does not use the batching built-into > COPY). Amit Langote has pointed out a related patch being discussed on hackers at [1]. That patch introduces a new API. But if we can do it without introducing a new API that will be good. FDWs which can support batching can just modify their code and don't have to implement and manage a new API. We already have a handful of those APIs. > > 2) What about the insert results? > > I'm not sure what to do about "result" status for the inserted rows. We > only really "stash" the rows into a buffer, so we don't know if it will > succeed or not. The patch simply assumes it will succeed, but that's > clearly wrong, and it may result in reporting a wrong number or rows. I didn't get this. We are executing an INSERT on the foreign server, so we get the number of rows INSERTed from that server. We should just add those up across batches. If there's a failure, it would abort the transaction, local as well as remote. > > The patch also disables the batching when the insert has a RETURNING > clause, because there's just a single slot (for the currently inserted > row). I suppose a "batching" method would take an array of slots. > It will be a rare case when a bulk load also has a RETURNING clause. So, we can leave with this restriction. We should try to choose a design which allows that restriction to be lifted in the future. But I doubt that restriction will be a serious one. > > 3) What about the other DML operations (DELETE/UPDATE)? > > The other DML operations could probably benefit from the batching too. > INSERT was good enough for a PoC, but having batching only for INSERT > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because > of quals, but likely doable. Bulk INSERTs are more common in a sharded environment because of data load in say OLAP systems. Bulk update/delete are rare, although not that rare. So if an approach just supports bulk insert and not bulk UPDATE/DELETE that will address a large number of usecases IMO. But if we can make everything work together that would be good as well. In your patch, I see that an INSERT statement with batch is constructed as INSERT INTO ... VALUES (...), (...) as many values as the batch size. That won't work as is for UPDATE/DELETE since we can't pass multiple pairs of ctids and columns to be updated for each ctid in one statement. Maybe we could build as many UPDATE/DELETE statements as the size of a batch, but that would be ugly. What we need is a feature like a batch prepared statement in libpq similar to what JDBC supports ((https://mkyong.com/jdbc/jdbc-preparedstatement-example-batch-update/). This will allow a single prepared statement to be executed with a batch of parameters, each batch corresponding to one foreign DML statement. > > > 3) Should we do batching for COPY insteads? > > While looking at multi_insert, I've realized it's mostly exactly what > the new "batching insert" API function would need to be. But it's only > really used in COPY, so I wonder if we should just abandon the idea of > batching INSERTs and do batching COPY for FDW tables. I think this won't support RETURNING as well. But if we could somehow use copy protocol to send the data to the foreign server and yet treat it as INSERT, that might work. I think we have find out which performs better COPY or batch INSERT. > > For cases that can replace INSERT with COPY this would be enough, but > unfortunately it does nothing for DELETE/UPDATE so I'm hesitant to do > this :-( Agreed, if we want to support bulk UPDATE/DELETE as well. > > > 4) Expected consistency? > > I'm not entirely sure what are the consistency expectations for FDWs. > Currently the FDW nodes pointing to the same server share a connection, > so the inserted rows might be visible to other nodes. But if we only > stash the rows in a local buffer for a while, that's no longer true. So > maybe this breaks the consistency expectations? > > But maybe that's OK - I'm not sure how the prepared statements/cursors > affect this. I can imagine restricting the batching only to plans where > this is not an issue (single FDW node or something), but it seems rather > fragile and undesirable. I think that area is grey. Depending upon where the cursor is positioned when a DML node executes a query, the data fetched from cursor may or may not see the effect of DML. The cursor position is based on the batch size so we already have problems in this area I think. Assuming that the DML and SELECT are independent this will work. So, the consistency problems exists, it will just be modulated by batching DML. I doubt that's related to this feature exclusively and should be solved independent of this feature. > > I was thinking about adding a GUC to enable/disable the batching at some > level (global, server, table, ...) but it seems like a bad match because > the consistency expectations likely depend on a query. There should be a > GUC to set the batch size, though (it's hardcoded to 100 for now). > Similar to fetch_size, it should foreign server, table level setting, IMO. [1] https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru -- Best Wishes, Ashutosh Bapat
On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > > 3) What about the other DML operations (DELETE/UPDATE)? > > > > The other DML operations could probably benefit from the batching too. > > INSERT was good enough for a PoC, but having batching only for INSERT > > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because > > of quals, but likely doable. > > Bulk INSERTs are more common in a sharded environment because of data > load in say OLAP systems. Bulk update/delete are rare, although not > that rare. So if an approach just supports bulk insert and not bulk > UPDATE/DELETE that will address a large number of usecases IMO. But if > we can make everything work together that would be good as well. In most cases, I think the entire UPDATE/DELETE operations would be pushed down to the remote side by DirectModify. So, I'm not sure we really need the bulk UPDATE/DELETE. > > 3) Should we do batching for COPY insteads? > > > > While looking at multi_insert, I've realized it's mostly exactly what > > the new "batching insert" API function would need to be. But it's only > > really used in COPY, so I wonder if we should just abandon the idea of > > batching INSERTs and do batching COPY for FDW tables. > I think we have find out which performs > better COPY or batch INSERT. Maybe I'm missing something, but I think the COPY patch [1] seems more promising to me, because 1) it would not get the remote side's planner and executor involved, and 2) the data would be loaded more efficiently by multi-insert on the demote side. (Yeah, COPY doesn't support RETURNING, but it's rare that RETURNING is needed in a bulk load, as you mentioned.) > [1] https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru Best regards, Etsuro Fujita
On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> > 3) What about the other DML operations (DELETE/UPDATE)?
> >
> > The other DML operations could probably benefit from the batching too.
> > INSERT was good enough for a PoC, but having batching only for INSERT
> > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
> > of quals, but likely doable.
>
> Bulk INSERTs are more common in a sharded environment because of data
> load in say OLAP systems. Bulk update/delete are rare, although not
> that rare. So if an approach just supports bulk insert and not bulk
> UPDATE/DELETE that will address a large number of usecases IMO. But if
> we can make everything work together that would be good as well.
In most cases, I think the entire UPDATE/DELETE operations would be
pushed down to the remote side by DirectModify. So, I'm not sure we
really need the bulk UPDATE/DELETE.
> > 3) Should we do batching for COPY insteads?
> >
> > While looking at multi_insert, I've realized it's mostly exactly what
> > the new "batching insert" API function would need to be. But it's only
> > really used in COPY, so I wonder if we should just abandon the idea of
> > batching INSERTs and do batching COPY for FDW tables.
> I think we have find out which performs
> better COPY or batch INSERT.
Maybe I'm missing something, but I think the COPY patch [1] seems more
promising to me, because 1) it would not get the remote side's planner
and executor involved, and 2) the data would be loaded more
efficiently by multi-insert on the demote side. (Yeah, COPY doesn't
support RETURNING, but it's rare that RETURNING is needed in a bulk
load, as you mentioned.)
> [1] https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru
Best regards,
Etsuro Fujita
--
On Tue, Jun 30, 2020 at 1:22 PM Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> wrote: > On Tue, 30 Jun 2020 at 08:47, Etsuro Fujita <etsuro.fujita@gmail.com> wrote: >> On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat >> <ashutosh.bapat.oss@gmail.com> wrote: >> > On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra >> > <tomas.vondra@2ndquadrant.com> wrote: >> >> > > 3) What about the other DML operations (DELETE/UPDATE)? >> > > >> > > The other DML operations could probably benefit from the batching too. >> > > INSERT was good enough for a PoC, but having batching only for INSERT >> > > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because >> > > of quals, but likely doable. >> > >> > Bulk INSERTs are more common in a sharded environment because of data >> > load in say OLAP systems. Bulk update/delete are rare, although not >> > that rare. So if an approach just supports bulk insert and not bulk >> > UPDATE/DELETE that will address a large number of usecases IMO. But if >> > we can make everything work together that would be good as well. >> >> In most cases, I think the entire UPDATE/DELETE operations would be >> pushed down to the remote side by DirectModify. So, I'm not sure we >> really need the bulk UPDATE/DELETE. > That may not be true for a partitioned table whose partitions are foreign tables. Esp. given the work that Amit Langoteis doing [1]. It really depends on the ability of postgres_fdw to detect that the DML modifying each of the partitionscan be pushed down. That may not come easily. While it's true that how to accommodate the DirectModify API in the new inherited update/delete planning approach is an open question on that thread, I would eventually like to find an answer to that. That is, that work shouldn't result in losing the foreign partition's ability to use DirectModify API to optimize updates/deletes. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
On Tue, Jun 30, 2020 at 2:54 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Tue, Jun 30, 2020 at 1:22 PM Ashutosh Bapat > <ashutosh.bapat@2ndquadrant.com> wrote: > > On Tue, 30 Jun 2020 at 08:47, Etsuro Fujita <etsuro.fujita@gmail.com> wrote: > >> On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat > >> <ashutosh.bapat.oss@gmail.com> wrote: > >> > On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra > >> > <tomas.vondra@2ndquadrant.com> wrote: > >> > >> > > 3) What about the other DML operations (DELETE/UPDATE)? > >> > > > >> > > The other DML operations could probably benefit from the batching too. > >> > > INSERT was good enough for a PoC, but having batching only for INSERT > >> > > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because > >> > > of quals, but likely doable. > >> > > >> > Bulk INSERTs are more common in a sharded environment because of data > >> > load in say OLAP systems. Bulk update/delete are rare, although not > >> > that rare. So if an approach just supports bulk insert and not bulk > >> > UPDATE/DELETE that will address a large number of usecases IMO. But if > >> > we can make everything work together that would be good as well. > >> > >> In most cases, I think the entire UPDATE/DELETE operations would be > >> pushed down to the remote side by DirectModify. So, I'm not sure we > >> really need the bulk UPDATE/DELETE. > > That may not be true for a partitioned table whose partitions are foreign tables. Esp. given the work that Amit Langoteis doing [1]. It really depends on the ability of postgres_fdw to detect that the DML modifying each of the partitionscan be pushed down. That may not come easily. > > While it's true that how to accommodate the DirectModify API in the > new inherited update/delete planning approach is an open question on > that thread, I would eventually like to find an answer to that. That > is, that work shouldn't result in losing the foreign partition's > ability to use DirectModify API to optimize updates/deletes. That would be great! Best regards, Etsuro Fujita
On Mon, Jun 29, 2020 at 04:22:15PM +0530, Ashutosh Bapat wrote: >On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: > >> >> FDW batching: 4584 ms >> >> So, rather nice improvement, I'd say ... > >Very nice. > >> >> Before I spend more time hacking on this, I have a couple open questions >> about the design, restrictions etc. >> >> >> 1) Extend the FDW API? >> >> In the patch, the batching is simply "injected" into the existing insert >> API method, i.e. ExecForeignInsert et al. I wonder if it'd be better to >> extend the API with a "batched" version of the method, so that we can >> easily determine whether the FDW supports batching or not - it would >> require changes in the callers, though. OTOH it might be useful for >> COPY, where we could do something similar to multi_insert (COPY already >> benefits from this patch, but it does not use the batching built-into >> COPY). > >Amit Langote has pointed out a related patch being discussed on hackers at [1]. > >That patch introduces a new API. But if we can do it without >introducing a new API that will be good. FDWs which can support >batching can just modify their code and don't have to implement and >manage a new API. We already have a handful of those APIs. > I don't think extending the API is a big issue - the FDW code will need changing anyway, so this seems minor. I'll take a look at the COPY patch - I agree it seems like a good idea, although it can be less convenient in various caes (e.g. I've seen a lot of INSERT ... SELECT queries in sharded systems, etc.). >> >> 2) What about the insert results? >> >> I'm not sure what to do about "result" status for the inserted rows. We >> only really "stash" the rows into a buffer, so we don't know if it will >> succeed or not. The patch simply assumes it will succeed, but that's >> clearly wrong, and it may result in reporting a wrong number or rows. > >I didn't get this. We are executing an INSERT on the foreign server, >so we get the number of rows INSERTed from that server. We should just >add those up across batches. If there's a failure, it would abort the >transaction, local as well as remote. > True, but it's not the FDW code doing the counting - it's the caller, depending on whether the ExecForeignInsert returns a valid slot or NULL. So it's not quite possible to just return a number of inserted tuples, as returned by the remote server. >> >> The patch also disables the batching when the insert has a RETURNING >> clause, because there's just a single slot (for the currently inserted >> row). I suppose a "batching" method would take an array of slots. >> > >It will be a rare case when a bulk load also has a RETURNING clause. >So, we can leave with this restriction. We should try to choose a >design which allows that restriction to be lifted in the future. But I >doubt that restriction will be a serious one. > >> >> 3) What about the other DML operations (DELETE/UPDATE)? >> >> The other DML operations could probably benefit from the batching too. >> INSERT was good enough for a PoC, but having batching only for INSERT >> seems somewhat asmymetric. DELETE/UPDATE seem more complicated because >> of quals, but likely doable. > >Bulk INSERTs are more common in a sharded environment because of data >load in say OLAP systems. Bulk update/delete are rare, although not >that rare. So if an approach just supports bulk insert and not bulk >UPDATE/DELETE that will address a large number of usecases IMO. But if >we can make everything work together that would be good as well. > >In your patch, I see that an INSERT statement with batch is >constructed as INSERT INTO ... VALUES (...), (...) as many values as >the batch size. That won't work as is for UPDATE/DELETE since we can't >pass multiple pairs of ctids and columns to be updated for each ctid >in one statement. Maybe we could build as many UPDATE/DELETE >statements as the size of a batch, but that would be ugly. What we >need is a feature like a batch prepared statement in libpq similar to >what JDBC supports >((https://mkyong.com/jdbc/jdbc-preparedstatement-example-batch-update/). >This will allow a single prepared statement to be executed with a >batch of parameters, each batch corresponding to one foreign DML >statement. > I'm pretty sure we could make it work with some array/unnest tricks to build a relation, and use that as a source of data. >> >> >> 3) Should we do batching for COPY insteads? >> >> While looking at multi_insert, I've realized it's mostly exactly what >> the new "batching insert" API function would need to be. But it's only >> really used in COPY, so I wonder if we should just abandon the idea of >> batching INSERTs and do batching COPY for FDW tables. > >I think this won't support RETURNING as well. But if we could somehow >use copy protocol to send the data to the foreign server and yet treat >it as INSERT, that might work. I think we have find out which performs >better COPY or batch INSERT. > I don't see why not support both, the use cases are somewhat different I think. >> >> For cases that can replace INSERT with COPY this would be enough, but >> unfortunately it does nothing for DELETE/UPDATE so I'm hesitant to do >> this :-( > >Agreed, if we want to support bulk UPDATE/DELETE as well. > >> >> >> 4) Expected consistency? >> >> I'm not entirely sure what are the consistency expectations for FDWs. >> Currently the FDW nodes pointing to the same server share a connection, >> so the inserted rows might be visible to other nodes. But if we only >> stash the rows in a local buffer for a while, that's no longer true. So >> maybe this breaks the consistency expectations? >> >> But maybe that's OK - I'm not sure how the prepared statements/cursors >> affect this. I can imagine restricting the batching only to plans where >> this is not an issue (single FDW node or something), but it seems rather >> fragile and undesirable. > >I think that area is grey. Depending upon where the cursor is >positioned when a DML node executes a query, the data fetched from >cursor may or may not see the effect of DML. The cursor position is >based on the batch size so we already have problems in this area I >think. Assuming that the DML and SELECT are independent this will >work. So, the consistency problems exists, it will just be modulated >by batching DML. I doubt that's related to this feature exclusively >and should be solved independent of this feature. > OK, thanks for the feedback. >> >> I was thinking about adding a GUC to enable/disable the batching at some >> level (global, server, table, ...) but it seems like a bad match because >> the consistency expectations likely depend on a query. There should be a >> GUC to set the batch size, though (it's hardcoded to 100 for now). >> > >Similar to fetch_size, it should foreign server, table level setting, IMO. > >[1] https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru > Yeah, I agree we should have a GUC to define the batch size. What I had in mind was something that would allow us to enable/disable batching to increase the consistency guarantees, or something like that. I think simple GUCs are a poor solution for that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>I didn't get this. We are executing an INSERT on the foreign server,
>so we get the number of rows INSERTed from that server. We should just
>add those up across batches. If there's a failure, it would abort the
>transaction, local as well as remote.
>
True, but it's not the FDW code doing the counting - it's the caller,
depending on whether the ExecForeignInsert returns a valid slot or NULL.
So it's not quite possible to just return a number of inserted tuples,
as returned by the remote server.
>In your patch, I see that an INSERT statement with batch is
>constructed as INSERT INTO ... VALUES (...), (...) as many values as
>the batch size. That won't work as is for UPDATE/DELETE since we can't
>pass multiple pairs of ctids and columns to be updated for each ctid
>in one statement. Maybe we could build as many UPDATE/DELETE
>statements as the size of a batch, but that would be ugly. What we
>need is a feature like a batch prepared statement in libpq similar to
>what JDBC supports
>((https://mkyong.com/jdbc/jdbc-preparedstatement-example-batch-update/).
>This will allow a single prepared statement to be executed with a
>batch of parameters, each batch corresponding to one foreign DML
>statement.
>
I'm pretty sure we could make it work with some array/unnest tricks to
build a relation, and use that as a source of data.
I don't see why not support both, the use cases are somewhat different I
think.
Hi, On 2020-06-28 17:10:02 +0200, Tomas Vondra wrote: > 3) Should we do batching for COPY insteads? > > While looking at multi_insert, I've realized it's mostly exactly what > the new "batching insert" API function would need to be. But it's only > really used in COPY, so I wonder if we should just abandon the idea of > batching INSERTs and do batching COPY for FDW tables. > > For cases that can replace INSERT with COPY this would be enough, but > unfortunately it does nothing for DELETE/UPDATE so I'm hesitant to do > this :-( I personally think - and I realize that that might be annoying to somebody looking to make an incremental improvement - that the nodeModifyTable.c and copy.c code dealing with DML has become too complicated to add features like this without a larger refactoring. Leading to choices like this, whether to add a feature in one place but not the other. I think before we add more complexity, we ought to centralize and clean up the DML handling code so most is shared between copy.c and nodeModifyTable.c. Then we can much more easily add batching to FDWs, to CTAS, to INSERT INTO SELECT etc, for which there are patches already. > 4) Expected consistency? > > I'm not entirely sure what are the consistency expectations for FDWs. > Currently the FDW nodes pointing to the same server share a connection, > so the inserted rows might be visible to other nodes. But if we only > stash the rows in a local buffer for a while, that's no longer true. So > maybe this breaks the consistency expectations? Given that for local queries that's not the case (since the snapshot won't have those changes visible), I think we shouldn't be too concerned about that. If anything we should be concerned about the opposite. If we are concerned, perhaps we could add functionality to flush all pending changes before executing further statements? > I was thinking about adding a GUC to enable/disable the batching at some > level (global, server, table, ...) but it seems like a bad match because > the consistency expectations likely depend on a query. There should be a > GUC to set the batch size, though (it's hardcoded to 100 for now). Hm. If libpq allowed to utilize pipelining ISTM the answer here would be to not batch by building a single statement with all rows as a VALUES, but issue the single INSERTs in a pipelined manner. That'd probably remove all behavioural differences. I really wish somebody would pick up that libpq patch again. Greetings, Andres Freund
On 6/28/20 8:10 PM, Tomas Vondra wrote: > Now, the primary reason why the performance degrades like this is that > while FDW has batching for SELECT queries (i.e. we read larger chunks of > data from the cursors), we don't have that for INSERTs (or other DML). > Every time you insert a row, it has to go all the way down into the > partition synchronously. You added new fields into the PgFdwModifyState struct. Why you didn't reused ResultRelInfo::ri_CopyMultiInsertBuffer field and CopyMultiInsertBuffer machinery as storage for incoming tuples? -- regards, Andrey Lepikhov Postgres Professional
On Fri, Jul 10, 2020 at 09:28:44AM +0500, Andrey V. Lepikhov wrote: >On 6/28/20 8:10 PM, Tomas Vondra wrote: >>Now, the primary reason why the performance degrades like this is that >>while FDW has batching for SELECT queries (i.e. we read larger chunks of >>data from the cursors), we don't have that for INSERTs (or other DML). >>Every time you insert a row, it has to go all the way down into the >>partition synchronously. > >You added new fields into the PgFdwModifyState struct. Why you didn't >reused ResultRelInfo::ri_CopyMultiInsertBuffer field and >CopyMultiInsertBuffer machinery as storage for incoming tuples? > Because I was focused on speeding-up inserts, and that is not using CopyMultiInsertBuffer I think. I agree the way the tuples are stored may be improved, of course. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello Tomas san, Thank you for picking up this. I'm interested in this topic, too. (As an aside, we'd like to submit a bulk insert patchfor ECPG in the near future.) As others referred, Andrey-san's fast COPY to foreign partitions is also promising. But I think your bulk INSERT is a separatefeature and offers COPY cannot do -- data transformation during loading with INSERT SELECT and CREATE TABLE AS SELECT. Is there anything that makes you worry and stops development? Could I give it a try to implement this (I'm not sure I can,sorry. I'm worried if we can change the executor's call chain easily.) > 1) Extend the FDW API? Yes, I think, because FDWs for other DBMSs will benefit from this. (But it's questionable whether we want users to transferdata in Postgres database to other DBMSs...) MySQL and SQL Server has the same bulk insert syntax as Postgres, i.e., INSERT INTO table VALUES(record1), (record2), ... Oracle doesn't have this syntax, but it can use CTE as follows: INSERT INTO table WITH t AS ( SELECT record1 FROM DUAL UNION ALL SELECT record2 FROM DUAL UNION ALL ... ) SELECT * FROM t; And many DBMSs should have CTAS, INSERT SELECT, and INSERT SELECT record1 UNION ALL SELECT record2 ... The API would simply be: TupleTableSlot ** ExecForeignMultiInsert(EState *estate, ResultRelInfo *rinfo, TupleTableSlot **slot, TupleTableSlot **planSlot, int numSlots); > 2) What about the insert results? I'm wondering if we can report success or failure of each inserted row, because the remote INSERT will fail entirely. OtherFDWs may be able to do it, so the API can be like above. For the same reason, support for RETURNING clause will vary from DBMS to DBMS. > 3) What about the other DML operations (DELETE/UPDATE)? I don't think they are necessary for the time being. If we want them, they will be implemented using the libpq batch/pipeliningas Andres-san said. > 3) Should we do batching for COPY insteads? I'm thinking of issuing INSERT with multiple records as your patch does, because: * When the user executed INSERT statements, it would look strange to the user if the remote SQL is displayed as COPY. * COPY doesn't invoke rules unlike INSERT. (I don't think the rule is a feature what users care about, though.) Also, I'ma bit concerned that there might be, or will be, other differences between INSERT and COPY. [1] Fast COPY FROM command for the table with foreign partitions https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru Regards Takayuki Tsunakawa
On Thu, Oct 08, 2020 at 02:40:10AM +0000, tsunakawa.takay@fujitsu.com wrote: >Hello Tomas san, > > >Thank you for picking up this. I'm interested in this topic, too. (As an aside, we'd like to submit a bulk insert patchfor ECPG in the near future.) > >As others referred, Andrey-san's fast COPY to foreign partitions is also promising. But I think your bulk INSERT is a separatefeature and offers COPY cannot do -- data transformation during loading with INSERT SELECT and CREATE TABLE AS SELECT. > >Is there anything that makes you worry and stops development? Could I give it a try to implement this (I'm not sure I can,sorry. I'm worried if we can change the executor's call chain easily.) > It's primarily a matter of having too much other stuff on my plate, thus not having time to work on this feature. I was not too worried about any particular issue, but I wanted some feedback before spending more time on extending the API. I'm not sure when I'll have time to work on this again, so if you are interested and willing to work on it, please go ahead. I'll gladly do reviews and help you with it. > >> 1) Extend the FDW API? > >Yes, I think, because FDWs for other DBMSs will benefit from this. (But it's questionable whether we want users to transferdata in Postgres database to other DBMSs...) > I think transferring data to other databases is fine - interoperability is a big advantage for users, I don't see it as something threatening the PostgreSQL project. I doubt this would make it more likely for users to migrate from PostgreSQL - there are many ways to do that already. >MySQL and SQL Server has the same bulk insert syntax as Postgres, i.e., INSERT INTO table VALUES(record1), (record2), ... Oracle doesn't have this syntax, but it can use CTE as follows: > > INSERT INTO table > WITH t AS ( > SELECT record1 FROM DUAL UNION ALL > SELECT record2 FROM DUAL UNION ALL > ... > ) > SELECT * FROM t; > >And many DBMSs should have CTAS, INSERT SELECT, and INSERT SELECT record1 UNION ALL SELECT record2 ... > True. In some cases INSERT may be replaced by COPY, but it has various other features too. >The API would simply be: > >TupleTableSlot ** >ExecForeignMultiInsert(EState *estate, > ResultRelInfo *rinfo, > TupleTableSlot **slot, > TupleTableSlot **planSlot, > int numSlots); > > +1, seems quite reasonable >> 2) What about the insert results? > >I'm wondering if we can report success or failure of each inserted row, because the remote INSERT will fail entirely. OtherFDWs may be able to do it, so the API can be like above. > Yeah. I think handling complete failure should not be very difficult, but there are cases that worry me more. For example, what if there's a before trigger (on the remote db) that "skips" inserting some of the rows by returning NULL? >For the same reason, support for RETURNING clause will vary from DBMS to DBMS. > Yeah. I wonder if the FDW needs to indicate which features are supported by the ExecForeignMultiInsert, e.g. by adding a function that decides whether batch insert is supported (it might also do that internally by calling ExecForeignInsert, of course). > >> 3) What about the other DML operations (DELETE/UPDATE)? > >I don't think they are necessary for the time being. If we want them, they will be implemented using the libpq batch/pipeliningas Andres-san said. > I agree. > >> 3) Should we do batching for COPY insteads? > >I'm thinking of issuing INSERT with multiple records as your patch does, because: > >* When the user executed INSERT statements, it would look strange to the user if the remote SQL is displayed as COPY. > >* COPY doesn't invoke rules unlike INSERT. (I don't think the rule is a feature what users care about, though.) Also,I'm a bit concerned that there might be, or will be, other differences between INSERT and COPY. > I agree. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From: Tomas Vondra <tomas.vondra@2ndquadrant.com> > I'm not sure when I'll have time to work on this again, so if you are > interested and willing to work on it, please go ahead. I'll gladly do > reviews and help you with it. Thank you very much. > I think transferring data to other databases is fine - interoperability > is a big advantage for users, I don't see it as something threatening > the PostgreSQL project. I doubt this would make it more likely for users > to migrate from PostgreSQL - there are many ways to do that already. Definitely true. Users may want to use INSERT SELECT to do some data transformation in their OLTP database and load it intoa non-Postgres data warehouse. > Yeah. I think handling complete failure should not be very difficult, > but there are cases that worry me more. For example, what if there's a > before trigger (on the remote db) that "skips" inserting some of the > rows by returning NULL? > Yeah. I wonder if the FDW needs to indicate which features are supported > by the ExecForeignMultiInsert, e.g. by adding a function that decides > whether batch insert is supported (it might also do that internally by > calling ExecForeignInsert, of course). Thanks for your advice. I'll try to address them. Regards Takayuki Tsunakawa
Hello, The attached patch implements the new bulk insert routine for postgres_fdw and the executor utilizing it. It passes makecheck-world. I measured performance in a basic non-partitioned case by modifying Tomas-san's scripts. They perform an INSERT SELECT statementthat copies one million records. The table consists of two integer columns, with a primary key on one of thosethem. You can run the attached prepare.sql to set up once. local.sql inserts to the table directly, while fdw.sqlinserts through a foreign table. The performance results, the average time of 5 runs, were as follows on a Linux host where the average round-trip time of"ping localhost" was 34 us: master, local: 6.1 seconds master, fdw: 125.3 seconds patched, fdw: 11.1 seconds (11x improvement) The patch accumulates at most 100 records in ModifyTableState before inserting in bulk. Also, when an input record is targetedfor a different relation (= partition) than that for already accumulated records, insert the accumulated recordsand store the new record for later insert. [Issues] 1. Do we want a GUC parameter, say, max_bulk_insert_records = (integer), to control the number of records inserted at once? The range of allowed values would be between 1 and 1,000. 1 disables bulk insert. The possible reason of the need for this kind of parameter would be to limit the amount of memory used for accumulated records,which could be prohibitively large if each record is big. I don't think this is a must, but I think we can haveit. 2. Should we accumulate records per relation in ResultRelInfo instead? That is, when inserting into a partitioned table that has foreign partitions, delay insertion until a certain number of inputrecords accumulate, and then insert accumulated records per relation (e.g., 50 records to relation A, 30 records torelation B, and 20 records to relation C.) If we do that, * The order of insertion differs from the order of input records. Is it OK? * Should the maximum count of accumulated records be applied per relation or the query? When many foreign partitions belong to a partitioned table, if the former is chosen, it may use much memory in total. Ifthe latter is chosen, the records per relation could be few and thus the benefit of bulk insert could be small. Regards Takayuki Tsunakawa
Вложения
Hi, Thanks for working on this! On 11/10/20 1:45 AM, tsunakawa.takay@fujitsu.com wrote: > Hello, > > > The attached patch implements the new bulk insert routine for > postgres_fdw and the executor utilizing it. It passes make > check-world. > I haven't done any testing yet, just a quick review. I see the patch builds the "bulk" query in execute_foreign_modify. IMO that's something we should do earlier, when we're building the simple query (for 1-row inserts). I'd understand if you were concerned about overhead in case of 1-row inserts, trying to not plan the bulk query until necessary, but I'm not sure this actually helps. Or was the goal to build a query for every possible number of slots? I don't think that's really useful, considering it requires deallocating the old plan, preparing a new one, etc. IMO it should be sufficient to have two queries - one for 1-row inserts, one for the full batch. The last incomplete batch can be inserted using a loop of 1-row queries. That's what my patch was doing, but I'm not insisting on that - it just seems like a better approach to me. So feel free to argue why this is better. > I measured performance in a basic non-partitioned case by modifying > Tomas-san's scripts. They perform an INSERT SELECT statement that > copies one million records. The table consists of two integer > columns, with a primary key on one of those them. You can run the > attached prepare.sql to set up once. local.sql inserts to the table > directly, while fdw.sql inserts through a foreign table. > > The performance results, the average time of 5 runs, were as follows > on a Linux host where the average round-trip time of "ping localhost" > was 34 us: > > master, local: 6.1 seconds master, fdw: 125.3 seconds patched, fdw: > 11.1 seconds (11x improvement) > Nice. I think we can't really get much closer to local master, so 6.1 vs. 11.1 seconds look quite acceptable. > > The patch accumulates at most 100 records in ModifyTableState before > inserting in bulk. Also, when an input record is targeted for a > different relation (= partition) than that for already accumulated > records, insert the accumulated records and store the new record for > later insert. > > [Issues] > > 1. Do we want a GUC parameter, say, max_bulk_insert_records = > (integer), to control the number of records inserted at once? The > range of allowed values would be between 1 and 1,000. 1 disables > bulk insert. The possible reason of the need for this kind of > parameter would be to limit the amount of memory used for accumulated > records, which could be prohibitively large if each record is big. I > don't think this is a must, but I think we can have it. > I think it'd be good to have such GUC, even if only for testing and development. We should probably have a way to disable the batching, which the GUC could also do, I think. So +1 to have the GUC. > 2. Should we accumulate records per relation in ResultRelInfo > instead? That is, when inserting into a partitioned table that has > foreign partitions, delay insertion until a certain number of input > records accumulate, and then insert accumulated records per relation > (e.g., 50 records to relation A, 30 records to relation B, and 20 > records to relation C.) If we do that, > I think there's a chunk of text missing here? If we do that, then what? Anyway, I don't see why accumulating the records in ResultRelInfo would be better than what the patch does now. It seems to me like fairly specific to FDWs, so keeping it int FDW state seems appropriate. What would be the advantage of stashing it in ResultRelInfo? > * The order of insertion differs from the order of input records. Is > it OK? > I think that's OK for most use cases, and if it's not (e.g. when there's something requiring the exact order of writes) then it's not possible to use batching. That's one of the reasons why I think we should have a GUC to disable the batching. > * Should the maximum count of accumulated records be applied per > relation or the query? When many foreign partitions belong to a > partitioned table, if the former is chosen, it may use much memory in > total. If the latter is chosen, the records per relation could be > few and thus the benefit of bulk insert could be small. > I think it needs to be applied per relation, because that's the level at which we can do it easily and consistently. The whole point is to send data in sufficiently large chunks to minimize the communication overhead (latency etc.), but if you enforce it "per query" that seems hard. Imagine you're inserting data into a table with many partitions - how do you pick the number of rows to accumulate? The table may have 10 or 1000 partitions, we may be inserting into all partitions or just a small subset, not all partitions may be foreign, etc. It seems pretty difficult to pick and enforce a reliable limit at the query level. But maybe I'm missing something and it's easier than I think? Of course, you're entirely correct enforcing this at the partition level may require a lot of memory. Sadly, I don't see a way around that, except for (a) disabling batching or (b) ordering the data to insert data into one partition at a time. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/10/20 4:05 PM, Tomas Vondra wrote: > Hi, > > Thanks for working on this! > > On 11/10/20 1:45 AM, tsunakawa.takay@fujitsu.com wrote: >> Hello, >> >> >> The attached patch implements the new bulk insert routine for >> postgres_fdw and the executor utilizing it. It passes make >> check-world. >> > > I haven't done any testing yet, just a quick review. > > I see the patch builds the "bulk" query in execute_foreign_modify. IMO > that's something we should do earlier, when we're building the simple > query (for 1-row inserts). I'd understand if you were concerned about > overhead in case of 1-row inserts, trying to not plan the bulk query > until necessary, but I'm not sure this actually helps. > > Or was the goal to build a query for every possible number of slots? I > don't think that's really useful, considering it requires deallocating > the old plan, preparing a new one, etc. IMO it should be sufficient to > have two queries - one for 1-row inserts, one for the full batch. The > last incomplete batch can be inserted using a loop of 1-row queries. > > That's what my patch was doing, but I'm not insisting on that - it just > seems like a better approach to me. So feel free to argue why this is > better. > > >> I measured performance in a basic non-partitioned case by modifying >> Tomas-san's scripts. They perform an INSERT SELECT statement that >> copies one million records. The table consists of two integer >> columns, with a primary key on one of those them. You can run the >> attached prepare.sql to set up once. local.sql inserts to the table >> directly, while fdw.sql inserts through a foreign table. >> >> The performance results, the average time of 5 runs, were as follows >> on a Linux host where the average round-trip time of "ping localhost" >> was 34 us: >> >> master, local: 6.1 seconds master, fdw: 125.3 seconds patched, fdw: >> 11.1 seconds (11x improvement) >> > > Nice. I think we can't really get much closer to local master, so 6.1 > vs. 11.1 seconds look quite acceptable. > >> >> The patch accumulates at most 100 records in ModifyTableState before >> inserting in bulk. Also, when an input record is targeted for a >> different relation (= partition) than that for already accumulated >> records, insert the accumulated records and store the new record for >> later insert. >> >> [Issues] >> >> 1. Do we want a GUC parameter, say, max_bulk_insert_records = >> (integer), to control the number of records inserted at once? The >> range of allowed values would be between 1 and 1,000. 1 disables >> bulk insert. The possible reason of the need for this kind of >> parameter would be to limit the amount of memory used for accumulated >> records, which could be prohibitively large if each record is big. I >> don't think this is a must, but I think we can have it. >> > > I think it'd be good to have such GUC, even if only for testing and > development. We should probably have a way to disable the batching, > which the GUC could also do, I think. So +1 to have the GUC. > >> 2. Should we accumulate records per relation in ResultRelInfo >> instead? That is, when inserting into a partitioned table that has >> foreign partitions, delay insertion until a certain number of input >> records accumulate, and then insert accumulated records per relation >> (e.g., 50 records to relation A, 30 records to relation B, and 20 >> records to relation C.) If we do that, >> > > I think there's a chunk of text missing here? If we do that, then what? > > Anyway, I don't see why accumulating the records in ResultRelInfo would > be better than what the patch does now. It seems to me like fairly > specific to FDWs, so keeping it int FDW state seems appropriate. What > would be the advantage of stashing it in ResultRelInfo? > >> * The order of insertion differs from the order of input records. Is >> it OK? >> > > I think that's OK for most use cases, and if it's not (e.g. when there's > something requiring the exact order of writes) then it's not possible to > use batching. That's one of the reasons why I think we should have a GUC > to disable the batching. > >> * Should the maximum count of accumulated records be applied per >> relation or the query? When many foreign partitions belong to a >> partitioned table, if the former is chosen, it may use much memory in >> total. If the latter is chosen, the records per relation could be >> few and thus the benefit of bulk insert could be small. >> > > I think it needs to be applied per relation, because that's the level at > which we can do it easily and consistently. The whole point is to send > data in sufficiently large chunks to minimize the communication overhead > (latency etc.), but if you enforce it "per query" that seems hard. > > Imagine you're inserting data into a table with many partitions - how do > you pick the number of rows to accumulate? The table may have 10 or 1000 > partitions, we may be inserting into all partitions or just a small > subset, not all partitions may be foreign, etc. It seems pretty > difficult to pick and enforce a reliable limit at the query level. But > maybe I'm missing something and it's easier than I think? > > Of course, you're entirely correct enforcing this at the partition level > may require a lot of memory. Sadly, I don't see a way around that, > except for (a) disabling batching or (b) ordering the data to insert > data into one partition at a time. > Two more comments regarding this: 1) If we want to be more strict about the memory consumption, we should probably set the limit in terms of memory, not number of rows. Currently the 100 rows may be 10kB or 10MB, there's no way to know. Of course, this is not the only place with this issue. 2) I wonder what the COPY FROM patch [1] does in this regard. I don't have time to check right now, but I suggest we try to do the same thing, if only to be consistent. [1] https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > I see the patch builds the "bulk" query in execute_foreign_modify. IMO > that's something we should do earlier, when we're building the simple > query (for 1-row inserts). I'd understand if you were concerned about > overhead in case of 1-row inserts, trying to not plan the bulk query > until necessary, but I'm not sure this actually helps. > > Or was the goal to build a query for every possible number of slots? I > don't think that's really useful, considering it requires deallocating > the old plan, preparing a new one, etc. IMO it should be sufficient to > have two queries - one for 1-row inserts, one for the full batch. The > last incomplete batch can be inserted using a loop of 1-row queries. > > That's what my patch was doing, but I'm not insisting on that - it just > seems like a better approach to me. So feel free to argue why this is > better. Don't be concerned, the processing is not changed for 1-row inserts: the INSERT query string is built in PlanForeignModify(),and the remote statement is prepared in execute_foreign_modify() during the first call to ExecForeignInsert()and it's reused for subsequent ExecForeignInsert() calls. The re-creation of INSERT query string and its corresponding PREPARE happen when the number of tuples to be inserted is differentfrom the previous call to ExecForeignInsert()/ExecForeignBulkInsert(). That's because we don't know how many tupleswill be inserted during planning (PlanForeignModify) or execution (until the scan ends for SELECT). For example, ifwe insert 10,030 rows with the bulk size 100, the flow is: PlanForeignModify(): build the INSERT query string for 1 row ExecForeignBulkInsert(100): drop the INSERT query string and prepared statement for 1 row build the query string and prepare statement for 100 row INSERT execute it ExecForeignBulkInsert(100): reuse the prepared statement for 100 row INSERT and execute it ... ExecForeignBulkInsert(30): drop the INSERT query string and prepared statement for 100 row build the query string and prepare statement for 30 row INSERT execute it > I think it'd be good to have such GUC, even if only for testing and > development. We should probably have a way to disable the batching, > which the GUC could also do, I think. So +1 to have the GUC. OK, I'll add it. The name would be max_bulk_insert_tuples, because a) it might cover bulk insert for local relations inthe future, and b) "tuple" is used in cpu_(index_)tuple_cost and parallel_tuple_cost, while "row" or "record" is not usedin GUC (except for row_security). The valid range would be between 1 and 1,000 (I said 10,000 previously, but I think it's overreaction and am a bit worriedabout unforseen trouble too many tuples might cause.) 1 disables the bulk processing and uses the traditonal ExecForeignInsert(). The default value is 100 (would 1 be sensible as a default value to avoid surprising users by increasedmemory usage?) > > 2. Should we accumulate records per relation in ResultRelInfo > > instead? That is, when inserting into a partitioned table that has > > foreign partitions, delay insertion until a certain number of input > > records accumulate, and then insert accumulated records per relation > > (e.g., 50 records to relation A, 30 records to relation B, and 20 > > records to relation C.) If we do that, > > > > I think there's a chunk of text missing here? If we do that, then what? Sorry, the two bullets below there are what follows. Perhaps I should have written ":" instead of ",". > Anyway, I don't see why accumulating the records in ResultRelInfo would > be better than what the patch does now. It seems to me like fairly > specific to FDWs, so keeping it int FDW state seems appropriate. What > would be the advantage of stashing it in ResultRelInfo? I thought of distributing input records to their corresponding partitions' ResultRelInfos. For example, input record forpartition 1 comes, store it in the ResultRelInfo for partition 1, then input record for partition 2 comes, store it inthe ResultRelInfo for partition 2. When a ResultRelInfo accumulates some number of rows, insert the accumulated rows thereininto the partition. When the input endds, perform bulk inserts for ResultRelInfos that have accumulated rows. > I think that's OK for most use cases, and if it's not (e.g. when there's > something requiring the exact order of writes) then it's not possible to > use batching. That's one of the reasons why I think we should have a GUC > to disable the batching. Agreed. > > * Should the maximum count of accumulated records be applied per > > relation or the query? When many foreign partitions belong to a > > partitioned table, if the former is chosen, it may use much memory in > > total. If the latter is chosen, the records per relation could be > > few and thus the benefit of bulk insert could be small. > > > > I think it needs to be applied per relation, because that's the level at > which we can do it easily and consistently. The whole point is to send > data in sufficiently large chunks to minimize the communication overhead > (latency etc.), but if you enforce it "per query" that seems hard. > > Imagine you're inserting data into a table with many partitions - how do > you pick the number of rows to accumulate? The table may have 10 or 1000 > partitions, we may be inserting into all partitions or just a small > subset, not all partitions may be foreign, etc. It seems pretty > difficult to pick and enforce a reliable limit at the query level. But > maybe I'm missing something and it's easier than I think? > > Of course, you're entirely correct enforcing this at the partition level > may require a lot of memory. Sadly, I don't see a way around that, > except for (a) disabling batching or (b) ordering the data to insert > data into one partition at a time. OK, I think I'll try doing like that, after waiting for other opinions some days. > Two more comments regarding this: > > 1) If we want to be more strict about the memory consumption, we should > probably set the limit in terms of memory, not number of rows. Currently > the 100 rows may be 10kB or 10MB, there's no way to know. Of course, > this is not the only place with this issue. > > 2) I wonder what the COPY FROM patch [1] does in this regard. I don't > have time to check right now, but I suggest we try to do the same thing, > if only to be consistent. > > [1] > https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-909 > 86e55489f%40postgrespro.ru That COPY FROM patch uses the tuple accumulation mechanism for local tables as-is. That is, it accumulates at most 1,000tuples per partition. /* * No more than this many tuples per CopyMultiInsertBuffer * * Caution: Don't make this too big, as we could end up with this many * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's * multiInsertBuffers list. Increasing this can cause quadratic growth in * memory requirements during copies into partitioned tables with a large * number of partitions. */ #define MAX_BUFFERED_TUPLES 1000 Regards Takayuki Tsunakawa
On Wed, 11 Nov 2020, tsunakawa.takay@fujitsu.com wrote: > This email was sent to you by someone outside of the University. > You should only click on links or attachments if you are certain that the email is genuine and the content is safe. > > From: Tomas Vondra <tomas.vondra@enterprisedb.com> >> I see the patch builds the "bulk" query in execute_foreign_modify. IMO >> that's something we should do earlier, when we're building the simple >> query (for 1-row inserts). I'd understand if you were concerned about >> overhead in case of 1-row inserts, trying to not plan the bulk query >> until necessary, but I'm not sure this actually helps. >> >> Or was the goal to build a query for every possible number of slots? I >> don't think that's really useful, considering it requires deallocating >> the old plan, preparing a new one, etc. IMO it should be sufficient to >> have two queries - one for 1-row inserts, one for the full batch. The >> last incomplete batch can be inserted using a loop of 1-row queries. >> >> That's what my patch was doing, but I'm not insisting on that - it just >> seems like a better approach to me. So feel free to argue why this is >> better. > > Don't be concerned, the processing is not changed for 1-row inserts: the INSERT query string is built in PlanForeignModify(),and the remote statement is prepared in execute_foreign_modify() during the first call to ExecForeignInsert()and it's reused for subsequent ExecForeignInsert() calls. > > The re-creation of INSERT query string and its corresponding PREPARE happen when the number of tuples to be inserted isdifferent from the previous call to ExecForeignInsert()/ExecForeignBulkInsert(). That's because we don't know how manytuples will be inserted during planning (PlanForeignModify) or execution (until the scan ends for SELECT). For example,if we insert 10,030 rows with the bulk size 100, the flow is: > > PlanForeignModify(): > build the INSERT query string for 1 row > ExecForeignBulkInsert(100): > drop the INSERT query string and prepared statement for 1 row > build the query string and prepare statement for 100 row INSERT > execute it > ExecForeignBulkInsert(100): > reuse the prepared statement for 100 row INSERT and execute it > ... > ExecForeignBulkInsert(30): > drop the INSERT query string and prepared statement for 100 row > build the query string and prepare statement for 30 row INSERT > execute it > > >> I think it'd be good to have such GUC, even if only for testing and >> development. We should probably have a way to disable the batching, >> which the GUC could also do, I think. So +1 to have the GUC. > > OK, I'll add it. The name would be max_bulk_insert_tuples, because a) it might cover bulk insert for local relations inthe future, and b) "tuple" is used in cpu_(index_)tuple_cost and parallel_tuple_cost, while "row" or "record" is not usedin GUC (except for row_security). > > The valid range would be between 1 and 1,000 (I said 10,000 previously, but I think it's overreaction and am a bit worriedabout unforseen trouble too many tuples might cause.) 1 disables the bulk processing and uses the traditonal ExecForeignInsert(). The default value is 100 (would 1 be sensible as a default value to avoid surprising users by increasedmemory usage?) > > >>> 2. Should we accumulate records per relation in ResultRelInfo >>> instead? That is, when inserting into a partitioned table that has >>> foreign partitions, delay insertion until a certain number of input >>> records accumulate, and then insert accumulated records per relation >>> (e.g., 50 records to relation A, 30 records to relation B, and 20 >>> records to relation C.) If we do that, >>> >> >> I think there's a chunk of text missing here? If we do that, then what? > > Sorry, the two bullets below there are what follows. Perhaps I should have written ":" instead of ",". > > >> Anyway, I don't see why accumulating the records in ResultRelInfo would >> be better than what the patch does now. It seems to me like fairly >> specific to FDWs, so keeping it int FDW state seems appropriate. What >> would be the advantage of stashing it in ResultRelInfo? > > I thought of distributing input records to their corresponding partitions' ResultRelInfos. For example, input record forpartition 1 comes, store it in the ResultRelInfo for partition 1, then input record for partition 2 comes, store it inthe ResultRelInfo for partition 2. When a ResultRelInfo accumulates some number of rows, insert the accumulated rows thereininto the partition. When the input endds, perform bulk inserts for ResultRelInfos that have accumulated rows. > > > >> I think that's OK for most use cases, and if it's not (e.g. when there's >> something requiring the exact order of writes) then it's not possible to >> use batching. That's one of the reasons why I think we should have a GUC >> to disable the batching. > > Agreed. > > >>> * Should the maximum count of accumulated records be applied per >>> relation or the query? When many foreign partitions belong to a >>> partitioned table, if the former is chosen, it may use much memory in >>> total. If the latter is chosen, the records per relation could be >>> few and thus the benefit of bulk insert could be small. >>> >> >> I think it needs to be applied per relation, because that's the level at >> which we can do it easily and consistently. The whole point is to send >> data in sufficiently large chunks to minimize the communication overhead >> (latency etc.), but if you enforce it "per query" that seems hard. >> >> Imagine you're inserting data into a table with many partitions - how do >> you pick the number of rows to accumulate? The table may have 10 or 1000 >> partitions, we may be inserting into all partitions or just a small >> subset, not all partitions may be foreign, etc. It seems pretty >> difficult to pick and enforce a reliable limit at the query level. But >> maybe I'm missing something and it's easier than I think? >> >> Of course, you're entirely correct enforcing this at the partition level >> may require a lot of memory. Sadly, I don't see a way around that, >> except for (a) disabling batching or (b) ordering the data to insert >> data into one partition at a time. > > OK, I think I'll try doing like that, after waiting for other opinions some days. > > >> Two more comments regarding this: >> >> 1) If we want to be more strict about the memory consumption, we should >> probably set the limit in terms of memory, not number of rows. Currently >> the 100 rows may be 10kB or 10MB, there's no way to know. Of course, >> this is not the only place with this issue. >> >> 2) I wonder what the COPY FROM patch [1] does in this regard. I don't >> have time to check right now, but I suggest we try to do the same thing, >> if only to be consistent. >> >> [1] >> https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-909 >> 86e55489f%40postgrespro.ru > > That COPY FROM patch uses the tuple accumulation mechanism for local tables as-is. That is, it accumulates at most 1,000tuples per partition. > > /* > * No more than this many tuples per CopyMultiInsertBuffer > * > * Caution: Don't make this too big, as we could end up with this many > * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's > * multiInsertBuffers list. Increasing this can cause quadratic growth in > * memory requirements during copies into partitioned tables with a large > * number of partitions. > */ > #define MAX_BUFFERED_TUPLES 1000 > > > Regards > Takayuki Tsunakawa > > Does this patch affect trigger semantics on the base table? At the moment when I insert 1000 rows into a postgres_fdw table using a single insert statement (e.g. INSERT INTO fdw_foo SELECT ... FROM bar) I naively expect a "statement level" trigger on the base table to trigger once. But this is not the case. The postgres_fdw implements this operation as 1000 separate insert statements on the base table, so the trigger happens 1000 times instead of once. Hence there is no distinction between using a statement level and a row level trigger on the base table in this context. So would this patch change the behaviour so only 10 separate insert statements (each of 100 rows) would be made against the base table? If so thats useful as it means improving performance using statement level triggers becomes possible. But it would also result in more obscure semantics and might break user processes dependent on the existing behaviour after the patch is applied. BTW is this subtlety documented, I haven't found anything but happy to be proved wrong? Tim -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
From: timc@corona.is.ed.ac.uk <timc@corona.is.ed.ac.uk> On Behalf Of > Does this patch affect trigger semantics on the base table? > > At the moment when I insert 1000 rows into a postgres_fdw table using a > single insert statement (e.g. INSERT INTO fdw_foo SELECT ... FROM bar) I > naively expect a "statement level" trigger on the base table to trigger > once. But this is not the case. The postgres_fdw implements this > operation as 1000 separate insert statements on the base table, so the > trigger happens 1000 times instead of once. Hence there is no > distinction between using a statement level and a row level trigger on > the base table in this context. > > So would this patch change the behaviour so only 10 separate insert > statements (each of 100 rows) would be made against the base table? > If so thats useful as it means improving performance using statement > level triggers becomes possible. But it would also result in more > obscure semantics and might break user processes dependent on the > existing behaviour after the patch is applied. Yes, the times the statement trigger defined on the base (remote) table will be reduced, as you said. > BTW is this subtlety documented, I haven't found anything but happy > to be proved wrong? Unfortunately, there doesn't seem to be any description on triggers on base tables. For example, if the local foreign tablehas an AFTER ROW trigger and its remote base table has a BEFORE ROW trigger that modifies the input record, it seemsthat the AFTER ROW trigger doesn't see the modified record. Regards Takayuki Tsunakawa
Hello, Modified the patch as I talked with Tomas-san. The performance results of loading one million records into a hash-partitionedtable with 8 partitions are as follows: unpatched, local: 8.6 seconds unpatched, fdw: 113.7 seconds patched, fdw: 12.5 seconds (9x improvement) The test scripts are also attached. Run prepare.sql once to set up tables and source data. Run local_part.sql and fdw_part.sqlto load source data into a partitioned table with local partitions and a partitioned table with foreign tablesrespectively. Regards Takayuki Tsunakawa
Вложения
On 11/17/20 10:11 AM, tsunakawa.takay@fujitsu.com wrote: > Hello, > > > Modified the patch as I talked with Tomas-san. The performance > results of loading one million records into a hash-partitioned table > with 8 partitions are as follows: > > unpatched, local: 8.6 seconds unpatched, fdw: 113.7 seconds patched, > fdw: 12.5 seconds (9x improvement) > > The test scripts are also attached. Run prepare.sql once to set up > tables and source data. Run local_part.sql and fdw_part.sql to load > source data into a partitioned table with local partitions and a > partitioned table with foreign tables respectively. > Unfortunately, this does not compile for me, because nodeModifyTable calls ExecGetTouchedPartitions, which is not defined anywhere. Not sure what's that about, so I simply commented-out this. That probably fails the partitioned cases, but it allowed me to do some review and testing. As for the patch, I have a couple of comments 1) As I mentioned before, I really don't think we should be doing deparsing in execute_foreign_modify - that's something that should happen earlier, and should be in a deparse.c function. 2) I think the GUC should be replaced with an server/table option, similar to fetch_size. The attached patch tries to address both of these points. Firstly, it adds a new deparseBulkInsertSql function, that builds a query for the "full" batch, and then uses those two queries - when we get a full batch we use the bulk query, otherwise we use the single-row query in a loop. IMO this is cleaner than deparsing queries ad hoc in the execute_foreign_modify. Of course, this might be worse when we don't have a full batch, e.g. for a query that insert only 50 rows with batch_size=100. If this case is common, one option would be lowering the batch_size accordingly. If we really want to improve this case too, I suggest we pass more info than just a position of the VALUES clause - that seems a bit too hackish. Secondly, it adds the batch_size option to server/foreign table, and uses that. This is not complete, though. postgresPlanForeignModify currently passes a hard-coded value at the moment, it needs to lookup the correct value for the server/table from RelOptInfo or something. And I suppose ModifyTable inftractructure will need to determine the value in order to pass the correct number of slots to the FDW API. The are a couple other smaller changes. E.g. it undoes changes to finish_foreign_modify, and instead calls separate functions to prepare the bulk statement. It also adds list_make5/list_make6 macros, so as to not have to do strange stuff with the parameter lists. A finally, this should probably add a bunch of regression tests. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > Unfortunately, this does not compile for me, because nodeModifyTable calls > ExecGetTouchedPartitions, which is not defined anywhere. Not sure what's > that about, so I simply commented-out this. That probably fails the partitioned > cases, but it allowed me to do some review and testing. Ouch, sorry. I'm ashamed to have forgotten including execPartition.c. > The are a couple other smaller changes. E.g. it undoes changes to > finish_foreign_modify, and instead calls separate functions to prepare the bulk > statement. It also adds list_make5/list_make6 macros, so as to not have to do > strange stuff with the parameter lists. Thanks, I'll take them thankfully! I wonder why I didn't think of separating deallocate_query() from finish_foreign_modify()... perhaps my brain was dying. As for list_make5/6(), I saw your first patch avoid adding them,so I thought you found them ugly (and I felt so, too.) But thinking now, there's no reason to hesitate it. > A finally, this should probably add a bunch of regression tests. Sure. > 1) As I mentioned before, I really don't think we should be doing deparsing in > execute_foreign_modify - that's something that should happen earlier, and > should be in a deparse.c function. ... > The attached patch tries to address both of these points. > > Firstly, it adds a new deparseBulkInsertSql function, that builds a query for the > "full" batch, and then uses those two queries - when we get a full batch we use > the bulk query, otherwise we use the single-row query in a loop. IMO this is > cleaner than deparsing queries ad hoc in the execute_foreign_modify. ... > Of course, this might be worse when we don't have a full batch, e.g. for a query > that insert only 50 rows with batch_size=100. If this case is common, one > option would be lowering the batch_size accordingly. If we really want to > improve this case too, I suggest we pass more info than just a position of the > VALUES clause - that seems a bit too hackish. ... > Secondly, it adds the batch_size option to server/foreign table, and uses that. > This is not complete, though. postgresPlanForeignModify currently passes a > hard-coded value at the moment, it needs to lookup the correct value for the > server/table from RelOptInfo or something. And I suppose ModifyTable > inftractructure will need to determine the value in order to pass the correct > number of slots to the FDW API. I can sort of understand your feeling, but I'd like to reconstruct the query and prepare it in execute_foreign_modify() because: * Some of our customers use bulk insert in ECPG (INSERT ... VALUES(record1, (record2), ...) to insert variable number ofrecords per query. (Oracle's Pro*C has such a feature.) So, I want to be prepared to enable such a thing with FDW. * The number of records to insert is not known during planning (in general), so it feels natural to get prepared during executionphase, or not unnatural at least. * I wanted to avoid the overhead of building the full query string for 100-record insert statement during query planning,because it may be a bit costly for usual 1-record inserts. (The overhead may be hidden behind the high communicationcost of postgres_fdw, though.) So, in terms of code cleanness, how about moving my code for rebuilding query string from execute_foreign_modify() to somenew function in deparse.c? > 2) I think the GUC should be replaced with an server/table option, similar to > fetch_size. Hmm, batch_size differs from fetch_size. fetch_size is a postgres_fdw-specific feature with no relevant FDW routine, whilebatch_size is a configuration parameter for all FDWs that implement ExecForeignBulkInsert(). The ideas I can thinkof are: 1. Follow JDBC/ODBC and add standard FDW properties. For example, the JDBC standard defines standard connection pool propertiessuch as maxPoolSize and minPoolSize. JDBC drivers have to provide them with those defined names. Likewise, theFDW interface requires FDW implementors to handle the foreign server option name "max_bulk_insert_tuples" if he/she wantsto provide bulk insert feature and implement ExecForeignBulkInsert(). The core executor gets that setting from theFDW by calling a new FDW routine like GetMaxBulkInsertTuples(). Sigh... 2. Add a new max_bulk_insert_tuples reloption to CREATE/ALTER FOREIGN TABLE. executor gets the value from Relation and usesit. (But is this a table-specific configuration? I don't think so, sigh...) 3. Adopt the current USERSET GUC max_bulk_insert_tuples. I think this is enough because the user can change the settingper session, application, and database. Regards Takayuki Tsunakawa
On 11/19/20 3:43 AM, tsunakawa.takay@fujitsu.com wrote: > From: Tomas Vondra <tomas.vondra@enterprisedb.com> >> Unfortunately, this does not compile for me, because >> nodeModifyTable calls ExecGetTouchedPartitions, which is not >> defined anywhere. Not sure what's that about, so I simply >> commented-out this. That probably fails the partitioned cases, but >> it allowed me to do some review and testing. > > Ouch, sorry. I'm ashamed to have forgotten including > execPartition.c. > No reason to feel ashamed. Mistakes do happen from time to time. > >> The are a couple other smaller changes. E.g. it undoes changes to >> finish_foreign_modify, and instead calls separate functions to >> prepare the bulk statement. It also adds list_make5/list_make6 >> macros, so as to not have to do strange stuff with the parameter >> lists. > > Thanks, I'll take them thankfully! I wonder why I didn't think of > separating deallocate_query() from finish_foreign_modify() ... > perhaps my brain was dying. As for list_make5/6(), I saw your first > patch avoid adding them, so I thought you found them ugly (and I felt > so, too.) But thinking now, there's no reason to hesitate it. > I think it's often easier to look changes like deallocate_query with a bit of distance, not while hacking on the patch and just trying to make it work somehow. For the list_make# stuff, I think I've decided to do the simplest thing possible in extension, without having to recompile the server. But I think for a proper patch it's better to keep it more readable. > ... > >> 1) As I mentioned before, I really don't think we should be doing >> deparsing in execute_foreign_modify - that's something that should >> happen earlier, and should be in a deparse.c function. > ... >> The attached patch tries to address both of these points. >> >> Firstly, it adds a new deparseBulkInsertSql function, that builds a >> query for the "full" batch, and then uses those two queries - when >> we get a full batch we use the bulk query, otherwise we use the >> single-row query in a loop. IMO this is cleaner than deparsing >> queries ad hoc in the execute_foreign_modify. > ... >> Of course, this might be worse when we don't have a full batch, >> e.g. for a query that insert only 50 rows with batch_size=100. If >> this case is common, one option would be lowering the batch_size >> accordingly. If we really want to improve this case too, I suggest >> we pass more info than just a position of the VALUES clause - that >> seems a bit too hackish. > ... >> Secondly, it adds the batch_size option to server/foreign table, >> and uses that. This is not complete, though. >> postgresPlanForeignModify currently passes a hard-coded value at >> the moment, it needs to lookup the correct value for the >> server/table from RelOptInfo or something. And I suppose >> ModifyTable inftractructure will need to determine the value in >> order to pass the correct number of slots to the FDW API. > > I can sort of understand your feeling, but I'd like to reconstruct > the query and prepare it in execute_foreign_modify() because: > > * Some of our customers use bulk insert in ECPG (INSERT ... > VALUES(record1, (record2), ...) to insert variable number of records > per query. (Oracle's Pro*C has such a feature.) So, I want to be > prepared to enable such a thing with FDW. > > * The number of records to insert is not known during planning (in > general), so it feels natural to get prepared during execution phase, > or not unnatural at least. > I think we should differentiate between "deparsing" and "preparing". > * I wanted to avoid the overhead of building the full query string > for 100-record insert statement during query planning, because it may > be a bit costly for usual 1-record inserts. (The overhead may be > hidden behind the high communication cost of postgres_fdw, though.) > Hmm, ok. I haven't tried how expensive that would be, but my assumption was it's much cheaper than the latency we save. But maybe I'm wrong. > So, in terms of code cleanness, how about moving my code for > rebuilding query string from execute_foreign_modify() to some new > function in deparse.c? > That might work, yeah. I suggest we do this: 1) try to use the same approach for both single-row inserts and larger batches, to not have a lot of different branches 2) modify deparseInsertSql to produce not the "final" query but some intermediate representation useful to generate queries inserting arbitrary number of rows 3) in execute_foreign_modify remember the last number of rows, and only rebuild/replan the query when it changes > >> 2) I think the GUC should be replaced with an server/table option, >> similar to fetch_size. > > Hmm, batch_size differs from fetch_size. fetch_size is a > postgres_fdw-specific feature with no relevant FDW routine, while > batch_size is a configuration parameter for all FDWs that implement > ExecForeignBulkInsert(). The ideas I can think of are: > > 1. Follow JDBC/ODBC and add standard FDW properties. For example, > the JDBC standard defines standard connection pool properties such as > maxPoolSize and minPoolSize. JDBC drivers have to provide them with > those defined names. Likewise, the FDW interface requires FDW > implementors to handle the foreign server option name > "max_bulk_insert_tuples" if he/she wants to provide bulk insert > feature and implement ExecForeignBulkInsert(). The core executor > gets that setting from the FDW by calling a new FDW routine like > GetMaxBulkInsertTuples(). Sigh... > > 2. Add a new max_bulk_insert_tuples reloption to CREATE/ALTER FOREIGN > TABLE. executor gets the value from Relation and uses it. (But is > this a table-specific configuration? I don't think so, sigh...) > I do agree there's a difference between fetch_size and batch_size. For fetch_size, it's internal to postgres_fdw - no external code needs to know about it. For batch_size that's not the case, the ModifyTable core code needs to be aware of that. That means the "batch_size" is becoming part of the API, and IMO the way to do that is by exposing it as an explicit API method. So +1 to add something like GetMaxBulkInsertTuples. It still needs to be configurable at the server/table level, though. The new API method should only inform ModifyTable about the final max batch size the FDW decided to use. > 3. Adopt the current USERSET GUC max_bulk_insert_tuples. I think > this is enough because the user can change the setting per session, > application, and database. > I don't think this is usable in practice, because a single session may be using multiple FDW servers, with different implementations, latency to the data nodes, etc. It's unlikely a single GUC value will be suitable for all of them. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > I don't think this is usable in practice, because a single session may > be using multiple FDW servers, with different implementations, latency > to the data nodes, etc. It's unlikely a single GUC value will be > suitable for all of them. That makes sense. The row size varies from table to table, so the user may want to tune this option to reduce memory consumption. I think the attached patch has reflected all your comments. I hope this will pass.. Regards Takayuki Tsunakawa
Вложения
On 11/23/20 3:17 AM, tsunakawa.takay@fujitsu.com wrote: > From: Tomas Vondra <tomas.vondra@enterprisedb.com> >> I don't think this is usable in practice, because a single session >> may be using multiple FDW servers, with different implementations, >> latency to the data nodes, etc. It's unlikely a single GUC value >> will be suitable for all of them. > > That makes sense. The row size varies from table to table, so the > user may want to tune this option to reduce memory consumption. > > I think the attached patch has reflected all your comments. I hope > this will pass.. > Thanks - I didn't have time for a thorough review at the moment, so I only skimmed through the diff and did a couple very simple tests. And I think overall it looks quite nice. A couple minor comments/questions: 1) We're calling it "batch_size" but the API function is named postgresGetMaxBulkInsertTuples(). Perhaps we should rename the function to postgresGetModifyBatchSize()? That has the advantage it'd work if we ever add support for batching to UPDATE/DELETE. 2) Do we have to lookup the batch_size in create_foreign_modify (in server/table options)? I'd have expected to look it up while planning the modify and then pass it through the list, just like the other FdwModifyPrivateIndex stuff. But maybe that's not possible. 3) That reminds me - should we show the batching info on EXPLAIN? That seems like a fairly interesting thing to show to the user. Perhaps showing the average batch size would also be useful? Or maybe not, we create the batches as large as possible, with the last one smaller. 4) It seems that ExecInsert executes GetMaxBulkInsertTuples() over and over for every tuple. I don't know it that has measurable impact, but it seems a bit excessive IMO. I don't think we should support the batch size changing during execution (seems tricky). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > 1) We're calling it "batch_size" but the API function is named > postgresGetMaxBulkInsertTuples(). Perhaps we should rename the function > to postgresGetModifyBatchSize()? That has the advantage it'd work if we > ever add support for batching to UPDATE/DELETE. Actually, I was in two minds whether the term batch or bulk is better. Because Oracle uses "bulk insert" and "bulk fetch",like in FETCH cur BULK COLLECT INTO array and FORALL in array INSERT INTO, while JDBC uses batch as in "batch updates"and its API method names (addBatch, executeBatch). But it seems better or common to use batch according to the etymology and the following Stack Overflow page: https://english.stackexchange.com/questions/141884/which-is-a-better-and-commonly-used-word-bulk-or-batch OTOH, as for the name GetModifyBatchSize() you suggest, I think GetInsertBatchSize may be better. That is, this API dealswith multiple records in a single INSERT statement. Your GetModifyBatchSize will be reserved for statement batchingwhen libpq has supported batch/pipelining to execute multiple INSERT/UPDATE/DELETE statements, as in the followingJDBC batch updates. What do you think? CODE EXAMPLE 14-1 Creating and executing a batch of insert statements -------------------------------------------------- Statement stmt = con.createStatement(); stmt.addBatch("INSERT INTO employees VALUES (1000, 'Joe Jones')"); stmt.addBatch("INSERT INTO departments VALUES (260, 'Shoe')"); stmt.addBatch("INSERT INTO emp_dept VALUES (1000, 260)"); // submit a batch of update commands for execution int[] updateCounts = stmt.executeBatch(); -------------------------------------------------- > 2) Do we have to lookup the batch_size in create_foreign_modify (in > server/table options)? I'd have expected to look it up while planning > the modify and then pass it through the list, just like the other > FdwModifyPrivateIndex stuff. But maybe that's not possible. Don't worry, create_foreign_modify() is called from PlanForeignModify() during planning. Unfortunately, it's also calledfrom BeginForeignInsert(), but other stuff passed to create_foreign_modify() including the query string is constructedthere. > 3) That reminds me - should we show the batching info on EXPLAIN? That > seems like a fairly interesting thing to show to the user. Perhaps > showing the average batch size would also be useful? Or maybe not, we > create the batches as large as possible, with the last one smaller. Hmm, maybe batch_size is not for EXPLAIN because its value doesn't change dynamically based on the planning or system stateunlike shared buffers and parallel workers. OTOH, I sometimes want to see what configuration parameter values the userset, such as work_mem, enable_*, and shared_buffers, together with the query plan (EXPLAIN and auto_explain). For example,it'd be nice if EXPLAIN (parameters on) could do that. Some relevant FDW-related parameters could be included inthat output. > 4) It seems that ExecInsert executes GetMaxBulkInsertTuples() over and > over for every tuple. I don't know it that has measurable impact, but it > seems a bit excessive IMO. I don't think we should support the batch > size changing during execution (seems tricky). Don't worry about this, too. GetMaxBulkInsertTuples() just returns a value that was already saved in a struct in create_foreign_modify(). Regards Takayuki Tsunakawa
On 11/24/20 9:45 AM, tsunakawa.takay@fujitsu.com wrote: > From: Tomas Vondra <tomas.vondra@enterprisedb.com> >> 1) We're calling it "batch_size" but the API function is named >> postgresGetMaxBulkInsertTuples(). Perhaps we should rename the function >> to postgresGetModifyBatchSize()? That has the advantage it'd work if we >> ever add support for batching to UPDATE/DELETE. > > Actually, I was in two minds whether the term batch or bulk is better. Because Oracle uses "bulk insert" and "bulk fetch",like in FETCH cur BULK COLLECT INTO array and FORALL in array INSERT INTO, while JDBC uses batch as in "batch updates"and its API method names (addBatch, executeBatch). > > But it seems better or common to use batch according to the etymology and the following Stack Overflow page: > > https://english.stackexchange.com/questions/141884/which-is-a-better-and-commonly-used-word-bulk-or-batch > > OTOH, as for the name GetModifyBatchSize() you suggest, I think GetInsertBatchSize may be better. That is, this API dealswith multiple records in a single INSERT statement. Your GetModifyBatchSize will be reserved for statement batchingwhen libpq has supported batch/pipelining to execute multiple INSERT/UPDATE/DELETE statements, as in the followingJDBC batch updates. What do you think? > I don't know. I was really only thinking about batching in the context of a single DML command, not about batching of multiple commands at the protocol level. IMHO it's far more likely we'll add support for batching for DELETE/UPDATE than libpq pipelining, which seems rather different from how the FDW API works. Which is why I was suggesting to use a name that would work for all DML commands, not just for inserts. > CODE EXAMPLE 14-1 Creating and executing a batch of insert statements > -------------------------------------------------- > Statement stmt = con.createStatement(); > stmt.addBatch("INSERT INTO employees VALUES (1000, 'Joe Jones')"); > stmt.addBatch("INSERT INTO departments VALUES (260, 'Shoe')"); > stmt.addBatch("INSERT INTO emp_dept VALUES (1000, 260)"); > > // submit a batch of update commands for execution > int[] updateCounts = stmt.executeBatch(); > -------------------------------------------------- > Sure. We already have a patch to support something like this at the libpq level, IIRC. But I'm not sure how well that matches the FDW API approach in general. > >> 2) Do we have to lookup the batch_size in create_foreign_modify (in >> server/table options)? I'd have expected to look it up while planning >> the modify and then pass it through the list, just like the other >> FdwModifyPrivateIndex stuff. But maybe that's not possible. > > Don't worry, create_foreign_modify() is called from PlanForeignModify() during planning. Unfortunately, it's also calledfrom BeginForeignInsert(), but other stuff passed to create_foreign_modify() including the query string is constructedthere. > Hmm, ok. > >> 3) That reminds me - should we show the batching info on EXPLAIN? That >> seems like a fairly interesting thing to show to the user. Perhaps >> showing the average batch size would also be useful? Or maybe not, we >> create the batches as large as possible, with the last one smaller. > > Hmm, maybe batch_size is not for EXPLAIN because its value doesn't change dynamically based on the planning or system stateunlike shared buffers and parallel workers. OTOH, I sometimes want to see what configuration parameter values the userset, such as work_mem, enable_*, and shared_buffers, together with the query plan (EXPLAIN and auto_explain). For example,it'd be nice if EXPLAIN (parameters on) could do that. Some relevant FDW-related parameters could be included inthat output. > Not sure, but I'd guess knowing whether batching is used would be useful. We only print the single-row SQL query, which kinda gives the impression that there's no batching. >> 4) It seems that ExecInsert executes GetMaxBulkInsertTuples() over and >> over for every tuple. I don't know it that has measurable impact, but it >> seems a bit excessive IMO. I don't think we should support the batch >> size changing during execution (seems tricky). > > Don't worry about this, too. GetMaxBulkInsertTuples() just returns a value that was already saved in a struct in create_foreign_modify(). > Well, I do worry for two reasons. Firstly, the fact that in postgres_fdw the call is cheap does not mean it'll be like that in every other FDW. Presumably, the other FDWs might cache it in the struct and do the same thing, of course. But the fact that we're calling it over and over for each row kinda seems like we allow the value to change during execution, but I very much doubt the code is expecting that. I haven't tried, but assume the function first returns 10 and then 100. ISTM the code will allocate ri_Slots with 25 slots, but then we'll try stashing 100 tuples there. That can't end well. Sure, we can claim it's a bug in the FDW extension, but it's also due to the API design. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you for picking up this. I'm interested in this topic, too. (As an aside, we'd like to submit a bulk insert patch for ECPG in the near future.)
As others referred, Andrey-san's fast COPY to foreign partitions is also promising. But I think your bulk INSERT is a separate feature and offers COPY cannot do -- data transformation during loading with INSERT SELECT and CREATE TABLE AS SELECT.
Is there anything that makes you worry and stops development? Could I give it a try to implement this (I'm not sure I can, sorry. I'm worried if we can change the executor's call chain easily.)
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > On 11/24/20 9:45 AM, tsunakawa.takay@fujitsu.com wrote: > > OTOH, as for the name GetModifyBatchSize() you suggest, I think > GetInsertBatchSize may be better. That is, this API deals with multiple > records in a single INSERT statement. Your GetModifyBatchSize will be > reserved for statement batching when libpq has supported batch/pipelining to > execute multiple INSERT/UPDATE/DELETE statements, as in the following > JDBC batch updates. What do you think? > > > > I don't know. I was really only thinking about batching in the context > of a single DML command, not about batching of multiple commands at the > protocol level. IMHO it's far more likely we'll add support for batching > for DELETE/UPDATE than libpq pipelining, which seems rather different > from how the FDW API works. Which is why I was suggesting to use a name > that would work for all DML commands, not just for inserts. Right, I can't imagine now how the interaction among the client, server core and FDWs would be regarding the statement batching. So I'll take your suggested name. > Not sure, but I'd guess knowing whether batching is used would be > useful. We only print the single-row SQL query, which kinda gives the > impression that there's no batching. Added in postgres_fdw like "Remote SQL" when EXPLAIN VERBOSE is run. > > Don't worry about this, too. GetMaxBulkInsertTuples() just returns a value > that was already saved in a struct in create_foreign_modify(). > > > > Well, I do worry for two reasons. > > Firstly, the fact that in postgres_fdw the call is cheap does not mean > it'll be like that in every other FDW. Presumably, the other FDWs might > cache it in the struct and do the same thing, of course. > > But the fact that we're calling it over and over for each row kinda > seems like we allow the value to change during execution, but I very > much doubt the code is expecting that. I haven't tried, but assume the > function first returns 10 and then 100. ISTM the code will allocate > ri_Slots with 25 slots, but then we'll try stashing 100 tuples there. > That can't end well. Sure, we can claim it's a bug in the FDW extension, > but it's also due to the API design. You worried about other FDWs than postgres_fdw. That's reasonable. I insisted in other threads that PG developers careonly about postgres_fdw, not other FDWs, when designing the FDW interface, but I myself made the same mistake. I madechanges so that the executor calls GetModifyBatchSize() once per relation per statement. Regards Takayuki Tsunakawa
Вложения
From: Craig Ringer <craig.ringer@enterprisedb.com> > I suggest that when developing this, you keep in mind the ongoing work on the libpq pipelining/batching enhancements, andalso the way many interfaces to foreign data sources support asynchronous, concurrent operations. Yes, thank you, I bear it in mind. I understand it's a feature for batching multiple kinds of SQL statements like DBC'sbatch updates. > I'd argue it's pretty much vital for decent performance when talking to a cloud database from an on-prem server for example,or any other time that round-trip-time reduction is important. Yeah, I'm thinking of the data migration and integration as the prominent use case. Regards Takayuki Tsunakawa
On 11/25/20 7:31 AM, tsunakawa.takay@fujitsu.com wrote: > From: Craig Ringer <craig.ringer@enterprisedb.com> >> I suggest that when developing this, you keep in mind the ongoing >> work on the libpq pipelining/batching enhancements, and also the >> way many interfaces to foreign data sources support asynchronous, >> concurrent operations. > > Yes, thank you, I bear it in mind. I understand it's a feature for > batching multiple kinds of SQL statements like DBC's batch updates. > I haven't followed the libpq pipelining thread very closely. It does seem related, but I'm not sure if it's a good match for this patch, or how far is it from being committable ... > >> I'd argue it's pretty much vital for decent performance when >> talking to a cloud database from an on-prem server for example, or >> any other time that round-trip-time reduction is important. > > Yeah, I'm thinking of the data migration and integration as the > prominent use case. > Well, good that we all agree this is a useful feature to have (in general). The question is whether postgres_fdw should be doing batching on it's onw (per this thread) or rely on some other feature (libpq pipelining). I haven't followed the other thread, so I don't have an opinion on that. Note however we're doing two things here, actually - we're implementing custom batching for postgres_fdw, but we're also extending the FDW API to allow other implementations do the same thing. And most of them won't be able to rely on the connection library providing that, I believe. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > Well, good that we all agree this is a useful feature to have (in > general). The question is whether postgres_fdw should be doing batching > on it's onw (per this thread) or rely on some other feature (libpq > pipelining). I haven't followed the other thread, so I don't have an > opinion on that. Well, as someone said in this thread, I think bulk insert is much more common than updates/deletes. Thus, major DBMSs haveINSERT VALUES(record1), (record2)... and INSERT SELECT. Oracle has direct path INSERT in addition. As for the comparisonof INSERT with multiple records and libpq batching (= multiple INSERTs), I think the former is more efficient becausethe amount of data transfer is less and the parsing-planning of INSERT for each record is eliminated. I never deny the usefulness of libpq batch/pipelining, but I'm not sure if app developers would really use it. If they wantto reduce the client-server round-trips, won't they use traditional stored procedures? Yes, the stored procedure languageis very DBMS-specific. Then, I'd like to know what kind of well-known applications are using standard batching APIlike JDBC's batch updates. (Sorry, I think that should be discussed in libpq batch/pipelining thread and this threadshould not be polluted.) > Note however we're doing two things here, actually - we're implementing > custom batching for postgres_fdw, but we're also extending the FDW API > to allow other implementations do the same thing. And most of them won't > be able to rely on the connection library providing that, I believe. I'm afraid so, too. Then, postgres_fdw would be an example that other FDW developers would look at when they use INSERTwith multiple records. Regards Takayuki Tsunakawa
On 11/26/20 2:48 AM, tsunakawa.takay@fujitsu.com wrote: > From: Tomas Vondra <tomas.vondra@enterprisedb.com> >> Well, good that we all agree this is a useful feature to have (in >> general). The question is whether postgres_fdw should be doing >> batching on it's onw (per this thread) or rely on some other >> feature (libpq pipelining). I haven't followed the other thread, >> so I don't have an opinion on that. > > Well, as someone said in this thread, I think bulk insert is much > more common than updates/deletes. Thus, major DBMSs have INSERT > VALUES(record1), (record2)... and INSERT SELECT. Oracle has direct > path INSERT in addition. As for the comparison of INSERT with > multiple records and libpq batching (= multiple INSERTs), I think > the former is more efficient because the amount of data transfer is > less and the parsing-planning of INSERT for each record is > eliminated. > > I never deny the usefulness of libpq batch/pipelining, but I'm not > sure if app developers would really use it. If they want to reduce > the client-server round-trips, won't they use traditional stored > procedures? Yes, the stored procedure language is very > DBMS-specific. Then, I'd like to know what kind of well-known > applications are using standard batching API like JDBC's batch > updates. (Sorry, I think that should be discussed in libpq > batch/pipelining thread and this thread should not be polluted.) > Not sure how is this related to app developers? I think the idea was that the libpq features might be useful between the two PostgreSQL instances. I.e. the postgres_fdw would use the libpq batching to send chunks of data to the other side. > >> Note however we're doing two things here, actually - we're >> implementing custom batching for postgres_fdw, but we're also >> extending the FDW API to allow other implementations do the same >> thing. And most of them won't be able to rely on the connection >> library providing that, I believe. > > I'm afraid so, too. Then, postgres_fdw would be an example that > other FDW developers would look at when they use INSERT with > multiple records. > Well, my point was that we could keep the API, but maybe it should be implemented using the proposed libpq batching. They could still use the postgres_fdw example how to use the API, but the internals would need to be different, of course. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Not sure how is this related to app developers? I think the idea was
that the libpq features might be useful between the two PostgreSQL
instances. I.e. the postgres_fdw would use the libpq batching to send
chunks of data to the other side.
Well, my point was that we could keep the API, but maybe it should be
implemented using the proposed libpq batching. They could still use the
postgres_fdw example how to use the API, but the internals would need to
be different, of course.
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > Not sure how is this related to app developers? I think the idea was > that the libpq features might be useful between the two PostgreSQL > instances. I.e. the postgres_fdw would use the libpq batching to send > chunks of data to the other side. > Well, my point was that we could keep the API, but maybe it should be > implemented using the proposed libpq batching. They could still use the > postgres_fdw example how to use the API, but the internals would need to > be different, of course. Yes, I understand them. I just wondered if app developers use the statement batching API for libpq or JDBC in what kindof apps. That is, I talked about the batching API itself, not related to FDW. (So, I mentioned I think I should asksuch a question in the libpq batching thread.) I expect postgresExecForeignBatchInsert() would be able to use the libpq batching API, because it receives an array of tuplesand can generate and issue INSERT statement for each tuple. But I'm not sure either if the libpq batching is likelyto be committed in the near future. (The thread looks too long...) Anyway, this thread's batch insert can be progressed(and hopefully committed), and once the libpq batching has been committed, we can give it a try to use it and modifypostgres_fdw to see if we can get further performance boost. Regards Takayuki Tsunakawa
I expect postgresExecForeignBatchInsert() would be able to use the libpq batching API, because it receives an array of tuples and can generate and issue INSERT statement for each tuple.
Anyway, this thread's batch insert can be progressed (and hopefully committed), and once the libpq batching has been committed, we can give it a try to use it and modify postgres_fdw to see if we can get further performance boost.
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> Not sure how is this related to app developers? I think the idea was
> that the libpq features might be useful between the two PostgreSQL
> instances. I.e. the postgres_fdw would use the libpq batching to send
> chunks of data to the other side.
> Well, my point was that we could keep the API, but maybe it should be
> implemented using the proposed libpq batching. They could still use the
> postgres_fdw example how to use the API, but the internals would need to
> be different, of course.
Yes, I understand them. I just wondered if app developers use the statement batching API for libpq or JDBC in what kind of apps.
But I'm not sure either if the libpq batching is likely to be committed in the near future. (The thread looks too long...)
Regards
Takayuki Tsunakawa
From: Craig Ringer <craig.ringer@enterprisedb.com> > But in the libpq pipelining patch I demonstrated a 300 times (3000%) performance improvement on a test workload... Wow, impressive number. I've just seen it in the beginning of the libpq pipelining thread (oh, already four years ago..!) Could you share the workload and the network latency (ping time)? I'm sorry I'm just overlooking it. Thank you for your (always) concise explanation. I'd like to check other DBMSs and your rich references for the FDW interface. (My first intuition is that many major DBMSs might not have client C APIs that can be used to implement an asyncpipelining FDW interface. Also, I'm afraid it requires major surgery or reform of executor. I don't want it to delaythe release of reasonably good (10x) improvement with the synchronous interface.) (It'd be kind of you to send emails in text format. I've changed the format of this reply from HTML to text.) Regards Takayuki Tsunakawa
On 11/27/20 7:05 AM, tsunakawa.takay@fujitsu.com wrote: > From: Craig Ringer <craig.ringer@enterprisedb.com> >> But in the libpq pipelining patch I demonstrated a 300 times >> (3000%) performance improvement on a test workload... > > Wow, impressive number. I've just seen it in the beginning of the > libpq pipelining thread (oh, already four years ago..!) Could you > share the workload and the network latency (ping time)? I'm sorry > I'm just overlooking it. > > Thank you for your (always) concise explanation. I'd like to check > other DBMSs and your rich references for the FDW interface. (My > first intuition is that many major DBMSs might not have client C APIs > that can be used to implement an async pipelining FDW interface. > Also, I'm afraid it requires major surgery or reform of executor. I > don't want it to delay the release of reasonably good (10x) > improvement with the synchronous interface.) > I do agree that pipelining is nice, and can bring huge improvements. However, the FDW interface as it's implemented today is not designed to allow that, I believe (we pretty much just invoke the FWD callbacks as if it was a local AM). It assumes the calls are synchronous, and redesigning it to work in async way is a much larger/complex patch than what's being discussed here. I do think the FDW extension proposed here (adding the bulk-insert callback) is useful in general, for two reasons: (a) even if most client libraries support some sort of pipelining, some don't, and (b) I'd bet it's still more efficient to send one large insert than pipelining many individual inserts. That being said, I'm against expanding the scope of this patch to also require redesign of the whole FDW infrastructure - that would likely mean no such improvement landing in PG14. If the libpq pipelining patch seems likely to get committed, we can try using it for the bulk insert callback (instead of the current multi-value stuff). > (It'd be kind of you to send emails in text format. I've changed the > format of this reply from HTML to text.) > Craig's client is sending messages in both text/plain and text/html. You probably need to tell your client to prefer that over html, somehow. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/27/20 7:05 AM, tsunakawa.takay@fujitsu.com wrote:
However, the FDW interface as it's implemented today is not designed to
allow that, I believe (we pretty much just invoke the FWD callbacks as
if it was a local AM). It assumes the calls are synchronous, and
redesigning it to work in async way is a much larger/complex patch than
what's being discussed here.
I do think the FDW extension proposed here (adding the bulk-insert
callback) is useful in general, for two reasons: (a) even if most client
libraries support some sort of pipelining, some don't, and (b) I'd bet
it's still more efficient to send one large insert than pipelining many
individual inserts.
That being said, I'm against expanding the scope of this patch to also
require redesign of the whole FDW infrastructure - that would likely
mean no such improvement landing in PG14. If the libpq pipelining patch
seems likely to get committed, we can try using it for the bulk insert
callback (instead of the current multi-value stuff).
On Fri, 27 Nov 2020, 14:06 tsunakawa.takay@fujitsu.com, <tsunakawa.takay@fujitsu.com> wrote: > > Also, I'm afraid it requires major surgery or reform of executor. I don't want it to delay the release of reasonably good (10x) improvement with the synchronous interface.) Totally sensible. If it isn't feasible without significant executor change that's all that needs to be said. I was afraid that'd be the case given the executor's pull flow but just didn't know enough. It was not my intention to hold this patch up or greatly expand its scope. I'll spend some time testing it out and have a closer read soon to see if I can help progress it. I know Andres did some initial work on executor parallelism and pipelining. I should take a look. > > But in the libpq pipelining patch I demonstrated a 300 times (3000%) performance improvement on a test workload... > > Wow, impressive number. I've just seen it in the beginning of the libpq pipelining thread (oh, already four years ago..!) Yikes. > Could you share the workload and the network latency (ping time)? I'm sorry I'm just overlooking it. I thought I gave it at the time, and a demo program. IIRC it was just doing small multi row inserts or single row inserts. Latency would've been a couple of hundred ms probably, I think I did something like running on my laptop (Australia, ADSL) to a server on AWS US or EU. > Thank you for your (always) concise explanation. You joke! I am many things but despite my best efforts concise is rarely one of them. > I'd like to check other DBMSs and your rich references for the FDW interface. (My first intuition is that many major DBMSsmight not have client C APIs that can be used to implement an async pipelining FDW interface. Likely correct for C APIs of other traditional DBMSes. I'd be less sure about newer non SQL ones, especially cloud oriented. For example DynamoDB supports at least async requests in the Java client [3] and C++ client [4]; it's not immediately clear if requests can be pipelined, but the API suggests they can. Most things with a REST-like API can do a fair bit of concurrency though. Multiple async nonblocking HTTP connections can be serviced at once. Or HTTP/1.1 pipelining can be used [1], or even better HTTP/2.0 streams [2]. This is relevant for any REST-like API. > (It'd be kind of you to send emails in text format. I've changed the format of this reply from HTML to text.) I try to remember. Stupid Gmail. Sorry. On mobile it offers very little control over format but I'll do my best when I can. [1] https://en.wikipedia.org/wiki/HTTP_pipelining [2] https://blog.restcase.com/http2-benefits-for-rest-apis/ [3] https://aws.amazon.com/blogs/developer/asynchronous-requests-with-the-aws-sdk-for-java/ [4] https://sdk.amazonaws.com/cpp/api/LATEST/class_aws_1_1_dynamo_d_b_1_1_dynamo_d_b_client.html#ab631edaccca5f3f8988af15e7e9aa4f0
On Wed, Nov 25, 2020 at 05:04:36AM +0000, tsunakawa.takay@fujitsu.com wrote: > From: Tomas Vondra <tomas.vondra@enterprisedb.com> > > On 11/24/20 9:45 AM, tsunakawa.takay@fujitsu.com wrote: > > > OTOH, as for the name GetModifyBatchSize() you suggest, I think > > GetInsertBatchSize may be better. That is, this API deals with multiple > > records in a single INSERT statement. Your GetModifyBatchSize will be > > reserved for statement batching when libpq has supported batch/pipelining to > > execute multiple INSERT/UPDATE/DELETE statements, as in the following > > JDBC batch updates. What do you think? > > > > > > > I don't know. I was really only thinking about batching in the context > > of a single DML command, not about batching of multiple commands at the > > protocol level. IMHO it's far more likely we'll add support for batching > > for DELETE/UPDATE than libpq pipelining, which seems rather different > > from how the FDW API works. Which is why I was suggesting to use a name > > that would work for all DML commands, not just for inserts. > > Right, I can't imagine now how the interaction among the client, server core and FDWs would be regarding the statementbatching. So I'll take your suggested name. > > > > Not sure, but I'd guess knowing whether batching is used would be > > useful. We only print the single-row SQL query, which kinda gives the > > impression that there's no batching. > > Added in postgres_fdw like "Remote SQL" when EXPLAIN VERBOSE is run. > > > > > Don't worry about this, too. GetMaxBulkInsertTuples() just returns a value > > that was already saved in a struct in create_foreign_modify(). > > > > > > > Well, I do worry for two reasons. > > > > Firstly, the fact that in postgres_fdw the call is cheap does not mean > > it'll be like that in every other FDW. Presumably, the other FDWs might > > cache it in the struct and do the same thing, of course. > > > > But the fact that we're calling it over and over for each row kinda > > seems like we allow the value to change during execution, but I very > > much doubt the code is expecting that. I haven't tried, but assume the > > function first returns 10 and then 100. ISTM the code will allocate > > ri_Slots with 25 slots, but then we'll try stashing 100 tuples there. > > That can't end well. Sure, we can claim it's a bug in the FDW extension, > > but it's also due to the API design. > > You worried about other FDWs than postgres_fdw. That's reasonable. I insisted in other threads that PG developers careonly about postgres_fdw, not other FDWs, when designing the FDW interface, but I myself made the same mistake. I madechanges so that the executor calls GetModifyBatchSize() once per relation per statement. Please pardon me for barging in late in this discussion, but if we're going to be using a bulk API here, wouldn't it make more sense to use COPY, except where RETURNING is specified, in place of INSERT? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
From: Craig Ringer <craig.ringer@enterprisedb.com> > It was not my intention to hold this patch up or greatly expand its > scope. I'll spend some time testing it out and have a closer read soon > to see if I can help progress it. Thank you, I'm relieved to hear that. Last weekend, I was scared of a possible mood that's something like "We won't acceptthe insert speedup patch for foreign tables unless you take full advantage of pipelining and achieve maximum conceivablespeed!" > I thought I gave it at the time, and a demo program. IIRC it was just > doing small multi row inserts or single row inserts. Latency would've > been a couple of hundred ms probably, I think I did something like > running on my laptop (Australia, ADSL) to a server on AWS US or EU. a couple of hundred ms, so that would be dominant in each prepare-send-execute-receive, possibly even for batch insert withhundreds of rows in each batch. Then, the synchronous batch insert of the current patch may achieve a few hundreds timesspeedup compared to a single row inserts when the batch size is hundreds or more. > > I'd like to check other DBMSs and your rich references for the FDW interface. > (My first intuition is that many major DBMSs might not have client C APIs that > can be used to implement an async pipelining FDW interface. > > Likely correct for C APIs of other traditional DBMSes. I'd be less > sure about newer non SQL ones, especially cloud oriented. For example > DynamoDB supports at least async requests in the Java client [3] and > C++ client [4]; it's not immediately clear if requests can be > pipelined, but the API suggests they can. I've checked ODBC, MySQL, Microsoft Synapse Analytics, Redshift, and BigQuery, guessing that the data warehouse may haveasynchronous/pipelining API that enables efficient data integration/migration. But none of them had one. (I seem tohave spent too long and am a bit tired... but it was a bit fun as well.) They all support INSERT with multiple recordsin its VALUES clause. So, it will be useful to provide a synchronous batch insert FDW API. I guess Oracle's OCIhas an asynchronous API, but I didn't check it. As an aside, MySQL 8.0.16 added support for asynchronous execution in its C API, but it allows only one active SQL statementin each connection. Likewise, although the ODBC standard defines asynchronous execution (SQLSetStmtAttr(SQL_ASYNC_ENABLE)and SQLCompleteAsync), SQL Server and Synapse Analytics only allows only one active statementper connection. psqlODBC doesn't support asynchronous execution. > Most things with a REST-like API can do a fair bit of concurrency > though. Multiple async nonblocking HTTP connections can be serviced at > once. Or HTTP/1.1 pipelining can be used [1], or even better HTTP/2.0 > streams [2]. This is relevant for any REST-like API. I'm not sure if this is related, Google deprecated Batch HTTP API [1]. [1] https://cloud.google.com/bigquery/batch Regards Takayuki Tsunakawa
From: David Fetter <david@fetter.org> > Please pardon me for barging in late in this discussion, but if we're > going to be using a bulk API here, wouldn't it make more sense to use > COPY, except where RETURNING is specified, in place of INSERT? Please do not hesitate. I mentioned earlier in this thread that I think INSERT is better because: -------------------------------------------------- * When the user executed INSERT statements, it would look strange to the user if the remote SQL is displayed as COPY. * COPY doesn't invoke rules unlike INSERT. (I don't think the rule is a feature what users care about, though.) Also, I'ma bit concerned that there might be, or will be, other differences between INSERT and COPY. -------------------------------------------------- Also, COPY to foreign tables currently uses INSERTs, the improvement of using COPY instead of INSERT is in progress [1]. Keeping "COPY uses COPY, INSERT uses INSERT" correspondence seems natural, and it makes COPY's high-speed advantagestand out. [1] Fast COPY FROM command for the table with foreign partitions https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru Regards Takayuki Tsunakawa
Hi, Attached is a v6 of this patch, rebased to current master and with some minor improvements (mostly comments and renaming the "end" struct field to "values_end" which I think is more descriptive). The one thing that keeps bugging me is convert_prep_stmt_params - it dies the right thing, but the code is somewhat confusing. AFAICS the discussions about making this use COPY and/or libpq pipelining (neither of which is committed yet) ended with the conclusion that those changes are somewhat independent, and that it's worth getting this committed in the current form. Barring objections, I'll push this within the next couple days. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > Attached is a v6 of this patch, rebased to current master and with some minor > improvements (mostly comments and renaming the "end" struct field to > "values_end" which I think is more descriptive). Thank you very much. In fact, my initial patches used values_end, and I changed it to len considering that it may be usedfor bulk UPDATEand DELETE in the future. But I think values_end is easier to understand its role, too. Regards Takayuki Tsunakawa
Hi Tomas, Tsunakawa-san, Thanks for your work on this. On Tue, Jan 12, 2021 at 11:06 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > AFAICS the discussions about making this use COPY and/or libpq > pipelining (neither of which is committed yet) ended with the conclusion > that those changes are somewhat independent, and that it's worth getting > this committed in the current form. Barring objections, I'll push this > within the next couple days. I was trying this out today (been meaning to do so for a while) and noticed that this fails when there are AFTER ROW triggers on the foreign table. Here's an example: create extension postgres_fdw ; create server lb foreign data wrapper postgres_fdw ; create user mapping for current_user server lb; create table p (a numeric primary key); create foreign table fp (a int) server lb options (table_name 'p'); create function print_row () returns trigger as $$ begin raise notice '%', new; return null; end; $$ language plpgsql; create trigger after_insert_trig after insert on fp for each row execute function print_row(); insert into fp select generate_series (1, 10); <crashes> Apparently, the new code seems to assume that batching wouldn't be active when the original query contains RETURNING clause but some parts fail to account for the case where RETURNING is added to the query to retrieve the tuple to pass to the AFTER TRIGGER. Specifically, the Assert in the following block in execute_foreign_modify() is problematic: /* Check number of rows affected, and fetch RETURNING tuple if any */ if (fmstate->has_returning) { Assert(*numSlots == 1); n_rows = PQntuples(res); if (n_rows > 0) store_returning_result(fmstate, slots[0], res); } -- Amit Langote EDB: http://www.enterprisedb.com
On 1/13/21 10:15 AM, Amit Langote wrote: > Hi Tomas, Tsunakawa-san, > > Thanks for your work on this. > > On Tue, Jan 12, 2021 at 11:06 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> AFAICS the discussions about making this use COPY and/or libpq >> pipelining (neither of which is committed yet) ended with the conclusion >> that those changes are somewhat independent, and that it's worth getting >> this committed in the current form. Barring objections, I'll push this >> within the next couple days. > > I was trying this out today (been meaning to do so for a while) and > noticed that this fails when there are AFTER ROW triggers on the > foreign table. Here's an example: > > create extension postgres_fdw ; > create server lb foreign data wrapper postgres_fdw ; > create user mapping for current_user server lb; > create table p (a numeric primary key); > create foreign table fp (a int) server lb options (table_name 'p'); > create function print_row () returns trigger as $$ begin raise notice > '%', new; return null; end; $$ language plpgsql; > create trigger after_insert_trig after insert on fp for each row > execute function print_row(); > insert into fp select generate_series (1, 10); > <crashes> > > Apparently, the new code seems to assume that batching wouldn't be > active when the original query contains RETURNING clause but some > parts fail to account for the case where RETURNING is added to the > query to retrieve the tuple to pass to the AFTER TRIGGER. > Specifically, the Assert in the following block in > execute_foreign_modify() is problematic: > > /* Check number of rows affected, and fetch RETURNING tuple if any */ > if (fmstate->has_returning) > { > Assert(*numSlots == 1); > n_rows = PQntuples(res); > if (n_rows > 0) > store_returning_result(fmstate, slots[0], res); > } > Thanks for the report. Yeah, I think there's a missing check in ExecInsert. Adding (!resultRelInfo->ri_TrigDesc->trig_insert_after_row) solves this. But now I'm wondering if this is the wrong place to make this decision. I mean, why should we make the decision here, when the decision whether to have a RETURNING clause is made in postgres_fdw in deparseReturningList? We don't really know what the other FDWs will do, for example. So I think we should just move all of this into GetModifyBatchSize. We can start with ri_BatchSize = 0. And then do if (resultRelInfo->ri_BatchSize == 0) resultRelInfo->ri_BatchSize = resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo); if (resultRelInfo->ri_BatchSize > 1) { ... do batching ... } The GetModifyBatchSize would always return value > 0, so either 1 (no batching) or >1 (batching). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 1/13/21 3:43 PM, Tomas Vondra wrote: > > ... > > Thanks for the report. Yeah, I think there's a missing check in > ExecInsert. Adding > > (!resultRelInfo->ri_TrigDesc->trig_insert_after_row) > > solves this. But now I'm wondering if this is the wrong place to make > this decision. I mean, why should we make the decision here, when the > decision whether to have a RETURNING clause is made in postgres_fdw in > deparseReturningList? We don't really know what the other FDWs will do, > for example. > > So I think we should just move all of this into GetModifyBatchSize. We > can start with ri_BatchSize = 0. And then do > > if (resultRelInfo->ri_BatchSize == 0) > resultRelInfo->ri_BatchSize = > resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo); > > if (resultRelInfo->ri_BatchSize > 1) > { > ... do batching ... > } > > The GetModifyBatchSize would always return value > 0, so either 1 (no > batching) or >1 (batching). > FWIW the attached v8 patch does this - most of the conditions are moved to the GetModifyBatchSize() callback. I've removed the check for the BatchInsert callback, though - the FDW knows whether it supports that, and it seems a bit pointless at the moment as there are no other batch callbacks. Maybe we should add an Assert somewhere, though? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > FWIW the attached v8 patch does this - most of the conditions are moved to the > GetModifyBatchSize() callback. I've removed the check for the BatchInsert > callback, though - the FDW knows whether it supports that, and it seems a bit > pointless at the moment as there are no other batch callbacks. Maybe we > should add an Assert somewhere, though? Thank you. I'm in favor this idea that the decision to support RETURNING and trigger is left to the FDW. I don' think ofthe need for another Assert, as the caller has one for the returned batch size. Regards Takayuki Tsunakawa
Hi, On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 1/13/21 3:43 PM, Tomas Vondra wrote: > > Thanks for the report. Yeah, I think there's a missing check in > > ExecInsert. Adding > > > > (!resultRelInfo->ri_TrigDesc->trig_insert_after_row) > > > > solves this. But now I'm wondering if this is the wrong place to make > > this decision. I mean, why should we make the decision here, when the > > decision whether to have a RETURNING clause is made in postgres_fdw in > > deparseReturningList? We don't really know what the other FDWs will do, > > for example. > > > > So I think we should just move all of this into GetModifyBatchSize. We > > can start with ri_BatchSize = 0. And then do > > > > if (resultRelInfo->ri_BatchSize == 0) > > resultRelInfo->ri_BatchSize = > > resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo); > > > > if (resultRelInfo->ri_BatchSize > 1) > > { > > ... do batching ... > > } > > > > The GetModifyBatchSize would always return value > 0, so either 1 (no > > batching) or >1 (batching). > > > > FWIW the attached v8 patch does this - most of the conditions are moved > to the GetModifyBatchSize() callback. Thanks. A few comments: * I agree with leaving it up to an FDW to look at the properties of the table and of the operation being performed to decide whether or not to use batching, although maybe BeginForeignModify() is a better place for putting that logic instead of GetModifyBatchSize()? So, in create_foreign_modify(), instead of PgFdwModifyState.batch_size simply being set to match the table's or the server's value for the batch_size option, make it also consider the things that prevent batching and set the execution state's batch_size based on that. GetModifyBatchSize() simply returns that value. * Regarding the timing of calling GetModifyBatchSize() to set ri_BatchSize, I wonder if it wouldn't be better to call it just once, say from ExecInitModifyTable(), right after BeginForeignModify() returns? I don't quite understand why it is being called from ExecInsert(). Can the batch size change once the execution starts? * Lastly, how about calling it GetForeignModifyBatchSize() to be consistent with other nearby callbacks? > I've removed the check for the > BatchInsert callback, though - the FDW knows whether it supports that, > and it seems a bit pointless at the moment as there are no other batch > callbacks. Maybe we should add an Assert somewhere, though? Hmm, not checking whether BatchInsert() exists may not be good idea, because if an FDW's GetModifyBatchSize() returns a value > 1 but there's no BatchInsert() function to call, ExecBatchInsert() would trip. I don't see the newly added documentation telling FDW authors to either define both or none. Regarding how this plays with partitions, I don't think we need ExecGetTouchedPartitions(), because you can get the routed-to partitions using es_tuple_routing_result_relations. Also, perhaps it's a good idea to put the "finishing" ExecBatchInsert() calls into a function ExecFinishBatchInsert(). Maybe the logic to choose the relations to perform the finishing calls on will get complicated in the future as batching is added for updates/deletes too and it seems better to encapsulate that in the separate function than have it out in the open in ExecModifyTable(). (Sorry about being so late reviewing this.) -- Amit Langote EDB: http://www.enterprisedb.com
On 1/14/21 9:58 AM, Amit Langote wrote: > Hi, > > On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> On 1/13/21 3:43 PM, Tomas Vondra wrote: >>> Thanks for the report. Yeah, I think there's a missing check in >>> ExecInsert. Adding >>> >>> (!resultRelInfo->ri_TrigDesc->trig_insert_after_row) >>> >>> solves this. But now I'm wondering if this is the wrong place to make >>> this decision. I mean, why should we make the decision here, when the >>> decision whether to have a RETURNING clause is made in postgres_fdw in >>> deparseReturningList? We don't really know what the other FDWs will do, >>> for example. >>> >>> So I think we should just move all of this into GetModifyBatchSize. We >>> can start with ri_BatchSize = 0. And then do >>> >>> if (resultRelInfo->ri_BatchSize == 0) >>> resultRelInfo->ri_BatchSize = >>> resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo); >>> >>> if (resultRelInfo->ri_BatchSize > 1) >>> { >>> ... do batching ... >>> } >>> >>> The GetModifyBatchSize would always return value > 0, so either 1 (no >>> batching) or >1 (batching). >>> >> >> FWIW the attached v8 patch does this - most of the conditions are moved >> to the GetModifyBatchSize() callback. > > Thanks. A few comments: > > * I agree with leaving it up to an FDW to look at the properties of > the table and of the operation being performed to decide whether or > not to use batching, although maybe BeginForeignModify() is a better > place for putting that logic instead of GetModifyBatchSize()? So, in > create_foreign_modify(), instead of PgFdwModifyState.batch_size simply > being set to match the table's or the server's value for the > batch_size option, make it also consider the things that prevent > batching and set the execution state's batch_size based on that. > GetModifyBatchSize() simply returns that value. > > * Regarding the timing of calling GetModifyBatchSize() to set > ri_BatchSize, I wonder if it wouldn't be better to call it just once, > say from ExecInitModifyTable(), right after BeginForeignModify() > returns? I don't quite understand why it is being called from > ExecInsert(). Can the batch size change once the execution starts? > But it should be called just once. The idea is that initially we have batch_size=0, and the fist call returns value that is >= 1. So we never call it again. But maybe it could be called from BeginForeignModify, in which case we'd not need this logic with first setting it to 0 etc. > * Lastly, how about calling it GetForeignModifyBatchSize() to be > consistent with other nearby callbacks? > Yeah, good point. >> I've removed the check for the >> BatchInsert callback, though - the FDW knows whether it supports that, >> and it seems a bit pointless at the moment as there are no other batch >> callbacks. Maybe we should add an Assert somewhere, though? > > Hmm, not checking whether BatchInsert() exists may not be good idea, > because if an FDW's GetModifyBatchSize() returns a value > 1 but > there's no BatchInsert() function to call, ExecBatchInsert() would > trip. I don't see the newly added documentation telling FDW authors > to either define both or none. > Hmm. The BatchInsert check seemed somewhat unnecessary to me, but OTOH it can't hurt, I guess. I'll ad it back. > Regarding how this plays with partitions, I don't think we need > ExecGetTouchedPartitions(), because you can get the routed-to > partitions using es_tuple_routing_result_relations. Also, perhaps I'm not very familiar with es_tuple_routing_result_relations, but that doesn't seem to work. I've replaced the flushing code at the end of ExecModifyTable with a loop over es_tuple_routing_result_relations, but then some of the rows are missing (i.e. not flushed). > it's a good idea to put the "finishing" ExecBatchInsert() calls into a > function ExecFinishBatchInsert(). Maybe the logic to choose the > relations to perform the finishing calls on will get complicated in > the future as batching is added for updates/deletes too and it seems > better to encapsulate that in the separate function than have it out > in the open in ExecModifyTable(). > IMO that'd be an over-engineering at this point. We don't need such separate function yet, so why complicate the API? If we need it in the future, we can add it. > (Sorry about being so late reviewing this.) thanks -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/14/21 9:58 AM, Amit Langote wrote:
> Hi,
>
> On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 1/13/21 3:43 PM, Tomas Vondra wrote:
>>> Thanks for the report. Yeah, I think there's a missing check in
>>> ExecInsert. Adding
>>>
>>> (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)
>>>
>>> solves this. But now I'm wondering if this is the wrong place to make
>>> this decision. I mean, why should we make the decision here, when the
>>> decision whether to have a RETURNING clause is made in postgres_fdw in
>>> deparseReturningList? We don't really know what the other FDWs will do,
>>> for example.
>>>
>>> So I think we should just move all of this into GetModifyBatchSize. We
>>> can start with ri_BatchSize = 0. And then do
>>>
>>> if (resultRelInfo->ri_BatchSize == 0)
>>> resultRelInfo->ri_BatchSize =
>>> resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);
>>>
>>> if (resultRelInfo->ri_BatchSize > 1)
>>> {
>>> ... do batching ...
>>> }
>>>
>>> The GetModifyBatchSize would always return value > 0, so either 1 (no
>>> batching) or >1 (batching).
>>>
>>
>> FWIW the attached v8 patch does this - most of the conditions are moved
>> to the GetModifyBatchSize() callback.
>
> Thanks. A few comments:
>
> * I agree with leaving it up to an FDW to look at the properties of
> the table and of the operation being performed to decide whether or
> not to use batching, although maybe BeginForeignModify() is a better
> place for putting that logic instead of GetModifyBatchSize()? So, in
> create_foreign_modify(), instead of PgFdwModifyState.batch_size simply
> being set to match the table's or the server's value for the
> batch_size option, make it also consider the things that prevent
> batching and set the execution state's batch_size based on that.
> GetModifyBatchSize() simply returns that value.
>
> * Regarding the timing of calling GetModifyBatchSize() to set
> ri_BatchSize, I wonder if it wouldn't be better to call it just once,
> say from ExecInitModifyTable(), right after BeginForeignModify()
> returns? I don't quite understand why it is being called from
> ExecInsert(). Can the batch size change once the execution starts?
>
But it should be called just once. The idea is that initially we have
batch_size=0, and the fist call returns value that is >= 1. So we never
call it again. But maybe it could be called from BeginForeignModify, in
which case we'd not need this logic with first setting it to 0 etc.
> * Lastly, how about calling it GetForeignModifyBatchSize() to be
> consistent with other nearby callbacks?
>
Yeah, good point.
>> I've removed the check for the
>> BatchInsert callback, though - the FDW knows whether it supports that,
>> and it seems a bit pointless at the moment as there are no other batch
>> callbacks. Maybe we should add an Assert somewhere, though?
>
> Hmm, not checking whether BatchInsert() exists may not be good idea,
> because if an FDW's GetModifyBatchSize() returns a value > 1 but
> there's no BatchInsert() function to call, ExecBatchInsert() would
> trip. I don't see the newly added documentation telling FDW authors
> to either define both or none.
>
Hmm. The BatchInsert check seemed somewhat unnecessary to me, but OTOH
it can't hurt, I guess. I'll ad it back.
> Regarding how this plays with partitions, I don't think we need
> ExecGetTouchedPartitions(), because you can get the routed-to
> partitions using es_tuple_routing_result_relations. Also, perhaps
I'm not very familiar with es_tuple_routing_result_relations, but that
doesn't seem to work. I've replaced the flushing code at the end of
ExecModifyTable with a loop over es_tuple_routing_result_relations, but
then some of the rows are missing (i.e. not flushed).
> it's a good idea to put the "finishing" ExecBatchInsert() calls into a
> function ExecFinishBatchInsert(). Maybe the logic to choose the
> relations to perform the finishing calls on will get complicated in
> the future as batching is added for updates/deletes too and it seems
> better to encapsulate that in the separate function than have it out
> in the open in ExecModifyTable().
>
IMO that'd be an over-engineering at this point. We don't need such
separate function yet, so why complicate the API? If we need it in the
future, we can add it.
EDB: http://www.enterprisedb.com
On 1/14/21 2:57 PM, Amit Langote wrote: > On Thu, Jan 14, 2021 at 21:57 Tomas Vondra > <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>> > wrote: > > On 1/14/21 9:58 AM, Amit Langote wrote: > > Hi, > > > > On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra > > <tomas.vondra@enterprisedb.com > <mailto:tomas.vondra@enterprisedb.com>> wrote: > >> On 1/13/21 3:43 PM, Tomas Vondra wrote: > >>> Thanks for the report. Yeah, I think there's a missing check in > >>> ExecInsert. Adding > >>> > >>> (!resultRelInfo->ri_TrigDesc->trig_insert_after_row) > >>> > >>> solves this. But now I'm wondering if this is the wrong place to > make > >>> this decision. I mean, why should we make the decision here, > when the > >>> decision whether to have a RETURNING clause is made in > postgres_fdw in > >>> deparseReturningList? We don't really know what the other FDWs > will do, > >>> for example. > >>> > >>> So I think we should just move all of this into > GetModifyBatchSize. We > >>> can start with ri_BatchSize = 0. And then do > >>> > >>> if (resultRelInfo->ri_BatchSize == 0) > >>> resultRelInfo->ri_BatchSize = > >>> > resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo); > >>> > >>> if (resultRelInfo->ri_BatchSize > 1) > >>> { > >>> ... do batching ... > >>> } > >>> > >>> The GetModifyBatchSize would always return value > 0, so either > 1 (no > >>> batching) or >1 (batching). > >>> > >> > >> FWIW the attached v8 patch does this - most of the conditions are > moved > >> to the GetModifyBatchSize() callback. > > > > Thanks. A few comments: > > > > * I agree with leaving it up to an FDW to look at the properties of > > the table and of the operation being performed to decide whether or > > not to use batching, although maybe BeginForeignModify() is a better > > place for putting that logic instead of GetModifyBatchSize()? So, in > > create_foreign_modify(), instead of PgFdwModifyState.batch_size simply > > being set to match the table's or the server's value for the > > batch_size option, make it also consider the things that prevent > > batching and set the execution state's batch_size based on that. > > GetModifyBatchSize() simply returns that value. > > > > * Regarding the timing of calling GetModifyBatchSize() to set > > ri_BatchSize, I wonder if it wouldn't be better to call it just once, > > say from ExecInitModifyTable(), right after BeginForeignModify() > > returns? I don't quite understand why it is being called from > > ExecInsert(). Can the batch size change once the execution starts? > > > > But it should be called just once. The idea is that initially we have > batch_size=0, and the fist call returns value that is >= 1. So we never > call it again. But maybe it could be called from BeginForeignModify, in > which case we'd not need this logic with first setting it to 0 etc. > > > Right, although I was thinking that maybe ri_BatchSize itself is not to > be written to by the FDW. Not to say that’s doing anything wrong though. > > > * Lastly, how about calling it GetForeignModifyBatchSize() to be > > consistent with other nearby callbacks? > > > > Yeah, good point. > > >> I've removed the check for the > >> BatchInsert callback, though - the FDW knows whether it supports > that, > >> and it seems a bit pointless at the moment as there are no other > batch > >> callbacks. Maybe we should add an Assert somewhere, though? > > > > Hmm, not checking whether BatchInsert() exists may not be good idea, > > because if an FDW's GetModifyBatchSize() returns a value > 1 but > > there's no BatchInsert() function to call, ExecBatchInsert() would > > trip. I don't see the newly added documentation telling FDW authors > > to either define both or none. > > > > Hmm. The BatchInsert check seemed somewhat unnecessary to me, but OTOH > it can't hurt, I guess. I'll ad it back. > > > Regarding how this plays with partitions, I don't think we need > > ExecGetTouchedPartitions(), because you can get the routed-to > > partitions using es_tuple_routing_result_relations. Also, perhaps > > I'm not very familiar with es_tuple_routing_result_relations, but that > doesn't seem to work. I've replaced the flushing code at the end of > ExecModifyTable with a loop over es_tuple_routing_result_relations, but > then some of the rows are missing (i.e. not flushed). > > > I should’ve mentioned es_opened_result_relations too which contain > non-routing result relations. So I really meant if (proute) then use > es_tuple_routing_result_relations, else es_opened_result_relations. > This should work as long as batching is only used for inserts. > Ah, right. That did the trick. Attached is v9 with all of those tweaks, except for moving the BatchSize call to BeginForeignModify - I tried that, but it did not seem like an improvement, because we'd still need the checks for API callbacks in ExecInsert for example. So I decided not to do that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > Attached is v9 with all of those tweaks, except for moving the BatchSize call to > BeginForeignModify - I tried that, but it did not seem like an improvement, > because we'd still need the checks for API callbacks in ExecInsert for example. > So I decided not to do that. Thanks, Tomas-san. The patch looks good again. Amit-san, thank you for teaching us about es_tuple_routing_result_relations and es_opened_result_relations. Regards Takayuki Tsunakawa
On Fri, Jan 15, 2021 at 12:05 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Attached is v9 with all of those tweaks, Thanks. > except for moving the BatchSize > call to BeginForeignModify - I tried that, but it did not seem like an > improvement, because we'd still need the checks for API callbacks in > ExecInsert for example. So I decided not to do that. Okay, so maybe not moving the whole logic into the FDW's BeginForeignModify(), but at least if we move this... @@ -441,6 +449,72 @@ ExecInsert(ModifyTableState *mtstate, + /* + * Determine if the FDW supports batch insert and determine the batch + * size (a FDW may support batching, but it may be disabled for the + * server/table). Do this only once, at the beginning - we don't want + * the batch size to change during execution. + */ + if (resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && + resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert && + resultRelInfo->ri_BatchSize == 0) + resultRelInfo->ri_BatchSize = + resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); ...into ExecInitModifyTable(), ExecInsert() only needs the following block: /* + * If the FDW supports batching, and batching is requested, accumulate + * rows and insert them in batches. Otherwise use the per-row inserts. + */ + if (resultRelInfo->ri_BatchSize > 1) + { + ... AFAICS, I don't see anything that will cause ri_BatchSize to become 0 once set so don't see the point of checking whether it needs to be set again on every ExecInsert() call. Also, maybe not that important, but shaving off 3 comparisons for every tuple would add up nicely IMHO especially given that we're targeting bulk loads. -- Amit Langote EDB: http://www.enterprisedb.com
From: Amit Langote <amitlangote09@gmail.com> > Okay, so maybe not moving the whole logic into the FDW's > BeginForeignModify(), but at least if we move this... > > @@ -441,6 +449,72 @@ ExecInsert(ModifyTableState *mtstate, > + /* > + * Determine if the FDW supports batch insert and determine the > batch > + * size (a FDW may support batching, but it may be disabled for the > + * server/table). Do this only once, at the beginning - we don't want > + * the batch size to change during execution. > + */ > + if (resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && > + resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert && > + resultRelInfo->ri_BatchSize == 0) > + resultRelInfo->ri_BatchSize = > + > resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); > > ...into ExecInitModifyTable(), ExecInsert() only needs the following block: Does ExecInitModifyTable() know all leaf partitions where the tuples produced by VALUES or SELECT go? ExecInsert() doesn'tfind the target leaf partition for the first time through the call to ExecPrepareTupleRouting()? Leaf partitionscan have different batch_size settings. Regards Takayuki Tsunakawa
On Sat, Jan 16, 2021 at 12:00 AM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > From: Amit Langote <amitlangote09@gmail.com> > > Okay, so maybe not moving the whole logic into the FDW's > > BeginForeignModify(), but at least if we move this... > > > > @@ -441,6 +449,72 @@ ExecInsert(ModifyTableState *mtstate, > > + /* > > + * Determine if the FDW supports batch insert and determine the > > batch > > + * size (a FDW may support batching, but it may be disabled for the > > + * server/table). Do this only once, at the beginning - we don't want > > + * the batch size to change during execution. > > + */ > > + if (resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && > > + resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert && > > + resultRelInfo->ri_BatchSize == 0) > > + resultRelInfo->ri_BatchSize = > > + > > resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); > > > > ...into ExecInitModifyTable(), ExecInsert() only needs the following block: > > Does ExecInitModifyTable() know all leaf partitions where the tuples produced by VALUES or SELECT go? ExecInsert() doesn'tfind the target leaf partition for the first time through the call to ExecPrepareTupleRouting()? Leaf partitionscan have different batch_size settings. Good thing you reminded me that this is about inserts, and in that case no, ExecInitModifyTable() doesn't know all leaf partitions, it only sees the root table whose batch_size doesn't really matter. So it's really ExecInitRoutingInfo() that I would recommend to set ri_BatchSize; right after this block: /* * If the partition is a foreign table, let the FDW init itself for * routing tuples to the partition. */ if (partRelInfo->ri_FdwRoutine != NULL && partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL) partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo); Note that ExecInitRoutingInfo() is called only once for a partition when it is initialized after being inserted into for the first time. For a non-partitioned targets, I'd still say set ri_BatchSize in ExecInitModifyTable(). -- Amit Langote EDB: http://www.enterprisedb.com
Tomas-san, From: Amit Langote <amitlangote09@gmail.com> > Good thing you reminded me that this is about inserts, and in that > case no, ExecInitModifyTable() doesn't know all leaf partitions, it > only sees the root table whose batch_size doesn't really matter. So > it's really ExecInitRoutingInfo() that I would recommend to set > ri_BatchSize; right after this block: > > /* > * If the partition is a foreign table, let the FDW init itself for > * routing tuples to the partition. > */ > if (partRelInfo->ri_FdwRoutine != NULL && > partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL) > partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo); > > Note that ExecInitRoutingInfo() is called only once for a partition > when it is initialized after being inserted into for the first time. > > For a non-partitioned targets, I'd still say set ri_BatchSize in > ExecInitModifyTable(). Attached is the patch that added call to GetModifyBatchSize() to the above two places. The regression test passes. (FWIW, frankly, I prefer the previous version because the code is a bit smaller... Maybe we should refactor the code somedayto reduce similar processings in both the partitioned case and non-partitioned case.) Regards Takayuki Tsunakawa
Вложения
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("%s requires a non-negative integer value",
Tomas-san,
From: Amit Langote <amitlangote09@gmail.com>
> Good thing you reminded me that this is about inserts, and in that
> case no, ExecInitModifyTable() doesn't know all leaf partitions, it
> only sees the root table whose batch_size doesn't really matter. So
> it's really ExecInitRoutingInfo() that I would recommend to set
> ri_BatchSize; right after this block:
>
> /*
> * If the partition is a foreign table, let the FDW init itself for
> * routing tuples to the partition.
> */
> if (partRelInfo->ri_FdwRoutine != NULL &&
> partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
> partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);
>
> Note that ExecInitRoutingInfo() is called only once for a partition
> when it is initialized after being inserted into for the first time.
>
> For a non-partitioned targets, I'd still say set ri_BatchSize in
> ExecInitModifyTable().
Attached is the patch that added call to GetModifyBatchSize() to the above two places. The regression test passes.
(FWIW, frankly, I prefer the previous version because the code is a bit smaller... Maybe we should refactor the code someday to reduce similar processings in both the partitioned case and non-partitioned case.)
Regards
Takayuki Tsunakawa
On 1/18/21 7:51 AM, tsunakawa.takay@fujitsu.com wrote: > Tomas-san, > > From: Amit Langote <amitlangote09@gmail.com> >> Good thing you reminded me that this is about inserts, and in that >> case no, ExecInitModifyTable() doesn't know all leaf partitions, >> it only sees the root table whose batch_size doesn't really matter. >> So it's really ExecInitRoutingInfo() that I would recommend to set >> ri_BatchSize; right after this block: >> >> /* * If the partition is a foreign table, let the FDW init itself >> for * routing tuples to the partition. */ if >> (partRelInfo->ri_FdwRoutine != NULL && >> partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL) >> partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, >> partRelInfo); >> >> Note that ExecInitRoutingInfo() is called only once for a >> partition when it is initialized after being inserted into for the >> first time. >> >> For a non-partitioned targets, I'd still say set ri_BatchSize in >> ExecInitModifyTable(). > > Attached is the patch that added call to GetModifyBatchSize() to the > above two places. The regression test passes. > > (FWIW, frankly, I prefer the previous version because the code is a > bit smaller... Maybe we should refactor the code someday to reduce > similar processings in both the partitioned case and non-partitioned > case.) > Less code would be nice, but it's not always the right thing to do, unfortunately :-( I took a look at this - there's a bit of bitrot due to 708d165ddb92c, so attached is a rebased patch (0001) fixing that. 0002 adds a couple comments and minor tweaks 0003 addresses a couple shortcomings related to explain - we haven't been showing the batch size for EXPLAIN (VERBOSE), because there'd be no FdwState, so this tries to fix that. Furthermore, there were no tests for EXPLAIN output with batch size, so I added a couple. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > I took a look at this - there's a bit of bitrot due to 708d165ddb92c, so attached is > a rebased patch (0001) fixing that. > > 0002 adds a couple comments and minor tweaks > > 0003 addresses a couple shortcomings related to explain - we haven't been > showing the batch size for EXPLAIN (VERBOSE), because there'd be no > FdwState, so this tries to fix that. Furthermore, there were no tests for EXPLAIN > output with batch size, so I added a couple. Thank you, good additions. They all look good. Only one point: I think the code for retrieving batch_size in create_foreign_modify() can be replaced with a call to thenew function in 0003. God bless us. Regards Takayuki Tsunakawa
Tomas-san, Zhihong-san, From: Zhihong Yu <zyu@yugabyte.com> > + if (batch_size <= 0) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("%s requires a non-negative integer value", > > It seems the message doesn't match the check w.r.t. the batch size of 0. Ah, "non-negative" should be "positive". The message for the existing fetch_size should be fixed too. Tomas-san, couldyou include this as well? I'm sorry to trouble you. > + int numInserted = numSlots; > > Since numInserted is filled by ExecForeignBatchInsert(), the initialization can be done with 0. No, the code is correct, since the batch function requires the number of rows to insert as input. Regards Takayuki Tsunakawa
On 1/19/21 2:28 AM, tsunakawa.takay@fujitsu.com wrote: > From: Tomas Vondra <tomas.vondra@enterprisedb.com> >> I took a look at this - there's a bit of bitrot due to >> 708d165ddb92c, so attached is a rebased patch (0001) fixing that. >> >> 0002 adds a couple comments and minor tweaks >> >> 0003 addresses a couple shortcomings related to explain - we >> haven't been showing the batch size for EXPLAIN (VERBOSE), because >> there'd be no FdwState, so this tries to fix that. Furthermore, >> there were no tests for EXPLAIN output with batch size, so I added >> a couple. > > Thank you, good additions. They all look good. Only one point: I > think the code for retrieving batch_size in create_foreign_modify() > can be replaced with a call to the new function in 0003. > OK. Can you prepare a final patch, squashing all the commits into a single one, and perhaps use the function in create_foreign_modify? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > OK. Can you prepare a final patch, squashing all the commits into a > single one, and perhaps use the function in create_foreign_modify? Attached, including the message fix pointed by Zaihong-san. Regards Takayuki Tsunakawa
Вложения
Tsunakawa-san, On Tue, Jan 19, 2021 at 12:50 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > From: Tomas Vondra <tomas.vondra@enterprisedb.com> > > OK. Can you prepare a final patch, squashing all the commits into a > > single one, and perhaps use the function in create_foreign_modify? > > Attached, including the message fix pointed by Zaihong-san. Thanks for adopting my suggestions regarding GetForeignModifyBatchSize(). I apologize in advance for being maybe overly pedantic, but I noticed that, in ExecInitModifyTable(), you decided to place the call outside the loop that goes over resultRelations (shown below), although my intent was to ask to place it next to the BeginForeignModify() in that loop. resultRelInfo = mtstate->resultRelInfo; i = 0; forboth(l, node->resultRelations, l1, node->plans) { ... /* Also let FDWs init themselves for foreign-table result rels */ if (!resultRelInfo->ri_usesFdwDirectModify && resultRelInfo->ri_FdwRoutine != NULL && resultRelInfo->ri_FdwRoutine->BeginForeignModify != NULL) { List *fdw_private = (List *) list_nth(node->fdwPrivLists, i); resultRelInfo->ri_FdwRoutine->BeginForeignModify(mtstate, resultRelInfo, fdw_private, i, eflags); } Maybe it's fine today because we only care about inserts and there's always only one entry in the resultRelations list in that case, but that may not remain the case in the future. -- Amit Langote EDB: http://www.enterprisedb.com
From: Amit Langote <amitlangote09@gmail.com> > I apologize in advance for being maybe overly pedantic, but I noticed > that, in ExecInitModifyTable(), you decided to place the call outside > the loop that goes over resultRelations (shown below), although my > intent was to ask to place it next to the BeginForeignModify() in that > loop. Actually, I tried to do it (adding the GetModifyBatchSize() call after BeginForeignModify()), but it failed. Because postgresfdwGetModifyBatchSize()wants to know if RETURNING is specified, and ResultRelInfo->projectReturning is created afterthe above part. Considering the context where GetModifyBatchSize() implementations may want to know the environment,I placed the call as late as possible in the initialization phase. As for the future(?) multi-target DML statements,I think we can change this together with other many(?) parts that assume a single target table. Regards Takayuki Tsunakawa
On Tue, Jan 19, 2021 at 2:06 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > From: Amit Langote <amitlangote09@gmail.com> > > I apologize in advance for being maybe overly pedantic, but I noticed > > that, in ExecInitModifyTable(), you decided to place the call outside > > the loop that goes over resultRelations (shown below), although my > > intent was to ask to place it next to the BeginForeignModify() in that > > loop. > > Actually, I tried to do it (adding the GetModifyBatchSize() call after BeginForeignModify()), but it failed. Because postgresfdwGetModifyBatchSize()wants to know if RETURNING is specified, and ResultRelInfo->projectReturning is created afterthe above part. Considering the context where GetModifyBatchSize() implementations may want to know the environment,I placed the call as late as possible in the initialization phase. As for the future(?) multi-target DML statements,I think we can change this together with other many(?) parts that assume a single target table. Okay, sometime later then. I wasn't sure if bringing it up here would be appropriate, but there's a patch by me to refactor ModfiyTable result relation allocation that will have to remember to move this code along to an appropriate place [1]. Thanks for the tip about the dependency on how RETURNING is handled. I will remember it when rebasing my patch over this. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://commitfest.postgresql.org/31/2621/
On 1/19/21 7:23 AM, Amit Langote wrote: > On Tue, Jan 19, 2021 at 2:06 PM tsunakawa.takay@fujitsu.com > <tsunakawa.takay@fujitsu.com> wrote: >> From: Amit Langote <amitlangote09@gmail.com> >>> I apologize in advance for being maybe overly pedantic, but I noticed >>> that, in ExecInitModifyTable(), you decided to place the call outside >>> the loop that goes over resultRelations (shown below), although my >>> intent was to ask to place it next to the BeginForeignModify() in that >>> loop. >> >> Actually, I tried to do it (adding the GetModifyBatchSize() call after BeginForeignModify()), but it failed. BecausepostgresfdwGetModifyBatchSize() wants to know if RETURNING is specified, and ResultRelInfo->projectReturning is createdafter the above part. Considering the context where GetModifyBatchSize() implementations may want to know the environment,I placed the call as late as possible in the initialization phase. As for the future(?) multi-target DML statements,I think we can change this together with other many(?) parts that assume a single target table. > > Okay, sometime later then. > > I wasn't sure if bringing it up here would be appropriate, but there's > a patch by me to refactor ModfiyTable result relation allocation that > will have to remember to move this code along to an appropriate place > [1]. Thanks for the tip about the dependency on how RETURNING is > handled. I will remember it when rebasing my patch over this. > Thanks. The last version (v12) should be addressing all the comments and seems fine to me, so barring objections I'll get that pushed shortly. One thing that seems a bit annoying is that with the partitioned table the explain (verbose) looks like this: QUERY PLAN ----------------------------------------------------- Insert on public.batch_table -> Function Scan on pg_catalog.generate_series i Output: i.i Function Call: generate_series(1, 66) (4 rows) That is, there's no information about the batch size :-( But AFAICS that's due to how explain shows (or rather does not) partitions in this type of plan. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 20, 2021 at 1:01 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 1/19/21 7:23 AM, Amit Langote wrote: > > On Tue, Jan 19, 2021 at 2:06 PM tsunakawa.takay@fujitsu.com > >> Actually, I tried to do it (adding the GetModifyBatchSize() call after BeginForeignModify()), but it failed. BecausepostgresfdwGetModifyBatchSize() wants to know if RETURNING is specified, and ResultRelInfo->projectReturning is createdafter the above part. Considering the context where GetModifyBatchSize() implementations may want to know the environment,I placed the call as late as possible in the initialization phase. As for the future(?) multi-target DML statements,I think we can change this together with other many(?) parts that assume a single target table. > > > > Okay, sometime later then. > > > > I wasn't sure if bringing it up here would be appropriate, but there's > > a patch by me to refactor ModfiyTable result relation allocation that > > will have to remember to move this code along to an appropriate place > > [1]. Thanks for the tip about the dependency on how RETURNING is > > handled. I will remember it when rebasing my patch over this. > > > > Thanks. The last version (v12) should be addressing all the comments and > seems fine to me, so barring objections I'll get that pushed shortly. +1 > One thing that seems a bit annoying is that with the partitioned table > the explain (verbose) looks like this: > > QUERY PLAN > ----------------------------------------------------- > Insert on public.batch_table > -> Function Scan on pg_catalog.generate_series i > Output: i.i > Function Call: generate_series(1, 66) > (4 rows) > > That is, there's no information about the batch size :-( But AFAICS > that's due to how explain shows (or rather does not) partitions in this > type of plan. Yeah. Partition result relations are always lazily allocated for INSERT, so EXPLAIN (without ANALYZE) has no idea what to show for them, nor does it know which partitions will be used in the first place. With ANALYZE however, you could get them from es_tuple_routing_result_relations and maybe list them if you want, but that sounds like a project on its own. -- Amit Langote EDB: http://www.enterprisedb.com
OK, pushed after a little bit of additional polishing (mostly comments). Thanks everyone! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hmm, seems that florican doesn't like this :-( https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15 It's a i386 machine running FreeBSD, so not sure what exactly it's picky about. But when I tried running this under valgrind, I get some strange failures in the new chunk in ExecInitModifyTable: /* * Determine if the FDW supports batch insert and determine the batch * size (a FDW may support batching, but it may be disabled for the * server/table). */ if (!resultRelInfo->ri_usesFdwDirectModify && operation == CMD_INSERT && resultRelInfo->ri_FdwRoutine != NULL && resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) resultRelInfo->ri_BatchSize = resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); else resultRelInfo->ri_BatchSize = 1; Assert(resultRelInfo->ri_BatchSize >= 1); It seems as if the resultRelInfo is not initialized, or something like that. I wouldn't be surprised if the 32-bit machine was pickier and failing because of that. A sample of the valgrind log is attached. It's pretty much just repetitions of these three reports. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > OK, pushed after a little bit of additional polishing (mostly comments). > Thanks everyone! florican reports this is seriously broken on 32-bit hardware: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15 First guess is incorrect memory-allocation computations ... regards, tom lane
{
resultRelInfo = mtstate->resultRelInfo;
index 9c36860704..a6a814454d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2798,17 +2798,17 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
* size (a FDW may support batching, but it may be disabled for the
* server/table).
*/
- if (!resultRelInfo->ri_usesFdwDirectModify &&
+ if (!mtstate->resultRelInfo->ri_usesFdwDirectModify &&
operation == CMD_INSERT &&
- resultRelInfo->ri_FdwRoutine != NULL &&
- resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
- resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert)
- resultRelInfo->ri_BatchSize =
- resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);
+ mtstate->resultRelInfo->ri_FdwRoutine != NULL &&
+ mtstate->resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
+ mtstate->resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert)
+ mtstate->resultRelInfo->ri_BatchSize =
+ mtstate->resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(mtstate->resultRelInfo);
else
- resultRelInfo->ri_BatchSize = 1;
+ mtstate->resultRelInfo->ri_BatchSize = 1;
- Assert(resultRelInfo->ri_BatchSize >= 1);
+ Assert(mtstate->resultRelInfo->ri_BatchSize >= 1);
/*
* Lastly, if this is not the primary (canSetTag) ModifyTable node, add it
Hmm, seems that florican doesn't like this :-(
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15
It's a i386 machine running FreeBSD, so not sure what exactly it's picky
about. But when I tried running this under valgrind, I get some strange
failures in the new chunk in ExecInitModifyTable:
/*
* Determine if the FDW supports batch insert and determine the batch
* size (a FDW may support batching, but it may be disabled for the
* server/table).
*/
if (!resultRelInfo->ri_usesFdwDirectModify &&
operation == CMD_INSERT &&
resultRelInfo->ri_FdwRoutine != NULL &&
resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert)
resultRelInfo->ri_BatchSize =
resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);
else
resultRelInfo->ri_BatchSize = 1;
Assert(resultRelInfo->ri_BatchSize >= 1);
It seems as if the resultRelInfo is not initialized, or something like
that. I wouldn't be surprised if the 32-bit machine was pickier and
failing because of that.
A sample of the valgrind log is attached. It's pretty much just
repetitions of these three reports.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 1/21/21 12:52 AM, Tomas Vondra wrote: > Hmm, seems that florican doesn't like this :-( > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15 > > > It's a i386 machine running FreeBSD, so not sure what exactly it's picky > about. But when I tried running this under valgrind, I get some strange > failures in the new chunk in ExecInitModifyTable: > > /* > * Determine if the FDW supports batch insert and determine the batch > * size (a FDW may support batching, but it may be disabled for the > * server/table). > */ > if (!resultRelInfo->ri_usesFdwDirectModify && > operation == CMD_INSERT && > resultRelInfo->ri_FdwRoutine != NULL && > resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize && > resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert) > resultRelInfo->ri_BatchSize = > > resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo); > else > resultRelInfo->ri_BatchSize = 1; > > Assert(resultRelInfo->ri_BatchSize >= 1); > > It seems as if the resultRelInfo is not initialized, or something like > that. I wouldn't be surprised if the 32-bit machine was pickier and > failing because of that. > > A sample of the valgrind log is attached. It's pretty much just > repetitions of these three reports. > OK, it's definitely accessing uninitialized memory, because the resultRelInfo (on line 2801, i.e. the "if" condition) looks like this: (gdb) p resultRelInfo $1 = (ResultRelInfo *) 0xe595988 (gdb) p *resultRelInfo $2 = {type = 2139062142, ri_RangeTableIndex = 2139062143, ri_RelationDesc = 0x7f7f7f7f7f7f7f7f, ri_NumIndices = 2139062143, ri_IndexRelationDescs = 0x7f7f7f7f7f7f7f7f, ri_IndexRelationInfo = 0x7f7f7f7f7f7f7f7f, ri_TrigDesc = 0x7f7f7f7f7f7f7f7f, ri_TrigFunctions = 0x7f7f7f7f7f7f7f7f, ri_TrigWhenExprs = 0x7f7f7f7f7f7f7f7f, ri_TrigInstrument = 0x7f7f7f7f7f7f7f7f, ri_ReturningSlot = 0x7f7f7f7f7f7f7f7f, ri_TrigOldSlot = 0x7f7f7f7f7f7f7f7f, ri_TrigNewSlot = 0x7f7f7f7f7f7f7f7f, ri_FdwRoutine = 0x7f7f7f7f7f7f7f7f, ri_FdwState = 0x7f7f7f7f7f7f7f7f, ri_usesFdwDirectModify = 127, ri_NumSlots = 2139062143, ri_BatchSize = 2139062143, ri_Slots = 0x7f7f7f7f7f7f7f7f, ri_PlanSlots = 0x7f7f7f7f7f7f7f7f, ri_WithCheckOptions = 0x7f7f7f7f7f7f7f7f, ri_WithCheckOptionExprs = 0x7f7f7f7f7f7f7f7f, ri_ConstraintExprs = 0x7f7f7f7f7f7f7f7f, ri_GeneratedExprs = 0x7f7f7f7f7f7f7f7f, ri_NumGeneratedNeeded = 2139062143, ri_junkFilter = 0x7f7f7f7f7f7f7f7f, ri_returningList = 0x7f7f7f7f7f7f7f7f, ri_projectReturning = 0x7f7f7f7f7f7f7f7f, ri_onConflictArbiterIndexes = 0x7f7f7f7f7f7f7f7f, ri_onConflict = 0x7f7f7f7f7f7f7f7f, ri_PartitionCheckExpr = 0x7f7f7f7f7f7f7f7f, ri_PartitionRoot = 0x7f7f7f7f7f7f7f7f, ri_RootToPartitionMap = 0x8, ri_PartitionTupleSlot = 0x8, ri_ChildToRootMap = 0xe5952b0, ri_CopyMultiInsertBuffer = 0xe596740} (gdb) I may be wrong, but the most likely explanation seems to be this is due to the junk filter initialization, which simply moves past the end of the mtstate->resultRelInfo array. It kinda seems the GetForeignModifyBatchSize call should happen before that block. The attached patch fixes this for me (i.e. regression tests pass with no valgrind reports. Or did I get that wrong? -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 1/21/21 12:59 AM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@enterprisedb.com> writes: >> OK, pushed after a little bit of additional polishing (mostly comments). >> Thanks everyone! > > florican reports this is seriously broken on 32-bit hardware: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2021-01-20%2023%3A08%3A15 > > First guess is incorrect memory-allocation computations ... > I know, although it seems more like an access to unitialized memory. I've already posted a patch that resolves that for me on 64-bits (per valgrind, I suppose it's the same issue). I'm working on reproducing it on 32-bits, hopefully it won't take long. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/21/21 1:17 AM, Zhihong Yu wrote: > Hi, > The assignment to resultRelInfo is done when junk_filter_needed is true: > > if (junk_filter_needed) > { > resultRelInfo = mtstate->resultRelInfo; > > Should the code for determining batch size access mtstate->resultRelInfo > directly ? > IMO the issue is that code iterates over all plans and moves to the next for each one: resultRelInfo++; so it ends up pointing past the last element, hence the failures. So yeah, either the code needs to move before the loop (per my patch), or we need to access mtstate->resultRelInfo directly. I'm pretty amazed this did not crash during any of the many regression runs I did recently. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* size (a FDW may support batching, but it may be disabled for the
* server/table).
On 1/21/21 1:17 AM, Zhihong Yu wrote:
> Hi,
> The assignment to resultRelInfo is done when junk_filter_needed is true:
>
> if (junk_filter_needed)
> {
> resultRelInfo = mtstate->resultRelInfo;
>
> Should the code for determining batch size access mtstate->resultRelInfo
> directly ?
>
IMO the issue is that code iterates over all plans and moves to the next
for each one:
resultRelInfo++;
so it ends up pointing past the last element, hence the failures. So
yeah, either the code needs to move before the loop (per my patch), or
we need to access mtstate->resultRelInfo directly.
I'm pretty amazed this did not crash during any of the many regression
runs I did recently.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 1/21/21 2:02 AM, Zhihong Yu wrote: > Hi, Tomas: > In my opinion, my patch is a little better. > Suppose one of the conditions in the if block changes in between the > start of loop and the end of the loop: > > * Determine if the FDW supports batch insert and determine the batch > * size (a FDW may support batching, but it may be disabled for the > * server/table). > > My patch would reflect that change. I guess this was the reason the if / > else block was placed there in the first place. > But can it change? All the loop does is extracting junk attributes from the plans, it does not modify anything related to the batching. Or maybe I just don't understand what you mean. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 1/21/21 2:02 AM, Zhihong Yu wrote:
> Hi, Tomas:
> In my opinion, my patch is a little better.
> Suppose one of the conditions in the if block changes in between the
> start of loop and the end of the loop:
>
> * Determine if the FDW supports batch insert and determine the batch
> * size (a FDW may support batching, but it may be disabled for the
> * server/table).
>
> My patch would reflect that change. I guess this was the reason the if /
> else block was placed there in the first place.
>
But can it change? All the loop does is extracting junk attributes from
the plans, it does not modify anything related to the batching. Or maybe
I just don't understand what you mean.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes: > I may be wrong, but the most likely explanation seems to be this is due > to the junk filter initialization, which simply moves past the end of > the mtstate->resultRelInfo array. resultRelInfo is certainly pointing at garbage at that point. > It kinda seems the GetForeignModifyBatchSize call should happen before > that block. The attached patch fixes this for me (i.e. regression tests > pass with no valgrind reports. > Or did I get that wrong? Don't we need to initialize ri_BatchSize for *each* resultrelinfo, not merely the first one? That is, this new code needs to be somewhere inside a loop over the result rels. regards, tom lane
On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 1/21/21 1:17 AM, Zhihong Yu wrote: > > Hi, > > The assignment to resultRelInfo is done when junk_filter_needed is true: > > > > if (junk_filter_needed) > > { > > resultRelInfo = mtstate->resultRelInfo; > > > > Should the code for determining batch size access mtstate->resultRelInfo > > directly ? > > > > IMO the issue is that code iterates over all plans and moves to the next > for each one: > > resultRelInfo++; > > so it ends up pointing past the last element, hence the failures. So > yeah, either the code needs to move before the loop (per my patch), or > we need to access mtstate->resultRelInfo directly. Accessing mtstate->resultRelInfo directly would do. The only constraint on where this block should be placed is that ri_projectReturning must be valid as of calling GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread. So, after this block in ExecInitModifyTable: /* * Initialize RETURNING projections if needed. */ if (node->returningLists) { .... /* * Build a projection for each result rel. */ resultRelInfo = mtstate->resultRelInfo; foreach(l, node->returningLists) { List *rlist = (List *) lfirst(l); resultRelInfo->ri_returningList = rlist; resultRelInfo->ri_projectReturning = ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps, resultRelInfo->ri_RelationDesc->rd_att); resultRelInfo++; } } -- Amit Langote EDB: http://www.enterprisedb.com
From: Zhihong Yu <zyu@yugabyte.com>
> Do we need to consider how this part of code inside ExecInitModifyTable() would evolve ?
> I think placing the compound condition toward the end of ExecInitModifyTable() is reasonable because it checks the latest information.
+1 for Zaihong-san's idea. But instead of rewriting every relsultRelInfo to mtstate->resultRelInfo, which makes it a bit harder to read, I'd like to suggest just adding "resultRelInfo = mtstate->resultRelInfo;" immediately before the if block.
Thanks a lot, all for helping to solve the problem quickly!
Regards
Takayuki Tsunakawa
On 1/21/21 2:24 AM, Amit Langote wrote: > On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> On 1/21/21 1:17 AM, Zhihong Yu wrote: >>> Hi, >>> The assignment to resultRelInfo is done when junk_filter_needed is true: >>> >>> if (junk_filter_needed) >>> { >>> resultRelInfo = mtstate->resultRelInfo; >>> >>> Should the code for determining batch size access mtstate->resultRelInfo >>> directly ? >>> >> >> IMO the issue is that code iterates over all plans and moves to the next >> for each one: >> >> resultRelInfo++; >> >> so it ends up pointing past the last element, hence the failures. So >> yeah, either the code needs to move before the loop (per my patch), or >> we need to access mtstate->resultRelInfo directly. > > Accessing mtstate->resultRelInfo directly would do. The only > constraint on where this block should be placed is that > ri_projectReturning must be valid as of calling > GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread. > So, after this block in ExecInitModifyTable: > > /* > * Initialize RETURNING projections if needed. > */ > if (node->returningLists) > { > .... > /* > * Build a projection for each result rel. > */ > resultRelInfo = mtstate->resultRelInfo; > foreach(l, node->returningLists) > { > List *rlist = (List *) lfirst(l); > > resultRelInfo->ri_returningList = rlist; > resultRelInfo->ri_projectReturning = > ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps, > resultRelInfo->ri_RelationDesc->rd_att); > resultRelInfo++; > } > } > Right. But I think Tom is right this should initialize ri_BatchSize for all the resultRelInfo elements, not just the first one. Per the attached patch, which resolves the issue both on x86_64 and armv7l for me. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 1/21/21 2:22 AM, Tom Lane wrote: > Tomas Vondra <tomas.vondra@enterprisedb.com> writes: >> I may be wrong, but the most likely explanation seems to be this is due >> to the junk filter initialization, which simply moves past the end of >> the mtstate->resultRelInfo array. > > resultRelInfo is certainly pointing at garbage at that point. > Yup. It's pretty amazing the x86 machines seem to be mostly OK with it. >> It kinda seems the GetForeignModifyBatchSize call should happen before >> that block. The attached patch fixes this for me (i.e. regression tests >> pass with no valgrind reports. > >> Or did I get that wrong? > > Don't we need to initialize ri_BatchSize for *each* resultrelinfo, > not merely the first one? That is, this new code needs to be > somewhere inside a loop over the result rels. > Yeah, I think you're right. That's an embarrassing oversight :-( regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
From: Zhihong Yu <zyu@yugabyte.com>
> Do we need to consider how this part of code inside ExecInitModifyTable() would evolve ?
> I think placing the compound condition toward the end of ExecInitModifyTable() is reasonable because it checks the latest information.
+1 for Zaihong-san's idea. But instead of rewriting every relsultRelInfo to mtstate->resultRelInfo, which makes it a bit harder to read, I'd like to suggest just adding "resultRelInfo = mtstate->resultRelInfo;" immediately before the if block.
Thanks a lot, all for helping to solve the problem quickly!
Regards
Takayuki Tsunakawa
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > Right. But I think Tom is right this should initialize ri_BatchSize for all the > resultRelInfo elements, not just the first one. Per the attached patch, which > resolves the issue both on x86_64 and armv7l for me. I think Your patch is perfect in the sense that it's ready for the future multi-target DML support. +1 Just for learning, could anyone tell me what this loop for? I thought current Postgres's DML supports a single target table,so it's enough to handle the first element of mtstate->resultRelInfo. In that sense, Amit-san and I agreed that wedon't put the if block in the for loop yet. Regards Takayuki Tsunakawa
On Thu, Jan 21, 2021 at 10:42 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 1/21/21 2:24 AM, Amit Langote wrote: > > On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > >> On 1/21/21 1:17 AM, Zhihong Yu wrote: > >>> Hi, > >>> The assignment to resultRelInfo is done when junk_filter_needed is true: > >>> > >>> if (junk_filter_needed) > >>> { > >>> resultRelInfo = mtstate->resultRelInfo; > >>> > >>> Should the code for determining batch size access mtstate->resultRelInfo > >>> directly ? > >>> > >> > >> IMO the issue is that code iterates over all plans and moves to the next > >> for each one: > >> > >> resultRelInfo++; > >> > >> so it ends up pointing past the last element, hence the failures. So > >> yeah, either the code needs to move before the loop (per my patch), or > >> we need to access mtstate->resultRelInfo directly. > > > > Accessing mtstate->resultRelInfo directly would do. The only > > constraint on where this block should be placed is that > > ri_projectReturning must be valid as of calling > > GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread. > > So, after this block in ExecInitModifyTable: > > > > /* > > * Initialize RETURNING projections if needed. > > */ > > if (node->returningLists) > > { > > .... > > /* > > * Build a projection for each result rel. > > */ > > resultRelInfo = mtstate->resultRelInfo; > > foreach(l, node->returningLists) > > { > > List *rlist = (List *) lfirst(l); > > > > resultRelInfo->ri_returningList = rlist; > > resultRelInfo->ri_projectReturning = > > ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps, > > resultRelInfo->ri_RelationDesc->rd_att); > > resultRelInfo++; > > } > > } > > > > Right. But I think Tom is right this should initialize ri_BatchSize for > all the resultRelInfo elements, not just the first one. Per the attached > patch, which resolves the issue both on x86_64 and armv7l for me. +1 in general. To avoid looping uselessly in the case of UPDATE/DELETE where batching can't be used today, I'd suggest putting if (operation == CMD_INSERT) around the loop. -- Amit Langote EDB: http://www.enterprisedb.com
From: Zhihong Yu <zyu@yugabyte.com>
> My first name is Zhihong.
> You can call me Ted if you want to save some typing :-)
Ah, I'm very sorry. Thank you, let me call you Ted then. That can't be mistaken.
Regards
Takayuki Tsunakawa
On 1/21/21 2:53 AM, Amit Langote wrote: > On Thu, Jan 21, 2021 at 10:42 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> On 1/21/21 2:24 AM, Amit Langote wrote: >>> On Thu, Jan 21, 2021 at 9:56 AM Tomas Vondra >>> <tomas.vondra@enterprisedb.com> wrote: >>>> On 1/21/21 1:17 AM, Zhihong Yu wrote: >>>>> Hi, >>>>> The assignment to resultRelInfo is done when junk_filter_needed is true: >>>>> >>>>> if (junk_filter_needed) >>>>> { >>>>> resultRelInfo = mtstate->resultRelInfo; >>>>> >>>>> Should the code for determining batch size access mtstate->resultRelInfo >>>>> directly ? >>>>> >>>> >>>> IMO the issue is that code iterates over all plans and moves to the next >>>> for each one: >>>> >>>> resultRelInfo++; >>>> >>>> so it ends up pointing past the last element, hence the failures. So >>>> yeah, either the code needs to move before the loop (per my patch), or >>>> we need to access mtstate->resultRelInfo directly. >>> >>> Accessing mtstate->resultRelInfo directly would do. The only >>> constraint on where this block should be placed is that >>> ri_projectReturning must be valid as of calling >>> GetForeignModifyBatchSize(), as Tsunakawa-san pointed out upthread. >>> So, after this block in ExecInitModifyTable: >>> >>> /* >>> * Initialize RETURNING projections if needed. >>> */ >>> if (node->returningLists) >>> { >>> .... >>> /* >>> * Build a projection for each result rel. >>> */ >>> resultRelInfo = mtstate->resultRelInfo; >>> foreach(l, node->returningLists) >>> { >>> List *rlist = (List *) lfirst(l); >>> >>> resultRelInfo->ri_returningList = rlist; >>> resultRelInfo->ri_projectReturning = >>> ExecBuildProjectionInfo(rlist, econtext, slot, &mtstate->ps, >>> resultRelInfo->ri_RelationDesc->rd_att); >>> resultRelInfo++; >>> } >>> } >>> >> >> Right. But I think Tom is right this should initialize ri_BatchSize for >> all the resultRelInfo elements, not just the first one. Per the attached >> patch, which resolves the issue both on x86_64 and armv7l for me. > > +1 in general. To avoid looping uselessly in the case of > UPDATE/DELETE where batching can't be used today, I'd suggest putting > if (operation == CMD_INSERT) around the loop. > Right, that's pretty much what I ended up doing (without the CMD_INSERT check it'd add batching info to explain for updates too, for example). I'll do a bit more testing on the attached patch, but I think that's the right fix to push. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
"tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com> writes: > Just for learning, could anyone tell me what this loop for? I thought current Postgres's DML supports a single targettable, so it's enough to handle the first element of mtstate->resultRelInfo. The "single target table" could be partitioned, in which case there'll be multiple resultrelinfos, some of which could be foreign tables. regards, tom lane
From: Tomas Vondra <tomas.vondra@enterprisedb.com> > Right, that's pretty much what I ended up doing (without the CMD_INSERT > check it'd add batching info to explain for updates too, for example). > I'll do a bit more testing on the attached patch, but I think that's the right fix to > push. Thanks to the outer check for operation == CMD_INSERT, the inner one became unnecessary. Regards Takayuki Tsunakawa
From: Tom Lane <tgl@sss.pgh.pa.us> > The "single target table" could be partitioned, in which case there'll be > multiple resultrelinfos, some of which could be foreign tables. Thank you. I thought so at first, but later I found that ExecInsert() only handles one element in mtstate->resultRelInfo. So I thought just the first element is processed in INSERT case. I understood (guessed) the for loop is for UPDATE and DELETE. EXPLAIN without ANALYZE UPDATE/DELETE on a partitioned tableshows partitions, which would be mtstate->resultRelInfo. EXPLAIN on INSERT doesn't show partitions, so I think INSERTwill find relevant partitions based on input rows during execution. Regards Takayuki Tsunakawa
On 1/21/21 3:09 AM, tsunakawa.takay@fujitsu.com wrote: > From: Tomas Vondra <tomas.vondra@enterprisedb.com> >> Right, that's pretty much what I ended up doing (without the CMD_INSERT >> check it'd add batching info to explain for updates too, for example). >> I'll do a bit more testing on the attached patch, but I think that's the right fix to >> push. > > Thanks to the outer check for operation == CMD_INSERT, the inner one became unnecessary. > Right. I've pushed the fix, hopefully buildfarm will get happy again. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
OK, pushed after a little bit of additional polishing (mostly comments).
Thanks everyone!
is:
TupleTableSlot **
ExecForeignBatchInsert(EState *estate,
ResultRelInfo *rinfo,
TupleTableSlot **slots,
TupleTableSlot *planSlots,
int *numSlots);
should be:
TupleTableSlot **
ExecForeignBatchInsert(EState *estate,
ResultRelInfo *rinfo,
TupleTableSlot **slots,
TupleTableSlot **planSlots,
int *numSlots);
(Trivial patch attached).
--
Вложения
On Thu, Jan 21, 2021 at 11:36 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 1/21/21 3:09 AM, tsunakawa.takay@fujitsu.com wrote: > > From: Tomas Vondra <tomas.vondra@enterprisedb.com> > >> Right, that's pretty much what I ended up doing (without the CMD_INSERT > >> check it'd add batching info to explain for updates too, for example). > >> I'll do a bit more testing on the attached patch, but I think that's the right fix to > >> push. > > > > Thanks to the outer check for operation == CMD_INSERT, the inner one became unnecessary. > > > > Right. I've pushed the fix, hopefully buildfarm will get happy again. I was looking at this and it looks like we've got a problematic case where postgresGetForeignModifyBatchSize() is called from ExecInitRoutingInfo(). That case is when the insert is performed as part of a cross-partition update of a partitioned table containing postgres_fdw foreign table partitions. Because we don't check the operation in ExecInitRoutingInfo() when calling GetForeignModifyBatchSize(), such inserts attempt to use batching. However the ResultRelInfo may be one for the original update operation, so ri_FdwState contains a PgFdwModifyState with batch_size set to 0, because updates don't support batching. As things stand now, postgresGetForeignModifyBatchSize() simply returns that, tripping the following Assert in the caller. Assert(partRelInfo->ri_BatchSize >= 1); Use this example to see the crash: create table p (a int) partition by list (a); create table p1 (like p); create extension postgres_fdw; create server lb foreign data wrapper postgres_fdw ; create user mapping for current_user server lb; create foreign table fp1 (a int) server lb options (table_name 'p1'); alter table p attach partition fp1 for values in (1); create or replace function report_trig_details() returns trigger as $$ begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op = 'DELETE' then return old; end if; return new; end; $$ language plpgsql; create trigger trig before update on fp1 for each row execute function report_trig_details(); create table p2 partition of p for values in (2); insert into p values (2); update p set a = 1; -- crashes So we let's check mtstate->operation == CMD_INSERT in ExecInitRoutingInfo() to prevent calling GetForeignModifyBatchSize() in cross-update situations where mtstate->operation would be CMD_UPDATE. I've attached a patch. -- Amit Langote EDB: http://www.enterprisedb.com
Вложения
On Thu, Jan 21, 2021 at 11:36 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
> On 1/21/21 3:09 AM, tsunakawa.takay@fujitsu.com wrote:
> > From: Tomas Vondra <tomas.vondra@enterprisedb.com>
> >> Right, that's pretty much what I ended up doing (without the CMD_INSERT
> >> check it'd add batching info to explain for updates too, for example).
> >> I'll do a bit more testing on the attached patch, but I think that's the right fix to
> >> push.
> >
> > Thanks to the outer check for operation == CMD_INSERT, the inner one became unnecessary.
> >
>
> Right. I've pushed the fix, hopefully buildfarm will get happy again.
I was looking at this and it looks like we've got a problematic case
where postgresGetForeignModifyBatchSize() is called from
ExecInitRoutingInfo().
That case is when the insert is performed as part of a cross-partition
update of a partitioned table containing postgres_fdw foreign table
partitions. Because we don't check the operation in
ExecInitRoutingInfo() when calling GetForeignModifyBatchSize(), such
inserts attempt to use batching. However the ResultRelInfo may be one
for the original update operation, so ri_FdwState contains a
PgFdwModifyState with batch_size set to 0, because updates don't
support batching. As things stand now,
postgresGetForeignModifyBatchSize() simply returns that, tripping the
following Assert in the caller.
Assert(partRelInfo->ri_BatchSize >= 1);
Use this example to see the crash:
create table p (a int) partition by list (a);
create table p1 (like p);
create extension postgres_fdw;
create server lb foreign data wrapper postgres_fdw ;
create user mapping for current_user server lb;
create foreign table fp1 (a int) server lb options (table_name 'p1');
alter table p attach partition fp1 for values in (1);
create or replace function report_trig_details() returns trigger as $$
begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op =
'DELETE' then return old; end if; return new; end; $$ language
plpgsql;
create trigger trig before update on fp1 for each row execute function
report_trig_details();
create table p2 partition of p for values in (2);
insert into p values (2);
update p set a = 1; -- crashes
So we let's check mtstate->operation == CMD_INSERT in
ExecInitRoutingInfo() to prevent calling GetForeignModifyBatchSize()
in cross-update situations where mtstate->operation would be
CMD_UPDATE.
I've attached a patch.
--
Amit Langote
EDB: http://www.enterprisedb.com
On 1/23/21 9:31 AM, Amit Langote wrote: > On Thu, Jan 21, 2021 at 11:36 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> On 1/21/21 3:09 AM, tsunakawa.takay@fujitsu.com wrote: >>> From: Tomas Vondra <tomas.vondra@enterprisedb.com> >>>> Right, that's pretty much what I ended up doing (without the CMD_INSERT >>>> check it'd add batching info to explain for updates too, for example). >>>> I'll do a bit more testing on the attached patch, but I think that's the right fix to >>>> push. >>> >>> Thanks to the outer check for operation == CMD_INSERT, the inner one became unnecessary. >>> >> >> Right. I've pushed the fix, hopefully buildfarm will get happy again. > > I was looking at this and it looks like we've got a problematic case > where postgresGetForeignModifyBatchSize() is called from > ExecInitRoutingInfo(). > > That case is when the insert is performed as part of a cross-partition > update of a partitioned table containing postgres_fdw foreign table > partitions. Because we don't check the operation in > ExecInitRoutingInfo() when calling GetForeignModifyBatchSize(), such > inserts attempt to use batching. However the ResultRelInfo may be one > for the original update operation, so ri_FdwState contains a > PgFdwModifyState with batch_size set to 0, because updates don't > support batching. As things stand now, > postgresGetForeignModifyBatchSize() simply returns that, tripping the > following Assert in the caller. > > Assert(partRelInfo->ri_BatchSize >= 1); > > Use this example to see the crash: > > create table p (a int) partition by list (a); > create table p1 (like p); > create extension postgres_fdw; > create server lb foreign data wrapper postgres_fdw ; > create user mapping for current_user server lb; > create foreign table fp1 (a int) server lb options (table_name 'p1'); > alter table p attach partition fp1 for values in (1); > create or replace function report_trig_details() returns trigger as $$ > begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op = > 'DELETE' then return old; end if; return new; end; $$ language > plpgsql; > create trigger trig before update on fp1 for each row execute function > report_trig_details(); > create table p2 partition of p for values in (2); > insert into p values (2); > update p set a = 1; -- crashes > > So we let's check mtstate->operation == CMD_INSERT in > ExecInitRoutingInfo() to prevent calling GetForeignModifyBatchSize() > in cross-update situations where mtstate->operation would be > CMD_UPDATE. > > I've attached a patch. > Thanks for catching this. I think it'd be good if the fix included a regression test. The example seems like a good starting point, not sure if it can be simplified further. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jan 24, 2021 at 2:17 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 1/23/21 9:31 AM, Amit Langote wrote: > > I was looking at this and it looks like we've got a problematic case > > where postgresGetForeignModifyBatchSize() is called from > > ExecInitRoutingInfo(). > > > > That case is when the insert is performed as part of a cross-partition > > update of a partitioned table containing postgres_fdw foreign table > > partitions. Because we don't check the operation in > > ExecInitRoutingInfo() when calling GetForeignModifyBatchSize(), such > > inserts attempt to use batching. However the ResultRelInfo may be one > > for the original update operation, so ri_FdwState contains a > > PgFdwModifyState with batch_size set to 0, because updates don't > > support batching. As things stand now, > > postgresGetForeignModifyBatchSize() simply returns that, tripping the > > following Assert in the caller. > > > > Assert(partRelInfo->ri_BatchSize >= 1); > > > > Use this example to see the crash: > > > > create table p (a int) partition by list (a); > > create table p1 (like p); > > create extension postgres_fdw; > > create server lb foreign data wrapper postgres_fdw ; > > create user mapping for current_user server lb; > > create foreign table fp1 (a int) server lb options (table_name 'p1'); > > alter table p attach partition fp1 for values in (1); > > create or replace function report_trig_details() returns trigger as $$ > > begin raise notice '% % on %', tg_when, tg_op, tg_relname; if tg_op = > > 'DELETE' then return old; end if; return new; end; $$ language > > plpgsql; > > create trigger trig before update on fp1 for each row execute function > > report_trig_details(); > > create table p2 partition of p for values in (2); > > insert into p values (2); > > update p set a = 1; -- crashes > > > > So we let's check mtstate->operation == CMD_INSERT in > > ExecInitRoutingInfo() to prevent calling GetForeignModifyBatchSize() > > in cross-update situations where mtstate->operation would be > > CMD_UPDATE. > > > > I've attached a patch. > > Thanks for catching this. I think it'd be good if the fix included a > regression test. The example seems like a good starting point, not sure > if it can be simplified further. Yes, it can be simplified by using a local join to prevent the update of the foreign partition from being pushed to the remote server, for which my example in the previous email used a local trigger. Note that the update of the foreign partition to be done locally is a prerequisite for this bug to occur. I've added that simplified test case in the attached updated patch. -- Amit Langote EDB: http://www.enterprisedb.com
Вложения
From: Amit Langote <amitlangote09@gmail.com> > Yes, it can be simplified by using a local join to prevent the update of the foreign > partition from being pushed to the remote server, for which my example in the > previous email used a local trigger. Note that the update of the foreign > partition to be done locally is a prerequisite for this bug to occur. Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway. Good catch (and my bad miss.) + PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ? + (PgFdwModifyState *) resultRelInfo->ri_FdwState : + NULL; This can be written as: + PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState; Regards Takayuki Tsunakawa
Hi2021年1月21日(木) 8:00 Tomas Vondra <tomas.vondra@enterprisedb.com>:OK, pushed after a little bit of additional polishing (mostly comments).
Thanks everyone!There's a minor typo in the doc's version of the ExecForeignBatchInsert() declaration;
is:
TupleTableSlot **
ExecForeignBatchInsert(EState *estate,
ResultRelInfo *rinfo,
TupleTableSlot **slots,
TupleTableSlot *planSlots,
int *numSlots);
should be:
TupleTableSlot **
ExecForeignBatchInsert(EState *estate,
ResultRelInfo *rinfo,
TupleTableSlot **slots,
TupleTableSlot **planSlots,
int *numSlots);
(Trivial patch attached).RegardsIan Barwick
--EnterpriseDB: https://www.enterprisedb.com
--
Hi2021年1月21日(木) 8:00 Tomas Vondra <tomas.vondra@enterprisedb.com>:OK, pushed after a little bit of additional polishing (mostly comments).
Thanks everyone!There's a minor typo in the doc's version of the ExecForeignBatchInsert() declaration;
is:
TupleTableSlot **
ExecForeignBatchInsert(EState *estate,
ResultRelInfo *rinfo,
TupleTableSlot **slots,
TupleTableSlot *planSlots,
int *numSlots);
should be:
TupleTableSlot **
ExecForeignBatchInsert(EState *estate,
ResultRelInfo *rinfo,
TupleTableSlot **slots,
TupleTableSlot **planSlots,
int *numSlots);
(Trivial patch attached).
Tsunakwa-san, On Mon, Jan 25, 2021 at 1:21 PM tsunakawa.takay@fujitsu.com <tsunakawa.takay@fujitsu.com> wrote: > From: Amit Langote <amitlangote09@gmail.com> > > Yes, it can be simplified by using a local join to prevent the update of the foreign > > partition from being pushed to the remote server, for which my example in the > > previous email used a local trigger. Note that the update of the foreign > > partition to be done locally is a prerequisite for this bug to occur. > > Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway. Good catch (and my bad miss.) It appears I had missed your reply, sorry. > + PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ? > + (PgFdwModifyState *) resultRelInfo->ri_FdwState : > + NULL; > > This can be written as: > > + PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState; Facepalm, yes. Patch updated. Thanks for the review. -- Amit Langote EDB: http://www.enterprisedb.com
Вложения
From: Amit Langote <amitlangote09@gmail.com> > It appears I had missed your reply, sorry. > > > + PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ? > > + > (PgFdwModifyState *) resultRelInfo->ri_FdwState : > > + NULL; > > > > This can be written as: > > > > + PgFdwModifyState *fmstate = (PgFdwModifyState *) > > + resultRelInfo->ri_FdwState; > > Facepalm, yes. > > Patch updated. Thanks for the review. Thank you for picking this up. Regards Takayuki Tsunakawa
On 2/5/21 2:55 AM, Ian Lawrence Barwick wrote: > ... > > There's a minor typo in the doc's version of the > ExecForeignBatchInsert() declaration; > is: > > TupleTableSlot ** > ExecForeignBatchInsert(EState *estate, > ResultRelInfo *rinfo, > TupleTableSlot **slots, > TupleTableSlot *planSlots, > int *numSlots); > > should be: > > TupleTableSlot ** > ExecForeignBatchInsert(EState *estate, > ResultRelInfo *rinfo, > TupleTableSlot **slots, > TupleTableSlot **planSlots, > int *numSlots); > > (Trivial patch attached). > > > Forgot to mention the relevant doc link: > > > https://www.postgresql.org/docs/devel/fdw-callbacks.html#FDW-CALLBACKS-UPDATE > <https://www.postgresql.org/docs/devel/fdw-callbacks.html#FDW-CALLBACKS-UPDATE> > Thanks, I'll get this fixed. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/5/21 3:52 AM, Amit Langote wrote: > Tsunakwa-san, > > On Mon, Jan 25, 2021 at 1:21 PM tsunakawa.takay@fujitsu.com > <tsunakawa.takay@fujitsu.com> wrote: >> From: Amit Langote <amitlangote09@gmail.com> >>> Yes, it can be simplified by using a local join to prevent the update of the foreign >>> partition from being pushed to the remote server, for which my example in the >>> previous email used a local trigger. Note that the update of the foreign >>> partition to be done locally is a prerequisite for this bug to occur. >> >> Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway. Good catch (and my bad miss.) > > It appears I had missed your reply, sorry. > >> + PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ? >> + (PgFdwModifyState *) resultRelInfo->ri_FdwState : >> + NULL; >> >> This can be written as: >> >> + PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState; > > Facepalm, yes. > > Patch updated. Thanks for the review. > Thanks for the patch, it seems fine to me. I wonder it the commit message needs some tweaks, though. At the moment it says: Prevent FDW insert batching during cross-partition updates but what the patch seems to be doing is simply initializing the info only for CMD_INSERT operations. Which does the trick, but it affects everything, i.e. all updates, no? Not just cross-partition updates. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 16, 2021 at 1:36 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 2/5/21 3:52 AM, Amit Langote wrote: > > Tsunakwa-san, > > > > On Mon, Jan 25, 2021 at 1:21 PM tsunakawa.takay@fujitsu.com > > <tsunakawa.takay@fujitsu.com> wrote: > >> From: Amit Langote <amitlangote09@gmail.com> > >>> Yes, it can be simplified by using a local join to prevent the update of the foreign > >>> partition from being pushed to the remote server, for which my example in the > >>> previous email used a local trigger. Note that the update of the foreign > >>> partition to be done locally is a prerequisite for this bug to occur. > >> > >> Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway. Good catch (and my bad miss.) > > > > It appears I had missed your reply, sorry. > > > >> + PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ? > >> + (PgFdwModifyState *) resultRelInfo->ri_FdwState : > >> + NULL; > >> > >> This can be written as: > >> > >> + PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState; > > > > Facepalm, yes. > > > > Patch updated. Thanks for the review. > > > > Thanks for the patch, it seems fine to me. Thanks for checking. > I wonder it the commit > message needs some tweaks, though. At the moment it says: > > Prevent FDW insert batching during cross-partition updates > > but what the patch seems to be doing is simply initializing the info > only for CMD_INSERT operations. Which does the trick, but it affects > everything, i.e. all updates, no? Not just cross-partition updates. You're right. Please check the message in the updated patch. -- Amit Langote EDB: http://www.enterprisedb.com
Вложения
On 2/16/21 10:25 AM, Amit Langote wrote: > On Tue, Feb 16, 2021 at 1:36 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> On 2/5/21 3:52 AM, Amit Langote wrote: >>> Tsunakwa-san, >>> >>> On Mon, Jan 25, 2021 at 1:21 PM tsunakawa.takay@fujitsu.com >>> <tsunakawa.takay@fujitsu.com> wrote: >>>> From: Amit Langote <amitlangote09@gmail.com> >>>>> Yes, it can be simplified by using a local join to prevent the update of the foreign >>>>> partition from being pushed to the remote server, for which my example in the >>>>> previous email used a local trigger. Note that the update of the foreign >>>>> partition to be done locally is a prerequisite for this bug to occur. >>>> >>>> Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it partway. Good catch (and my bad miss.) >>> >>> It appears I had missed your reply, sorry. >>> >>>> + PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ? >>>> + (PgFdwModifyState *) resultRelInfo->ri_FdwState : >>>> + NULL; >>>> >>>> This can be written as: >>>> >>>> + PgFdwModifyState *fmstate = (PgFdwModifyState *) resultRelInfo->ri_FdwState; >>> >>> Facepalm, yes. >>> >>> Patch updated. Thanks for the review. >>> >> >> Thanks for the patch, it seems fine to me. > > Thanks for checking. > >> I wonder it the commit >> message needs some tweaks, though. At the moment it says: >> >> Prevent FDW insert batching during cross-partition updates >> >> but what the patch seems to be doing is simply initializing the info >> only for CMD_INSERT operations. Which does the trick, but it affects >> everything, i.e. all updates, no? Not just cross-partition updates. > > You're right. Please check the message in the updated patch. > Thanks. I'm not sure I understand what "FDW may not be able to handle both the original update operation and the batched insert operation being performed at the same time" means. I mean, if we translate the UPDATE into DELETE+INSERT, then we don't run both the update and insert at the same time, right? What exactly is the problem with allowing batching for inserts in cross-partition updates? On a closer look, it seems the problem actually lies in a small inconsistency between create_foreign_modify and ExecInitRoutingInfo. The former only set batch_size for CMD_INSERT while the latter called the BatchSize() for all operations, expecting >= 1 result. So we may either relax create_foreign_modify and set batch_size for all DML, or make ExecInitRoutingInfo stricter (which is what the patches here do). Is there a reason not to do the first thing, allowing batching of inserts during cross-partition updates? I tried to do that, but it dawned on me that we can't mix batched and un-batched operations, e.g. DELETE + INSERT, because that'd break the order of execution, leading to bogus results in case the same row is modified repeatedly, etc. Am I getting this right? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 17, 2021 at 5:46 PM Amit Langote <amitlangote09@gmail.com> wrote: > On Wed, Feb 17, 2021 at 12:04 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > On 2/16/21 10:25 AM, Amit Langote wrote: > > > On Tue, Feb 16, 2021 at 1:36 AM Tomas Vondra > > >> Thanks for the patch, it seems fine to me. > > > > > > Thanks for checking. > > > > > >> I wonder it the commit > > >> message needs some tweaks, though. At the moment it says: > > >> > > >> Prevent FDW insert batching during cross-partition updates > > >> > > >> but what the patch seems to be doing is simply initializing the info > > >> only for CMD_INSERT operations. Which does the trick, but it affects > > >> everything, i.e. all updates, no? Not just cross-partition updates. > > > > > > You're right. Please check the message in the updated patch. > > > > Thanks. I'm not sure I understand what "FDW may not be able to handle > > both the original update operation and the batched insert operation > > being performed at the same time" means. I mean, if we translate the > > UPDATE into DELETE+INSERT, then we don't run both the update and insert > > at the same time, right? What exactly is the problem with allowing > > batching for inserts in cross-partition updates? > > Sorry, I hadn't shared enough details of my investigations when I > originally ran into this. Such as that I had considered implementing > the use of batching for these inserts too but had given up. > > Now that you mention it, I think I gave a less convincing reason for > why we should avoid doing it at all. Maybe it would have been more > right to say that it is the core code, not necessarily the FDWs, that > currently fails to deal with the use of batching by the insert > component of a cross-partition update. Those failures could be > addressed as I'll describe below. > > For postgres_fdw, postgresGetForeignModifyBatchSize() could be taught > to simply use the PgFdwModifyTable that is installed to handle the > insert component of a cross-partition update (one can get that one via > aux_fmstate field of the original PgFdwModifyState). However, even > though that's fine for postgres_fdw to do, what worries (had worried) > me is that it also results in scribbling on ri_BatchSize that the core > code may see to determine what to do with a particular tuple, and I > just have to hope that nodeModifyTable.c doesn't end up doing anything > unwarranted with the original update based on seeing a non-zero > ri_BatchSize. AFAICS, we are fine on that front. > > That said, there are some deficiencies in the code that have to be > addressed before we can let postgres_fdw do as mentioned above. For > example, the code in ExecModifyTable() that runs after breaking out of > the loop to insert any remaining batched tuples appears to miss the > tuples batched by such inserts. Apparently, that is because the > ResultRelInfos used by those inserts are not present in > es_tuple_routing_result_relations. Turns out I had forgotten that > execPartition.c doesn't add the ResultRelInfos to that list if they > are made by ExecInitModifyTable() for the original update operation > and simply reused by ExecFindPartition() when tuples were routed to > those partitions. It can be "fixed" by reverting to the original > design in Tsunakawa-san's patch where the tuple routing result > relations were obtained from the PartitionTupleRouting data structure, > which fortunately stores all tuple routing result relations. (Sorry, > I gave wrong advice in [1] in retrospect.) > > > On a closer look, it seems the problem actually lies in a small > > inconsistency between create_foreign_modify and ExecInitRoutingInfo. The > > former only set batch_size for CMD_INSERT while the latter called the > > BatchSize() for all operations, expecting >= 1 result. So we may either > > relax create_foreign_modify and set batch_size for all DML, or make > > ExecInitRoutingInfo stricter (which is what the patches here do). > > I think we should be fine if we make > postgresGetForeignModifyBatchSize() use the correct PgFdwModifyState > as described above. We can be sure that we are not mixing the > information used by the batched insert with that of the original > unbatched update. > > > Is there a reason not to do the first thing, allowing batching of > > inserts during cross-partition updates? I tried to do that, but it > > dawned on me that we can't mix batched and un-batched operations, e.g. > > DELETE + INSERT, because that'd break the order of execution, leading to > > bogus results in case the same row is modified repeatedly, etc. > > Actually, postgres_fdw only supports moving a row into a partition (as > part of a cross-partition update that is) if it has already finished > performing any updates on it. So there is no worry of rows that are > moved into a partition subsequently getting updated due to the > original command. > > The attached patch implements the changes necessary to make these > inserts use batching too. > > [1] https://www.postgresql.org/message-id/CA%2BHiwqEbnhwVJMsukTP-S9Kv1ynC7Da3yuqSPZC0Y7oWWOwoHQ%40mail.gmail.com Oops, I had mistakenly not hit "Reply All". Attaching the patch again. -- Amit Langote EDB: http://www.enterprisedb.com
Вложения
On 2/17/21 9:51 AM, Amit Langote wrote: > On Wed, Feb 17, 2021 at 5:46 PM Amit Langote <amitlangote09@gmail.com> wrote: >> On Wed, Feb 17, 2021 at 12:04 AM Tomas Vondra >> <tomas.vondra@enterprisedb.com> wrote: >>> On 2/16/21 10:25 AM, Amit Langote wrote: >>>> On Tue, Feb 16, 2021 at 1:36 AM Tomas Vondra >>>>> Thanks for the patch, it seems fine to me. >>>> >>>> Thanks for checking. >>>> >>>>> I wonder it the commit >>>>> message needs some tweaks, though. At the moment it says: >>>>> >>>>> Prevent FDW insert batching during cross-partition updates >>>>> >>>>> but what the patch seems to be doing is simply initializing the info >>>>> only for CMD_INSERT operations. Which does the trick, but it affects >>>>> everything, i.e. all updates, no? Not just cross-partition updates. >>>> >>>> You're right. Please check the message in the updated patch. >>> >>> Thanks. I'm not sure I understand what "FDW may not be able to handle >>> both the original update operation and the batched insert operation >>> being performed at the same time" means. I mean, if we translate the >>> UPDATE into DELETE+INSERT, then we don't run both the update and insert >>> at the same time, right? What exactly is the problem with allowing >>> batching for inserts in cross-partition updates? >> >> Sorry, I hadn't shared enough details of my investigations when I >> originally ran into this. Such as that I had considered implementing >> the use of batching for these inserts too but had given up. >> >> Now that you mention it, I think I gave a less convincing reason for >> why we should avoid doing it at all. Maybe it would have been more >> right to say that it is the core code, not necessarily the FDWs, that >> currently fails to deal with the use of batching by the insert >> component of a cross-partition update. Those failures could be >> addressed as I'll describe below. >> >> For postgres_fdw, postgresGetForeignModifyBatchSize() could be taught >> to simply use the PgFdwModifyTable that is installed to handle the >> insert component of a cross-partition update (one can get that one via >> aux_fmstate field of the original PgFdwModifyState). However, even >> though that's fine for postgres_fdw to do, what worries (had worried) >> me is that it also results in scribbling on ri_BatchSize that the core >> code may see to determine what to do with a particular tuple, and I >> just have to hope that nodeModifyTable.c doesn't end up doing anything >> unwarranted with the original update based on seeing a non-zero >> ri_BatchSize. AFAICS, we are fine on that front. >> >> That said, there are some deficiencies in the code that have to be >> addressed before we can let postgres_fdw do as mentioned above. For >> example, the code in ExecModifyTable() that runs after breaking out of >> the loop to insert any remaining batched tuples appears to miss the >> tuples batched by such inserts. Apparently, that is because the >> ResultRelInfos used by those inserts are not present in >> es_tuple_routing_result_relations. Turns out I had forgotten that >> execPartition.c doesn't add the ResultRelInfos to that list if they >> are made by ExecInitModifyTable() for the original update operation >> and simply reused by ExecFindPartition() when tuples were routed to >> those partitions. It can be "fixed" by reverting to the original >> design in Tsunakawa-san's patch where the tuple routing result >> relations were obtained from the PartitionTupleRouting data structure, >> which fortunately stores all tuple routing result relations. (Sorry, >> I gave wrong advice in [1] in retrospect.) >> >>> On a closer look, it seems the problem actually lies in a small >>> inconsistency between create_foreign_modify and ExecInitRoutingInfo. The >>> former only set batch_size for CMD_INSERT while the latter called the >>> BatchSize() for all operations, expecting >= 1 result. So we may either >>> relax create_foreign_modify and set batch_size for all DML, or make >>> ExecInitRoutingInfo stricter (which is what the patches here do). >> >> I think we should be fine if we make >> postgresGetForeignModifyBatchSize() use the correct PgFdwModifyState >> as described above. We can be sure that we are not mixing the >> information used by the batched insert with that of the original >> unbatched update. >> >>> Is there a reason not to do the first thing, allowing batching of >>> inserts during cross-partition updates? I tried to do that, but it >>> dawned on me that we can't mix batched and un-batched operations, e.g. >>> DELETE + INSERT, because that'd break the order of execution, leading to >>> bogus results in case the same row is modified repeatedly, etc. >> >> Actually, postgres_fdw only supports moving a row into a partition (as >> part of a cross-partition update that is) if it has already finished >> performing any updates on it. So there is no worry of rows that are >> moved into a partition subsequently getting updated due to the >> original command. >> >> The attached patch implements the changes necessary to make these >> inserts use batching too. >> >> [1] https://www.postgresql.org/message-id/CA%2BHiwqEbnhwVJMsukTP-S9Kv1ynC7Da3yuqSPZC0Y7oWWOwoHQ%40mail.gmail.com > > Oops, I had mistakenly not hit "Reply All". Attaching the patch again. > Thanks. The patch seems reasonable, but it's a bit too large for a fix, so I'll go ahead and push one of the previous fixes restricting batching to plain INSERT commands. But this seems useful, so I suggest adding it to the next commitfest. One thing that surprised me is that we only move the rows *to* the foreign partition, not from it (even on pg13, i.e. before the batching etc.). I mean, using the example you posted earlier, with one foreign and one local partition, consider this: delete from p; insert into p values (2); test=# select * from p2; a --- 2 (1 row) test=# update p set a = 1; UPDATE 1 test=# select * from p1; a --- 1 (1 row) OK, so it was moved to the foreign partition, which is for rows with value in (1). So far so good. Let's do another update: test=# update p set a = 2; UPDATE 1 test=# select * from p1; a --- 2 (1 row) So now the foreign partition contains value (2), which is however wrong with respect to the partitioning rules - this should be in p2, the local partition. This however causes pretty annoying issue: test=# explain analyze select * from p where a = 2; QUERY PLAN --------------------------------------------------------------- Seq Scan on p2 p (cost=0.00..41.88 rows=13 width=4) (actual time=0.024..0.028 rows=0 loops=1) Filter: (a = 2) Planning Time: 0.355 ms Execution Time: 0.089 ms (4 rows) That is, we fail to find the row, because we eliminate the partition. Now, maybe this is expected, but it seems like a rather mind-boggling violation of POLA principle. I've checked if postgres_fdw mentions this somewhere, but all I see about row movement is this: Note also that postgres_fdw supports row movement invoked by UPDATE statements executed on partitioned tables, but it currently does not handle the case where a remote partition chosen to insert a moved row into is also an UPDATE target partition that will be updated later. and if I understand that correctly, that's about something different. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/17/21 8:36 PM, Tomas Vondra wrote: > > ... > > Thanks. The patch seems reasonable, but it's a bit too large for a fix, > so I'll go ahead and push one of the previous fixes restricting batching > to plain INSERT commands. But this seems useful, so I suggest adding it > to the next commitfest. I've pushed the v4 fix, adding the CMD_INSERT to execPartition. I think we may need to revise the relationship between FDW and places that (may) call GetForeignModifyBatchSize. Currently, these places need to be quite well synchronized - in a way, the issue was (partially) due to postgres_fdw and core not agreeing on the details. In particular, create_foreign_modify sets batch_size for CMD_INSERT and leaves it 0 otherwise. So GetForeignModifyBatchSize() returned 0 later, triggering an assert. It's probably better to require GetForeignModifyBatchSize() to always return a valid batch size (>= 1). If batching is not supported, just return 1. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 18, 2021 at 8:38 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 2/17/21 8:36 PM, Tomas Vondra wrote: > > Thanks. The patch seems reasonable, but it's a bit too large for a fix, > > so I'll go ahead and push one of the previous fixes restricting batching > > to plain INSERT commands. But this seems useful, so I suggest adding it > > to the next commitfest. > > I've pushed the v4 fix, adding the CMD_INSERT to execPartition. > > I think we may need to revise the relationship between FDW and places > that (may) call GetForeignModifyBatchSize. Currently, these places need > to be quite well synchronized - in a way, the issue was (partially) due > to postgres_fdw and core not agreeing on the details. Agreed. > In particular, create_foreign_modify sets batch_size for CMD_INSERT and > leaves it 0 otherwise. So GetForeignModifyBatchSize() returned 0 later, > triggering an assert. > > It's probably better to require GetForeignModifyBatchSize() to always > return a valid batch size (>= 1). If batching is not supported, just > return 1. That makes sense. How about the attached? -- Amit Langote EDB: http://www.enterprisedb.com
Вложения
On Thu, Feb 18, 2021 at 4:36 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > On 2/17/21 9:51 AM, Amit Langote wrote: > > On Wed, Feb 17, 2021 at 5:46 PM Amit Langote <amitlangote09@gmail.com> wrote: > >> Sorry, I hadn't shared enough details of my investigations when I > >> originally ran into this. Such as that I had considered implementing > >> the use of batching for these inserts too but had given up. > >> > >> Now that you mention it, I think I gave a less convincing reason for > >> why we should avoid doing it at all. Maybe it would have been more > >> right to say that it is the core code, not necessarily the FDWs, that > >> currently fails to deal with the use of batching by the insert > >> component of a cross-partition update. Those failures could be > >> addressed as I'll describe below. > >> > >> For postgres_fdw, postgresGetForeignModifyBatchSize() could be taught > >> to simply use the PgFdwModifyTable that is installed to handle the > >> insert component of a cross-partition update (one can get that one via > >> aux_fmstate field of the original PgFdwModifyState). However, even > >> though that's fine for postgres_fdw to do, what worries (had worried) > >> me is that it also results in scribbling on ri_BatchSize that the core > >> code may see to determine what to do with a particular tuple, and I > >> just have to hope that nodeModifyTable.c doesn't end up doing anything > >> unwarranted with the original update based on seeing a non-zero > >> ri_BatchSize. AFAICS, we are fine on that front. > >> > >> That said, there are some deficiencies in the code that have to be > >> addressed before we can let postgres_fdw do as mentioned above. For > >> example, the code in ExecModifyTable() that runs after breaking out of > >> the loop to insert any remaining batched tuples appears to miss the > >> tuples batched by such inserts. Apparently, that is because the > >> ResultRelInfos used by those inserts are not present in > >> es_tuple_routing_result_relations. Turns out I had forgotten that > >> execPartition.c doesn't add the ResultRelInfos to that list if they > >> are made by ExecInitModifyTable() for the original update operation > >> and simply reused by ExecFindPartition() when tuples were routed to > >> those partitions. It can be "fixed" by reverting to the original > >> design in Tsunakawa-san's patch where the tuple routing result > >> relations were obtained from the PartitionTupleRouting data structure, > >> which fortunately stores all tuple routing result relations. (Sorry, > >> I gave wrong advice in [1] in retrospect.) > >> > >>> On a closer look, it seems the problem actually lies in a small > >>> inconsistency between create_foreign_modify and ExecInitRoutingInfo. The > >>> former only set batch_size for CMD_INSERT while the latter called the > >>> BatchSize() for all operations, expecting >= 1 result. So we may either > >>> relax create_foreign_modify and set batch_size for all DML, or make > >>> ExecInitRoutingInfo stricter (which is what the patches here do). > >> > >> I think we should be fine if we make > >> postgresGetForeignModifyBatchSize() use the correct PgFdwModifyState > >> as described above. We can be sure that we are not mixing the > >> information used by the batched insert with that of the original > >> unbatched update. > >> > >>> Is there a reason not to do the first thing, allowing batching of > >>> inserts during cross-partition updates? I tried to do that, but it > >>> dawned on me that we can't mix batched and un-batched operations, e.g. > >>> DELETE + INSERT, because that'd break the order of execution, leading to > >>> bogus results in case the same row is modified repeatedly, etc. > >> > >> Actually, postgres_fdw only supports moving a row into a partition (as > >> part of a cross-partition update that is) if it has already finished > >> performing any updates on it. So there is no worry of rows that are > >> moved into a partition subsequently getting updated due to the > >> original command. > >> > >> The attached patch implements the changes necessary to make these > >> inserts use batching too. > >> > >> [1] https://www.postgresql.org/message-id/CA%2BHiwqEbnhwVJMsukTP-S9Kv1ynC7Da3yuqSPZC0Y7oWWOwoHQ%40mail.gmail.com > > > > Oops, I had mistakenly not hit "Reply All". Attaching the patch again. > > > > Thanks. The patch seems reasonable, but it's a bit too large for a fix, > so I'll go ahead and push one of the previous fixes restricting batching > to plain INSERT commands. But this seems useful, so I suggest adding it > to the next commitfest. Okay will post the rebased patch to a new thread. > One thing that surprised me is that we only move the rows *to* the > foreign partition, not from it (even on pg13, i.e. before the batching > etc.). I mean, using the example you posted earlier, with one foreign > and one local partition, consider this: > > delete from p; > insert into p values (2); > > test=# select * from p2; > a > --- > 2 > (1 row) > > test=# update p set a = 1; > UPDATE 1 > > test=# select * from p1; > a > --- > 1 > (1 row) > > OK, so it was moved to the foreign partition, which is for rows with > value in (1). So far so good. Let's do another update: > > test=# update p set a = 2; > UPDATE 1 > test=# select * from p1; > a > --- > 2 > (1 row) > > So now the foreign partition contains value (2), which is however wrong > with respect to the partitioning rules - this should be in p2, the local > partition. > > This however causes pretty annoying issue: > > test=# explain analyze select * from p where a = 2; > > QUERY PLAN > --------------------------------------------------------------- > Seq Scan on p2 p (cost=0.00..41.88 rows=13 width=4) > (actual time=0.024..0.028 rows=0 loops=1) > Filter: (a = 2) > Planning Time: 0.355 ms > Execution Time: 0.089 ms > (4 rows) > > That is, we fail to find the row, because we eliminate the partition. > > Now, maybe this is expected, but it seems like a rather mind-boggling > violation of POLA principle. Yeah, I think we knowingly allow this behavior. The documentation states that a foreign table's constraints are not enforced by the core server nor by the FDW, but I suppose we still allow declaring them mostly for the planner's consumption: https://www.postgresql.org/docs/current/sql-createforeigntable.html "Constraints on foreign tables (such as CHECK or NOT NULL clauses) are not enforced by the core PostgreSQL system, and most foreign data wrappers do not attempt to enforce them either; that is, the constraint is simply assumed to hold true. There would be little point in such enforcement since it would only apply to rows inserted or updated via the foreign table, and not to rows modified by other means, such as directly on the remote server. Instead, a constraint attached to a foreign table should represent a constraint that is being enforced by the remote server." Partitioning constraints are not treated any differently for those reasons. It's a good idea to declare a CHECK constraint on the remote table matching with the local partition constraint, though IIRC we don't mention that advice anywhere in our documentation. > I've checked if postgres_fdw mentions this > somewhere, but all I see about row movement is this: > > Note also that postgres_fdw supports row movement invoked by UPDATE > statements executed on partitioned tables, but it currently does not > handle the case where a remote partition chosen to insert a moved row > into is also an UPDATE target partition that will be updated later. > > and if I understand that correctly, that's about something different. Yeah, that's a note saying that while we do support moving a row from a local partition to a postgres_fdw foreign partition, it's only allowed if the foreign partition won't subsequently be updated. So to reiterate, the cases we don't support: * Moving a row from a foreign partition to a local one * Moving a row from a local partition to a foreign one if the latter will be updated subsequent to moving a row into it postgres_fdw detects the second case with the following code in postgresBeginForeignInsert(): /* * If the foreign table we are about to insert routed rows into is also an * UPDATE subplan result rel that will be updated later, proceeding with * the INSERT will result in the later UPDATE incorrectly modifying those * routed rows, so prevent the INSERT --- it would be nice if we could * handle this case; but for now, throw an error for safety. */ if (plan && plan->operation == CMD_UPDATE && (resultRelInfo->ri_usesFdwDirectModify || resultRelInfo->ri_FdwState) && resultRelInfo > mtstate->resultRelInfo + mtstate->mt_whichplan) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot route tuples into foreign table to be updated \"%s\"", RelationGetRelationName(rel)))); -- Amit Langote EDB: http://www.enterprisedb.com