Обсуждение: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errorshandling
[HACKERS] GSOC'17 project introduction: Parallel COPY execution with errorshandling
От
Alexey Kondratov
Дата:
Hi pgsql-hackers,
I'm planning to apply to GSOC'17 and my proposal consists currently of two parts:
(1) Add errors handling to COPY as a minimum program
Motivation: Using PG on the daily basis for years I found that there are some cases when you need to load (e.g. for a further analytics) a bunch of not well consistent records with rare type/column mismatches. Since PG throws exception on the first error, currently the only one solution is to preformat your data with any other tool and then load to PG. However, frequently it is easier to drop certain records instead of doing such preprocessing for every data source you have.
I have done a small research and found the item in PG's TODO https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar patch https://www.postgresql.org/message-id/flat/603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com#603c8f070909141218i291bc983t501507ebc996a531@mail.gmail.com. There were no negative responses against this patch and it seams that it was just forgoten and have not been finalized.
As an example of a general idea I can provide read_csv method of python package – pandas (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). It uses C parser which throws error on first columns mismatch. However, it has two flags error_bad_lines and warn_bad_lines, which being set to False helps to drop bad lines or even hide warn messages about them.
(2) Parallel COPY execution as a maximum program
I guess that there is nothing necessary to say about motivation, it just should be faster on multicore CPUs.
There is also an record about parallel COPY in PG's wiki https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it always better to have well-performing core functionality out of the box.
My main concerns here are:
1) Is there anyone out of PG comunity who will be interested in such project and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount of data from a different sources, but mybe it would be better to focus on the single task?
3) Is it realistic to mostly finish both parts during the 3+ months of almost full-time work or I am too presumptuous?
I will be very appreciate to any comments and criticism.
P.S. I know about very interesting ready projects from the PG's comunity https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to solve your own problems, issues and questions, which are the product of you experience with software. That's why I dare to propose my own project.
P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from Moscow, Russia, and highly involved in software development since 2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran development and basic understanding of algorithms design and analysis.
Best regards,
Alexey
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Pavel Stehule
Дата:
Hi
2017-03-23 12:33 GMT+01:00 Alexey Kondratov <kondratov.aleksey@gmail.com>:
Hi pgsql-hackers,I'm planning to apply to GSOC'17 and my proposal consists currently of two parts:(1) Add errors handling to COPY as a minimum programMotivation: Using PG on the daily basis for years I found that there are some cases when you need to load (e.g. for a further analytics) a bunch of not well consistent records with rare type/column mismatches. Since PG throws exception on the first error, currently the only one solution is to preformat your data with any other tool and then load to PG. However, frequently it is easier to drop certain records instead of doing such preprocessing for every data source you have.I have done a small research and found the item in PG's TODO https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar patch https://www.postgresql.org/ message-id/flat/ 603c8f070909141218i291bc983t50 1507ebc996a531%40mail.gmail. com# 603c8f070909141218i291bc983t50 1507ebc996a531@mail.gmail.com. There were no negative responses against this patch and it seams that it was just forgoten and have not been finalized. As an example of a general idea I can provide read_csv method of python package – pandas (http://pandas.pydata.org/pandas-docs/stable/generated/ pandas.read_csv.html). It uses C parser which throws error on first columns mismatch. However, it has two flags error_bad_lines and warn_bad_lines, which being set to False helps to drop bad lines or even hide warn messages about them. (2) Parallel COPY execution as a maximum programI guess that there is nothing necessary to say about motivation, it just should be faster on multicore CPUs.There is also an record about parallel COPY in PG's wiki https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side extensions, e.g. https://github.com/ossc-db/pg_ bulkload, but it always better to have well-performing core functionality out of the box. My main concerns here are:1) Is there anyone out of PG comunity who will be interested in such project and can be a menthor?2) These two points have a general idea – to simplify work with a large amount of data from a different sources, but mybe it would be better to focus on the single task?
I spent lot of time on implementation @1 - maybe I found somewhere a patch. Both tasks has some common - you have to divide import to more batches.
3) Is it realistic to mostly finish both parts during the 3+ months of almost full-time work or I am too presumptuous?
It is possible, I am thinking - I am not sure about all possible details, but basic implementation can be done in 3 months.
I will be very appreciate to any comments and criticism.P.S. I know about very interesting ready projects from the PG's comunity https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to solve your own problems, issues and questions, which are the product of you experience with software. That's why I dare to propose my own project. P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from Moscow, Russia, and highly involved in software development since 2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran development and basic understanding of algorithms design and analysis.Best regards,Alexey
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Pavel Stehule
Дата:
1) Is there anyone out of PG comunity who will be interested in such project and can be a menthor?2) These two points have a general idea – to simplify work with a large amount of data from a different sources, but mybe it would be better to focus on the single task?I spent lot of time on implementation @1 - maybe I found somewhere a patch. Both tasks has some common - you have to divide import to more batches.
Patch is in /dev/null :( - My implementation was based on subtransactions for 1000 rows. When some checks fails, then I throw subtransaction and I imported every row from block in own subtransaction. It was a prototype - I didn't search some smarter implementation.
3) Is it realistic to mostly finish both parts during the 3+ months of almost full-time work or I am too presumptuous?It is possible, I am thinking - I am not sure about all possible details, but basic implementation can be done in 3 months.
Some data, some check depends on order - it can be a problem in parallel processing - you should to define corner cases.
I will be very appreciate to any comments and criticism.P.S. I know about very interesting ready projects from the PG's comunity https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to solve your own problems, issues and questions, which are the product of you experience with software. That's why I dare to propose my own project. P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from Moscow, Russia, and highly involved in software development since 2010. I guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran development and basic understanding of algorithms design and analysis.Best regards,Alexey
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Craig Ringer
Дата:
On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote: > (1) Add errors handling to COPY as a minimum program Huge +1 if you can do it in an efficient way. I think the main barrier to doing so is that the naïve approach creates a subtransaction for every row, which is pretty dire in performance terms and burns transaction IDs very rapidly. Most of our datatype I/O functions, etc, have no facility for being invoked in a mode where they fail nicely and clean up after themselves. We rely on unwinding the subtransaction's memory context for error handling, for releasing any LWLocks that were taken, etc. There's no try_timestamptz_in function or anything, just timestamptz_in, and it ERROR's if it doesn't like its input. You cannot safely PG_TRY / PG_CATCH such an exception and continue processing to, say, write another row. Currently we also don't have a way to differentiate between * "this row is structurally invalid" (wrong number of columns, etc) * "this row is structually valid but has fields we could not parse into their data types" * "this row looks structurally valid and has data types we could parse, but does not satisfy a constraint on the destination table" Nor do we have a way to write to any kind of failure-log table in the database, since a simple approach relies on aborting subtransactions to clean up failed inserts so it can't write anything for failed rows. Not without starting a 2nd subxact to record the failure, anyway. So, having said why it's hard, I don't really have much for you in terms of suggestions for ways forward. User-defined data types, user-defined constraints and triggers, etc mean anything involving significant interface changes will be a hard sell, especially in something pretty performance-sensitive like COPY. I guess it'd be worth setting out your goals first. Do you want to handle all the kinds of problems above? Malformed rows, rows with malformed field values, and rows that fail to satisfy a constraint? or just some subset? -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Stas Kelvich
Дата:
> On 23 Mar 2017, at 15:53, Craig Ringer <craig@2ndquadrant.com> wrote: > > On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote: > >> (1) Add errors handling to COPY as a minimum program > > Huge +1 if you can do it in an efficient way. > > I think the main barrier to doing so is that the naïve approach > creates a subtransaction for every row, which is pretty dire in > performance terms and burns transaction IDs very rapidly. > > Most of our datatype I/O functions, etc, have no facility for being > invoked in a mode where they fail nicely and clean up after > themselves. We rely on unwinding the subtransaction's memory context > for error handling, for releasing any LWLocks that were taken, etc. > There's no try_timestamptz_in function or anything, just > timestamptz_in, and it ERROR's if it doesn't like its input. You > cannot safely PG_TRY / PG_CATCH such an exception and continue > processing to, say, write another row. > > Currently we also don't have a way to differentiate between > > * "this row is structurally invalid" (wrong number of columns, etc) > * "this row is structually valid but has fields we could not parse > into their data types" > * "this row looks structurally valid and has data types we could > parse, but does not satisfy a constraint on the destination table" > > Nor do we have a way to write to any kind of failure-log table in the > database, since a simple approach relies on aborting subtransactions > to clean up failed inserts so it can't write anything for failed rows. > Not without starting a 2nd subxact to record the failure, anyway. If we are optimising COPY for case with small amount of bad rows than 2nd subtransaction seems as not a bad idea. We can try to apply batch in subtx, if it fails on some row N then insert rows [1, N) in next subtx, report an error, commit subtx. Row N+1 can be treated as beginning of next batch. But if there will be some problems with handling everything with subtransaction and since parallelism is anyway mentioned, what about starting bgworker that will perform data insertion and will be controlled by backend? For example backend can do following: * Start bgworker (or even parallel worker) * Get chunk of rows out of the file and try to apply them in batch as subtransaction in bgworker. * If it fails than we can open transaction in backend itself and raise notice / move failed rows to special errors table. > So, having said why it's hard, I don't really have much for you in > terms of suggestions for ways forward. User-defined data types, > user-defined constraints and triggers, etc mean anything involving > significant interface changes will be a hard sell, especially in > something pretty performance-sensitive like COPY. > > I guess it'd be worth setting out your goals first. Do you want to > handle all the kinds of problems above? Malformed rows, rows with > malformed field values, and rows that fail to satisfy a constraint? or > just some subset? > > > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Re: GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Alexey Kondratov
Дата:
Pavel, Craig and Stas,
Thank you for your responses and valuable comments!
I have written draft proposal https://docs.google.com/document/d/1Y4mc_PCvRTjLsae-_fhevYfepv4sxaqwhOo4rlxvK1c/edit
It seems that COPY currently is able to return first error line and error type (extra or missing columns, type parse error, etc).
Thus, the approach similar to the Stas wrote should work and, being optimised for a small number of error rows, should not
affect COPY performance in such case.
I will be glad to receive any critical remarks and suggestions.
Alexey
On 23 Mar 2017, at 17:24, Stas Kelvich <stas.kelvich@gmail.com> wrote:
On 23 Mar 2017, at 15:53, Craig Ringer <craig@2ndquadrant.com> wrote:
On 23 March 2017 at 19:33, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:(1) Add errors handling to COPY as a minimum program
Huge +1 if you can do it in an efficient way.
I think the main barrier to doing so is that the naïve approach
creates a subtransaction for every row, which is pretty dire in
performance terms and burns transaction IDs very rapidly.
Most of our datatype I/O functions, etc, have no facility for being
invoked in a mode where they fail nicely and clean up after
themselves. We rely on unwinding the subtransaction's memory context
for error handling, for releasing any LWLocks that were taken, etc.
There's no try_timestamptz_in function or anything, just
timestamptz_in, and it ERROR's if it doesn't like its input. You
cannot safely PG_TRY / PG_CATCH such an exception and continue
processing to, say, write another row.
Currently we also don't have a way to differentiate between
* "this row is structurally invalid" (wrong number of columns, etc)
* "this row is structually valid but has fields we could not parse
into their data types"
* "this row looks structurally valid and has data types we could
parse, but does not satisfy a constraint on the destination table"
Nor do we have a way to write to any kind of failure-log table in the
database, since a simple approach relies on aborting subtransactions
to clean up failed inserts so it can't write anything for failed rows.
Not without starting a 2nd subxact to record the failure, anyway.
If we are optimising COPY for case with small amount of bad rows
than 2nd subtransaction seems as not a bad idea. We can try to
apply batch in subtx, if it fails on some row N then insert rows [1, N)
in next subtx, report an error, commit subtx. Row N+1 can be treated
as beginning of next batch.
But if there will be some problems with handling everything with
subtransaction and since parallelism is anyway mentioned, what about
starting bgworker that will perform data insertion and will be controlled
by backend?
For example backend can do following:
* Start bgworker (or even parallel worker)
* Get chunk of rows out of the file and try to apply them in batch
as subtransaction in bgworker.
* If it fails than we can open transaction in backend itself and
raise notice / move failed rows to special errors table.So, having said why it's hard, I don't really have much for you in
terms of suggestions for ways forward. User-defined data types,
user-defined constraints and triggers, etc mean anything involving
significant interface changes will be a hard sell, especially in
something pretty performance-sensitive like COPY.
I guess it'd be worth setting out your goals first. Do you want to
handle all the kinds of problems above? Malformed rows, rows with
malformed field values, and rows that fail to satisfy a constraint? or
just some subset?
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Alexander Korotkov
Дата:
Hi, Alexey!
On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
Thank you for your responses and valuable comments!I have written draft proposal https://docs.google.com/document/d/1Y4mc_ PCvRTjLsae-_ fhevYfepv4sxaqwhOo4rlxvK1c/ edit It seems that COPY currently is able to return first error line and error type (extra or missing columns, type parse error, etc).Thus, the approach similar to the Stas wrote should work and, being optimised for a small number of error rows, should notaffect COPY performance in such case.I will be glad to receive any critical remarks and suggestions.
I've following questions about your proposal.
1. Suppose we have to insert N records
2. We create subtransaction with these N records
3. Error is raised on k-th line
4. Then, we can safely insert all lines from 1st and till (k - 1)
5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till N with another subtransaction
7. Repeat until the end of file
Do you assume that we start new subtransaction in 4 since subtransaction we started in 2 is rolled back?
I am planning to use background worker processes for parallel COPY execution. Each process will receive equal piece of the input file. Since file is splitted by size not by lines, each worker will start import from the first new line to do not hit a broken line.
I think that situation when backend is directly reading file during COPY is not typical. More typical case is \copy psql command. In that case "COPY ... FROM stdin;" is actually executed while psql is streaming the data.
How can we apply parallel COPY in this case?
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi Alexander!
I've missed your reply, since proposal submission deadline have passed last Monday and I didn't check hackers mailing list too frequently.
(1) It seems that starting new subtransaction at step 4 is not necessary. We can just gather all error lines in one pass and at the end of input start the only one additional subtransaction with all safe-lines at once: [1, ..., k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number.
But assuming that the only livable use-case is when number of errors is relatively small compared to the total rows number, because if the input is in totally inconsistent format, then it seems useless to import it into the db. Thus, it is not 100% clear for me, would it be any real difference in performance, if one starts new subtransaction at step 4 or not.
(2) Hmm, good question. As far as I know it is impossible to get stdin input size, thus it is impossible to distribute stdin directly to the parallel workers. The first approach which comes to the mind is to store stdin input in any kind of buffer/query and next read it in parallel by workers. The question is how it will perform in the case of large file, I guess poor, at least from the memory consumption perspective. But would parallel execution still be faster is the next question.
Alexey
On Thu, Apr 6, 2017 at 4:47 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi, Alexey!On Tue, Mar 28, 2017 at 1:54 AM, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote:Thank you for your responses and valuable comments!I have written draft proposal https://docs.google.com/document/d/1Y4mc_PCvRTjLsae -_fhevYfepv4sxaqwhOo4rlxvK1c/e dit It seems that COPY currently is able to return first error line and error type (extra or missing columns, type parse error, etc).Thus, the approach similar to the Stas wrote should work and, being optimised for a small number of error rows, should notaffect COPY performance in such case.I will be glad to receive any critical remarks and suggestions.I've following questions about your proposal.1. Suppose we have to insert N records
2. We create subtransaction with these N records
3. Error is raised on k-th line
4. Then, we can safely insert all lines from 1st and till (k - 1)5. Report, save to errors table or silently drop k-th line
6. Next, try to insert lines from (k + 1) till N with another subtransaction
7. Repeat until the end of fileDo you assume that we start new subtransaction in 4 since subtransaction we started in 2 is rolled back?I am planning to use background worker processes for parallel COPY execution. Each process will receive equal piece of the input file. Since file is splitted by size not by lines, each worker will start import from the first new line to do not hit a broken line.I think that situation when backend is directly reading file during COPY is not typical. More typical case is \copy psql command. In that case "COPY ... FROM stdin;" is actually executed while psql is streaming the data.How can we apply parallel COPY in this case?------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Apr 10, 2017 at 11:39 AM, Alex K <kondratov.aleksey@gmail.com> wrote: > (1) It seems that starting new subtransaction at step 4 is not necessary. We > can just gather all error lines in one pass and at the end of input start > the only one additional subtransaction with all safe-lines at once: [1, ..., > k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number. The only way to recover from an error is to abort the subtransaction, or to abort the toplevel transaction. Anything else is unsafe. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Alexey Kondratov
Дата:
Yes, sure, I don't doubt it. The question was around step 4 in the following possible algorithm: 1. Suppose we have to insert N records 2. Start subtransaction with these N records 3. Error is raised on k-th line 4. Then, we know that we can safely insert all lines from the 1st till (k - 1) 5. Report, save to errors table or silently drop k-th line 6. Next, try to insert lines from (k + 1) till Nth with another subtransaction 7. Repeat until the end of file One can start subtransaction with those (k - 1) safe-lines and repeat it after each error line OR iterate till the end of file and start only one subtransaction with all lines excepting error lines. Alexey > On 10 Apr 2017, at 19:55, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Apr 10, 2017 at 11:39 AM, Alex K <kondratov.aleksey@gmail.com> wrote: >> (1) It seems that starting new subtransaction at step 4 is not necessary. We >> can just gather all error lines in one pass and at the end of input start >> the only one additional subtransaction with all safe-lines at once: [1, ..., >> k1 - 1, k1 + 1, ..., k2 - 1, k2 + 1, ...], where ki is an error line number. > > The only way to recover from an error is to abort the subtransaction, > or to abort the toplevel transaction. Anything else is unsafe. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Mon, Apr 10, 2017 at 2:46 PM, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote: > Yes, sure, I don't doubt it. The question was around step 4 in the following possible algorithm: > > 1. Suppose we have to insert N records > 2. Start subtransaction with these N records > 3. Error is raised on k-th line > 4. Then, we know that we can safely insert all lines from the 1st till (k - 1) > 5. Report, save to errors table or silently drop k-th line > 6. Next, try to insert lines from (k + 1) till Nth with another subtransaction > 7. Repeat until the end of file > > One can start subtransaction with those (k - 1) safe-lines and repeat it after each error line I don't understand what you mean by that. > OR > iterate till the end of file and start only one subtransaction with all lines excepting error lines. That could involve buffering a huge file. Imagine a 300GB load. Also consider how many XIDs whatever design is proposed will blow through when loading 300GB of data. There's a nasty trade-off here between XID consumption (and the aggressive vacuums it eventually causes) and preserving performance in the face of errors - e.g. if you make k = 100,000 you consume 100x fewer XIDs than if you make k = 1000, but you also have 100x the work to redo (on average) every time you hit an error. If the data quality is poor (say, 50% of lines have errors) it's almost impossible to avoid runaway XID consumption. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Nicolas Barbier
Дата:
2017-04-11 Robert Haas <robertmhaas@gmail.com>: > There's a nasty trade-off here between XID consumption (and the > aggressive vacuums it eventually causes) and preserving performance in > the face of errors - e.g. if you make k = 100,000 you consume 100x > fewer XIDs than if you make k = 1000, but you also have 100x the work > to redo (on average) every time you hit an error. You could make it dynamic: Commit the subtransaction even when not encountering any error after N lines (N starts out at 1), then double N and continue. When encountering an error, roll back the current subtransaction back and re-insert all the known good rows that have been rolled back (plus maybe the erroneous row into a separate table or whatever) in one new subtransaction and commit; then reset N to 1 and continue processing the rest of the file. That would work reasonable well whenever the ratio of erroneous rows is not extremely high: whether the erroneous rows are all clumped together, entirely randomly spread out over the file, or a combination of both. > If the data quality is poor (say, 50% of lines have errors) it's > almost impossible to avoid runaway XID consumption. Yup, that seems difficult to work around with anything similar to the proposed. So the docs might need to suggest not to insert a 300 GB file with 50% erroneous lines :-). Greetings, Nicolas
On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier <nicolas.barbier@gmail.com> wrote: > 2017-04-11 Robert Haas <robertmhaas@gmail.com>: >> There's a nasty trade-off here between XID consumption (and the >> aggressive vacuums it eventually causes) and preserving performance in >> the face of errors - e.g. if you make k = 100,000 you consume 100x >> fewer XIDs than if you make k = 1000, but you also have 100x the work >> to redo (on average) every time you hit an error. > > You could make it dynamic: Commit the subtransaction even when not > encountering any error after N lines (N starts out at 1), then double > N and continue. When encountering an error, roll back the current > subtransaction back and re-insert all the known good rows that have > been rolled back (plus maybe the erroneous row into a separate table > or whatever) in one new subtransaction and commit; then reset N to 1 > and continue processing the rest of the file. > > That would work reasonable well whenever the ratio of erroneous rows > is not extremely high: whether the erroneous rows are all clumped > together, entirely randomly spread out over the file, or a combination > of both. Right. I wouldn't suggest the exact algorithm you proposed; I think you ought to vary between some lower limit >1, maybe 10, and some upper limit, maybe 1,000,000, ratcheting up and down based on how often you hit errors in some way that might not be as simple as doubling. But something along those lines. >> If the data quality is poor (say, 50% of lines have errors) it's >> almost impossible to avoid runaway XID consumption. > > Yup, that seems difficult to work around with anything similar to the > proposed. So the docs might need to suggest not to insert a 300 GB > file with 50% erroneous lines :-). Yep. But it does seem reasonably likely that someone might shoot themselves in the foot anyway. Maybe we just live with that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Stas Kelvich
Дата:
> On 12 Apr 2017, at 20:23, Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier > <nicolas.barbier@gmail.com> wrote: >> 2017-04-11 Robert Haas <robertmhaas@gmail.com>: >>> If the data quality is poor (say, 50% of lines have errors) it's >>> almost impossible to avoid runaway XID consumption. >> >> Yup, that seems difficult to work around with anything similar to the >> proposed. So the docs might need to suggest not to insert a 300 GB >> file with 50% erroneous lines :-). > > Yep. But it does seem reasonably likely that someone might shoot > themselves in the foot anyway. Maybe we just live with that. > Moreover if that file consists of one-byte lines (plus one byte of newline char) then during its import xid wraparound will happens 18 times =) I think it’s reasonable at least to have something like max_errors parameter to COPY, that will be set by default to 1000 for example. If user will hit that limit then it is a good moment to put a warning about possible xid consumption in case of bigger limit. However I think it worth of quick research whether it is possible to create special code path for COPY in which errors don’t cancel transaction. At least when COPY called outside of transaction block. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Craig Ringer
Дата:
On 13 April 2017 at 01:57, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > However I think it worth of quick research whether it is possible to create special > code path for COPY in which errors don’t cancel transaction. Not really. Anything at any layer of the system expects to be able to ERROR: * datatype input functions * CHECK constraints * FK constraints * unique indexes * user defined functions run by triggers * interrupt signalling (think deadlock detector) * ... and we rely on ERROR unwinding any relevant memory contexts, releasing lwlocks, etc. When an xact aborts it may leave all sorts of mess on disk. Nothing gets deleted, it's just ignored due to an aborted xmin. Maybe some xid burn could be saved by trying harder to pre-validate batches of data as much as possible before we write anything to the heap, sorting obviously faulty data into buffers and doing as much work as possible before allocating a new (sub)xid and writing to the heap. We'd still abort but we'd only be aborting a vtxid. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi pgsql-hackers,
Thank you again for all these replies. I have started working under this project
and learnt a lot of new stuff last month, so here are some new thoughts about
ERRORS handling in COPY. I decided to stick to the same thread, since it
has a neutral subject.
UPSERT. It may be a good point to be able to achieve the same functionality
as during the ON CONFLICT DO NOTHING, when COPY actually inserts tuples
and errors handling is turned on. It could additionally reduce number of failed
subtransactions and reduce XIDs consumption, while still ignoring some common
errors like unique index violation.
Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
to be a large separated task and is out of the current project scope, but maybe there is
a relatively simple way to somehow perform internally tuples insert with
ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
I understand he is the major contributor of UPSERT in PostgreSQL. It would be great
if he will answer this question.
(2) Otherwise, I am still going to use subtransactions via BeginInternalSubTransactio n
and PG_TRY / PG_CATCH with
ReleaseCurrentSubTransaction / RollbackAndReleaseCurrentSubTr ansaction.
To minimize XIDs consumption I will try to insert tuples in batches and pre-validate
them as much as possible (as was suggested in the thread before).
Alexey
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Peter Geoghegan
Дата:
On Wed, Jun 7, 2017 at 12:34 PM, Alex K <kondratov.aleksey@gmail.com> wrote: > (1) One of my mentors--Alvaro Herrera--suggested me to have a look on the > UPSERT. > It may be a good point to be able to achieve the same functionality > as during the ON CONFLICT DO NOTHING, when COPY actually inserts tuples > and errors handling is turned on. It could additionally reduce number of > failed > subtransactions and reduce XIDs consumption, while still ignoring some > common > errors like unique index violation. Alvaro and I talked about this informally at PGCon. > Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems > to be a large separated task and is out of the current project scope, but > maybe there is > a relatively simple way to somehow perform internally tuples insert with > ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as > I understand he is the major contributor of UPSERT in PostgreSQL. It would > be great > if he will answer this question. I think that there is a way of making COPY use "speculative insertion", so that it behaves the same as ON CONFLICT DO NOTHING with no inference specification. Whether or not this is useful depends on a lot of things. You seem to be talking about doing this as an optimization on top of a base feature that does the main thing you want (captures all errors within an implementation level subxact without passing them to the client). That could make sense, as a way of preventing extreme bloat for a very bad case where almost all inserts have conflicts. (This seems quite possible, whereas it seems much less likely that users would have an input file simple full of illformed tuples.) I think that you need to more formally identify what errors your new COPY error handling will need to swallow. I'm not sure if it's possible to avoid using subtransactions all together, but speculative insertion would help if you find that you can do it without subtransactions. Using subtransactions is always going to be a bit ugly, because you'll need to continually reassess whether or not you're batching insertions together at the right granularity (that is, that you've weighed the rate of XID consumption against how much work you lose when a batched transaction has to be "replayed" to include things that are known to be valid). And, if you care about duplicate violations, then you can't really be sure that replaying a "known good" tuple will stay good from one moment to the next. My advice right now is: see if you can figure out a way of doing what you want without subtransactions at all, possibly by cutting some scope. For example, maybe it would be satisfactory to have the implementation just ignore constraint violations, but still raise errors for invalid input for types. Is there really much value in ignoring errors due to invalid encoding? It's not as if such problems can be reliably detected today. If you use the wrong encoding, and ignore some errors that COPY would generally raise, then there is an excellent chance that you'll still insert some remaining rows with text that has been incorrectly interpreted as valid in the database encoding -- some text datums are bound to accidentally appear valid. There are probably similar issues with other types. It's not clear what the point is at which the user is no longer helped by ignoring problems, because we cannot reliably detect *all* problems at the level of each row. If you must ignore errors within the input functions of types, then maybe you can optionally let the user do that by way of a "dry run", where the entire input file is examined for basic structural soundness ahead of considering constraints. Any errors are saved then and there, in a format that can be used to make sure that those entries are skipped on a later COPY. As a further enhancement, in the future, the user might then be able to define special transform functions that correct the errors for those rows only. You kind of need to see all the faulty rows together to do something like that, so a dry run could make a lot of sense. -- Peter Geoghegan
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Alexey Kondratov
Дата:
Thank you for your comments Peter, there are some points that I did not think about before.
On 9 Jun 2017, at 01:09, Peter Geoghegan <pg@bowt.ie> wrote:Adding a full support of ON CONFLICT DO NOTHING/UPDATE to COPY seems
to be a large separated task and is out of the current project scope, but
maybe there is
a relatively simple way to somehow perform internally tuples insert with
ON CONFLICT DO NOTHING? I have added Peter Geoghegan to cc, as
I understand he is the major contributor of UPSERT in PostgreSQL. It would
be great
if he will answer this question.
I think that there is a way of making COPY use "speculative
insertion", so that it behaves the same as ON CONFLICT DO NOTHING with
no inference specification. Whether or not this is useful depends on a
lot of things.
I am not going to start with "speculative insertion" right now, but it would be very
useful, if you give me a point, where to start. Maybe I will at least try to evaluate
the complexity of the problem.
I think that you need to more formally identify what errors your new
COPY error handling will need to swallow.
...
My advice right now is: see if you can figure out a way of doing what
you want without subtransactions at all, possibly by cutting some
scope. For example, maybe it would be satisfactory to have the
implementation just ignore constraint violations, but still raise
errors for invalid input for types.
Initially I was thinking only about malformed rows, e.g. less or extra columns.
Honestly, I did not know that there are so many levels and ways where error
can occur. So currently (and especially after your comments) I prefer to focus
only on the following list of errors:
1) File format issues
a. Less columns than needed
b. Extra columns
2) I am doubt about type mismatch. It is possible to imagine a situation when,
e.g. some integers are exported as int, and some as "int", but I am not sure
that is is a common situation.
3) Some constraint violations, e.g. unique index.
First appeared to be easy achievable without subtransactions. I have created a
proof of concept version of copy, where the errors handling is turned on by default.
Please, see small patch attached (applicable to 76b11e8a43eca4612dfccfe7f3ebd293fb8a46ec)
or GUI version on GitHub https://github.com/ololobus/postgres/pull/1/files.
It throws warnings instead of errors for malformed lines with less/extra columns
and reports line number.
Second is probably achievable without subtransactions via the PG_TRY/PG_CATCH
around heap_form_tuple, since it is not yet inserted into the heap.
But third is questionable without subtransactions, since even if we check
constraints once, there maybe various before/after triggers which can modify
tuple, so it will not satisfy them. Corresponding comment inside copy.c states:
"Note that a BR trigger might modify tuple such that the partition constraint is
no satisfied, so we need to check in that case." Thus, there are maybe different
situations here, as I understand. However, it a point where "speculative insertion"
is able to help.
These three cases should cover most real-life scenarios.
Is there really much value in ignoring errors due to invalid encoding?
Now, I have some doubts about it too. If there is an encoding problem,
it is probably about the whole file, not only a few rows.
Alexey
Вложения
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Peter Geoghegan
Дата:
On Mon, Jun 12, 2017 at 3:52 AM, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote: > I am not going to start with "speculative insertion" right now, but it would > be very > useful, if you give me a point, where to start. Maybe I will at least try to > evaluate > the complexity of the problem. Speculative insertion has the following special entry points to heapam.c and execIndexing.c, currently only called within nodeModifyTable.c: * SpeculativeInsertionLockAcquire() * HeapTupleHeaderSetSpeculativeToken() * heap_insert() called with HEAP_INSERT_SPECULATIVE argument * ExecInsertIndexTuples() with specInsert = true * heap_finish_speculative() * heap_abort_speculative() Offhand, it doesn't seem like it would be that hard to teach another heap_insert() caller the same tricks. >> My advice right now is: see if you can figure out a way of doing what >> you want without subtransactions at all, possibly by cutting some >> scope. For example, maybe it would be satisfactory to have the >> implementation just ignore constraint violations, but still raise >> errors for invalid input for types. > > > Initially I was thinking only about malformed rows, e.g. less or extra > columns. > Honestly, I did not know that there are so many levels and ways where error > can occur. My sense is that it's going to be hard to sell a committer on any design that consumes subtransactions in a way that's not fairly obvious to the user, and doesn't have a pretty easily understood worse case. But, that's just my opinion, and it's possible that someone else will disagree. Try to get a second opinion. Limiting the feature to just skip rows on the basis of a formally defined constraint failing (not including type input failure, or a trigger throwing an error, and probably not including foreign key failures because they're really triggers) might be a good approach. MySQL's INSERT IGNORE is a bit like that, I think. (It doesn't *just* ignore duplicate violations, unlike our ON CONFLICT DO NOTHING feature). I haven't thought about this very carefully, but I guess you could do something like passing a flag to ExecConstraints() that indicates "don't throw an error; instead, just return false so I know not to proceed". Plus maybe one or two other cases, like using speculative insertion to back out of unique violation without consuming a subxact. -- Peter Geoghegan
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Alexey Kondratov
Дата:
On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote:
I am not going to start with "speculative insertion" right now, but it would
be very
useful, if you give me a point, where to start. Maybe I will at least try to
evaluate
the complexity of the problem.
Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c
Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.
I went through the nodeModifyTable.c code and it seems not to be so
difficult to do the same inside COPY.
My sense is that it's going to be hard to sell a committer on any
design that consumes subtransactions in a way that's not fairly
obvious to the user, and doesn't have a pretty easily understood worse
case.
Yes, and worse case probably will be a quite frequent case, since it is not possible to do heap_multy_insert, if BEFORE/INSTEAD triggers or partitioning exist (according to the current copy.c code). Thus, it will frequently fall back into a single heap_insert, each being wrapped with subtransaction will consume XIDs too greedy and seriously affect performance. I like my previous idea less and less.
I haven't thought about this very carefully, but I guess you could do
something like passing a flag to ExecConstraints() that indicates
"don't throw an error; instead, just return false so I know not to
proceed"
I have updated my patch (rebased over the topmost master commit 94da2a6a9a05776953524424a3d8079e54bc5d94). Please, find patch file attached or always up to date version on GitHub https://github.com/ololobus/postgres/pull/1/files
Currently, It caches all major errors in the input data:
1) Rows with less/extra columns cause WARNINGs and are skipped
2) I found that input type format errors are thrown from the InputFunctionCall; and wrapped it up with PG_TRY/CATCH. I am not 100%
Alexey
Вложения
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Alexey Kondratov
Дата:
Sorry for a previous email, I have accidentally sent it unfinished.
On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote:
Speculative insertion has the following special entry points to
heapam.c and execIndexing.c, currently only called within
nodeModifyTable.c
Offhand, it doesn't seem like it would be that hard to teach another
heap_insert() caller the same tricks.
I went through the nodeModifyTable.c code and it seems not to be so
difficult to do the same inside COPY.
My sense is that it's going to be hard to sell a committer on any
design that consumes subtransactions in a way that's not fairly
obvious to the user, and doesn't have a pretty easily understood worse
case.
Yes, and worse case probably will be a quite frequent case, since it is not
possible to do heap_multi_insert, when BEFORE/INSTEAD triggers or partitioning
exist (according to the current copy.c code). Thus, it will frequently fall back
into a single heap_insert, each being wrapped with subtransaction will
consume XIDs too greedy and seriously affect performance. I like my initial
idea less and less.
By the way, is it possible to use heap_multi_insert with speculative insertion too?
I haven't thought about this very carefully, but I guess you could do
something like passing a flag to ExecConstraints() that indicates
"don't throw an error; instead, just return false so I know not to
proceed"
it would be wise from my side to modify its behaviour.
I have updated my patch (rebased over the topmost master commit
94da2a6a9a05776953524424a3d8079e54bc5d94). Please, find patch
file attached or always up to date version on GitHub
It catches all major errors in the input data:
1) Rows with less/extra columns cause WARNINGs and are skipped
2) I found that input type format errors are thrown from the
InputFunctionCall; and wrapped it up with PG_TRY/CATCH.
I am not sure that it is 100% transactionally safe, but it seems so,
since all these errors are handled before this point
where current COPY implementation has a mechanism to skip tuple.
I use the same skip_tuple flag.
Patch passes all regression tests, excepting a few tests due to the slightly
changed error message texts.
Now, I think that it may be a good idea to separate all possible errors
into two groups:
– Malformed input data
– DB conflicts during insertion
First is solved (I hope) well with the current patch. I can add, e.g.
MAXERRORS flag to COPY, which will limit number of errors.
Second may be solved with speculative insertion using the same
syntax ON CONFLICT DO as in INSERT statement.
Following this way, we do not use subtransactions at all; and keeping
predictable and consistent behaviour of INSERT and COPY along the
database. For me it sounds much better, than just swallowing all errors
without a difference and any logic.
Alexey
Вложения
> On 16 Jun 2017, at 21:30, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote: > > On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote: > > Speculative insertion has the following special entry points to > > heapam.c and execIndexing.c, currently only called within > > nodeModifyTable.c > > Offhand, it doesn't seem like it would be that hard to teach another > > heap_insert() caller the same tricks. > I went through the nodeModifyTable.c code and it seems not to be so > difficult to do the same inside COPY. After a more precise look, I have figured out at least one difficulty, COPY and INSERT follow the different execution paths: INSERT goes through the Planner, while COPY does not. It leads to the absence of some required attributes like arbiterIndexes, which are available during INSERT via PlanState/ModifyTableState. Probably it is possible to get the same in the COPY, but it is not clear for me how. Anyway, adding of the 'speculative insertion' into the COPY is worth of a separated patch; and I would be glad to try implementing it. In the same time I have prepared a complete working patch with: - ignoring of the input data formatting errors - IGNORE_ERRORS parameter in the COPY options - updated regression tests Please, find the patch attached or check the web UI diff on GitHub as always: https://github.com/ololobus/postgres/pull/1/files Alexey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
От
Michael Paquier
Дата:
On Wed, Jun 21, 2017 at 11:37 PM, Alex K <kondratov.aleksey@gmail.com> wrote: >> On 16 Jun 2017, at 21:30, Alexey Kondratov <kondratov.aleksey@gmail.com> wrote: > >> > On 13 Jun 2017, at 01:44, Peter Geoghegan <pg@bowt.ie> wrote: > > >> > Speculative insertion has the following special entry points to >> > heapam.c and execIndexing.c, currently only called within >> > nodeModifyTable.c > >> > Offhand, it doesn't seem like it would be that hard to teach another >> > heap_insert() caller the same tricks. > > >> I went through the nodeModifyTable.c code and it seems not to be so >> difficult to do the same inside COPY. > > After a more precise look, I have figured out at least one difficulty, COPY > and INSERT follow the different execution paths: INSERT goes through > the Planner, while COPY does not. It leads to the absence of some required > attributes like arbiterIndexes, which are available during INSERT via > PlanState/ModifyTableState. Probably it is possible to get the same in the > COPY, but it is not clear for me how. > > Anyway, adding of the 'speculative insertion' into the COPY is worth of a > separated patch; and I would be glad to try implementing it. > > In the same time I have prepared a complete working patch with: > > - ignoring of the input data formatting errors > - IGNORE_ERRORS parameter in the COPY options > - updated regression tests > > Please, find the patch attached or check the web UI diff on GitHub as always: > https://github.com/ololobus/postgres/pull/1/files "git diff master --check" complains heavily, and the patch does not apply anymore. The last patch is 5-month old as well, so I am marking the patch as returned with feedback. -- Michael