Обсуждение: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

Поиск
Список
Период
Сортировка

Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
"David G. Johnston"
Дата:
Hi,

The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic.  There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null.  I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.

David J.

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
jian he
Дата:
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> Hi,
>
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic.  There would seem to be two relevant
waysto ignore bad column input data - drop the entire row or just set the column value to null.  I can see us wanting
toprovide the set to null option and in any case having the option name be explicit that it ignores the row seems like
agood idea. 

two issue I found out while playing around with it;
create table x1(a int not null, b int not null );

you can only do:
COPY x1 from stdin (on_error 'null');
but you cannot do
COPY x1 from stdin (on_error null);
we need to hack the gram.y to escape the "null". I don't know how to
make it work.
related post I found:
https://stackoverflow.com/questions/31786611/how-to-escape-flex-keyword

another issue:
COPY x1 from stdin (on_error null);

when we already have `not null` top level constraint for table x1.
Do we need an error immediately?
"on_error null" seems to conflict with `not null` constraint (assume
refers to the same column).
it may fail while doing bulk inserts while on_error is set to null
because of violating a not null constraint.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
"David G. Johnston"
Дата:


On Sun, Jan 28, 2024 at 4:51 PM jian he <jian.universality@gmail.com> wrote:
On Fri, Jan 26, 2024 at 11:09 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> Hi,
>
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly generic.  There would seem to be two relevant ways to ignore bad column input data - drop the entire row or just set the column value to null.  I can see us wanting to provide the set to null option and in any case having the option name be explicit that it ignores the row seems like a good idea.

two issue I found out while playing around with it;
create table x1(a int not null, b int not null );

another issue:
COPY x1 from stdin (on_error null);

when we already have `not null` top level constraint for table x1.
Do we need an error immediately?
"on_error null" seems to conflict with `not null` constraint (assume
refers to the same column).
it may fail while doing bulk inserts while on_error is set to null
because of violating a not null constraint.

You should not error immediately since whether or not there is a problem is table and data dependent.  I would not check for the case of all columns being defined not null and just let the mismatch happen.

That said, maybe with this being a string we can accept something like: 'null, ignore'

And so if attempting to place any one null fails, assuming we can make that a soft error too, we would then ignore the entire row.

David J.

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
Yugo NAGATA
Дата:
On Fri, 26 Jan 2024 08:08:29 -0700
"David G. Johnston" <david.g.johnston@gmail.com> wrote:

> Hi,
> 
> The option choice of "ignore" in the COPY ON_ERROR clause seems overly
> generic.  There would seem to be two relevant ways to ignore bad column
> input data - drop the entire row or just set the column value to null.  I
> can see us wanting to provide the set to null option and in any case having
> the option name be explicit that it ignores the row seems like a good idea.

I am not in favour of renaming the option name "ignore", instead we can
use another style of name for the option to set the column value to NULL,
for example, "set_to_null".

(Maybe, we can make a more generic option "set_to (col, val)" that can set
the value of column specified by "col" value to the specified value "val" 
(e.g. 'N/A') on a soft error, although the syntax would be a bit complex...) 

IMO, it is more simple to define "ignore" as to skip the entire row rather
than having variety of "ignore". Once defined it so, the option to set the
column value to NULL should not be called "ignore" because values in other
columns will be inserted.

Regards,
Yugo Nagata

> 
> David J.


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
jian he
Дата:
The idea of on_error is to tolerate errors, I think.
if a column has a not null constraint, let it cannot be used with
(on_error 'null')

Based on this, I've made a patch.
based on COPY Synopsis: ON_ERROR 'error_action'
on_error 'null', the  keyword NULL should be single quoted.

demo:
COPY check_ign_err FROM STDIN WITH (on_error 'null');
1 {1} a
2 {2} 1
3 {3} 2
4 {4} b
a {5} c
\.

\pset null NULL

SELECT * FROM check_ign_err;
  n   |  m  |  k
------+-----+------
    1 | {1} | NULL
    2 | {2} |    1
    3 | {3} |    2
    4 | {4} | NULL
 NULL | {5} | NULL

Вложения

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
torikoshia
Дата:
Hi,

On 2024-02-03 15:22, jian he wrote:
> The idea of on_error is to tolerate errors, I think.
> if a column has a not null constraint, let it cannot be used with
> (on_error 'null')

> +       /*
> +        * we can specify on_error 'null', but it can only apply to 
> columns
> +        * don't have not null constraint.
> +       */
> +       if (att->attnotnull && cstate->opts.on_error == 
> COPY_ON_ERROR_NULL)
> +           ereport(ERROR,
> +                   (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> +                    errmsg("copy on_error 'null' cannot be used with 
> not null constraint column")));

This means we cannot use ON_ERROR 'null' even when there is one column 
which have NOT NULL constraint, i.e. primary key, right?
IMHO this is strong constraint and will decrease the opportunity to use 
this feature.

It might be better to allow error_action 'null' for tables which have 
NOT NULL constraint columns, and when facing soft errors for those rows, 
skip that row or stop COPY.

> Based on this, I've made a patch.
> based on COPY Synopsis: ON_ERROR 'error_action'
> on_error 'null', the  keyword NULL should be single quoted.

As you mentioned, single quotation seems a little odd..

I'm not sure what is the best name and syntax for this feature, but 
since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
might not be appropriate.

> demo:
> COPY check_ign_err FROM STDIN WITH (on_error 'null');
> 1 {1} a
> 2 {2} 1
> 3 {3} 2
> 4 {4} b
> a {5} c
> \.
> 
> \pset null NULL
> 
> SELECT * FROM check_ign_err;
>   n   |  m  |  k
> ------+-----+------
>     1 | {1} | NULL
>     2 | {2} |    1
>     3 | {3} |    2
>     4 | {4} | NULL
>  NULL | {5} | NULL

Since we notice the number of ignored rows when ON_ERROR is 'ignore', 
users may want to know the number of rows which was changed to NULL when 
using ON_ERROR 'null'.

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
jian he
Дата:
On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
>
> Hi,
>
> On 2024-02-03 15:22, jian he wrote:
> > The idea of on_error is to tolerate errors, I think.
> > if a column has a not null constraint, let it cannot be used with
> > (on_error 'null')
>
> > +       /*
> > +        * we can specify on_error 'null', but it can only apply to
> > columns
> > +        * don't have not null constraint.
> > +       */
> > +       if (att->attnotnull && cstate->opts.on_error ==
> > COPY_ON_ERROR_NULL)
> > +           ereport(ERROR,
> > +                   (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> > +                    errmsg("copy on_error 'null' cannot be used with
> > not null constraint column")));
>
> This means we cannot use ON_ERROR 'null' even when there is one column
> which have NOT NULL constraint, i.e. primary key, right?
> IMHO this is strong constraint and will decrease the opportunity to use
> this feature.

I don't want to fail in the middle of bulk inserts,
so I thought immediately erroring out would be a great idea.
Let's see what other people think.



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
Yugo NAGATA
Дата:
On Mon, 05 Feb 2024 11:28:59 +0900
torikoshia <torikoshia@oss.nttdata.com> wrote:

> > Based on this, I've made a patch.
> > based on COPY Synopsis: ON_ERROR 'error_action'
> > on_error 'null', the  keyword NULL should be single quoted.
> 
> As you mentioned, single quotation seems a little odd..
> 
> I'm not sure what is the best name and syntax for this feature, but 
> since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
> might not be appropriate.

I am not in favour of using 'null' either, so I suggested to use
"set_to_null" or more generic syntax like  "set_to (col, val)" in my
previous post[1], although I'm not convinced what is the best either.

[1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
Kyotaro Horiguchi
Дата:
At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in 
> On Mon, 05 Feb 2024 11:28:59 +0900
> torikoshia <torikoshia@oss.nttdata.com> wrote:
> 
> > > Based on this, I've made a patch.
> > > based on COPY Synopsis: ON_ERROR 'error_action'
> > > on_error 'null', the  keyword NULL should be single quoted.
> > 
> > As you mentioned, single quotation seems a little odd..
> > 
> > I'm not sure what is the best name and syntax for this feature, but 
> > since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
> > might not be appropriate.
> 
> I am not in favour of using 'null' either, so I suggested to use
> "set_to_null" or more generic syntax like  "set_to (col, val)" in my
> previous post[1], although I'm not convinced what is the best either.
> 
> [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp

Tom sugggested using a separate option, and I agree with the
suggestion. Taking this into consideration, I imagined something like
the following, for example.  Although I'm not sure we are actually
going to do whole-tuple replacement, the action name in this example
has the suffix '-column'.

COPY (on_error 'replace-colomn', replacement 'null') ..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
Yugo NAGATA
Дата:
On Tue, 06 Feb 2024 09:39:09 +0900 (JST)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in 
> > On Mon, 05 Feb 2024 11:28:59 +0900
> > torikoshia <torikoshia@oss.nttdata.com> wrote:
> > 
> > > > Based on this, I've made a patch.
> > > > based on COPY Synopsis: ON_ERROR 'error_action'
> > > > on_error 'null', the  keyword NULL should be single quoted.
> > > 
> > > As you mentioned, single quotation seems a little odd..
> > > 
> > > I'm not sure what is the best name and syntax for this feature, but 
> > > since current error_action are verbs('stop' and 'ignore'), I feel 'null' 
> > > might not be appropriate.
> > 
> > I am not in favour of using 'null' either, so I suggested to use
> > "set_to_null" or more generic syntax like  "set_to (col, val)" in my
> > previous post[1], although I'm not convinced what is the best either.
> > 
> > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp
> 
> Tom sugggested using a separate option, and I agree with the
> suggestion. Taking this into consideration, I imagined something like
> the following, for example.  Although I'm not sure we are actually
> going to do whole-tuple replacement, the action name in this example
> has the suffix '-column'.
> 
> COPY (on_error 'replace-colomn', replacement 'null') ..

Thank you for your information. I've found a post[1] you mentioned, 
where adding a separate option for error log destination was suggested. 

Considering consistency with other options, adding a separate option
would be better if we want to specify a value to replace the invalid
value, without introducing a complex syntax that allows options with
more than one parameters. Maybe, if we allow to use values for the
replacement other than NULL, we have to also add a option to specify
a column (or a type)  for each replacement value. Or,  we may add a
option to specify a list of replacement values as many as the number of
columns, each of whose default is NULL.

Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option
value.

[1] https://www.postgresql.org/message-id/2070915.1705527477%40sss.pgh.pa.us

Regards,
Yugo Nagata


> regards.
> 
> -- 
> Kyotaro Horiguchi
> NTT Open Source Software Center


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
jian he
Дата:
On Tue, Feb 6, 2024 at 3:46 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote:
>
> On Tue, 06 Feb 2024 09:39:09 +0900 (JST)
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> > At Mon, 5 Feb 2024 17:22:56 +0900, Yugo NAGATA <nagata@sraoss.co.jp> wrote in
> > > On Mon, 05 Feb 2024 11:28:59 +0900
> > > torikoshia <torikoshia@oss.nttdata.com> wrote:
> > >
> > > > > Based on this, I've made a patch.
> > > > > based on COPY Synopsis: ON_ERROR 'error_action'
> > > > > on_error 'null', the  keyword NULL should be single quoted.
> > > >
> > > > As you mentioned, single quotation seems a little odd..
> > > >
> > > > I'm not sure what is the best name and syntax for this feature, but
> > > > since current error_action are verbs('stop' and 'ignore'), I feel 'null'
> > > > might not be appropriate.
> > >
> > > I am not in favour of using 'null' either, so I suggested to use
> > > "set_to_null" or more generic syntax like  "set_to (col, val)" in my
> > > previous post[1], although I'm not convinced what is the best either.
> > >
> > > [1] https://www.postgresql.org/message-id/20240129172858.ccb6c77c3be95a295e2b2b44%40sraoss.co.jp
> >
> > Tom sugggested using a separate option, and I agree with the
> > suggestion. Taking this into consideration, I imagined something like
> > the following, for example.  Although I'm not sure we are actually
> > going to do whole-tuple replacement, the action name in this example
> > has the suffix '-column'.
> >
> > COPY (on_error 'replace-colomn', replacement 'null') ..
>
> Thank you for your information. I've found a post[1] you mentioned,
> where adding a separate option for error log destination was suggested.
>
> Considering consistency with other options, adding a separate option
> would be better if we want to specify a value to replace the invalid
> value, without introducing a complex syntax that allows options with
> more than one parameters. Maybe, if we allow to use values for the
> replacement other than NULL, we have to also add a option to specify
> a column (or a type)  for each replacement value. Or,  we may add a
> option to specify a list of replacement values as many as the number of
> columns, each of whose default is NULL.
>
> Anyway, I prefer 'replace" (or 'set_to') to just 'null' as the option
> value.
>

Let's say tabe t column (a,b,c)
if we support set_to_null(a,b), what should we do if column c has an error.
should we ignore this row or error out immediately?
also I am not sure it's doable to just extract columnList from the
function defGetCopyOnErrorChoice.

to make `COPY x from stdin (on_error set_to_null(a,b);` work,
we may need to refactor to gram.y, in a similar way we do force null

i am ok with
COPY x from stdin (on_error set_to_null);



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
Yugo NAGATA
Дата:
On Mon, 5 Feb 2024 14:26:46 +0800
jian he <jian.universality@gmail.com> wrote:

> On Mon, Feb 5, 2024 at 10:29 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
> >
> > Hi,
> >
> > On 2024-02-03 15:22, jian he wrote:
> > > The idea of on_error is to tolerate errors, I think.
> > > if a column has a not null constraint, let it cannot be used with
> > > (on_error 'null')
> >
> > > +       /*
> > > +        * we can specify on_error 'null', but it can only apply to
> > > columns
> > > +        * don't have not null constraint.
> > > +       */
> > > +       if (att->attnotnull && cstate->opts.on_error ==
> > > COPY_ON_ERROR_NULL)
> > > +           ereport(ERROR,
> > > +                   (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> > > +                    errmsg("copy on_error 'null' cannot be used with
> > > not null constraint column")));
> >
> > This means we cannot use ON_ERROR 'null' even when there is one column
> > which have NOT NULL constraint, i.e. primary key, right?
> > IMHO this is strong constraint and will decrease the opportunity to use
> > this feature.
> 
> I don't want to fail in the middle of bulk inserts,
> so I thought immediately erroring out would be a great idea.
> Let's see what other people think.

I also think this restriction is too strong because it is very
common that a table has a primary key, unless there is some way
to specify columns that can be set to NULL. Even when ON_ERROR
is specified, any constraint violation errors cannot be generally
ignored, so we cannot elimiate the posibility COPY FROM fails in
the middle due to invalid data, anyway.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
jian he
Дата:
attached v2.
syntax: `on_error set_to_null`
based on upthread discussion, now if you specified `on_error
set_to_null` and your column has `not
null` constraint, we convert the error field to null, so it may error
while bulk inserting for violating NOT NULL constraint.

Вложения

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
Jim Jones
Дата:
Hi!

On 12.02.24 01:00, jian he wrote:
> attached v2.
> syntax: `on_error set_to_null`
> based on upthread discussion, now if you specified `on_error
> set_to_null` and your column has `not
> null` constraint, we convert the error field to null, so it may error
> while bulk inserting for violating NOT NULL constraint.
That's a very nice feature. Thanks for implementing it!

v2 applies cleanly and works as described.

\pset null '(NULL)'

CREATE TEMPORARY TABLE t1 (a int, b int);
COPY t1 (a,b) FROM STDIN;
1    a
2    1
3    2
4    b
a    c
\.
SELECT * FROM t1;

CONTEXT:  COPY t1, line 1, column b: "a"
 a | b
---+---
(0 rows)


CREATE TEMPORARY TABLE t2 (a int, b int);
COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null);
1    a
2    1
3    2
4    b
a    c
\.
SELECT * FROM t2;

psql:test-copy-on_error-2.sql:12: NOTICE:  some columns of 3 rows, value
were converted to NULL due to data type incompatibility
COPY 5
   a    |   b    
--------+--------
      1 | (NULL)
      2 |      1
      3 |      2
      4 | (NULL)
 (NULL) | (NULL)
(5 rows)


I have one question though:

In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all? See t2 example above.
I'm not sure if these records would be of any use in the table. What do
you think?

Since the parameter is already called "set_to_null", maybe it is not
necessary to mention in the NOTICE message that the values have been set
to null.
Perhaps something like "XX records were only partially copied due to
data type incompatibility"


-- 
Jim




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
"David G. Johnston"
Дата:
On Fri, Feb 16, 2024 at 1:16 PM Jim Jones <jim.jones@uni-muenster.de> wrote:
In case all columns of a record have been set to null due to data type
incompatibility, should we insert it at all?

Yes.  In particular not all columns in the table need be specified in the copy command so while the parsed input data is all nulls the record itself may not be.

The system should allow the user to exclude rows with incomplete data by ignoring a not null constraint violation.

In short we shouldn't judge non-usefulness and instead give tools to the user to decide for themselves.

David J.

Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

От
Jim Jones
Дата:

On 16.02.24 21:31, David G. Johnston wrote:
> Yes.  In particular not all columns in the table need be specified in
> the copy command so while the parsed input data is all nulls the
> record itself may not be.

Yeah, you have a point there.
I guess if users want to avoid it to happen they can rely on NOT NULL
constraints.

Thanks

-- 
Jim