Обсуждение: Add new COPY option REJECT_LIMIT

Поиск
Список
Период
Сортировка

Add new COPY option REJECT_LIMIT

От
torikoshia
Дата:
Hi,

9e2d870 enabled the COPY command to skip soft error, and I think we can 
add another option which specifies the maximum tolerable number of soft 
errors.

I remember this was discussed in [1], and feel it would be useful when 
loading 'dirty' data but there is a limit to how dirty it can be.

Attached a patch for this.

What do you think?


[1] 
https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us


-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation
Вложения

Re: Add new COPY option REJECT_LIMIT

От
"David G. Johnston"
Дата:
On Fri, Jan 26, 2024 at 2:49 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
Hi,

9e2d870 enabled the COPY command to skip soft error, and I think we can
add another option which specifies the maximum tolerable number of soft
errors.

I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.

Attached a patch for this.

What do you think?


I'm opposed to adding this particular feature.

When implementing this kind of business rule I'd need the option to specify a percentage, not just an absolute value.

I would focus on trying to put the data required to make this kind of determination into a place where applications implementing such business rules and monitoring can readily get at it.  The "ERRORS TO" and maybe a corresponding "STATS TO" option where a table can be specified for the system to place the problematic data and stats about the copy itself.

David J.

Re: Add new COPY option REJECT_LIMIT

От
torikoshia
Дата:
On 2024-01-27 00:20, David G. Johnston wrote:

Thanks for your comments!

> On Fri, Jan 26, 2024 at 2:49 AM torikoshia
> <torikoshia@oss.nttdata.com> wrote:
> 
>> Hi,
>> 
>> 9e2d870 enabled the COPY command to skip soft error, and I think we
>> can
>> add another option which specifies the maximum tolerable number of
>> soft
>> errors.
>> 
>> I remember this was discussed in [1], and feel it would be useful
>> when
>> loading 'dirty' data but there is a limit to how dirty it can be.
>> 
>> Attached a patch for this.
>> 
>> What do you think?
> 
> I'm opposed to adding this particular feature.
> 
> When implementing this kind of business rule I'd need the option to
> specify a percentage, not just an absolute value.

Yeah, it seems useful for some cases.
Actually, Greenplum enables to specify not only the max number of bad 
rows but also its percentage[1].

I may be wrong, but considering some dataloaders support something like 
reject_limit(Redshift supports MAXERROR[2], pg_bulkload supports 
PARSE_ERRORS[3]), specifying the "number" of the bad row might also be 
useful.

I think we can implement reject_limit specified by percentage simply 
calculating the ratio of skipped and processed at the end of CopyFrom() 
like this:

   if (cstate->opts.reject_limit > 0 && (double) skipped / (processed + 
skipped) > cstate->opts.reject_limit_percent)
                   ereport(ERROR,
                                   
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
                                    errmsg("exceeded the ratio specified 
by ..


> I would focus on trying to put the data required to make this kind of
> determination into a place where applications implementing such
> business rules and monitoring can readily get at it.  The "ERRORS TO"
> and maybe a corresponding "STATS TO" option where a table can be
> specified for the system to place the problematic data and stats about
> the copy itself.

It'd be nice to have such informative tables, but I believe the benefit 
of reject_limit is it fails the entire loading when the threshold is 
exceeded.
I imagine when we just have error and stats information tables for COPY, 
users have to delete the rows when they confirmed too many errors in 
these tables.



[1]https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-load-topics-g-handling-load-errors.html
[2]https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html
[3]https://ossc-db.github.io/pg_bulkload/pg_bulkload.html

-- 
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation