Обсуждение: Add new COPY option REJECT_LIMIT
Hi, 9e2d870 enabled the COPY command to skip soft error, and I think we can add another option which specifies the maximum tolerable number of soft errors. I remember this was discussed in [1], and feel it would be useful when loading 'dirty' data but there is a limit to how dirty it can be. Attached a patch for this. What do you think? [1] https://www.postgresql.org/message-id/752672.1699474336%40sss.pgh.pa.us -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation
Вложения
On Fri, Jan 26, 2024 at 2:49 AM torikoshia <torikoshia@oss.nttdata.com> wrote:
Hi,
9e2d870 enabled the COPY command to skip soft error, and I think we can
add another option which specifies the maximum tolerable number of soft
errors.
I remember this was discussed in [1], and feel it would be useful when
loading 'dirty' data but there is a limit to how dirty it can be.
Attached a patch for this.
What do you think?
I'm opposed to adding this particular feature.
When implementing this kind of business rule I'd need the option to specify a percentage, not just an absolute value.
I would focus on trying to put the data required to make this kind of determination into a place where applications implementing such business rules and monitoring can readily get at it. The "ERRORS TO" and maybe a corresponding "STATS TO" option where a table can be specified for the system to place the problematic data and stats about the copy itself.
David J.
On 2024-01-27 00:20, David G. Johnston wrote: Thanks for your comments! > On Fri, Jan 26, 2024 at 2:49 AM torikoshia > <torikoshia@oss.nttdata.com> wrote: > >> Hi, >> >> 9e2d870 enabled the COPY command to skip soft error, and I think we >> can >> add another option which specifies the maximum tolerable number of >> soft >> errors. >> >> I remember this was discussed in [1], and feel it would be useful >> when >> loading 'dirty' data but there is a limit to how dirty it can be. >> >> Attached a patch for this. >> >> What do you think? > > I'm opposed to adding this particular feature. > > When implementing this kind of business rule I'd need the option to > specify a percentage, not just an absolute value. Yeah, it seems useful for some cases. Actually, Greenplum enables to specify not only the max number of bad rows but also its percentage[1]. I may be wrong, but considering some dataloaders support something like reject_limit(Redshift supports MAXERROR[2], pg_bulkload supports PARSE_ERRORS[3]), specifying the "number" of the bad row might also be useful. I think we can implement reject_limit specified by percentage simply calculating the ratio of skipped and processed at the end of CopyFrom() like this: if (cstate->opts.reject_limit > 0 && (double) skipped / (processed + skipped) > cstate->opts.reject_limit_percent) ereport(ERROR, (errcode(ERRCODE_BAD_COPY_FILE_FORMAT), errmsg("exceeded the ratio specified by .. > I would focus on trying to put the data required to make this kind of > determination into a place where applications implementing such > business rules and monitoring can readily get at it. The "ERRORS TO" > and maybe a corresponding "STATS TO" option where a table can be > specified for the system to place the problematic data and stats about > the copy itself. It'd be nice to have such informative tables, but I believe the benefit of reject_limit is it fails the entire loading when the threshold is exceeded. I imagine when we just have error and stats information tables for COPY, users have to delete the rows when they confirmed too many errors in these tables. [1]https://docs.vmware.com/en/VMware-Greenplum/7/greenplum-database/admin_guide-load-topics-g-handling-load-errors.html [2]https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html [3]https://ossc-db.github.io/pg_bulkload/pg_bulkload.html -- Regards, -- Atsushi Torikoshi NTT DATA Group Corporation