Обсуждение: pgsql: Refactor COPY FROM to use format callback functions.
Refactor COPY FROM to use format callback functions. This commit introduces a new CopyFromRoutine struct, which is a set of callback routines to read tuples in a specific format. It also makes COPY FROM with the existing formats (text, CSV, and binary) utilize these format callbacks. This change is a preliminary step towards making the COPY FROM command extensible in terms of input formats. Similar to 2e4127b6d2d, this refactoring contributes to a performance improvement by reducing the number of "if" branches that need to be checked on a per-row basis when sending field representations in text or CSV mode. The performance benchmark results showed ~5% performance gain in text or CSV mode. Author: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/7717f63006935de00fafd000bff450280508adf1 Modified Files -------------- src/backend/commands/copyfrom.c | 192 ++++++++++--- src/backend/commands/copyfromparse.c | 446 +++++++++++++++++-------------- src/include/commands/copy.h | 2 - src/include/commands/copyapi.h | 50 +++- src/include/commands/copyfrom_internal.h | 11 + src/tools/pgindent/typedefs.list | 1 + 6 files changed, 459 insertions(+), 243 deletions(-)
On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote:
Refactor COPY FROM to use format callback functions. This commit introduces a new CopyFromRoutine struct, which is a set of callback routines to read tuples in a specific format. It also makes COPY FROM with the existing formats (text, CSV, and binary) utilize these format callbacks. This change is a preliminary step towards making the COPY FROM command extensible in terms of input formats. Similar to 2e4127b6d2d, this refactoring contributes to a performance improvement by reducing the number of "if" branches that need to be checked on a per-row basis when sending field representations in text or CSV mode. The performance benchmark results showed ~5% performance gain in text or CSV mode. Author: Sutou Kouhei <kou@clear-code.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com
This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API is not a good thing.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote: > > Refactor COPY FROM to use format callback functions. > > This commit introduces a new CopyFromRoutine struct, which is a set of > callback routines to read tuples in a specific format. It also makes > COPY FROM with the existing formats (text, CSV, and binary) utilize > these format callbacks. > > This change is a preliminary step towards making the COPY FROM command > extensible in terms of input formats. > > Similar to 2e4127b6d2d, this refactoring contributes to a performance > improvement by reducing the number of "if" branches that need to be > checked on a per-row basis when sending field representations in text > or CSV mode. The performance benchmark results showed ~5% performance > gain in text or CSV mode. > > Author: Sutou Kouhei <kou@clear-code.com> > Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> > Reviewed-by: Michael Paquier <michael@paquier.xyz> > Reviewed-by: Andres Freund <andres@anarazel.de> > Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> > Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> > Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com > > > > This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API isnot a good thing. > Thank you for pointing it out. I've just posted my analysis[1] and am planning to revive that API (Sutou-san already proposed an idea). Could you please check if the idea would work for file_text_array_fdw? Regards, [1] https://www.postgresql.org/message-id/CAD21AoDr13%3Ddx%2Bk8gmQnR5_bY%2BNskyN4mbSWN0KhQncL6xuPMA%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On 2025-02-28 Fr 2:55 PM, Masahiko Sawada wrote: > On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> >> On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote: >> >> Refactor COPY FROM to use format callback functions. >> >> This commit introduces a new CopyFromRoutine struct, which is a set of >> callback routines to read tuples in a specific format. It also makes >> COPY FROM with the existing formats (text, CSV, and binary) utilize >> these format callbacks. >> >> This change is a preliminary step towards making the COPY FROM command >> extensible in terms of input formats. >> >> Similar to 2e4127b6d2d, this refactoring contributes to a performance >> improvement by reducing the number of "if" branches that need to be >> checked on a per-row basis when sending field representations in text >> or CSV mode. The performance benchmark results showed ~5% performance >> gain in text or CSV mode. >> >> Author: Sutou Kouhei <kou@clear-code.com> >> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> >> Reviewed-by: Michael Paquier <michael@paquier.xyz> >> Reviewed-by: Andres Freund <andres@anarazel.de> >> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> >> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> >> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com >> >> >> >> This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from API isnot a good thing. >> > Thank you for pointing it out. > > I've just posted my analysis[1] and am planning to revive that API > (Sutou-san already proposed an idea). Could you please check if the > idea would work for file_text_array_fdw? > Looks OK, I think. You could even use the Internal function further down in the file and avoid a function call. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Feb 28, 2025 at 12:14 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 2025-02-28 Fr 2:55 PM, Masahiko Sawada wrote: > > On Fri, Feb 28, 2025 at 11:47 AM Andrew Dunstan <andrew@dunslane.net> wrote: > >> > >> On 2025-02-28 Fr 1:31 PM, Masahiko Sawada wrote: > >> > >> Refactor COPY FROM to use format callback functions. > >> > >> This commit introduces a new CopyFromRoutine struct, which is a set of > >> callback routines to read tuples in a specific format. It also makes > >> COPY FROM with the existing formats (text, CSV, and binary) utilize > >> these format callbacks. > >> > >> This change is a preliminary step towards making the COPY FROM command > >> extensible in terms of input formats. > >> > >> Similar to 2e4127b6d2d, this refactoring contributes to a performance > >> improvement by reducing the number of "if" branches that need to be > >> checked on a per-row basis when sending field representations in text > >> or CSV mode. The performance benchmark results showed ~5% performance > >> gain in text or CSV mode. > >> > >> Author: Sutou Kouhei <kou@clear-code.com> > >> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> > >> Reviewed-by: Michael Paquier <michael@paquier.xyz> > >> Reviewed-by: Andres Freund <andres@anarazel.de> > >> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> > >> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> > >> Discussion: https://postgr.es/m/20231204.153548.2126325458835528809.kou@clear-code.com > >> > >> > >> > >> This patch has completely broken the file_textarray fdw, which uses NextCopyFromRawFields(). Removing that from APIis not a good thing. > >> > > Thank you for pointing it out. > > > > I've just posted my analysis[1] and am planning to revive that API > > (Sutou-san already proposed an idea). Could you please check if the > > idea would work for file_text_array_fdw? > > > > Looks OK, I think. You could even use the Internal function further down > in the file and avoid a function call. Right. I've attached the updated patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
Hi,
Thanks for following up the patch!
In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
"Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Right. I've attached the updated patch.
In general, this approach will work but could you keep
the same signature for backward compatibility?
> --- a/src/backend/commands/copyfromparse.c
> +++ b/src/backend/commands/copyfromparse.c
> +bool
> +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> +{
> + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> +}
NextCopyFromRawFields() uses
NextCopyFromRawFields(CopyFromState cstate, char ***fields,
int *nfields) (no "bool is_csv") before we remove it. So
could you use the no "bool is_csv" signature for backward
compatibility?
bool
NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
{
return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
}
> @@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
> /*
> * Read raw fields in the next line for COPY FROM in text or csv mode.
> * Return false if no more lines.
> + */
> +bool
> +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> +{
> + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> +}
> +
> +/*
> + * Workhorse for NextCopyFromRawFields().
It may be better that we don't split docs for
NextCopyFromRawFields() and
NextCopyFromRawFieldsInternal(). How about referring the
NextCopyFromRawFieldsInternal() doc from the
NextCopyFromRawFields() doc something like the following?
/*
* See NextCopyFromRawFieldsInternal() for details.
*/
bool
NextCopyFromRawFields(...)
{
...
}
/*
* Workhorse for NextCopyFromRawFields().
*
* Read raw fields ...
*
* ...
*/
static pg_attribute_always_inline bool
NextCopyFromRawFieldsInternal()
{
...
}
Thanks,
--
kou
On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:
>
> Hi,
>
> Thanks for following up the patch!
>
> In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
> "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
> Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> > Right. I've attached the updated patch.
>
> In general, this approach will work but could you keep
> the same signature for backward compatibility?
>
> > --- a/src/backend/commands/copyfromparse.c
> > +++ b/src/backend/commands/copyfromparse.c
>
> > +bool
> > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> > +{
> > + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> > +}
>
> NextCopyFromRawFields() uses
> NextCopyFromRawFields(CopyFromState cstate, char ***fields,
> int *nfields) (no "bool is_csv") before we remove it. So
> could you use the no "bool is_csv" signature for backward
> compatibility?
>
> bool
> NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
> {
> return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
> }
I like this idea.
>
>
> > @@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
> > /*
> > * Read raw fields in the next line for COPY FROM in text or csv mode.
> > * Return false if no more lines.
> > + */
> > +bool
> > +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> > +{
> > + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> > +}
> > +
> > +/*
> > + * Workhorse for NextCopyFromRawFields().
>
> It may be better that we don't split docs for
> NextCopyFromRawFields() and
> NextCopyFromRawFieldsInternal(). How about referring the
> NextCopyFromRawFieldsInternal() doc from the
> NextCopyFromRawFields() doc something like the following?
>
> /*
> * See NextCopyFromRawFieldsInternal() for details.
> */
> bool
> NextCopyFromRawFields(...)
> {
> ...
> }
>
> /*
> * Workhorse for NextCopyFromRawFields().
> *
> * Read raw fields ...
> *
> * ...
> */
> static pg_attribute_always_inline bool
> NextCopyFromRawFieldsInternal()
> {
> ...
> }
>
Agreed.
I've updated the patch. I'm going to push it, barring any objections
and further comments.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Вложения
On 2025-02-28 Fr 5:39 PM, Masahiko Sawada wrote:
On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:Hi, Thanks for following up the patch! In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com> "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800, Masahiko Sawada <sawada.mshk@gmail.com> wrote:Right. I've attached the updated patch.In general, this approach will work but could you keep the same signature for backward compatibility?--- a/src/backend/commands/copyfromparse.c +++ b/src/backend/commands/copyfromparse.c+bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +}NextCopyFromRawFields() uses NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields) (no "bool is_csv") before we remove it. So could you use the no "bool is_csv" signature for backward compatibility? bool NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields) { return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode); }I like this idea.@@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes) /* * Read raw fields in the next line for COPY FROM in text or csv mode. * Return false if no more lines. + */ +bool +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv) +{ + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv); +} + +/* + * Workhorse for NextCopyFromRawFields().It may be better that we don't split docs for NextCopyFromRawFields() and NextCopyFromRawFieldsInternal(). How about referring the NextCopyFromRawFieldsInternal() doc from the NextCopyFromRawFields() doc something like the following? /* * See NextCopyFromRawFieldsInternal() for details. */ bool NextCopyFromRawFields(...) { ... } /* * Workhorse for NextCopyFromRawFields(). * * Read raw fields ... * * ... */ static pg_attribute_always_inline bool NextCopyFromRawFieldsInternal() { ... }Agreed. I've updated the patch. I'm going to push it, barring any objections and further comments.
I'm OK either way - I have made changes to adapt to the API change, and tested them.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Feb 28, 2025 at 3:06 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 2025-02-28 Fr 5:39 PM, Masahiko Sawada wrote:
>
> On Fri, Feb 28, 2025 at 2:16 PM Sutou Kouhei <kou@clear-code.com> wrote:
>
> Hi,
>
> Thanks for following up the patch!
>
> In <CAD21AoBA414Q76LthY65NJfWbjOxXn1bdFFsD_NBhT2wPUS1SQ@mail.gmail.com>
> "Re: pgsql: Refactor COPY FROM to use format callback functions." on Fri, 28 Feb 2025 12:56:19 -0800,
> Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Right. I've attached the updated patch.
>
> In general, this approach will work but could you keep
> the same signature for backward compatibility?
>
> --- a/src/backend/commands/copyfromparse.c
> +++ b/src/backend/commands/copyfromparse.c
>
> +bool
> +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> +{
> + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> +}
>
> NextCopyFromRawFields() uses
> NextCopyFromRawFields(CopyFromState cstate, char ***fields,
> int *nfields) (no "bool is_csv") before we remove it. So
> could you use the no "bool is_csv" signature for backward
> compatibility?
>
> bool
> NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields)
> {
> return NextCopyFromRawFieldsInternal(cstate, fields, nfields, cstate->opts.csv_mode);
> }
>
> I like this idea.
>
>
> @@ -738,6 +742,15 @@ CopyReadBinaryData(CopyFromState cstate, char *dest, int nbytes)
> /*
> * Read raw fields in the next line for COPY FROM in text or csv mode.
> * Return false if no more lines.
> + */
> +bool
> +NextCopyFromRawFields(CopyFromState cstate, char ***fields, int *nfields, bool is_csv)
> +{
> + return NextCopyFromRawFieldsInternal(cstate, fields, nfields, is_csv);
> +}
> +
> +/*
> + * Workhorse for NextCopyFromRawFields().
>
> It may be better that we don't split docs for
> NextCopyFromRawFields() and
> NextCopyFromRawFieldsInternal(). How about referring the
> NextCopyFromRawFieldsInternal() doc from the
> NextCopyFromRawFields() doc something like the following?
>
> /*
> * See NextCopyFromRawFieldsInternal() for details.
> */
> bool
> NextCopyFromRawFields(...)
> {
> ...
> }
>
> /*
> * Workhorse for NextCopyFromRawFields().
> *
> * Read raw fields ...
> *
> * ...
> */
> static pg_attribute_always_inline bool
> NextCopyFromRawFieldsInternal()
> {
> ...
> }
>
> Agreed.
>
> I've updated the patch. I'm going to push it, barring any objections
> and further comments.
>
>
>
> I'm OK either way - I have made changes to adapt to the API change, and tested them.
>
Thank you for the confirmation. Pushed the fix.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com