Обсуждение: zstd compression for pg_dump

Поиск
Список
Период
Сортировка

zstd compression for pg_dump

От
Justin Pryzby
Дата:
This is a draft patch - review is welcome and would help to get this
ready to be considererd for v16, if desired.

I'm going to add this thread to the old CF entry.
https://commitfest.postgresql.org/31/2888/

-- 
Justin

Вложения

Re: zstd compression for pg_dump

От
Michael Paquier
Дата:
On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
>
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/

Patch 0003 adds support for the --long option of zstd, meaning that it
"enables long distance matching with #windowLog".  What's the benefit
of that when it is applied to dumps and base backup contents?
--
Michael

Вложения

Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Sat, Feb 25, 2023 at 01:44:36PM +0900, Michael Paquier wrote:
> On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> > This is a draft patch - review is welcome and would help to get this
> > ready to be considererd for v16, if desired.
> > 
> > I'm going to add this thread to the old CF entry.
> > https://commitfest.postgresql.org/31/2888/
> 
> Patch 0003 adds support for the --long option of zstd, meaning that it
> "enables long distance matching with #windowLog".  What's the benefit
> of that when it is applied to dumps and base backup contents?

It (can) makes it smaller.

+           The <literal>long</literal> keyword enables long-distance matching
+           mode, for improved compression ratio, at the expense of higher memory
+           use.  Long-distance mode is supported only for 

+        With zstd compression, <literal>long</literal> mode may allow dumps
+        to be significantly smaller, but it might not reduce the size of
+        custom or directory format dumps, whose fields are separately compressed.

Note that I included that here as 003, but I also have an pre-existing
patch for adding that just to basebackup.

-- 
Justin



Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:
On 2/24/23 20:18, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
> 
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/
> 

Thanks. Sadly cfbot is unhappy - the windows and cplusplus builds failed
because of some issue in pg_backup_archiver.h. But it's a bit bizarre
because the patch does not modify that file at all ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
"Euler Taveira"
Дата:
On Sat, Feb 25, 2023, at 7:31 AM, Tomas Vondra wrote:
On 2/24/23 20:18, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.

> I'm going to add this thread to the old CF entry.


Thanks. Sadly cfbot is unhappy - the windows and cplusplus builds failed
because of some issue in pg_backup_archiver.h. But it's a bit bizarre
because the patch does not modify that file at all ...
cpluspluscheck says

        # pg_dump is not C++-clean because it uses "public" and "namespace"
        # as field names, which is unfortunate but we won't change it now.

Hence, the patch should exclude the new header file from it.

--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -153,6 +153,7 @@ do
    test "$f" = src/bin/pg_dump/compress_gzip.h && continue
    test "$f" = src/bin/pg_dump/compress_io.h && continue
    test "$f" = src/bin/pg_dump/compress_lz4.h && continue
+   test "$f" = src/bin/pg_dump/compress_zstd.h && continue
    test "$f" = src/bin/pg_dump/compress_none.h && continue
    test "$f" = src/bin/pg_dump/parallel.h && continue
    test "$f" = src/bin/pg_dump/pg_backup_archiver.h && continue


--
Euler Taveira

Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
> 
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/

This resolves cfbot warnings: windows and cppcheck.
And refactors zstd routines.
And updates docs.
And includes some fixes for earlier patches that these patches conflicts
with/depends on.

Вложения

Re: zstd compression for pg_dump

От
Jacob Champion
Дата:
On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> This resolves cfbot warnings: windows and cppcheck.
> And refactors zstd routines.
> And updates docs.
> And includes some fixes for earlier patches that these patches conflicts
> with/depends on.

This'll need a rebase (cfbot took a while to catch up). The patchset
includes basebackup modifications, which are part of a different CF
entry; was that intended?

I tried this on a local, 3.5GB, mostly-text table (from the UK Price
Paid dataset [1]) and the comparison against the other methods was
impressive. (I'm no good at constructing compression benchmarks, so
this is a super naive setup. Client's on the same laptop as the
server.)

    $ time ./src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
zstd > /tmp/zstd.dump
    real    1m17.632s
    user    0m35.521s
    sys    0m2.683s

    $ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
lz4 > /tmp/lz4.dump
    real    1m13.125s
    user    0m19.795s
    sys    0m3.370s

    $ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
gzip > /tmp/gzip.dump
    real    2m24.523s
    user    2m22.114s
    sys    0m1.848s

    $ ls -l /tmp/*.dump
    -rw-rw-r-- 1 jacob jacob 1331493925 Mar  3 09:45 /tmp/gzip.dump
    -rw-rw-r-- 1 jacob jacob 2125998939 Mar  3 09:42 /tmp/lz4.dump
    -rw-rw-r-- 1 jacob jacob 1215834718 Mar  3 09:40 /tmp/zstd.dump

Default gzip was the only method that bottlenecked on pg_dump rather
than the server, and default zstd outcompressed it at a fraction of
the CPU time. So, naively, this looks really good.

With this particular dataset, I don't see much improvement with
zstd:long. (At nearly double the CPU time, I get a <1% improvement in
compression size.) I assume it's heavily data dependent, but from the
notes on --long [2] it seems like they expect you to play around with
the window size to further tailor it to your data. Does it make sense
to provide the long option without the windowLog parameter?

Thanks,
--Jacob

[1] https://landregistry.data.gov.uk/
[2] https://github.com/facebook/zstd/releases/tag/v1.3.2



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Fri, Mar 03, 2023 at 10:32:53AM -0800, Jacob Champion wrote:
> On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > This resolves cfbot warnings: windows and cppcheck.
> > And refactors zstd routines.
> > And updates docs.
> > And includes some fixes for earlier patches that these patches conflicts
> > with/depends on.
> 
> This'll need a rebase (cfbot took a while to catch up).

Soon.

> The patchset includes basebackup modifications, which are part of a
> different CF entry; was that intended?

Yes, it's intentional - if zstd:long mode were to be merged first, then
this patch should include long mode from the start.
Or, if pgdump+zstd were merged first, then long mode could be added to
both places.

> I tried this on a local, 3.5GB, mostly-text table (from the UK Price

Thanks for looking.  If your zstd library is compiled with thread
support, could you also try with :workers=N ?  I believe this is working
correctly, but I'm going to ask for help verifying that...

It'd be especially useful to test under windows, where pgdump/restore
use threads instead of forking...  If you have a windows environment but
not set up for development, I think it's possible to get cirrusci to
compile a patch for you and then retrieve the binaries provided as an
"artifact" (credit/blame for this idea should be directed to Thomas
Munro).

> With this particular dataset, I don't see much improvement with
> zstd:long.

Yeah.  I this could be because either 1) you already got very good
comprssion without looking at more data; and/or 2) the neighboring data
is already very similar, maybe equally or more similar, than the further
data, from which there's nothing to gain.

> (At nearly double the CPU time, I get a <1% improvement in
> compression size.) I assume it's heavily data dependent, but from the
> notes on --long [2] it seems like they expect you to play around with
> the window size to further tailor it to your data. Does it make sense
> to provide the long option without the windowLog parameter?

I don't want to start exposing lots of fine-granined parameters at this
point.  In the immediate case, it looks like it may require more than
just adding another parameter:

              Note: If windowLog is set to larger than 27,
--long=windowLog or --memory=windowSize needs to be passed to the
decompressor.

-- 
Justin



Re: zstd compression for pg_dump

От
Jacob Champion
Дата:
On Fri, Mar 3, 2023 at 10:55 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Thanks for looking.  If your zstd library is compiled with thread
> support, could you also try with :workers=N ?  I believe this is working
> correctly, but I'm going to ask for help verifying that...

Unfortunately not (Ubuntu 20.04):

    pg_dump: error: could not set compression parameter: Unsupported parameter

But that lets me review the error! I think these error messages should
say which options caused them.

> It'd be especially useful to test under windows, where pgdump/restore
> use threads instead of forking...  If you have a windows environment but
> not set up for development, I think it's possible to get cirrusci to
> compile a patch for you and then retrieve the binaries provided as an
> "artifact" (credit/blame for this idea should be directed to Thomas
> Munro).

I should be able to do that next week.

> > With this particular dataset, I don't see much improvement with
> > zstd:long.
>
> Yeah.  I this could be because either 1) you already got very good
> comprssion without looking at more data; and/or 2) the neighboring data
> is already very similar, maybe equally or more similar, than the further
> data, from which there's nothing to gain.

What kinds of improvements do you see with your setup? I'm wondering
when we would suggest that people use it.

> I don't want to start exposing lots of fine-granined parameters at this
> point.  In the immediate case, it looks like it may require more than
> just adding another parameter:
>
>               Note: If windowLog is set to larger than 27,
> --long=windowLog or --memory=windowSize needs to be passed to the
> decompressor.

Hm. That would complicate things.

Thanks,
--Jacob



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Fri, Mar 03, 2023 at 01:38:05PM -0800, Jacob Champion wrote:
> > > With this particular dataset, I don't see much improvement with
> > > zstd:long.
> >
> > Yeah.  I this could be because either 1) you already got very good
> > comprssion without looking at more data; and/or 2) the neighboring data
> > is already very similar, maybe equally or more similar, than the further
> > data, from which there's nothing to gain.
> 
> What kinds of improvements do you see with your setup? I'm wondering
> when we would suggest that people use it.

On customer data, I see small improvements - below 10%.

But on my first two tries, I made synthetic data sets where it's a lot:

$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long |wc -c
286107
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long=0 |wc -c
1709695

That's just 6 identical tables like:
pryzbyj=# CREATE TABLE t1 AS SELECT generate_series(1,999999);

In this case, "custom" format doesn't see that benefit, because the
greatest similarity is across tables, which don't share compressor
state.  But I think the note that I wrote in the docs about that should
be removed - custom format could see a big benefit, as long as the table
is big enough, and there's more similarity/repetition at longer
distances.

Here's one where custom format *does* benefit, due to long-distance
repetition within a single table.  The data is contrived, but the schema
of ID => data is not.  What's notable isn't how compressible the data
is, but how much *more* compressible it is with long-distance matching.

pryzbyj=# CREATE TABLE t1 AS SELECT i,array_agg(j) FROM generate_series(1,444)i,generate_series(1,99999)j GROUP BY 1;
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=1 |wc -c
82023
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=0 |wc -c
1048267

-- 
Justin



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Sat, Feb 25, 2023 at 07:22:27PM -0600, Justin Pryzby wrote:
> On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> > This is a draft patch - review is welcome and would help to get this
> > ready to be considererd for v16, if desired.
> > 
> > I'm going to add this thread to the old CF entry.
> > https://commitfest.postgresql.org/31/2888/
> 
> This resolves cfbot warnings: windows and cppcheck.
> And refactors zstd routines.
> And updates docs.
> And includes some fixes for earlier patches that these patches conflicts
> with/depends on.

This rebases over the TAP and doc fixes to LZ4.
And adds necessary ENV to makefile and meson.
And adds an annoying boilerplate header.
And removes supports_compression(), which is what I think Tomas meant
when referring to "annoying unsupported cases".
And updates zstd.c: fix an off-by-one, allocate in init depending on
readF/writeF, do not reset the input buffer on each iteration, and show
parameter name in errors.

I'd appreciate help checking that this is doing the right things and
works correctly with zstd threaded workers.  The zstd API says: "use one
different context per thread for parallel execution" and "For parallel
execution, use one separate ZSTD_CStream per thread".
https://github.com/facebook/zstd/blob/dev/lib/zstd.h

I understand that to mean that, if pg_dump *itself* were using threads,
then each thread would need to call ZSTD_createCStream().  pg_dump isn't
threaded, so there's nothing special needed, right?

Except that, under windows, pg_dump -Fd -j actually uses threads instead
of forking.  I *think* that's still safe, since the pgdump threads are
created *before* calling zstd functions (see _PrintTocData and
_StartData of the custom and directory formats), so it happens naturally
that there's a separate zstd stream for each thread of pgdump.

-- 
Justin

Вложения

Re: zstd compression for pg_dump

От
Jacob Champion
Дата:
On Sat, Mar 4, 2023 at 8:57 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> pryzbyj=# CREATE TABLE t1 AS SELECT i,array_agg(j) FROM generate_series(1,444)i,generate_series(1,99999)j GROUP BY 1;
> $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=1 |wc -c
> 82023
> $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=0 |wc -c
> 1048267

Nice!

I did some smoke testing against zstd's GitHub release on Windows. To
build against it, I had to construct an import library, and put that
and the DLL into the `lib` folder expected by the MSVC scripts...
which makes me wonder if I've chosen a harder way than necessary?

Parallel zstd dumps seem to work as expected, in that the resulting
pg_restore output is identical to uncompressed dumps and nothing
explodes. I haven't inspected the threading implementation for safety
yet, as you mentioned. And I still wasn't able to test :workers, since
it looks like the official libzstd for Windows isn't built for
multithreading. That'll be another day's project.

--Jacob



Re: zstd compression for pg_dump

От
Jacob Champion
Дата:
Hi,

This'll need another rebase over the meson ICU changes.

On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com>
wrote:
> I did some smoke testing against zstd's GitHub release on Windows. To
> build against it, I had to construct an import library, and put that
> and the DLL into the `lib` folder expected by the MSVC scripts...
> which makes me wonder if I've chosen a harder way than necessary?

A meson wrap made this much easier! It looks like pg_dump's meson.build
is missing dependencies on zstd (meson couldn't find the headers in the
subproject without them).

> Parallel zstd dumps seem to work as expected, in that the resulting
> pg_restore output is identical to uncompressed dumps and nothing
> explodes. I haven't inspected the threading implementation for safety
> yet, as you mentioned.

Hm. Best I can tell, the CloneArchive() machinery is supposed to be
handling safety for this, by isolating each thread's state. I don't feel
comfortable pronouncing this new addition safe or not, because I'm not
sure I understand what the comments in the format-specific _Clone()
callbacks are saying yet.

> And I still wasn't able to test :workers, since
> it looks like the official libzstd for Windows isn't built for
> multithreading. That'll be another day's project.

The wrapped installation enabled threading too, so I was able to try
with :workers=8. Everything seems to work, but I didn't have a dataset
that showed speed improvements at the time. It did seem to affect the
overall compressibility negatively -- which makes sense, I think,
assuming each thread is looking at a separate and/or smaller window.

On to code (not a complete review):

>     if (hasSuffix(fname, ".gz"))
>         compression_spec.algorithm = PG_COMPRESSION_GZIP;
>     else
>     {
>         bool        exists;
> 
>         exists = (stat(path, &st) == 0);
>         /* avoid unused warning if it is not built with compression */
>         if (exists)
>             compression_spec.algorithm = PG_COMPRESSION_NONE;
> -#ifdef HAVE_LIBZ
> -       if (!exists)
> -       {
> -           free_keep_errno(fname);
> -           fname = psprintf("%s.gz", path);
> -           exists = (stat(fname, &st) == 0);
> -
> -           if (exists)
> -               compression_spec.algorithm = PG_COMPRESSION_GZIP;
> -       }
> -#endif
> -#ifdef USE_LZ4
> -       if (!exists)
> -       {
> -           free_keep_errno(fname);
> -           fname = psprintf("%s.lz4", path);
> -           exists = (stat(fname, &st) == 0);
> -
> -           if (exists)
> -               compression_spec.algorithm = PG_COMPRESSION_LZ4;
> -       }
> -#endif
> +       else if (check_compressed_file(path, &fname, "gz"))
> +           compression_spec.algorithm = PG_COMPRESSION_GZIP;
> +       else if (check_compressed_file(path, &fname, "lz4"))
> +           compression_spec.algorithm = PG_COMPRESSION_LZ4;
> +       else if (check_compressed_file(path, &fname, "zst"))
> +           compression_spec.algorithm = PG_COMPRESSION_ZSTD;
>     }

This function lost some coherence, I think. Should there be a hasSuffix
check at the top for ".zstd" (and, for that matter, ".lz4")? And the
comment references an unused warning, which is only possible with the
#ifdef blocks that were removed.

I'm a little suspicious of the replacement of supports_compression()
with parse_compress_specification(). For example:

> -   errmsg = supports_compression(AH->compression_spec);
> -   if (errmsg)
> +   parse_compress_specification(AH->compression_spec.algorithm,
> +           NULL, &compress_spec);
> +   if (compress_spec.parse_error != NULL)
>     {
>         pg_log_warning("archive is compressed, but this installation does not support compression (%s
> -                       errmsg);
> -       pg_free(errmsg);
> +                       compress_spec.parse_error);
> +       pg_free(compress_spec.parse_error);
>     }

The top-level error here is "does not support compression", but wouldn't
a bad specification option with a supported compression method trip this
path too?

> +static void
> +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream,
> +                         ZSTD_cParameter param, int value, char *paramname)

IMO we should avoid stepping on the ZSTD_ namespace with our own
internal function names.

> +   if (cs->readF != NULL)
> +   {
> +       zstdcs->dstream = ZSTD_createDStream();
> +       if (zstdcs->dstream == NULL)
> +           pg_fatal("could not initialize compression library");
> +
> +       zstdcs->input.size = ZSTD_DStreamInSize();
> +       zstdcs->input.src = pg_malloc(zstdcs->input.size);
> +
> +       zstdcs->output.size = ZSTD_DStreamOutSize();
> +       zstdcs->output.dst = pg_malloc(zstdcs->output.size + 1);
> +   }
> +
> +   if (cs->writeF != NULL)
> +   {
> +       zstdcs->cstream = ZstdCStreamParams(cs->compression_spec);
> +
> +       zstdcs->output.size = ZSTD_CStreamOutSize();
> +       zstdcs->output.dst = pg_malloc(zstdcs->output.size);
> +       zstdcs->output.pos = 0;
> +   }

This seems to suggest that both cs->readF and cs->writeF could be set,
but in that case, the output buffer gets reallocated.

I was curious about the extra byte allocated in the decompression case.
I see that ReadDataFromArchiveZstd() is null-terminating the buffer
before handing it to ahwrite(), but why does it need to do that?

> +static const char *
> +Zstd_get_error(CompressFileHandle *CFH)
> +{
> +       return strerror(errno);
> +}

Seems like this should be using the zstderror stored in the handle?

In ReadDataFromArchiveZstd():

> +   size_t      input_allocated_size = ZSTD_DStreamInSize();
> +   size_t      res;
> +
> +   for (;;)
> +   {
> +       size_t      cnt;
> +
> +       /*
> +        * Read compressed data.  Note that readF can resize the buffer; the
> +        * new size is tracked and used for future loops.
> +        */
> +       input->size = input_allocated_size;
> +       cnt = cs->readF(AH, (char **) unconstify(void **, &input->src), &input->size);
> +       input_allocated_size = input->size;
> +       input->size = cnt;
This is pretty complex for what it's doing. I'm a little worried that we
let the reallocated buffer escape to the caller while losing track of
how big it is. I think that works today, since there's only ever one
call per handle, but any future refactoring that allowed cs->readData()
to be called more than once would subtly break this code.

In ZstdWriteCommon():

> +       /*
> +        * Extra paranoia: avoid zero-length chunks, since a zero length chunk
> +        * is the EOF marker in the custom format. This should never happen
> +        * but...
> +        */
> +       if (output->pos > 0)
> +           cs->writeF(AH, output->dst, output->pos);
> +
> +       output->pos = 0;

Elsewhere, output->pos is set to zero before compressing, but here we do
it after, which I think leads to subtle differences in the function
preconditions. If that's an intentional difference, can the reason be
called out in a comment?

--Jacob



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> > I did some smoke testing against zstd's GitHub release on Windows. To
> > build against it, I had to construct an import library, and put that
> > and the DLL into the `lib` folder expected by the MSVC scripts...
> > which makes me wonder if I've chosen a harder way than necessary?
> 
> It looks like pg_dump's meson.build is missing dependencies on zstd
> (meson couldn't find the headers in the subproject without them).

I saw that this was added for LZ4, but I hadn't added it for zstd since
I didn't run into an issue without it.  Could you check that what I've
added works for your case ?

> > Parallel zstd dumps seem to work as expected, in that the resulting
> > pg_restore output is identical to uncompressed dumps and nothing
> > explodes. I haven't inspected the threading implementation for safety
> > yet, as you mentioned.
> 
> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> handling safety for this, by isolating each thread's state. I don't feel
> comfortable pronouncing this new addition safe or not, because I'm not
> sure I understand what the comments in the format-specific _Clone()
> callbacks are saying yet.

My line of reasoning for unix is that pg_dump forks before any calls to
zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
doesn't apply to pg_dump under windows.  This is an opened question.  If
there's no solid answer, I could disable/ignore the option (maybe only
under windows).

> On to code (not a complete review):
> 
> >     if (hasSuffix(fname, ".gz"))
> >         compression_spec.algorithm = PG_COMPRESSION_GZIP;
> >     else
> >     {
> >         bool        exists;
> > 
> >         exists = (stat(path, &st) == 0);
> >         /* avoid unused warning if it is not built with compression */
> >         if (exists)
> >             compression_spec.algorithm = PG_COMPRESSION_NONE;
> > -#ifdef HAVE_LIBZ
> > -       if (!exists)
> > -       {
> > -           free_keep_errno(fname);
> > -           fname = psprintf("%s.gz", path);
> > -           exists = (stat(fname, &st) == 0);
> > -
> > -           if (exists)
> > -               compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > -       }
> > -#endif
> > -#ifdef USE_LZ4
> > -       if (!exists)
> > -       {
> > -           free_keep_errno(fname);
> > -           fname = psprintf("%s.lz4", path);
> > -           exists = (stat(fname, &st) == 0);
> > -
> > -           if (exists)
> > -               compression_spec.algorithm = PG_COMPRESSION_LZ4;
> > -       }
> > -#endif
> > +       else if (check_compressed_file(path, &fname, "gz"))
> > +           compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > +       else if (check_compressed_file(path, &fname, "lz4"))
> > +           compression_spec.algorithm = PG_COMPRESSION_LZ4;
> > +       else if (check_compressed_file(path, &fname, "zst"))
> > +           compression_spec.algorithm = PG_COMPRESSION_ZSTD;
> >     }
> 
> This function lost some coherence, I think. Should there be a hasSuffix
> check at the top for ".zstd" (and, for that matter, ".lz4")?

The function is first checking if it was passed a filename which already
has a suffix.  And if not, it searches through a list of suffixes,
testing for an existing file with each suffix.  The search with stat()
doesn't happen if it has a suffix.  I'm having trouble seeing how the
hasSuffix() branch isn't dead code.  Another opened question.

> I'm a little suspicious of the replacement of supports_compression()
> with parse_compress_specification(). For example:
> 
> > -   errmsg = supports_compression(AH->compression_spec);
> > -   if (errmsg)
> > +   parse_compress_specification(AH->compression_spec.algorithm,
> > +           NULL, &compress_spec);
> > +   if (compress_spec.parse_error != NULL)
> >     {
> >         pg_log_warning("archive is compressed, but this installation does not support compression (%s
> > -                       errmsg);
> > -       pg_free(errmsg);
> > +                       compress_spec.parse_error);
> > +       pg_free(compress_spec.parse_error);
> >     }
> 
> The top-level error here is "does not support compression", but wouldn't
> a bad specification option with a supported compression method trip this
> path too?

No - since the 2nd argument is passed as NULL, it just checks whether
the compression is supported.  Maybe there ought to be a more
direct/clean way to do it.  But up to now evidently nobody needed to do
that.

> > +static void
> > +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream,
> > +                         ZSTD_cParameter param, int value, char *paramname)
> 
> IMO we should avoid stepping on the ZSTD_ namespace with our own
> internal function names.

done

> > +   if (cs->readF != NULL)
> > +
> > +   if (cs->writeF != NULL)
> 
> This seems to suggest that both cs->readF and cs->writeF could be set,
> but in that case, the output buffer gets reallocated.

I put back an assertion that exactly one of them was set, since that's
true of how it currently works.

> I was curious about the extra byte allocated in the decompression case.
> I see that ReadDataFromArchiveZstd() is null-terminating the buffer
> before handing it to ahwrite(), but why does it need to do that?

I was trying to figure that out, too.  I think the unterminated case
might be for ExecuteSqlCommandBuf(), and that may only (have) been
needed to allow pg_restore to handle ancient/development versions of
pg_dump...  It's not currently hit.
https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_db.c.gcov.html#470

I found that the terminator was added for the uncompressed case was
added at e8f69be05 and removed in bf9aa490d.

> > +Zstd_get_error(CompressFileHandle *CFH)
> 
> Seems like this should be using the zstderror stored in the handle?

Yes - I'd already addressed that locally.

> In ReadDataFromArchiveZstd():
> 
> > +        * Read compressed data.  Note that readF can resize the buffer; the
> > +        * new size is tracked and used for future loops.
> This is pretty complex for what it's doing. I'm a little worried that we
> let the reallocated buffer escape to the caller while losing track of
> how big it is. I think that works today, since there's only ever one
> call per handle, but any future refactoring that allowed cs->readData()
> to be called more than once would subtly break this code.

Note that nothing bad happens if we lose track of how big it is (well,
assuming that readF doesn't *shrink* the buffer).

The previous patch version didn't keep track of its new size, and the only
consequence is that readF() might re-resize it again on a future iteration,
even if it was already sufficiently large.

When I originally wrote it (and up until that patch version), I left
this as an XXX comment about reusing the resized buffer.  But it seemed
easy enough to fix so I did.

> In ZstdWriteCommon():
> 
> Elsewhere, output->pos is set to zero before compressing, but here we do
> it after, which I think leads to subtle differences in the function
> preconditions. If that's an intentional difference, can the reason be
> called out in a comment?

It's not deliberate.  I think it had no effect, but changed - thanks.

-- 
Justin

Вложения

Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:

On 3/16/23 05:50, Justin Pryzby wrote:
> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
>>> I did some smoke testing against zstd's GitHub release on Windows. To
>>> build against it, I had to construct an import library, and put that
>>> and the DLL into the `lib` folder expected by the MSVC scripts...
>>> which makes me wonder if I've chosen a harder way than necessary?
>>
>> It looks like pg_dump's meson.build is missing dependencies on zstd
>> (meson couldn't find the headers in the subproject without them).
> 
> I saw that this was added for LZ4, but I hadn't added it for zstd since
> I didn't run into an issue without it.  Could you check that what I've
> added works for your case ?
> 
>>> Parallel zstd dumps seem to work as expected, in that the resulting
>>> pg_restore output is identical to uncompressed dumps and nothing
>>> explodes. I haven't inspected the threading implementation for safety
>>> yet, as you mentioned.
>>
>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>> handling safety for this, by isolating each thread's state. I don't feel
>> comfortable pronouncing this new addition safe or not, because I'm not
>> sure I understand what the comments in the format-specific _Clone()
>> callbacks are saying yet.
> 
> My line of reasoning for unix is that pg_dump forks before any calls to
> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> doesn't apply to pg_dump under windows.  This is an opened question.  If
> there's no solid answer, I could disable/ignore the option (maybe only
> under windows).
> 

I may be missing something, but why would the patch affect this? Why
would it even affect safety of the parallel dump? And I don't see any
changes to the clone stuff ...

>> On to code (not a complete review):
>>
>>>     if (hasSuffix(fname, ".gz"))
>>>         compression_spec.algorithm = PG_COMPRESSION_GZIP;
>>>     else
>>>     {
>>>         bool        exists;
>>>
>>>         exists = (stat(path, &st) == 0);
>>>         /* avoid unused warning if it is not built with compression */
>>>         if (exists)
>>>             compression_spec.algorithm = PG_COMPRESSION_NONE;
>>> -#ifdef HAVE_LIBZ
>>> -       if (!exists)
>>> -       {
>>> -           free_keep_errno(fname);
>>> -           fname = psprintf("%s.gz", path);
>>> -           exists = (stat(fname, &st) == 0);
>>> -
>>> -           if (exists)
>>> -               compression_spec.algorithm = PG_COMPRESSION_GZIP;
>>> -       }
>>> -#endif
>>> -#ifdef USE_LZ4
>>> -       if (!exists)
>>> -       {
>>> -           free_keep_errno(fname);
>>> -           fname = psprintf("%s.lz4", path);
>>> -           exists = (stat(fname, &st) == 0);
>>> -
>>> -           if (exists)
>>> -               compression_spec.algorithm = PG_COMPRESSION_LZ4;
>>> -       }
>>> -#endif
>>> +       else if (check_compressed_file(path, &fname, "gz"))
>>> +           compression_spec.algorithm = PG_COMPRESSION_GZIP;
>>> +       else if (check_compressed_file(path, &fname, "lz4"))
>>> +           compression_spec.algorithm = PG_COMPRESSION_LZ4;
>>> +       else if (check_compressed_file(path, &fname, "zst"))
>>> +           compression_spec.algorithm = PG_COMPRESSION_ZSTD;
>>>     }
>>
>> This function lost some coherence, I think. Should there be a hasSuffix
>> check at the top for ".zstd" (and, for that matter, ".lz4")?
> 

This was discussed in the lz4 thread a couple days, and I think there
should be hasSuffix() cases for lz4/zstd too, not just for .gz.

> The function is first checking if it was passed a filename which already
> has a suffix.  And if not, it searches through a list of suffixes,
> testing for an existing file with each suffix.  The search with stat()
> doesn't happen if it has a suffix.  I'm having trouble seeing how the
> hasSuffix() branch isn't dead code.  Another opened question.
> 

AFAICS it's done this way because of this comment in pg_backup_directory

 * ...
 * ".gz" suffix is added to the filenames. The TOC files are never
 * compressed by pg_dump, however they are accepted with the .gz suffix
 * too, in case the user has manually compressed them with 'gzip'.

I haven't tried, but I believe that if you manually compress the
directory, it may hit this branch. And IMO if we support that for gzip,
the other compression methods should do that too for consistency.

In any case, it's a tiny amount of code and I don't feel like ripping
that out when it might break some currently supported use case.

>> I'm a little suspicious of the replacement of supports_compression()
>> with parse_compress_specification(). For example:
>>
>>> -   errmsg = supports_compression(AH->compression_spec);
>>> -   if (errmsg)
>>> +   parse_compress_specification(AH->compression_spec.algorithm,
>>> +           NULL, &compress_spec);
>>> +   if (compress_spec.parse_error != NULL)
>>>     {
>>>         pg_log_warning("archive is compressed, but this installation does not support compression (%s
>>> -                       errmsg);
>>> -       pg_free(errmsg);
>>> +                       compress_spec.parse_error);
>>> +       pg_free(compress_spec.parse_error);
>>>     }
>>
>> The top-level error here is "does not support compression", but wouldn't
>> a bad specification option with a supported compression method trip this
>> path too?
> 
> No - since the 2nd argument is passed as NULL, it just checks whether
> the compression is supported.  Maybe there ought to be a more
> direct/clean way to do it.  But up to now evidently nobody needed to do
> that.
> 

I don't think the patch can use parse_compress_specification() instead
of replace supports_compression(). The parsing simply determines if the
build has the library, it doesn't say if a particular tool was modified
to support the algorithm. I might build --with-zstd and yet pg_dump does
not support that algorithm yet.

Even after we add zstd to pg_dump, it's quite likely other compression
algorithms may not be supported by pg_dump from day 1.


I haven't looked at / tested the patch yet, but I wonder if you have any
thoughts regarding the size_t / int tweaks. I don't know what types zstd
library uses, how it reports errors etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:
Hi,

On 3/17/23 03:43, Tomas Vondra wrote:
> 
> ...
>
>>> I'm a little suspicious of the replacement of supports_compression()
>>> with parse_compress_specification(). For example:
>>>
>>>> -   errmsg = supports_compression(AH->compression_spec);
>>>> -   if (errmsg)
>>>> +   parse_compress_specification(AH->compression_spec.algorithm,
>>>> +           NULL, &compress_spec);
>>>> +   if (compress_spec.parse_error != NULL)
>>>>     {
>>>>         pg_log_warning("archive is compressed, but this installation does not support compression (%s
>>>> -                       errmsg);
>>>> -       pg_free(errmsg);
>>>> +                       compress_spec.parse_error);
>>>> +       pg_free(compress_spec.parse_error);
>>>>     }
>>>
>>> The top-level error here is "does not support compression", but wouldn't
>>> a bad specification option with a supported compression method trip this
>>> path too?
>>
>> No - since the 2nd argument is passed as NULL, it just checks whether
>> the compression is supported.  Maybe there ought to be a more
>> direct/clean way to do it.  But up to now evidently nobody needed to do
>> that.
>>
> 
> I don't think the patch can use parse_compress_specification() instead
> of replace supports_compression(). The parsing simply determines if the
> build has the library, it doesn't say if a particular tool was modified
> to support the algorithm. I might build --with-zstd and yet pg_dump does
> not support that algorithm yet.
> 
> Even after we add zstd to pg_dump, it's quite likely other compression
> algorithms may not be supported by pg_dump from day 1.
> 
> 
> I haven't looked at / tested the patch yet, but I wonder if you have any
> thoughts regarding the size_t / int tweaks. I don't know what types zstd
> library uses, how it reports errors etc.
> 

Any thoughts regarding my comments on removing supports_compression()?

Also, this patch needs a rebase to adopt it to the API changes from last
week. The sooner the better, considering we're getting fairly close to
the end of the CF and code freeze.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> On 3/16/23 05:50, Justin Pryzby wrote:
> > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> >>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>> build against it, I had to construct an import library, and put that
> >>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>> which makes me wonder if I've chosen a harder way than necessary?
> >>
> >> It looks like pg_dump's meson.build is missing dependencies on zstd
> >> (meson couldn't find the headers in the subproject without them).
> > 
> > I saw that this was added for LZ4, but I hadn't added it for zstd since
> > I didn't run into an issue without it.  Could you check that what I've
> > added works for your case ?
> > 
> >>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>> pg_restore output is identical to uncompressed dumps and nothing
> >>> explodes. I haven't inspected the threading implementation for safety
> >>> yet, as you mentioned.
> >>
> >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >> handling safety for this, by isolating each thread's state. I don't feel
> >> comfortable pronouncing this new addition safe or not, because I'm not
> >> sure I understand what the comments in the format-specific _Clone()
> >> callbacks are saying yet.
> > 
> > My line of reasoning for unix is that pg_dump forks before any calls to
> > zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> > doesn't apply to pg_dump under windows.  This is an opened question.  If
> > there's no solid answer, I could disable/ignore the option (maybe only
> > under windows).
> 
> I may be missing something, but why would the patch affect this? Why
> would it even affect safety of the parallel dump? And I don't see any
> changes to the clone stuff ...

zstd supports using threads during compression, with -Z zstd:workers=N.
When unix forks, the child processes can't do anything to mess up the
state of the parent processes.  

But windows pg_dump uses threads instead of forking, so it seems
possible that the pg_dump -j threads that then spawn zstd threads could
"leak threads" and break the main thread.  I suspect there's no issue,
but we still ought to verify that before declaring it safe.

> > The function is first checking if it was passed a filename which already
> > has a suffix.  And if not, it searches through a list of suffixes,
> > testing for an existing file with each suffix.  The search with stat()
> > doesn't happen if it has a suffix.  I'm having trouble seeing how the
> > hasSuffix() branch isn't dead code.  Another opened question.
> 
> AFAICS it's done this way because of this comment in pg_backup_directory
> 
>  * ...
>  * ".gz" suffix is added to the filenames. The TOC files are never
>  * compressed by pg_dump, however they are accepted with the .gz suffix
>  * too, in case the user has manually compressed them with 'gzip'.
> 
> I haven't tried, but I believe that if you manually compress the
> directory, it may hit this branch.

That would make sense, but when I tried, it didn't work like that.
The filenames were all uncompressed names.  Maybe it worked differently
in an older release.  Or maybe it changed during development of the
parallel-directory-dump patch and it's actually dead code.

This is rebased over the updated compression API.

It seems like I misunderstood something you said before, so now I put
back "supports_compression()".

-- 
Justin

Вложения

Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:

On 3/27/23 19:28, Justin Pryzby wrote:
> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>> On 3/16/23 05:50, Justin Pryzby wrote:
>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
>>>>> I did some smoke testing against zstd's GitHub release on Windows. To
>>>>> build against it, I had to construct an import library, and put that
>>>>> and the DLL into the `lib` folder expected by the MSVC scripts...
>>>>> which makes me wonder if I've chosen a harder way than necessary?
>>>>
>>>> It looks like pg_dump's meson.build is missing dependencies on zstd
>>>> (meson couldn't find the headers in the subproject without them).
>>>
>>> I saw that this was added for LZ4, but I hadn't added it for zstd since
>>> I didn't run into an issue without it.  Could you check that what I've
>>> added works for your case ?
>>>
>>>>> Parallel zstd dumps seem to work as expected, in that the resulting
>>>>> pg_restore output is identical to uncompressed dumps and nothing
>>>>> explodes. I haven't inspected the threading implementation for safety
>>>>> yet, as you mentioned.
>>>>
>>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>>>> handling safety for this, by isolating each thread's state. I don't feel
>>>> comfortable pronouncing this new addition safe or not, because I'm not
>>>> sure I understand what the comments in the format-specific _Clone()
>>>> callbacks are saying yet.
>>>
>>> My line of reasoning for unix is that pg_dump forks before any calls to
>>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
>>> doesn't apply to pg_dump under windows.  This is an opened question.  If
>>> there's no solid answer, I could disable/ignore the option (maybe only
>>> under windows).
>>
>> I may be missing something, but why would the patch affect this? Why
>> would it even affect safety of the parallel dump? And I don't see any
>> changes to the clone stuff ...
> 
> zstd supports using threads during compression, with -Z zstd:workers=N.
> When unix forks, the child processes can't do anything to mess up the
> state of the parent processes.  
> 
> But windows pg_dump uses threads instead of forking, so it seems
> possible that the pg_dump -j threads that then spawn zstd threads could
> "leak threads" and break the main thread.  I suspect there's no issue,
> but we still ought to verify that before declaring it safe.
> 

OK. I don't have access to a Windows machine so I can't test that. Is it
possible to disable the zstd threading, until we figure this out?

>>> The function is first checking if it was passed a filename which already
>>> has a suffix.  And if not, it searches through a list of suffixes,
>>> testing for an existing file with each suffix.  The search with stat()
>>> doesn't happen if it has a suffix.  I'm having trouble seeing how the
>>> hasSuffix() branch isn't dead code.  Another opened question.
>>
>> AFAICS it's done this way because of this comment in pg_backup_directory
>>
>>  * ...
>>  * ".gz" suffix is added to the filenames. The TOC files are never
>>  * compressed by pg_dump, however they are accepted with the .gz suffix
>>  * too, in case the user has manually compressed them with 'gzip'.
>>
>> I haven't tried, but I believe that if you manually compress the
>> directory, it may hit this branch.
> 
> That would make sense, but when I tried, it didn't work like that.
> The filenames were all uncompressed names.  Maybe it worked differently
> in an older release.  Or maybe it changed during development of the
> parallel-directory-dump patch and it's actually dead code.
> 

Interesting. Would be good to find out. I wonder if a little bit of
git-log digging could tell us more. Anyway, until we confirm it's dead
code, we should probably do what .gz does and have the same check for
.lz4 and .zst files.

> This is rebased over the updated compression API.
> 
> It seems like I misunderstood something you said before, so now I put
> back "supports_compression()".
> 

Thanks! I need to do a bit more testing and review, but it seems pretty
much RFC to me, assuming we can figure out what to do about threading.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
Kirk Wolak
Дата:
On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 3/27/23 19:28, Justin Pryzby wrote:
> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>> On 3/16/23 05:50, Justin Pryzby wrote:
>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
>>>>> I did some smoke testing against zstd's GitHub release on Windows. To
...
OK. I don't have access to a Windows machine so I can't test that. Is it
possible to disable the zstd threading, until we figure this out?

Thomas since I appear to be one of the few windows users (I use both), can I help?
I can test pg_dump... for you, easy to do.  I do about 5-10 pg_dumps a day on windows while developing.

Also, I have an AWS instance I created to build PG/Win with readline back in November.
I could give you access to that...  (you are not the only person who has made this statement here).
I've made such instances available for other Open Source developers, to support them.

 Obvi I would share connection credentials privately.

Regards, Kirk

Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:
On 3/28/23 20:03, Kirk Wolak wrote:
> On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     On 3/27/23 19:28, Justin Pryzby wrote:
>     > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>     >> On 3/16/23 05:50, Justin Pryzby wrote:
>     >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>     >>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion
>     <jchampion@timescale.com <mailto:jchampion@timescale.com>> wrote:
>     >>>>> I did some smoke testing against zstd's GitHub release on
>     Windows. To
>     ...
>     OK. I don't have access to a Windows machine so I can't test that. Is it
>     possible to disable the zstd threading, until we figure this out?
> 
> Thomas since I appear to be one of the few windows users (I use both),
> can I help?
> I can test pg_dump... for you, easy to do.  I do about 5-10 pg_dumps a
> day on windows while developing.
> 

Perhaps. But I'll leave the details up to Justin - it's his patch, and
I'm not sure how to verify the threading is OK.

I'd try applying this patch, build with --with-zstd and then run the
pg_dump TAP tests, and perhaps do some manual tests.

And perhaps do the same for --with-lz4 - there's a thread [1] suggesting
we don't detect lz4 stuff on Windows, so the TAP tests do nothing.

https://www.postgresql.org/message-id/ZAjL96N9ZW84U59p@msg.df7cb.de

> Also, I have an AWS instance I created to build PG/Win with readline
> back in November.
> I could give you access to that...  (you are not the only person who has
> made this statement here).
> I've made such instances available for other Open Source developers, to
> support them.
> 
>  Obvi I would share connection credentials privately.
> 

I'd rather leave the Windows stuff up to someone with more experience
with that platform. I have plenty of other stuff on my plate atm.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
Jacob Champion
Дата:
On Wed, Mar 15, 2023 at 9:50 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> > It looks like pg_dump's meson.build is missing dependencies on zstd
> > (meson couldn't find the headers in the subproject without them).
>
> I saw that this was added for LZ4, but I hadn't added it for zstd since
> I didn't run into an issue without it.  Could you check that what I've
> added works for your case ?

I thought I replied to this, sorry -- your newest patchset builds
correctly with subprojects, so the new dependency looks good to me.
Thanks!

> > Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> > handling safety for this, by isolating each thread's state. I don't feel
> > comfortable pronouncing this new addition safe or not, because I'm not
> > sure I understand what the comments in the format-specific _Clone()
> > callbacks are saying yet.
>
> My line of reasoning for unix is that pg_dump forks before any calls to
> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> doesn't apply to pg_dump under windows.  This is an opened question.  If
> there's no solid answer, I could disable/ignore the option (maybe only
> under windows).

To (maybe?) move this forward a bit, note that pg_backup_custom's
_Clone() function makes sure that there is no active compressor state
at the beginning of the new thread. pg_backup_directory's
implementation has no such provision. And I don't think it can,
because the parent thread might have concurrently set one up -- see
the directory-specific implementation of _CloseArchive(). Perhaps we
should just NULL out those fields after the copy, instead?

To illustrate why I think this is tough to characterize: if I've read
the code correctly, the _Clone() and CloneArchive() implementations
are running concurrently with code that is actively modifying the
ArchiveHandle and the lclContext. So safety is only ensured to the
extent that we keep track of which fields threads are allowed to
touch, and I don't have that mental model.

--Jacob



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Tue, Mar 28, 2023 at 02:03:49PM -0400, Kirk Wolak wrote:
> On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> > On 3/27/23 19:28, Justin Pryzby wrote:
> > > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> > >> On 3/16/23 05:50, Justin Pryzby wrote:
> > >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> > >>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> > >>>>> I did some smoke testing against zstd's GitHub release on Windows. To
> > ...
> > OK. I don't have access to a Windows machine so I can't test that. Is it
> > possible to disable the zstd threading, until we figure this out?
>
> Thomas since I appear to be one of the few windows users (I use both), can I help?
> I can test pg_dump... for you, easy to do.  I do about 5-10 pg_dumps a day
> on windows while developing.

It'd be great if you'd exercise this and other changes to
pg_dump/restore.  Tomas just pushed a bugfix, so be sure to "git pull"
before testing, or else you might rediscover the bug.

If you have a zstd library with thread support, you could test with
-Z zstd:workers=3.  But I think threads aren't enabled in the common
libzstd packages.  Jacob figured out how to compile libzstd easily using
"meson wraps" - but I don't know the details.

-- 
Justin



Re: zstd compression for pg_dump

От
Jacob Champion
Дата:
On Wed, Mar 29, 2023 at 6:35 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> If you have a zstd library with thread support, you could test with
> -Z zstd:workers=3.  But I think threads aren't enabled in the common
> libzstd packages.  Jacob figured out how to compile libzstd easily using
> "meson wraps" - but I don't know the details.

From the source root,

    $ mkdir subprojects
    $ meson wrap install zstd

From then on, Meson was pretty automagical about it during the ninja
build. The subproject's settings are themselves inspectable and
settable via `meson configure`:

    $ meson configure -Dzstd:<option>=<value>

--Jacob



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
> On 3/27/23 19:28, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >> On 3/16/23 05:50, Justin Pryzby wrote:
> >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> >>>>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>>>> build against it, I had to construct an import library, and put that
> >>>>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>>>> which makes me wonder if I've chosen a harder way than necessary?
> >>>>
> >>>> It looks like pg_dump's meson.build is missing dependencies on zstd
> >>>> (meson couldn't find the headers in the subproject without them).
> >>>
> >>> I saw that this was added for LZ4, but I hadn't added it for zstd since
> >>> I didn't run into an issue without it.  Could you check that what I've
> >>> added works for your case ?
> >>>
> >>>>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>>>> pg_restore output is identical to uncompressed dumps and nothing
> >>>>> explodes. I haven't inspected the threading implementation for safety
> >>>>> yet, as you mentioned.
> >>>>
> >>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >>>> handling safety for this, by isolating each thread's state. I don't feel
> >>>> comfortable pronouncing this new addition safe or not, because I'm not
> >>>> sure I understand what the comments in the format-specific _Clone()
> >>>> callbacks are saying yet.
> >>>
> >>> My line of reasoning for unix is that pg_dump forks before any calls to
> >>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> >>> doesn't apply to pg_dump under windows.  This is an opened question.  If
> >>> there's no solid answer, I could disable/ignore the option (maybe only
> >>> under windows).
> >>
> >> I may be missing something, but why would the patch affect this? Why
> >> would it even affect safety of the parallel dump? And I don't see any
> >> changes to the clone stuff ...
> > 
> > zstd supports using threads during compression, with -Z zstd:workers=N.
> > When unix forks, the child processes can't do anything to mess up the
> > state of the parent processes.  
> > 
> > But windows pg_dump uses threads instead of forking, so it seems
> > possible that the pg_dump -j threads that then spawn zstd threads could
> > "leak threads" and break the main thread.  I suspect there's no issue,
> > but we still ought to verify that before declaring it safe.
> 
> OK. I don't have access to a Windows machine so I can't test that. Is it
> possible to disable the zstd threading, until we figure this out?

I think that's what's best.  I made it issue a warning if "workers" was
specified.  It could also be an error, or just ignored.

I considered disabling workers only for windows, but realized that I
haven't tested with threads myself - my local zstd package is compiled
without threading, and I remember having some issue recompiling it with
threading.  Jacob's recipe for using meson wraps works well, but it
still seems better to leave it as a future feature.  I used that recipe
to enabled zstd with threading on CI (except for linux/autoconf).

> >>> The function is first checking if it was passed a filename which already
> >>> has a suffix.  And if not, it searches through a list of suffixes,
> >>> testing for an existing file with each suffix.  The search with stat()
> >>> doesn't happen if it has a suffix.  I'm having trouble seeing how the
> >>> hasSuffix() branch isn't dead code.  Another opened question.
> >>
> >> AFAICS it's done this way because of this comment in pg_backup_directory
> >>
> >>  * ...
> >>  * ".gz" suffix is added to the filenames. The TOC files are never
> >>  * compressed by pg_dump, however they are accepted with the .gz suffix
> >>  * too, in case the user has manually compressed them with 'gzip'.
> >>
> >> I haven't tried, but I believe that if you manually compress the
> >> directory, it may hit this branch.
> > 
> > That would make sense, but when I tried, it didn't work like that.
> > The filenames were all uncompressed names.  Maybe it worked differently
> > in an older release.  Or maybe it changed during development of the
> > parallel-directory-dump patch and it's actually dead code.
> 
> Interesting. Would be good to find out. I wonder if a little bit of
> git-log digging could tell us more. Anyway, until we confirm it's dead
> code, we should probably do what .gz does and have the same check for
> .lz4 and .zst files.

I found that hasSuffix() and cfopen() originated in the refactored patch
Heikki's sent here; there's no history beyond that.

https://www.postgresql.org/message-id/4D3954C7.9060503%40enterprisedb.com

The patch published there appends the .gz within cfopen(), and the
caller writes into the TOC the filename without ".gz".  It seems like
maybe a few hours prior, Heikki may have been appending the ".gz" suffix
in the caller, and then writing the TOC with filename.gz.

The only way I've been able to get a "filename.gz" passed to hasSuffix
is to write a directory-format dump, with LOs, and without compression,
and then compress the blobs with "gzip", and *also* edit the blobs.toc
file to say ".gz" (which isn't necessary since, if the original file
isn't found, the restore would search for files with compressed
suffixes).

So .. it's not *technically* unreachable, but I can't see why it'd be
useful to support editing the *content* of the blob TOC (other than
compressing it).  I might give some weight to the idea if it were also
possible to edit the non-blob TOC; but, it's a binary file, so no.

For now, I made the change to make zstd and lz4 to behave the same here
as .gz, unless Heikki has a memory or a git reflog going back far enough
to further support the idea that the code path isn't useful.

I'm going to set the patch as RFC as a hint to anyone who would want to
make a final review.

-- 
Justin

Вложения

Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:
On 4/1/23 01:16, Justin Pryzby wrote:
> On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
>> On 3/27/23 19:28, Justin Pryzby wrote:
>>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>>>> On 3/16/23 05:50, Justin Pryzby wrote:
>>>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>>>>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
>>>>>>> I did some smoke testing against zstd's GitHub release on Windows. To
>>>>>>> build against it, I had to construct an import library, and put that
>>>>>>> and the DLL into the `lib` folder expected by the MSVC scripts...
>>>>>>> which makes me wonder if I've chosen a harder way than necessary?
>>>>>>
>>>>>> It looks like pg_dump's meson.build is missing dependencies on zstd
>>>>>> (meson couldn't find the headers in the subproject without them).
>>>>>
>>>>> I saw that this was added for LZ4, but I hadn't added it for zstd since
>>>>> I didn't run into an issue without it.  Could you check that what I've
>>>>> added works for your case ?
>>>>>
>>>>>>> Parallel zstd dumps seem to work as expected, in that the resulting
>>>>>>> pg_restore output is identical to uncompressed dumps and nothing
>>>>>>> explodes. I haven't inspected the threading implementation for safety
>>>>>>> yet, as you mentioned.
>>>>>>
>>>>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>>>>>> handling safety for this, by isolating each thread's state. I don't feel
>>>>>> comfortable pronouncing this new addition safe or not, because I'm not
>>>>>> sure I understand what the comments in the format-specific _Clone()
>>>>>> callbacks are saying yet.
>>>>>
>>>>> My line of reasoning for unix is that pg_dump forks before any calls to
>>>>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
>>>>> doesn't apply to pg_dump under windows.  This is an opened question.  If
>>>>> there's no solid answer, I could disable/ignore the option (maybe only
>>>>> under windows).
>>>>
>>>> I may be missing something, but why would the patch affect this? Why
>>>> would it even affect safety of the parallel dump? And I don't see any
>>>> changes to the clone stuff ...
>>>
>>> zstd supports using threads during compression, with -Z zstd:workers=N.
>>> When unix forks, the child processes can't do anything to mess up the
>>> state of the parent processes.  
>>>
>>> But windows pg_dump uses threads instead of forking, so it seems
>>> possible that the pg_dump -j threads that then spawn zstd threads could
>>> "leak threads" and break the main thread.  I suspect there's no issue,
>>> but we still ought to verify that before declaring it safe.
>>
>> OK. I don't have access to a Windows machine so I can't test that. Is it
>> possible to disable the zstd threading, until we figure this out?
> 
> I think that's what's best.  I made it issue a warning if "workers" was
> specified.  It could also be an error, or just ignored.
> 
> I considered disabling workers only for windows, but realized that I
> haven't tested with threads myself - my local zstd package is compiled
> without threading, and I remember having some issue recompiling it with
> threading.  Jacob's recipe for using meson wraps works well, but it
> still seems better to leave it as a future feature.  I used that recipe
> to enabled zstd with threading on CI (except for linux/autoconf).
> 

+1 to disable this if we're unsure it works correctly. I agree it's
better to just error out if workers are requested - I rather dislike
when a tool just ignores an explicit parameter. And AFAICS it's what
zstd does too, when someone requests workers on incompatible build.

FWIW I've been thinking about this a bit more and I don't quite see why
would the threading cause issues (except for Windows). I forgot
pg_basebackup already supports zstd, including the worker threading, so
why would it work there and not in pg_dump? Sure, pg_basebackup is not
parallel, but with separate pg_dump processes that shouldn't be an issue
(although I'm not sure when zstd creates threads).

The one thing I'm wondering about is at which point are the worker
threads spawned - but presumably not before the pg_dump processes fork.

I'll try building zstd with threading enabled, and do some tests over
the weekend.

>>>>> The function is first checking if it was passed a filename which already
>>>>> has a suffix.  And if not, it searches through a list of suffixes,
>>>>> testing for an existing file with each suffix.  The search with stat()
>>>>> doesn't happen if it has a suffix.  I'm having trouble seeing how the
>>>>> hasSuffix() branch isn't dead code.  Another opened question.
>>>>
>>>> AFAICS it's done this way because of this comment in pg_backup_directory
>>>>
>>>>  * ...
>>>>  * ".gz" suffix is added to the filenames. The TOC files are never
>>>>  * compressed by pg_dump, however they are accepted with the .gz suffix
>>>>  * too, in case the user has manually compressed them with 'gzip'.
>>>>
>>>> I haven't tried, but I believe that if you manually compress the
>>>> directory, it may hit this branch.
>>>
>>> That would make sense, but when I tried, it didn't work like that.
>>> The filenames were all uncompressed names.  Maybe it worked differently
>>> in an older release.  Or maybe it changed during development of the
>>> parallel-directory-dump patch and it's actually dead code.
>>
>> Interesting. Would be good to find out. I wonder if a little bit of
>> git-log digging could tell us more. Anyway, until we confirm it's dead
>> code, we should probably do what .gz does and have the same check for
>> .lz4 and .zst files.
> 
> I found that hasSuffix() and cfopen() originated in the refactored patch
> Heikki's sent here; there's no history beyond that.
> 
> https://www.postgresql.org/message-id/4D3954C7.9060503%40enterprisedb.com
> 
> The patch published there appends the .gz within cfopen(), and the
> caller writes into the TOC the filename without ".gz".  It seems like
> maybe a few hours prior, Heikki may have been appending the ".gz" suffix
> in the caller, and then writing the TOC with filename.gz.
> 
> The only way I've been able to get a "filename.gz" passed to hasSuffix
> is to write a directory-format dump, with LOs, and without compression,
> and then compress the blobs with "gzip", and *also* edit the blobs.toc
> file to say ".gz" (which isn't necessary since, if the original file
> isn't found, the restore would search for files with compressed
> suffixes).
> 
> So .. it's not *technically* unreachable, but I can't see why it'd be
> useful to support editing the *content* of the blob TOC (other than
> compressing it).  I might give some weight to the idea if it were also
> possible to edit the non-blob TOC; but, it's a binary file, so no.
> 
> For now, I made the change to make zstd and lz4 to behave the same here
> as .gz, unless Heikki has a memory or a git reflog going back far enough
> to further support the idea that the code path isn't useful.
> 

Makes sense. Let's keep the same behavior for all compression methods,
and if it's unreachable we shall remove it from all. It's a trivial
amount of code, we can live with that.

> I'm going to set the patch as RFC as a hint to anyone who would want to
> make a final review.
> 

OK.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Sat, Apr 01, 2023 at 02:11:12AM +0200, Tomas Vondra wrote:
> On 4/1/23 01:16, Justin Pryzby wrote:
> > On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
> >> On 3/27/23 19:28, Justin Pryzby wrote:
> >>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >>>> On 3/16/23 05:50, Justin Pryzby wrote:
> >>>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >>>>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> >>>>>>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>>>>>> build against it, I had to construct an import library, and put that
> >>>>>>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>>>>>> which makes me wonder if I've chosen a harder way than necessary?
> >>>>>>
> >>>>>> It looks like pg_dump's meson.build is missing dependencies on zstd
> >>>>>> (meson couldn't find the headers in the subproject without them).
> >>>>>
> >>>>> I saw that this was added for LZ4, but I hadn't added it for zstd since
> >>>>> I didn't run into an issue without it.  Could you check that what I've
> >>>>> added works for your case ?
> >>>>>
> >>>>>>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>>>>>> pg_restore output is identical to uncompressed dumps and nothing
> >>>>>>> explodes. I haven't inspected the threading implementation for safety
> >>>>>>> yet, as you mentioned.
> >>>>>>
> >>>>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >>>>>> handling safety for this, by isolating each thread's state. I don't feel
> >>>>>> comfortable pronouncing this new addition safe or not, because I'm not
> >>>>>> sure I understand what the comments in the format-specific _Clone()
> >>>>>> callbacks are saying yet.
> >>>>>
> >>>>> My line of reasoning for unix is that pg_dump forks before any calls to
> >>>>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> >>>>> doesn't apply to pg_dump under windows.  This is an opened question.  If
> >>>>> there's no solid answer, I could disable/ignore the option (maybe only
> >>>>> under windows).
> >>>>
> >>>> I may be missing something, but why would the patch affect this? Why
> >>>> would it even affect safety of the parallel dump? And I don't see any
> >>>> changes to the clone stuff ...
> >>>
> >>> zstd supports using threads during compression, with -Z zstd:workers=N.
> >>> When unix forks, the child processes can't do anything to mess up the
> >>> state of the parent processes.  
> >>>
> >>> But windows pg_dump uses threads instead of forking, so it seems
> >>> possible that the pg_dump -j threads that then spawn zstd threads could
> >>> "leak threads" and break the main thread.  I suspect there's no issue,
> >>> but we still ought to verify that before declaring it safe.
> >>
> >> OK. I don't have access to a Windows machine so I can't test that. Is it
> >> possible to disable the zstd threading, until we figure this out?
> > 
> > I think that's what's best.  I made it issue a warning if "workers" was
> > specified.  It could also be an error, or just ignored.
> > 
> > I considered disabling workers only for windows, but realized that I
> > haven't tested with threads myself - my local zstd package is compiled
> > without threading, and I remember having some issue recompiling it with
> > threading.  Jacob's recipe for using meson wraps works well, but it
> > still seems better to leave it as a future feature.  I used that recipe
> > to enabled zstd with threading on CI (except for linux/autoconf).
> 
> +1 to disable this if we're unsure it works correctly. I agree it's
> better to just error out if workers are requested - I rather dislike
> when a tool just ignores an explicit parameter. And AFAICS it's what
> zstd does too, when someone requests workers on incompatible build.
> 
> FWIW I've been thinking about this a bit more and I don't quite see why
> would the threading cause issues (except for Windows). I forgot
> pg_basebackup already supports zstd, including the worker threading, so
> why would it work there and not in pg_dump? Sure, pg_basebackup is not
> parallel, but with separate pg_dump processes that shouldn't be an issue
> (although I'm not sure when zstd creates threads).

There's no concern at all except under windows (because on windows
pg_dump -j is implemented using threads rather than forking).
Especially since zstd:workers is already allowed in the basebackup
backend process.

> I'll try building zstd with threading enabled, and do some tests over
> the weekend.

Feel free to wait until v17 :)

I used "meson wraps" to get a local version with threading.  Note that
if you want to use a zstd subproject, you may have to specify -D
zstd=enabled, or else meson may not enable the library at all.

Also, in order to introspect its settings, I had to do like this:

mkdir subprojects
meson wrap install zstd
meson subprojects download
mkdir build.meson
meson setup -C build.meson --force-fallback-for=zstd

-- 
Justin



Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:

On 4/1/23 02:28, Justin Pryzby wrote:
> On Sat, Apr 01, 2023 at 02:11:12AM +0200, Tomas Vondra wrote:
>> On 4/1/23 01:16, Justin Pryzby wrote:
>>> On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
>>>> On 3/27/23 19:28, Justin Pryzby wrote:
>>>>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>>>>>> On 3/16/23 05:50, Justin Pryzby wrote:
>>>>>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>>>>>>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
>>>>>>>>> I did some smoke testing against zstd's GitHub release on Windows. To
>>>>>>>>> build against it, I had to construct an import library, and put that
>>>>>>>>> and the DLL into the `lib` folder expected by the MSVC scripts...
>>>>>>>>> which makes me wonder if I've chosen a harder way than necessary?
>>>>>>>>
>>>>>>>> It looks like pg_dump's meson.build is missing dependencies on zstd
>>>>>>>> (meson couldn't find the headers in the subproject without them).
>>>>>>>
>>>>>>> I saw that this was added for LZ4, but I hadn't added it for zstd since
>>>>>>> I didn't run into an issue without it.  Could you check that what I've
>>>>>>> added works for your case ?
>>>>>>>
>>>>>>>>> Parallel zstd dumps seem to work as expected, in that the resulting
>>>>>>>>> pg_restore output is identical to uncompressed dumps and nothing
>>>>>>>>> explodes. I haven't inspected the threading implementation for safety
>>>>>>>>> yet, as you mentioned.
>>>>>>>>
>>>>>>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>>>>>>>> handling safety for this, by isolating each thread's state. I don't feel
>>>>>>>> comfortable pronouncing this new addition safe or not, because I'm not
>>>>>>>> sure I understand what the comments in the format-specific _Clone()
>>>>>>>> callbacks are saying yet.
>>>>>>>
>>>>>>> My line of reasoning for unix is that pg_dump forks before any calls to
>>>>>>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
>>>>>>> doesn't apply to pg_dump under windows.  This is an opened question.  If
>>>>>>> there's no solid answer, I could disable/ignore the option (maybe only
>>>>>>> under windows).
>>>>>>
>>>>>> I may be missing something, but why would the patch affect this? Why
>>>>>> would it even affect safety of the parallel dump? And I don't see any
>>>>>> changes to the clone stuff ...
>>>>>
>>>>> zstd supports using threads during compression, with -Z zstd:workers=N.
>>>>> When unix forks, the child processes can't do anything to mess up the
>>>>> state of the parent processes.  
>>>>>
>>>>> But windows pg_dump uses threads instead of forking, so it seems
>>>>> possible that the pg_dump -j threads that then spawn zstd threads could
>>>>> "leak threads" and break the main thread.  I suspect there's no issue,
>>>>> but we still ought to verify that before declaring it safe.
>>>>
>>>> OK. I don't have access to a Windows machine so I can't test that. Is it
>>>> possible to disable the zstd threading, until we figure this out?
>>>
>>> I think that's what's best.  I made it issue a warning if "workers" was
>>> specified.  It could also be an error, or just ignored.
>>>
>>> I considered disabling workers only for windows, but realized that I
>>> haven't tested with threads myself - my local zstd package is compiled
>>> without threading, and I remember having some issue recompiling it with
>>> threading.  Jacob's recipe for using meson wraps works well, but it
>>> still seems better to leave it as a future feature.  I used that recipe
>>> to enabled zstd with threading on CI (except for linux/autoconf).
>>
>> +1 to disable this if we're unsure it works correctly. I agree it's
>> better to just error out if workers are requested - I rather dislike
>> when a tool just ignores an explicit parameter. And AFAICS it's what
>> zstd does too, when someone requests workers on incompatible build.
>>
>> FWIW I've been thinking about this a bit more and I don't quite see why
>> would the threading cause issues (except for Windows). I forgot
>> pg_basebackup already supports zstd, including the worker threading, so
>> why would it work there and not in pg_dump? Sure, pg_basebackup is not
>> parallel, but with separate pg_dump processes that shouldn't be an issue
>> (although I'm not sure when zstd creates threads).
> 
> There's no concern at all except under windows (because on windows
> pg_dump -j is implemented using threads rather than forking).
> Especially since zstd:workers is already allowed in the basebackup
> backend process.
> 

If there are no concerns, why disable it outside Windows? I don't have a
good idea how beneficial the multi-threaded compression is, so I can't
quite judge the risk/benefits tradeoff.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Sat, Apr 01, 2023 at 02:49:44PM +0200, Tomas Vondra wrote:
> On 4/1/23 02:28, Justin Pryzby wrote:
> > On Sat, Apr 01, 2023 at 02:11:12AM +0200, Tomas Vondra wrote:
> >> On 4/1/23 01:16, Justin Pryzby wrote:
> >>> On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
> >>>> On 3/27/23 19:28, Justin Pryzby wrote:
> >>>>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >>>>>> On 3/16/23 05:50, Justin Pryzby wrote:
> >>>>>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >>>>>>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> >>>>>>>>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>>>>>>>> build against it, I had to construct an import library, and put that
> >>>>>>>>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>>>>>>>> which makes me wonder if I've chosen a harder way than necessary?
> >>>>>>>>
> >>>>>>>> It looks like pg_dump's meson.build is missing dependencies on zstd
> >>>>>>>> (meson couldn't find the headers in the subproject without them).
> >>>>>>>
> >>>>>>> I saw that this was added for LZ4, but I hadn't added it for zstd since
> >>>>>>> I didn't run into an issue without it.  Could you check that what I've
> >>>>>>> added works for your case ?
> >>>>>>>
> >>>>>>>>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>>>>>>>> pg_restore output is identical to uncompressed dumps and nothing
> >>>>>>>>> explodes. I haven't inspected the threading implementation for safety
> >>>>>>>>> yet, as you mentioned.
> >>>>>>>>
> >>>>>>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >>>>>>>> handling safety for this, by isolating each thread's state. I don't feel
> >>>>>>>> comfortable pronouncing this new addition safe or not, because I'm not
> >>>>>>>> sure I understand what the comments in the format-specific _Clone()
> >>>>>>>> callbacks are saying yet.
> >>>>>>>
> >>>>>>> My line of reasoning for unix is that pg_dump forks before any calls to
> >>>>>>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> >>>>>>> doesn't apply to pg_dump under windows.  This is an opened question.  If
> >>>>>>> there's no solid answer, I could disable/ignore the option (maybe only
> >>>>>>> under windows).
> >>>>>>
> >>>>>> I may be missing something, but why would the patch affect this? Why
> >>>>>> would it even affect safety of the parallel dump? And I don't see any
> >>>>>> changes to the clone stuff ...
> >>>>>
> >>>>> zstd supports using threads during compression, with -Z zstd:workers=N.
> >>>>> When unix forks, the child processes can't do anything to mess up the
> >>>>> state of the parent processes.  
> >>>>>
> >>>>> But windows pg_dump uses threads instead of forking, so it seems
> >>>>> possible that the pg_dump -j threads that then spawn zstd threads could
> >>>>> "leak threads" and break the main thread.  I suspect there's no issue,
> >>>>> but we still ought to verify that before declaring it safe.
> >>>>
> >>>> OK. I don't have access to a Windows machine so I can't test that. Is it
> >>>> possible to disable the zstd threading, until we figure this out?
> >>>
> >>> I think that's what's best.  I made it issue a warning if "workers" was
> >>> specified.  It could also be an error, or just ignored.
> >>>
> >>> I considered disabling workers only for windows, but realized that I
> >>> haven't tested with threads myself - my local zstd package is compiled
> >>> without threading, and I remember having some issue recompiling it with
> >>> threading.  Jacob's recipe for using meson wraps works well, but it
> >>> still seems better to leave it as a future feature.  I used that recipe
> >>> to enabled zstd with threading on CI (except for linux/autoconf).
> >>
> >> +1 to disable this if we're unsure it works correctly. I agree it's
> >> better to just error out if workers are requested - I rather dislike
> >> when a tool just ignores an explicit parameter. And AFAICS it's what
> >> zstd does too, when someone requests workers on incompatible build.
> >>
> >> FWIW I've been thinking about this a bit more and I don't quite see why
> >> would the threading cause issues (except for Windows). I forgot
> >> pg_basebackup already supports zstd, including the worker threading, so
> >> why would it work there and not in pg_dump? Sure, pg_basebackup is not
> >> parallel, but with separate pg_dump processes that shouldn't be an issue
> >> (although I'm not sure when zstd creates threads).
> > 
> > There's no concern at all except under windows (because on windows
> > pg_dump -j is implemented using threads rather than forking).
> > Especially since zstd:workers is already allowed in the basebackup
> > backend process.
> 
> If there are no concerns, why disable it outside Windows? I don't have a
> good idea how beneficial the multi-threaded compression is, so I can't
> quite judge the risk/benefits tradeoff.

Because it's a minor/fringe feature, and it's annoying to have platform
differences (would we plan on relaxing the restriction in v17, or is it
more likely we'd forget ?).

I realized how little I've tested with zstd workers myself.  And I think
on cirrusci, the macos and freebsd tasks have zstd libraries with
threading support, but it wasn't being exercised (because using :workers
would cause the patch to fail unless it's supported everywhere).  So I
updated the "for CI only" patch to 1) use meson wraps to compile zstd
library with threading on linux and windows; and, 2) use zstd:workers=3
"opportunistically" (but avoid failing if threads are not supported,
since the autoconf task still doesn't have access to a library with
thread support).  That's a great step, but it still seems bad that the
thread stuff has been little exercised until now.  (Also, the windows
task failed; I think that's due to a transient network issue).

Feel free to mess around with threads (but I'd much rather see the patch
progress for zstd:long).

-- 
Justin



Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:
On 4/1/23 15:36, Justin Pryzby wrote:
>
> ...
>
>> If there are no concerns, why disable it outside Windows? I don't have a
>> good idea how beneficial the multi-threaded compression is, so I can't
>> quite judge the risk/benefits tradeoff.
> 
> Because it's a minor/fringe feature, and it's annoying to have platform
> differences (would we plan on relaxing the restriction in v17, or is it
> more likely we'd forget ?).
> 
> I realized how little I've tested with zstd workers myself.  And I think
> on cirrusci, the macos and freebsd tasks have zstd libraries with
> threading support, but it wasn't being exercised (because using :workers
> would cause the patch to fail unless it's supported everywhere).  So I
> updated the "for CI only" patch to 1) use meson wraps to compile zstd
> library with threading on linux and windows; and, 2) use zstd:workers=3
> "opportunistically" (but avoid failing if threads are not supported,
> since the autoconf task still doesn't have access to a library with
> thread support).  That's a great step, but it still seems bad that the
> thread stuff has been little exercised until now.  (Also, the windows
> task failed; I think that's due to a transient network issue).
> 

Agreed, let's leave the threading for PG17, depending on how beneficial
it turns out to be for pg_dump.

> Feel free to mess around with threads (but I'd much rather see the patch
> progress for zstd:long).

OK, understood. The long mode patch is pretty simple. IIUC it does not
change the format, i.e. in the worst case we could leave it for PG17
too. Correct?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
> > Feel free to mess around with threads (but I'd much rather see the patch
> > progress for zstd:long).
> 
> OK, understood. The long mode patch is pretty simple. IIUC it does not
> change the format, i.e. in the worst case we could leave it for PG17
> too. Correct?

Right, libzstd only has one "format", which is the same as what's used
by the commandline tool.  zstd:long doesn't change the format of the
output: the library just uses a larger memory buffer to allow better
compression.  There's no format change for zstd:workers, either.

-- 
Justin



Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:

On 4/3/23 21:17, Justin Pryzby wrote:
> On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
>>> Feel free to mess around with threads (but I'd much rather see the patch
>>> progress for zstd:long).
>>
>> OK, understood. The long mode patch is pretty simple. IIUC it does not
>> change the format, i.e. in the worst case we could leave it for PG17
>> too. Correct?
> 
> Right, libzstd only has one "format", which is the same as what's used
> by the commandline tool.  zstd:long doesn't change the format of the
> output: the library just uses a larger memory buffer to allow better
> compression.  There's no format change for zstd:workers, either.
> 

OK. I plan to do a bit more review/testing on this, and get it committed
over the next day or two, likely including the long mode. One thing I
noticed today is that maybe long_distance should be a bool, not int.
Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be
cleaner to cast the value during a call and keep it bool otherwise.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Mon, Apr 03, 2023 at 11:26:09PM +0200, Tomas Vondra wrote:
> On 4/3/23 21:17, Justin Pryzby wrote:
> > On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
> >>> Feel free to mess around with threads (but I'd much rather see the patch
> >>> progress for zstd:long).
> >>
> >> OK, understood. The long mode patch is pretty simple. IIUC it does not
> >> change the format, i.e. in the worst case we could leave it for PG17
> >> too. Correct?
> > 
> > Right, libzstd only has one "format", which is the same as what's used
> > by the commandline tool.  zstd:long doesn't change the format of the
> > output: the library just uses a larger memory buffer to allow better
> > compression.  There's no format change for zstd:workers, either.
> 
> OK. I plan to do a bit more review/testing on this, and get it committed
> over the next day or two, likely including the long mode. One thing I
> noticed today is that maybe long_distance should be a bool, not int.
> Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be
> cleaner to cast the value during a call and keep it bool otherwise.

Thanks for noticing.  Evidently I wrote it using "int" to get the
feature working, and then later wrote the bool parsing bits but never
changed the data structure.

This also updates a few comments, indentation, removes a useless
assertion, and updates the warning about zstd:workers.

-- 
Justin

Вложения

Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:
On 4/4/23 05:04, Justin Pryzby wrote:
> On Mon, Apr 03, 2023 at 11:26:09PM +0200, Tomas Vondra wrote:
>> On 4/3/23 21:17, Justin Pryzby wrote:
>>> On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
>>>>> Feel free to mess around with threads (but I'd much rather see the patch
>>>>> progress for zstd:long).
>>>>
>>>> OK, understood. The long mode patch is pretty simple. IIUC it does not
>>>> change the format, i.e. in the worst case we could leave it for PG17
>>>> too. Correct?
>>>
>>> Right, libzstd only has one "format", which is the same as what's used
>>> by the commandline tool.  zstd:long doesn't change the format of the
>>> output: the library just uses a larger memory buffer to allow better
>>> compression.  There's no format change for zstd:workers, either.
>>
>> OK. I plan to do a bit more review/testing on this, and get it committed
>> over the next day or two, likely including the long mode. One thing I
>> noticed today is that maybe long_distance should be a bool, not int.
>> Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be
>> cleaner to cast the value during a call and keep it bool otherwise.
> 
> Thanks for noticing.  Evidently I wrote it using "int" to get the
> feature working, and then later wrote the bool parsing bits but never
> changed the data structure.
> 
> This also updates a few comments, indentation, removes a useless
> assertion, and updates the warning about zstd:workers.
> 

Thanks. I've cleaned up the 0001 a little bit (a couple comment
improvements), updated the commit message and pushed it. I plan to take
care of the 0002 (long distance mode) tomorrow, and that'll be it for
PG16 I think.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
Tomas Vondra
Дата:

On 4/5/23 21:42, Tomas Vondra wrote:
> On 4/4/23 05:04, Justin Pryzby wrote:
>> On Mon, Apr 03, 2023 at 11:26:09PM +0200, Tomas Vondra wrote:
>>> On 4/3/23 21:17, Justin Pryzby wrote:
>>>> On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
>>>>>> Feel free to mess around with threads (but I'd much rather see the patch
>>>>>> progress for zstd:long).
>>>>>
>>>>> OK, understood. The long mode patch is pretty simple. IIUC it does not
>>>>> change the format, i.e. in the worst case we could leave it for PG17
>>>>> too. Correct?
>>>>
>>>> Right, libzstd only has one "format", which is the same as what's used
>>>> by the commandline tool.  zstd:long doesn't change the format of the
>>>> output: the library just uses a larger memory buffer to allow better
>>>> compression.  There's no format change for zstd:workers, either.
>>>
>>> OK. I plan to do a bit more review/testing on this, and get it committed
>>> over the next day or two, likely including the long mode. One thing I
>>> noticed today is that maybe long_distance should be a bool, not int.
>>> Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be
>>> cleaner to cast the value during a call and keep it bool otherwise.
>>
>> Thanks for noticing.  Evidently I wrote it using "int" to get the
>> feature working, and then later wrote the bool parsing bits but never
>> changed the data structure.
>>
>> This also updates a few comments, indentation, removes a useless
>> assertion, and updates the warning about zstd:workers.
>>
> 
> Thanks. I've cleaned up the 0001 a little bit (a couple comment
> improvements), updated the commit message and pushed it. I plan to take
> care of the 0002 (long distance mode) tomorrow, and that'll be it for
> PG16 I think.

I looked at the long mode patch again, updated the commit message and
pushed it. I was wondering if long_mode should really be bool -
logically it is, but ZSTD_CCtx_setParameter() expects int. But I think
that's fine.

I think that's all for PG16 in this patch series. If there's more we
want to do, it'll have to wait for PG17 - Justin, can you update and
submit the patches that you think are relevant for the next CF?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: zstd compression for pg_dump

От
Justin Pryzby
Дата:
On Thu, Apr 06, 2023 at 05:34:30PM +0200, Tomas Vondra wrote:
> I looked at the long mode patch again, updated the commit message and
> pushed it. I was wondering if long_mode should really be bool -
> logically it is, but ZSTD_CCtx_setParameter() expects int. But I think
> that's fine.

Thanks!

> I think that's all for PG16 in this patch series.  If there's more we want to
> do, it'll have to wait for PG17 - 

Yes

> Justin, can you update and submit the patches that you think are relevant for
> the next CF?

Yeah.

It sounds like a shiny new feature, but it's not totally clear if it's safe
here or even how useful it is.  (It might be like my patch for
wal_compression=zstd:level, and Michael's for toast_compression=zstd, neither
of which saw any support).

Last year's basebackup thread had some interesting comments about safety of
threads, although pg_dump's considerations may be different.

The patch itself is trivial, so it'd be fine to wait until PG16 is released to
get some experience.  If someone else wanted to do that, it'd be fine with me.

-- 
Justin



Re: zstd compression for pg_dump

От
Tom Lane
Дата:
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Thu, Apr 06, 2023 at 05:34:30PM +0200, Tomas Vondra wrote:
>> I think that's all for PG16 in this patch series.  If there's more we want to
>> do, it'll have to wait for PG17 -

> Yes

Shouldn't the CF entry be closed as committed?  It's certainly
making the cfbot unhappy because the patch-of-record doesn't apply.

            regards, tom lane