Обсуждение: zstd compression for pg_dump

Поиск

Список

Период

Сортировка

zstd compression for pg_dump

От

Justin Pryzby

Дата:

24 февраля 2023 г., 22:18:40

This is a draft patch - review is welcome and would help to get this
ready to be considererd for v16, if desired.

I'm going to add this thread to the old CF entry.
https://commitfest.postgresql.org/31/2888/

-- 
Justin

Вложения

Re: zstd compression for pg_dump

От

Michael Paquier

Дата:

25 февраля 2023 г., 07:44:36

On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
>
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/

Patch 0003 adds support for the --long option of zstd, meaning that it
"enables long distance matching with #windowLog".  What's the benefit
of that when it is applied to dumps and base backup contents?
--
Michael

Вложения

signature.asc

Re: zstd compression for pg_dump

От

Justin Pryzby

Дата:

25 февраля 2023 г., 07:50:41

On Sat, Feb 25, 2023 at 01:44:36PM +0900, Michael Paquier wrote:
> On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> > This is a draft patch - review is welcome and would help to get this
> > ready to be considererd for v16, if desired.
> > 
> > I'm going to add this thread to the old CF entry.
> > https://commitfest.postgresql.org/31/2888/
> 
> Patch 0003 adds support for the --long option of zstd, meaning that it
> "enables long distance matching with #windowLog".  What's the benefit
> of that when it is applied to dumps and base backup contents?

It (can) makes it smaller.

+           The <literal>long</literal> keyword enables long-distance matching
+           mode, for improved compression ratio, at the expense of higher memory
+           use.  Long-distance mode is supported only for 

+        With zstd compression, <literal>long</literal> mode may allow dumps
+        to be significantly smaller, but it might not reduce the size of
+        custom or directory format dumps, whose fields are separately compressed.

Note that I included that here as 003, but I also have an pre-existing
patch for adding that just to basebackup.

-- 
Justin

Re: zstd compression for pg_dump

От

Tomas Vondra

Дата:

25 февраля 2023 г., 13:31:28

On 2/24/23 20:18, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
> 
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/
> 

Thanks. Sadly cfbot is unhappy - the windows and cplusplus builds failed
because of some issue in pg_backup_archiver.h. But it's a bit bizarre
because the patch does not modify that file at all ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: zstd compression for pg_dump

От

"Euler Taveira"

Дата:

25 февраля 2023 г., 17:47:26

On Sat, Feb 25, 2023, at 7:31 AM, Tomas Vondra wrote:

On 2/24/23 20:18, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
>
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/
>

Thanks. Sadly cfbot is unhappy - the windows and cplusplus builds failed
because of some issue in pg_backup_archiver.h. But it's a bit bizarre
because the patch does not modify that file at all ...

cpluspluscheck says

# pg_dump is not C++-clean because it uses "public" and "namespace"

# as field names, which is unfortunate but we won't change it now.

Hence, the patch should exclude the new header file from it.

--- a/src/tools/pginclude/cpluspluscheck

+++ b/src/tools/pginclude/cpluspluscheck

@@ -153,6 +153,7 @@ do

test "$f" = src/bin/pg_dump/compress_gzip.h && continue

test "$f" = src/bin/pg_dump/compress_io.h && continue

test "$f" = src/bin/pg_dump/compress_lz4.h && continue

+ test "$f" = src/bin/pg_dump/compress_zstd.h && continue

test "$f" = src/bin/pg_dump/compress_none.h && continue

test "$f" = src/bin/pg_dump/parallel.h && continue

test "$f" = src/bin/pg_dump/pg_backup_archiver.h && continue

Euler Taveira

EDB https://www.enterprisedb.com/

Re: zstd compression for pg_dump

От

Justin Pryzby

Дата:

26 февраля 2023 г., 04:22:27

On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
> 
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/

This resolves cfbot warnings: windows and cppcheck.
And refactors zstd routines.
And updates docs.
And includes some fixes for earlier patches that these patches conflicts
with/depends on.

On Sat, Feb 25, 2023 at 07:22:27PM -0600, Justin Pryzby wrote:
> On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> > This is a draft patch - review is welcome and would help to get this
> > ready to be considererd for v16, if desired.
> > 
> > I'm going to add this thread to the old CF entry.
> > https://commitfest.postgresql.org/31/2888/
> 
> This resolves cfbot warnings: windows and cppcheck.
> And refactors zstd routines.
> And updates docs.
> And includes some fixes for earlier patches that these patches conflicts
> with/depends on.

This rebases over the TAP and doc fixes to LZ4.
And adds necessary ENV to makefile and meson.
And adds an annoying boilerplate header.
And removes supports_compression(), which is what I think Tomas meant
when referring to "annoying unsupported cases".
And updates zstd.c: fix an off-by-one, allocate in init depending on
readF/writeF, do not reset the input buffer on each iteration, and show
parameter name in errors.

I'd appreciate help checking that this is doing the right things and
works correctly with zstd threaded workers.  The zstd API says: "use one
different context per thread for parallel execution" and "For parallel
execution, use one separate ZSTD_CStream per thread".
https://github.com/facebook/zstd/blob/dev/lib/zstd.h

I understand that to mean that, if pg_dump *itself* were using threads,
then each thread would need to call ZSTD_createCStream().  pg_dump isn't
threaded, so there's nothing special needed, right?

Except that, under windows, pg_dump -Fd -j actually uses threads instead
of forking.  I *think* that's still safe, since the pgdump threads are
created *before* calling zstd functions (see _PrintTocData and
_StartData of the custom and directory formats), so it happens naturally
that there's a separate zstd stream for each thread of pgdump.

-- 
Justin

On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> > I did some smoke testing against zstd's GitHub release on Windows. To
> > build against it, I had to construct an import library, and put that
> > and the DLL into the `lib` folder expected by the MSVC scripts...
> > which makes me wonder if I've chosen a harder way than necessary?
> 
> It looks like pg_dump's meson.build is missing dependencies on zstd
> (meson couldn't find the headers in the subproject without them).

I saw that this was added for LZ4, but I hadn't added it for zstd since
I didn't run into an issue without it.  Could you check that what I've
added works for your case ?

> > Parallel zstd dumps seem to work as expected, in that the resulting
> > pg_restore output is identical to uncompressed dumps and nothing
> > explodes. I haven't inspected the threading implementation for safety
> > yet, as you mentioned.
> 
> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> handling safety for this, by isolating each thread's state. I don't feel
> comfortable pronouncing this new addition safe or not, because I'm not
> sure I understand what the comments in the format-specific _Clone()
> callbacks are saying yet.

My line of reasoning for unix is that pg_dump forks before any calls to
zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
doesn't apply to pg_dump under windows.  This is an opened question.  If
there's no solid answer, I could disable/ignore the option (maybe only
under windows).

> On to code (not a complete review):
> 
> >     if (hasSuffix(fname, ".gz"))
> >         compression_spec.algorithm = PG_COMPRESSION_GZIP;
> >     else
> >     {
> >         bool        exists;
> > 
> >         exists = (stat(path, &st) == 0);
> >         /* avoid unused warning if it is not built with compression */
> >         if (exists)
> >             compression_spec.algorithm = PG_COMPRESSION_NONE;
> > -#ifdef HAVE_LIBZ
> > -       if (!exists)
> > -       {
> > -           free_keep_errno(fname);
> > -           fname = psprintf("%s.gz", path);
> > -           exists = (stat(fname, &st) == 0);
> > -
> > -           if (exists)
> > -               compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > -       }
> > -#endif
> > -#ifdef USE_LZ4
> > -       if (!exists)
> > -       {
> > -           free_keep_errno(fname);
> > -           fname = psprintf("%s.lz4", path);
> > -           exists = (stat(fname, &st) == 0);
> > -
> > -           if (exists)
> > -               compression_spec.algorithm = PG_COMPRESSION_LZ4;
> > -       }
> > -#endif
> > +       else if (check_compressed_file(path, &fname, "gz"))
> > +           compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > +       else if (check_compressed_file(path, &fname, "lz4"))
> > +           compression_spec.algorithm = PG_COMPRESSION_LZ4;
> > +       else if (check_compressed_file(path, &fname, "zst"))
> > +           compression_spec.algorithm = PG_COMPRESSION_ZSTD;
> >     }
> 
> This function lost some coherence, I think. Should there be a hasSuffix
> check at the top for ".zstd" (and, for that matter, ".lz4")?

The function is first checking if it was passed a filename which already
has a suffix.  And if not, it searches through a list of suffixes,
testing for an existing file with each suffix.  The search with stat()
doesn't happen if it has a suffix.  I'm having trouble seeing how the
hasSuffix() branch isn't dead code.  Another opened question.

> I'm a little suspicious of the replacement of supports_compression()
> with parse_compress_specification(). For example:
> 
> > -   errmsg = supports_compression(AH->compression_spec);
> > -   if (errmsg)
> > +   parse_compress_specification(AH->compression_spec.algorithm,
> > +           NULL, &compress_spec);
> > +   if (compress_spec.parse_error != NULL)
> >     {
> >         pg_log_warning("archive is compressed, but this installation does not support compression (%s
> > -                       errmsg);
> > -       pg_free(errmsg);
> > +                       compress_spec.parse_error);
> > +       pg_free(compress_spec.parse_error);
> >     }
> 
> The top-level error here is "does not support compression", but wouldn't
> a bad specification option with a supported compression method trip this
> path too?

No - since the 2nd argument is passed as NULL, it just checks whether
the compression is supported.  Maybe there ought to be a more
direct/clean way to do it.  But up to now evidently nobody needed to do
that.

> > +static void
> > +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream,
> > +                         ZSTD_cParameter param, int value, char *paramname)
> 
> IMO we should avoid stepping on the ZSTD_ namespace with our own
> internal function names.

done

> > +   if (cs->readF != NULL)
> > +
> > +   if (cs->writeF != NULL)
> 
> This seems to suggest that both cs->readF and cs->writeF could be set,
> but in that case, the output buffer gets reallocated.

I put back an assertion that exactly one of them was set, since that's
true of how it currently works.

> I was curious about the extra byte allocated in the decompression case.
> I see that ReadDataFromArchiveZstd() is null-terminating the buffer
> before handing it to ahwrite(), but why does it need to do that?

I was trying to figure that out, too.  I think the unterminated case
might be for ExecuteSqlCommandBuf(), and that may only (have) been
needed to allow pg_restore to handle ancient/development versions of
pg_dump...  It's not currently hit.
https://coverage.postgresql.org/src/bin/pg_dump/pg_backup_db.c.gcov.html#470

I found that the terminator was added for the uncompressed case was
added at e8f69be05 and removed in bf9aa490d.

> > +Zstd_get_error(CompressFileHandle *CFH)
> 
> Seems like this should be using the zstderror stored in the handle?

Yes - I'd already addressed that locally.

> In ReadDataFromArchiveZstd():
> 
> > +        * Read compressed data.  Note that readF can resize the buffer; the
> > +        * new size is tracked and used for future loops.
> This is pretty complex for what it's doing. I'm a little worried that we
> let the reallocated buffer escape to the caller while losing track of
> how big it is. I think that works today, since there's only ever one
> call per handle, but any future refactoring that allowed cs->readData()
> to be called more than once would subtly break this code.

Note that nothing bad happens if we lose track of how big it is (well,
assuming that readF doesn't *shrink* the buffer).

The previous patch version didn't keep track of its new size, and the only
consequence is that readF() might re-resize it again on a future iteration,
even if it was already sufficiently large.

When I originally wrote it (and up until that patch version), I left
this as an XXX comment about reusing the resized buffer.  But it seemed
easy enough to fix so I did.

> In ZstdWriteCommon():
> 
> Elsewhere, output->pos is set to zero before compressing, but here we do
> it after, which I think leads to subtle differences in the function
> preconditions. If that's an intentional difference, can the reason be
> called out in a comment?

It's not deliberate.  I think it had no effect, but changed - thanks.

-- 
Justin

On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> On 3/16/23 05:50, Justin Pryzby wrote:
> > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> >>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>> build against it, I had to construct an import library, and put that
> >>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>> which makes me wonder if I've chosen a harder way than necessary?
> >>
> >> It looks like pg_dump's meson.build is missing dependencies on zstd
> >> (meson couldn't find the headers in the subproject without them).
> > 
> > I saw that this was added for LZ4, but I hadn't added it for zstd since
> > I didn't run into an issue without it.  Could you check that what I've
> > added works for your case ?
> > 
> >>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>> pg_restore output is identical to uncompressed dumps and nothing
> >>> explodes. I haven't inspected the threading implementation for safety
> >>> yet, as you mentioned.
> >>
> >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >> handling safety for this, by isolating each thread's state. I don't feel
> >> comfortable pronouncing this new addition safe or not, because I'm not
> >> sure I understand what the comments in the format-specific _Clone()
> >> callbacks are saying yet.
> > 
> > My line of reasoning for unix is that pg_dump forks before any calls to
> > zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> > doesn't apply to pg_dump under windows.  This is an opened question.  If
> > there's no solid answer, I could disable/ignore the option (maybe only
> > under windows).
> 
> I may be missing something, but why would the patch affect this? Why
> would it even affect safety of the parallel dump? And I don't see any
> changes to the clone stuff ...

zstd supports using threads during compression, with -Z zstd:workers=N.
When unix forks, the child processes can't do anything to mess up the
state of the parent processes.  

But windows pg_dump uses threads instead of forking, so it seems
possible that the pg_dump -j threads that then spawn zstd threads could
"leak threads" and break the main thread.  I suspect there's no issue,
but we still ought to verify that before declaring it safe.

> > The function is first checking if it was passed a filename which already
> > has a suffix.  And if not, it searches through a list of suffixes,
> > testing for an existing file with each suffix.  The search with stat()
> > doesn't happen if it has a suffix.  I'm having trouble seeing how the
> > hasSuffix() branch isn't dead code.  Another opened question.
> 
> AFAICS it's done this way because of this comment in pg_backup_directory
> 
>  * ...
>  * ".gz" suffix is added to the filenames. The TOC files are never
>  * compressed by pg_dump, however they are accepted with the .gz suffix
>  * too, in case the user has manually compressed them with 'gzip'.
> 
> I haven't tried, but I believe that if you manually compress the
> directory, it may hit this branch.

That would make sense, but when I tried, it didn't work like that.
The filenames were all uncompressed names.  Maybe it worked differently
in an older release.  Or maybe it changed during development of the
parallel-directory-dump patch and it's actually dead code.

This is rebased over the updated compression API.

It seems like I misunderstood something you said before, so now I put
back "supports_compression()".

-- 
Justin

Вложения

Re: zstd compression for pg_dump

От

Tomas Vondra

Дата:

28 марта 2023 г., 19:23:26


On 3/27/23 19:28, Justin Pryzby wrote:
> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>> On 3/16/23 05:50, Justin Pryzby wrote:
>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
>>>>> I did some smoke testing against zstd's GitHub release on Windows. To
>>>>> build against it, I had to construct an import library, and put that
>>>>> and the DLL into the `lib` folder expected by the MSVC scripts...
>>>>> which makes me wonder if I've chosen a harder way than necessary?
>>>>
>>>> It looks like pg_dump's meson.build is missing dependencies on zstd
>>>> (meson couldn't find the headers in the subproject without them).
>>>
>>> I saw that this was added for LZ4, but I hadn't added it for zstd since
>>> I didn't run into an issue without it.  Could you check that what I've
>>> added works for your case ?
>>>
>>>>> Parallel zstd dumps seem to work as expected, in that the resulting
>>>>> pg_restore output is identical to uncompressed dumps and nothing
>>>>> explodes. I haven't inspected the threading implementation for safety
>>>>> yet, as you mentioned.
>>>>
>>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>>>> handling safety for this, by isolating each thread's state. I don't feel
>>>> comfortable pronouncing this new addition safe or not, because I'm not
>>>> sure I understand what the comments in the format-specific _Clone()
>>>> callbacks are saying yet.
>>>
>>> My line of reasoning for unix is that pg_dump forks before any calls to
>>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
>>> doesn't apply to pg_dump under windows.  This is an opened question.  If
>>> there's no solid answer, I could disable/ignore the option (maybe only
>>> under windows).
>>
>> I may be missing something, but why would the patch affect this? Why
>> would it even affect safety of the parallel dump? And I don't see any
>> changes to the clone stuff ...
> 
> zstd supports using threads during compression, with -Z zstd:workers=N.
> When unix forks, the child processes can't do anything to mess up the
> state of the parent processes.  
> 
> But windows pg_dump uses threads instead of forking, so it seems
> possible that the pg_dump -j threads that then spawn zstd threads could
> "leak threads" and break the main thread.  I suspect there's no issue,
> but we still ought to verify that before declaring it safe.
> 

OK. I don't have access to a Windows machine so I can't test that. Is it
possible to disable the zstd threading, until we figure this out?

>>> The function is first checking if it was passed a filename which already
>>> has a suffix.  And if not, it searches through a list of suffixes,
>>> testing for an existing file with each suffix.  The search with stat()
>>> doesn't happen if it has a suffix.  I'm having trouble seeing how the
>>> hasSuffix() branch isn't dead code.  Another opened question.
>>
>> AFAICS it's done this way because of this comment in pg_backup_directory
>>
>>  * ...
>>  * ".gz" suffix is added to the filenames. The TOC files are never
>>  * compressed by pg_dump, however they are accepted with the .gz suffix
>>  * too, in case the user has manually compressed them with 'gzip'.
>>
>> I haven't tried, but I believe that if you manually compress the
>> directory, it may hit this branch.
> 
> That would make sense, but when I tried, it didn't work like that.
> The filenames were all uncompressed names.  Maybe it worked differently
> in an older release.  Or maybe it changed during development of the
> parallel-directory-dump patch and it's actually dead code.
> 

Interesting. Would be good to find out. I wonder if a little bit of
git-log digging could tell us more. Anyway, until we confirm it's dead
code, we should probably do what .gz does and have the same check for
.lz4 and .zst files.

> This is rebased over the updated compression API.
> 
> It seems like I misunderstood something you said before, so now I put
> back "supports_compression()".
> 

Thanks! I need to do a bit more testing and review, but it seems pretty
much RFC to me, assuming we can figure out what to do about threading.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: zstd compression for pg_dump

От

Kirk Wolak

Дата:

28 марта 2023 г., 21:03:49

On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 3/27/23 19:28, Justin Pryzby wrote:
> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>> On 3/16/23 05:50, Justin Pryzby wrote:
>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
>>>>> I did some smoke testing against zstd's GitHub release on Windows. To
...
OK. I don't have access to a Windows machine so I can't test that. Is it
possible to disable the zstd threading, until we figure this out?

Thomas since I appear to be one of the few windows users (I use both), can I help?
I can test pg_dump... for you, easy to do. I do about 5-10 pg_dumps a day on windows while developing.

Also, I have an AWS instance I created to build PG/Win with readline back in November.
I could give you access to that... (you are not the only person who has made this statement here).
I've made such instances available for other Open Source developers, to support them.

Obvi I would share connection credentials privately.

Regards, Kirk

Re: zstd compression for pg_dump

От

Tomas Vondra

Дата:

28 марта 2023 г., 21:45:33

On 3/28/23 20:03, Kirk Wolak wrote:
> On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
> wrote:
> 
>     On 3/27/23 19:28, Justin Pryzby wrote:
>     > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>     >> On 3/16/23 05:50, Justin Pryzby wrote:
>     >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>     >>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion
>     <jchampion@timescale.com <mailto:jchampion@timescale.com>> wrote:
>     >>>>> I did some smoke testing against zstd's GitHub release on
>     Windows. To
>     ...
>     OK. I don't have access to a Windows machine so I can't test that. Is it
>     possible to disable the zstd threading, until we figure this out?
> 
> Thomas since I appear to be one of the few windows users (I use both),
> can I help?
> I can test pg_dump... for you, easy to do.  I do about 5-10 pg_dumps a
> day on windows while developing.
> 

Perhaps. But I'll leave the details up to Justin - it's his patch, and
I'm not sure how to verify the threading is OK.

I'd try applying this patch, build with --with-zstd and then run the
pg_dump TAP tests, and perhaps do some manual tests.

And perhaps do the same for --with-lz4 - there's a thread [1] suggesting
we don't detect lz4 stuff on Windows, so the TAP tests do nothing.

https://www.postgresql.org/message-id/ZAjL96N9ZW84U59p@msg.df7cb.de

> Also, I have an AWS instance I created to build PG/Win with readline
> back in November.
> I could give you access to that...  (you are not the only person who has
> made this statement here).
> I've made such instances available for other Open Source developers, to
> support them.
> 
>  Obvi I would share connection credentials privately.
> 

I'd rather leave the Windows stuff up to someone with more experience
with that platform. I have plenty of other stuff on my plate atm.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: zstd compression for pg_dump

От

Jacob Champion

Дата:

29 марта 2023 г., 01:33:25

On Wed, Mar 15, 2023 at 9:50 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> > It looks like pg_dump's meson.build is missing dependencies on zstd
> > (meson couldn't find the headers in the subproject without them).
>
> I saw that this was added for LZ4, but I hadn't added it for zstd since
> I didn't run into an issue without it.  Could you check that what I've
> added works for your case ?

I thought I replied to this, sorry -- your newest patchset builds
correctly with subprojects, so the new dependency looks good to me.
Thanks!

> > Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> > handling safety for this, by isolating each thread's state. I don't feel
> > comfortable pronouncing this new addition safe or not, because I'm not
> > sure I understand what the comments in the format-specific _Clone()
> > callbacks are saying yet.
>
> My line of reasoning for unix is that pg_dump forks before any calls to
> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> doesn't apply to pg_dump under windows.  This is an opened question.  If
> there's no solid answer, I could disable/ignore the option (maybe only
> under windows).

To (maybe?) move this forward a bit, note that pg_backup_custom's
_Clone() function makes sure that there is no active compressor state
at the beginning of the new thread. pg_backup_directory's
implementation has no such provision. And I don't think it can,
because the parent thread might have concurrently set one up -- see
the directory-specific implementation of _CloseArchive(). Perhaps we
should just NULL out those fields after the copy, instead?

To illustrate why I think this is tough to characterize: if I've read
the code correctly, the _Clone() and CloneArchive() implementations
are running concurrently with code that is actively modifying the
ArchiveHandle and the lclContext. So safety is only ensured to the
extent that we keep track of which fields threads are allowed to
touch, and I don't have that mental model.

--Jacob

Re: zstd compression for pg_dump

От

Justin Pryzby

Дата:

29 марта 2023 г., 16:35:28

On Tue, Mar 28, 2023 at 02:03:49PM -0400, Kirk Wolak wrote:
> On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> > On 3/27/23 19:28, Justin Pryzby wrote:
> > > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> > >> On 3/16/23 05:50, Justin Pryzby wrote:
> > >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> > >>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> > >>>>> I did some smoke testing against zstd's GitHub release on Windows. To
> > ...
> > OK. I don't have access to a Windows machine so I can't test that. Is it
> > possible to disable the zstd threading, until we figure this out?
>
> Thomas since I appear to be one of the few windows users (I use both), can I help?
> I can test pg_dump... for you, easy to do.  I do about 5-10 pg_dumps a day
> on windows while developing.

It'd be great if you'd exercise this and other changes to
pg_dump/restore.  Tomas just pushed a bugfix, so be sure to "git pull"
before testing, or else you might rediscover the bug.

If you have a zstd library with thread support, you could test with
-Z zstd:workers=3.  But I think threads aren't enabled in the common
libzstd packages.  Jacob figured out how to compile libzstd easily using
"meson wraps" - but I don't know the details.

-- 
Justin

Re: zstd compression for pg_dump

От

Jacob Champion

Дата:

29 марта 2023 г., 18:10:18

On Wed, Mar 29, 2023 at 6:35 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> If you have a zstd library with thread support, you could test with
> -Z zstd:workers=3.  But I think threads aren't enabled in the common
> libzstd packages.  Jacob figured out how to compile libzstd easily using
> "meson wraps" - but I don't know the details.

From the source root,

    $ mkdir subprojects
    $ meson wrap install zstd

From then on, Meson was pretty automagical about it during the ninja
build. The subproject's settings are themselves inspectable and
settable via `meson configure`:

    $ meson configure -Dzstd:<option>=<value>

--Jacob

Re: zstd compression for pg_dump

От

Justin Pryzby

Дата:

01 апреля 2023 г., 02:16:31

On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
> On 3/27/23 19:28, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >> On 3/16/23 05:50, Justin Pryzby wrote:
> >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >>>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <jchampion@timescale.com> wrote:
> >>>>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>>>> build against it, I had to construct an import library, and put that
> >>>>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>>>> which makes me wonder if I've chosen a harder way than necessary?
> >>>>
> >>>> It looks like pg_dump's meson.build is missing dependencies on zstd
> >>>> (meson couldn't find the headers in the subproject without them).
> >>>
> >>> I saw that this was added for LZ4, but I hadn't added it for zstd since
> >>> I didn't run into an issue without it.  Could you check that what I've
> >>> added works for your case ?
> >>>
> >>>>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>>>> pg_restore output is identical to uncompressed dumps and nothing
> >>>>> explodes. I haven't inspected the threading implementation for safety
> >>>>> yet, as you mentioned.
> >>>>
> >>>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >>>> handling safety for this, by isolating each thread's state. I don't feel
> >>>> comfortable pronouncing this new addition safe or not, because I'm not
> >>>> sure I understand what the comments in the format-specific _Clone()
> >>>> callbacks are saying yet.
> >>>
> >>> My line of reasoning for unix is that pg_dump forks before any calls to
> >>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> >>> doesn't apply to pg_dump under windows.  This is an opened question.  If
> >>> there's no solid answer, I could disable/ignore the option (maybe only
> >>> under windows).
> >>
> >> I may be missing something, but why would the patch affect this? Why
> >> would it even affect safety of the parallel dump? And I don't see any
> >> changes to the clone stuff ...
> > 
> > zstd supports using threads during compression, with -Z zstd:workers=N.
> > When unix forks, the child processes can't do anything to mess up the
> > state of the parent processes.  
> > 
> > But windows pg_dump uses threads instead of forking, so it seems
> > possible that the pg_dump -j threads that then spawn zstd threads could
> > "leak threads" and break the main thread.  I suspect there's no issue,
> > but we still ought to verify that before declaring it safe.
> 
> OK. I don't have access to a Windows machine so I can't test that. Is it
> possible to disable the zstd threading, until we figure this out?

I think that's what's best.  I made it issue a warning if "workers" was
specified.  It could also be an error, or just ignored.

I considered disabling workers only for windows, but realized that I
haven't tested with threads myself - my local zstd package is compiled
without threading, and I remember having some issue recompiling it with
threading.  Jacob's recipe for using meson wraps works well, but it
still seems better to leave it as a future feature.  I used that recipe
to enabled zstd with threading on CI (except for linux/autoconf).

> >>> The function is first checking if it was passed a filename which already
> >>> has a suffix.  And if not, it searches through a list of suffixes,
> >>> testing for an existing file with each suffix.  The search with stat()
> >>> doesn't happen if it has a suffix.  I'm having trouble seeing how the
> >>> hasSuffix() branch isn't dead code.  Another opened question.
> >>
> >> AFAICS it's done this way because of this comment in pg_backup_directory
> >>
> >>  * ...
> >>  * ".gz" suffix is added to the filenames. The TOC files are never
> >>  * compressed by pg_dump, however they are accepted with the .gz suffix
> >>  * too, in case the user has manually compressed them with 'gzip'.
> >>
> >> I haven't tried, but I believe that if you manually compress the
> >> directory, it may hit this branch.
> > 
> > That would make sense, but when I tried, it didn't work like that.
> > The filenames were all uncompressed names.  Maybe it worked differently
> > in an older release.  Or maybe it changed during development of the
> > parallel-directory-dump patch and it's actually dead code.
> 
> Interesting. Would be good to find out. I wonder if a little bit of
> git-log digging could tell us more. Anyway, until we confirm it's dead
> code, we should probably do what .gz does and have the same check for
> .lz4 and .zst files.

I found that hasSuffix() and cfopen() originated in the refactored patch
Heikki's sent here; there's no history beyond that.

https://www.postgresql.org/message-id/4D3954C7.9060503%40enterprisedb.com

The patch published there appends the .gz within cfopen(), and the
caller writes into the TOC the filename without ".gz".  It seems like
maybe a few hours prior, Heikki may have been appending the ".gz" suffix
in the caller, and then writing the TOC with filename.gz.

The only way I've been able to get a "filename.gz" passed to hasSuffix
is to write a directory-format dump, with LOs, and without compression,
and then compress the blobs with "gzip", and *also* edit the blobs.toc
file to say ".gz" (which isn't necessary since, if the original file
isn't found, the restore would search for files with compressed
suffixes).

So .. it's not *technically* unreachable, but I can't see why it'd be
useful to support editing the *content* of the blob TOC (other than
compressing it).  I might give some weight to the idea if it were also
possible to edit the non-blob TOC; but, it's a binary file, so no.

For now, I made the change to make zstd and lz4 to behave the same here
as .gz, unless Heikki has a memory or a git reflog going back far enough
to further support the idea that the code path isn't useful.

I'm going to set the patch as RFC as a hint to anyone who would want to
make a final review.

-- 
Justin

On Mon, Apr 03, 2023 at 11:26:09PM +0200, Tomas Vondra wrote:
> On 4/3/23 21:17, Justin Pryzby wrote:
> > On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
> >>> Feel free to mess around with threads (but I'd much rather see the patch
> >>> progress for zstd:long).
> >>
> >> OK, understood. The long mode patch is pretty simple. IIUC it does not
> >> change the format, i.e. in the worst case we could leave it for PG17
> >> too. Correct?
> > 
> > Right, libzstd only has one "format", which is the same as what's used
> > by the commandline tool.  zstd:long doesn't change the format of the
> > output: the library just uses a larger memory buffer to allow better
> > compression.  There's no format change for zstd:workers, either.
> 
> OK. I plan to do a bit more review/testing on this, and get it committed
> over the next day or two, likely including the long mode. One thing I
> noticed today is that maybe long_distance should be a bool, not int.
> Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be
> cleaner to cast the value during a call and keep it bool otherwise.

Thanks for noticing.  Evidently I wrote it using "int" to get the
feature working, and then later wrote the bool parsing bits but never
changed the data structure.

This also updates a few comments, indentation, removes a useless
assertion, and updates the warning about zstd:workers.

-- 
Justin

Justin Pryzby <pryzby@telsasoft.com> writes:
> On Thu, Apr 06, 2023 at 05:34:30PM +0200, Tomas Vondra wrote:
>> I think that's all for PG16 in this patch series.  If there's more we want to
>> do, it'll have to wait for PG17 -

> Yes

Shouldn't the CF entry be closed as committed?  It's certainly
making the cfbot unhappy because the patch-of-record doesn't apply.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: zstd compression for pg_dump

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения