Обсуждение: [HACKERS] pg_basebackup: Allow use of arbitrary compression program

Поиск
Список
Период
Сортировка

[HACKERS] pg_basebackup: Allow use of arbitrary compression program

От
Michael Harris
Дата:
Hello,

Back in pg 9.2, we hacked a copy of pg_basebackup to add a command
line option which would allow the user to specify an arbitrary
external program (potentially including arguments) to be used to
compress the tar backup.

Our motivation was to be able to use pigz (parallel gzip
implementation) to speed up the compression. It also allows using
tools like bzip2, xz, etc instead of the inbuilt zlib.

I never ended up submitting that upstream, but now it looks like I
will have to repeat the exercise for 9.6, so I was wondering if such a
feature would be welcomed.

I found one or two references to people asking for this, eg:
https://www.commandprompt.com/blog/a_pg_basebackup_wish_list/

To do it properly would require:

1) Adding command line option as follows:
 -C, --compressprog=PROG                        Use supplied program for compression

2) The current logic either uses zlib if compiled in, or offers no
compression at all, controlled by a series of #ifdef/#endif. I would
prefer that the user can either use zlib or an external program
without having to recompile, so I would remove the #ifdefs and replace
them with run time branching.

3) When opening the output file, if the -C option was used, use popen
to open a child process and write to that.

My questions are:
- Has anything like this already been discussed?
- Would this be a welcome contribution?
- Can anyone see any problems with the above approach?

Thanks!

Regards
Mike Harris



Re: [HACKERS] pg_basebackup: Allow use of arbitrary compression program

От
Jeff Janes
Дата:
On Thu, Apr 6, 2017 at 7:04 PM, Michael Harris <harmic@gmail.com> wrote:
Hello,

Back in pg 9.2, we hacked a copy of pg_basebackup to add a command
line option which would allow the user to specify an arbitrary
external program (potentially including arguments) to be used to
compress the tar backup.

Our motivation was to be able to use pigz (parallel gzip
implementation) to speed up the compression. It also allows using
tools like bzip2, xz, etc instead of the inbuilt zlib.

I never ended up submitting that upstream, but now it looks like I
will have to repeat the exercise for 9.6, so I was wondering if such a
feature would be welcomed.

I would welcome it.  I would really like to be able to use parallel pigz and pxz.

You can stream the data into a compression tool of your choice as long as you use tar mode and specify '-D -', but that is incompatible with table spaces, and with xlog streaming, and so is not a very good solution.

Cheers,

Jeff

Re: [HACKERS] pg_basebackup: Allow use of arbitrary compression program

От
Magnus Hagander
Дата:
On Fri, Apr 7, 2017 at 4:04 AM, Michael Harris <harmic@gmail.com> wrote:
Hello,

Back in pg 9.2, we hacked a copy of pg_basebackup to add a command
line option which would allow the user to specify an arbitrary
external program (potentially including arguments) to be used to
compress the tar backup.

Our motivation was to be able to use pigz (parallel gzip
implementation) to speed up the compression. It also allows using
tools like bzip2, xz, etc instead of the inbuilt zlib.

I never ended up submitting that upstream, but now it looks like I
will have to repeat the exercise for 9.6, so I was wondering if such a
feature would be welcomed.

I found one or two references to people asking for this, eg:
https://www.commandprompt.com/blog/a_pg_basebackup_wish_list/

To do it properly would require:

1) Adding command line option as follows:

  -C, --compressprog=PROG
                         Use supplied program for compression

2) The current logic either uses zlib if compiled in, or offers no
compression at all, controlled by a series of #ifdef/#endif. I would
prefer that the user can either use zlib or an external program
without having to recompile, so I would remove the #ifdefs and replace
them with run time branching.

Not sure how that would work or be needed. The reasonable thing would be if zlib is available when building the choices would be "no compression", "zlib compression" or "external compression". If there was no zlib available when building, the choices would be "no compression" or "external compression". 

Or maybe I'm misunderstanding what you're saying?

 
3) When opening the output file, if the -C option was used, use popen
to open a child process and write to that.

My questions are:
- Has anything like this already been discussed?

I think it has, but not in detail.

 
- Would this be a welcome contribution?

Yes, I definitely think this would be useful.

 
- Can anyone see any problems with the above approach?

One thing to consider is the work done recently to ensure that the output is properly synchronized when written to disk. I don't think it's reasonable to expect that from an external compression, but if it can be made optional that'd be good. Or at least be careful not to break the current one.

--

Re: [HACKERS] pg_basebackup: Allow use of arbitrary compression program

От
Michael Harris
Дата:
Hi,

Thanks for the feedback!

>> 2) The current logic either uses zlib if compiled in, or offers no
>> compression at all, controlled by a series of #ifdef/#endif. I would
>> prefer that the user can either use zlib or an external program
>> without having to recompile, so I would remove the #ifdefs and replace
>> them with run time branching.
>
>
> Not sure how that would work or be needed. The reasonable thing would be if zlib
> is available when building the choices would be "no compression",
> "zlib compression" or "external compression". If there was no zlib available
> when building, the choices would be "no compression" or "external compression".

That's exactly how I intend it to work. I had thought that the current
structure of the code would not allow that, but looking at it more
closely I see that it does, so I don't have to re-organize the
#ifdefs.

Regards // Mike



Re: [HACKERS] pg_basebackup: Allow use of arbitrary compression program

От
Michael Harris
Дата:
Hi All,

I have a working prototype now, but there is one aspect I haven't been
able to find the best solution for.

The CLI interface so far has the following new added option:
   -C, --compressprog=PRG use supplied external program for compression

An example usage would be:
   pg_basebackup -D /home/harmic/tmp/ -C bzip2 -F t

The command string supplied to -C should be a compression command that
reads from stdin and outputs to stdout.

The problem is: when constructing output filename(s), how can we
suffix them with the correct suffix (.gz / .bz2 / .xz / ....) ?

The options I can think of are:
1. Add yet another command line option to specify a suffix2. Some kind of heuristic to figure it out from the supplied
command
string (from known compression programs, but that will never be
complete)3. Don't worry about it, let the user rename them afterwards, in
which case they would be named xxxx.tar4. Make the compression command a template, eg. "bzip2 -c > %s.bz2",
so that the template itself will add the suffix

#4 might also be more flexible for tools that don't support output to
stdout, but it is a bit more complex to use.

Any other ideas?

Regards // Mike


On Wed, Apr 12, 2017 at 3:49 PM, Michael Harris <harmic@gmail.com> wrote:
> Hi,
>
> Thanks for the feedback!
>
>>> 2) The current logic either uses zlib if compiled in, or offers no
>>> compression at all, controlled by a series of #ifdef/#endif. I would
>>> prefer that the user can either use zlib or an external program
>>> without having to recompile, so I would remove the #ifdefs and replace
>>> them with run time branching.
>>
>>
>> Not sure how that would work or be needed. The reasonable thing would be if zlib
>> is available when building the choices would be "no compression",
>> "zlib compression" or "external compression". If there was no zlib available
>> when building, the choices would be "no compression" or "external compression".
>
> That's exactly how I intend it to work. I had thought that the current
> structure of the code would not allow that, but looking at it more
> closely I see that it does, so I don't have to re-organize the
> #ifdefs.
>
> Regards // Mike