Обсуждение: [PING] fallocate() causes btrfs to never compress postgresql files

Поиск
Список
Период
Сортировка

[PING] fallocate() causes btrfs to never compress postgresql files

От
Dimitrios Apostolou
Дата:
Hello, sorry for mass sending this, but I didn't get any response to my
first email [1] so I'm now CC'ing the commit's 4d330a6 [2] author and the
reviewers. I think it's an important issue, because I need to
custom-compile postgresql to have what I had before: a transparently
compressed database.

[1] https://www.postgresql.org/message-id/d0f4fc11-969d-7b3a-aacf-00f86450e738@gmx.net
[2] https://github.com/postgres/postgres/commit/4d330a61bb1969df31f2cebfe1ba9d1d004346d8

My previous message follows:

Hi,

this is just a heads-up about files being generated by PostgreSQL 17 not
being compressed by Btrfs, even when mounted with the force-compress mount
option. I have this occuring aggressively when restoring a database via
pg_restore. I think this is caused mdzeroextend() calling FileFallocate(),
which in turn invokes posix_fallocate().

I also verified that turning off the use of fallocate causes the database
to write compressed files again, like it did in older versions.
Unfortunately the only way I found was to configure with a "hack" so that
autoconf thinks the feature is not available:

    ./configure ac_cv_func_posix_fallocate=no

There have been discussions on the btrfs mailing list about why it does
that, the summary is that it is very difficult to guarantee that
compressed writes will not fail with ENOSPACE on a CoW filesystem, thus
files with fallocate()d ranges are treated as being marked NOCOW,
effectively disabling compression.

Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is
it the filesystem at fault for not returning EOPNOTSUPP, in which case
postgres would use its fallback code?

BTW even in the last case, PostgreSQL would not notice the lack of
fallocate() support as glibc implements a userspace fallback in
posix_fallocate(). That fallback has its own issues that hopefully will
not affect postgres (see CAVEATS in man 3 posix_fallocate).

Regards,
Dimitris



Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Tomas Vondra
Дата:
On 5/28/25 16:22, Dimitrios Apostolou wrote:
> Hello, sorry for mass sending this, but I didn't get any response to my
> first email [1] so I'm now CC'ing the commit's 4d330a6 [2] author and
> the reviewers. I think it's an important issue, because I need to
> custom-compile postgresql to have what I had before: a transparently
> compressed database.
> 

That message arrived a couple days before the feature freeze, so
everyone was busy with getting PG18 patches over the line. I assume
that's why no one responded to a message about an issue that already
affects PG17. We're in the quieter part of the dev cycle, people are
recovering etc. Hence the delay.

> [1] https://www.postgresql.org/message-id/d0f4fc11-969d-7b3a-
> aacf-00f86450e738@gmx.net
> [2] https://github.com/postgres/postgres/
> commit/4d330a61bb1969df31f2cebfe1ba9d1d004346d8
> 
> My previous message follows:
> 
> Hi,
> 
> this is just a heads-up about files being generated by PostgreSQL 17 not
> being compressed by Btrfs, even when mounted with the force-compress mount
> option. I have this occuring aggressively when restoring a database via
> pg_restore. I think this is caused mdzeroextend() calling FileFallocate(),
> which in turn invokes posix_fallocate().
> 

Right, I don't think we're really using posix_fallocate() in other
places, or at least not in places that would matter. And this code comes
from commit 4d330a61bb in PG17:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4d330a61bb1969df31f2cebfe1ba9d1d004346d8

The commit message explains why we do that - it has advantages when
allocating large number of blocks. FWIW it's a general code, when we
need to add space to a relation, not just for pg_restore.


> I also verified that turning off the use of fallocate causes the database
> to write compressed files again, like it did in older versions.
> Unfortunately the only way I found was to configure with a "hack" so that
> autoconf thinks the feature is not available:
> 
>    ./configure ac_cv_func_posix_fallocate=no
> 

Unfortunately, that seems pretty heavy handed, because it will affect
the whole build, no matter which filesystem it gets used with. And I
guess we don't want to disable posix_fallocate() just because one
filesystem does something ... strange.

> There have been discussions on the btrfs mailing list about why it does
> that, the summary is that it is very difficult to guarantee that
> compressed writes will not fail with ENOSPACE on a CoW filesystem, thus
> files with fallocate()d ranges are treated as being marked NOCOW,
> effectively disabling compression.
> 

Isn't guaranteeing success of a write a general issue with compressed
filesystem? Why is posix_fallocate() any special in this regard?
Shouldn't the filesystem be defensive and assume the data is not
compressible? Or maybe just return EOPNOTSUPP when in doubt.

> Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is
> it the filesystem at fault for not returning EOPNOTSUPP, in which case
> postgres would use its fallback code?
> 

I don't have a clear opinion on whether it's a filesystem issue. Maybe
we should be handling this differently, not sure.

> BTW even in the last case, PostgreSQL would not notice the lack of
> fallocate() support as glibc implements a userspace fallback in
> posix_fallocate(). That fallback has its own issues that hopefully will
> not affect postgres (see CAVEATS in man 3 posix_fallocate).
> 

Well, if btrfs starts returning EOPNOTSUPP, and glibc switches to the
userspace fallback, we wouldn't notice. But that's up to the btrfs to
decide if they want to support fallocate. We still need our fallback
anyway, because of other OSes.


regards

-- 
Tomas Vondra




Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Dimitrios Apostolou
Дата:
On Wed, 28 May 2025, Tomas Vondra wrote:
>
> Isn't guaranteeing success of a write a general issue with compressed
> filesystem? Why is posix_fallocate() any special in this regard?
> Shouldn't the filesystem be defensive and assume the data is not
> compressible? Or maybe just return EOPNOTSUPP when in doubt.

It's not simple for CoW filesystems, including Btrfs and ZFS. What I know
is that the current design is a compromise, it's not that the developers
are happy with it. I can point you to some discussion, with pointers to
further discussions if you are interested:

https://marc.info/?l=linux-btrfs&m=174310663519516&w=2

>> BTW even in the last case, PostgreSQL would not notice the lack of
>> fallocate() support as glibc implements a userspace fallback in
>> posix_fallocate(). That fallback has its own issues that hopefully will
>> not affect postgres (see CAVEATS in man 3 posix_fallocate).
>>
>
> Well, if btrfs starts returning EOPNOTSUPP, and glibc switches to the
> userspace fallback, we wouldn't notice. But that's up to the btrfs to
> decide if they want to support fallocate. We still need our fallback
> anyway, because of other OSes.

Btrfs has decided a few years back: they will "support" fallocate, but
because real support is very difficult, they disable compression (among
others) for files with fallocate'd ranges. They can't change that and
return EOPNOTSUPP out of the blue now, but they are open to adding a mount
option to optionally do that:

https://marc.info/?l=linux-btrfs&m=174310663519516&w=2


>> Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is
>> it the filesystem at fault for not returning EOPNOTSUPP, in which case
>> postgres would use its fallback code?
>>
>
> I don't have a clear opinion on whether it's a filesystem issue. Maybe
> we should be handling this differently, not sure.

All I'm saying is that this is a regression for PostgreSQL users that keep
tablespaces on compressed Btrfs. What could be done from postgres, is to
provide a runtime setting for avoiding fallocate(), going instead through
the old code path. Idelly this would be an option per tablespace, but even
a global one is better than nothing.




Thanks,
Dimitris




Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Thomas Munro
Дата:
Or for a completely different approach: I wonder if ftruncate() would
be more efficient on COW systems anyway.  The minimum thing we need is
for the file system to remember the new size, 'cause, erm, we don't.
All the rest is probably a waste of cycles, since they reserve real
space (or fail to) later in the checkpointer or whatever process
eventually writes the data out.



Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Tomas Vondra
Дата:
On 5/31/25 16:00, Thomas Munro wrote:
> On Fri, May 30, 2025 at 3:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
>> All I'm saying is that this is a regression for PostgreSQL users that keep
>> tablespaces on compressed Btrfs. What could be done from postgres, is to
>> provide a runtime setting for avoiding fallocate(), going instead through
>> the old code path. Idelly this would be an option per tablespace, but even
>> a global one is better than nothing.
> 
> Here's an initial sketch of such a setting.  Better name, design,
> words welcome.  Would need a bit more work to cover temp tables too.
> It's slightly tricky to get smgr to behave differently because of the
> contents of a system catalogue!  I couldn't think of a better way than
> exposing it as a flag that the buffer manager layer has to know about
> and compute earlier, but that also seems a bit strange, as fallocate
> is a highly md.c specific concern.  Hmm.
> 

I find the definition of io_min_fallocate confusing, or rather that 0
means "never" instead of "always". It's described as a "threshold at
which to start using fallocate", so I'd expect 0 to mean "always"
because (len >= 0).

I suggest to use "-1" to mean never and "0" always, as for other similar
settings (e.g. log_min_duration_statement or log_lock_waits).

> I suppose something like the 0001 part could be back-patched if this
> is considered a serious enough problem without other workarounds, so I
> did this in two steps.  I wonder if there are good reasons to want to
> change the number on other file systems.  I suppose it at least allows
> experimentation.

Maybe. It'd need to get some of the 0002 bits too, ofc.

I'm not sure we really want all these special GUC tailored for different
filesystems. We already have a few such GUCs, it's getting tricky to
know which ones to set / not set, and it also changes with the
filesystem version ... I personally don't know which ones to set, a lot
of the knowledge is somewhat outdated I think.

Wouldn't it be better for btrfs to just start returning EOPNOTSUPP
(maybe with a mount option), in which case we already do the right thing
automatically already? Sure, it means the admin needs to be aware of
this in both cases.


regards

-- 
Tomas Vondra




Thomas Munro <thomas.munro@gmail.com> writes:
> It's slightly tricky to get smgr to behave differently because of the
> contents of a system catalogue!

The mere thought makes me blanch.  I'm okay with the GUC part,
but I do not think we should put in 0002 --- the odds of
causing serious problems greatly outweigh the value, IMO.
Fundamental layering violations tend to bite you on tender
parts of your anatomy.

            regards, tom lane



Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Jakub Wartak
Дата:
On Sat, May 31, 2025 at 4:33 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> On 5/31/25 16:00, Thomas Munro wrote:
> > On Fri, May 30, 2025 at 3:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
> >> All I'm saying is that this is a regression for PostgreSQL users that keep
> >> tablespaces on compressed Btrfs. What could be done from postgres, is to
> >> provide a runtime setting for avoiding fallocate(), going instead through
> >> the old code path. Idelly this would be an option per tablespace, but even
> >> a global one is better than nothing.
> >
> > Here's an initial sketch of such a setting.  Better name, design,
> > words welcome.  Would need a bit more work to cover temp tables too.
> > It's slightly tricky to get smgr to behave differently because of the
> > contents of a system catalogue!  I couldn't think of a better way than
> > exposing it as a flag that the buffer manager layer has to know about
> > and compute earlier, but that also seems a bit strange, as fallocate
> > is a highly md.c specific concern.  Hmm.
> >
>
> I find the definition of io_min_fallocate confusing, [..]

Thanks to Thomas for providing the patch, but - same here - but my
take is that making it a GUC that takes a number for this instead of
simply making it on/off switches makes it less more understandable. I
think io_fallocate=on/off would be easier for the users.

> > I suppose something like the 0001 part could be back-patched if this
> > is considered a serious enough problem without other workarounds, so I
> > did this in two steps.  I wonder if there are good reasons to want to
> > change the number on other file systems.  I suppose it at least allows
> > experimentation.
>
> Maybe. It'd need to get some of the 0002 bits too, ofc.
>
> I'm not sure we really want all these special GUC tailored for different
> filesystems. We already have a few such GUCs, it's getting tricky to
> know which ones to set / not set, and it also changes with the
> filesystem version ... I personally don't know which ones to set, a lot
> of the knowledge is somewhat outdated I think.

Well, XFS also got quite several reports of regressions due to
fallocate() being used [1], but there you could at least try to
mitigate it. I don't think we'll be able to get away without it and
the ginnie is already out of the bottle as the kernels are already
widely used (well, in theory we could add capability that would help
set some of those internal switches based on statfs(/path).fs_type,
but realistically we would still need to have the ability to override
anyway).

-J.

[1] - https://www.postgresql.org/message-id/CADofcAV8xu3hCNHq7-7x56KrP9rD6%3DA04%3DqjTr3nETh-gptF8w%40mail.gmail.com



Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Dimitrios Apostolou
Дата:
On Sun, 1 Jun 2025, Thomas Munro wrote:

> Or for a completely different approach: I wonder if ftruncate() would
> be more efficient on COW systems anyway.  The minimum thing we need is
> for the file system to remember the new size, 'cause, erm, we don't.
> All the rest is probably a waste of cycles, since they reserve real
> space (or fail to) later in the checkpointer or whatever process
> eventually writes the data out.

FWIW I asked the btrfs devs. From
https://github.com/kdave/btrfs-progs/pull/976
I quote Qu Wenruo:

> Only for falloc(), not ftruncate().
>
> The PREALLOC inode flag is added for any preallocated file extent,
> meanwhile truncate only creates holes.
>
> truncate is fast but it's really different from fallocate by there is
> nothing really allocated.
>
> This means the later writes will need to allocate their own data
> extents. This is fine and even preferred for btrfs, but may lead to
> performance drop for more traditional fses.
>
> We're in an era that fs features are not longer that generic, fallocate
> is just one example, in fact fallocate will cause more problems more
> than no compression.
>
> It's really a deep rabbit hole, and is not something simple true or
> false questions.


In other words, btrfs will not try to allocate anything with ftruncate(),
it will just mark the new space as a "hole". As such, the file is not
marked as "PREALLOC" which is what disables compression. Of course there
is no guarantee that further writes will succeed, and as quoted above,
other (non-COW) filesystems might be slower writing the
ftruncate()-allocated space.


Regards,
Dimitris




Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Thomas Munro
Дата:
On Mon, Jun 2, 2025 at 10:14 PM Dimitrios Apostolou <jimis@gmx.net> wrote:
> On Sun, 1 Jun 2025, Thomas Munro wrote:
> > Or for a completely different approach: I wonder if ftruncate() would
> > be more efficient on COW systems anyway.  The minimum thing we need is
> > for the file system to remember the new size, 'cause, erm, we don't.
> > All the rest is probably a waste of cycles, since they reserve real
> > space (or fail to) later in the checkpointer or whatever process
> > eventually writes the data out.
>
> FWIW I asked the btrfs devs. From
> https://github.com/kdave/btrfs-progs/pull/976
> I quote Qu Wenruo:
>
> > Only for falloc(), not ftruncate().
> >
> > The PREALLOC inode flag is added for any preallocated file extent,
> > meanwhile truncate only creates holes.
> >
> > truncate is fast but it's really different from fallocate by there is
> > nothing really allocated.
> >
> > This means the later writes will need to allocate their own data
> > extents. This is fine and even preferred for btrfs, but may lead to
> > performance drop for more traditional fses.
> >
> > We're in an era that fs features are not longer that generic, fallocate
> > is just one example, in fact fallocate will cause more problems more
> > than no compression.
> >
> > It's really a deep rabbit hole, and is not something simple true or
> > false questions.
>
>
> In other words, btrfs will not try to allocate anything with ftruncate(),
> it will just mark the new space as a "hole". As such, the file is not
> marked as "PREALLOC" which is what disables compression. Of course there
> is no guarantee that further writes will succeed, and as quoted above,
> other (non-COW) filesystems might be slower writing the
> ftruncate()-allocated space.

Yeah, right, I know.  But PostgreSQL has at least two different goals
when extending a relation:

1.  Remember the new size of the relation somewhere*.
2.  Reserve space now, so that we can report ENOSPC and roll back the
transaction that wants to extend the relation when the disk is full,
instead of causing a checkpoint or buffer eviction to fail later (see
https://wiki.postgresql.org/wiki/ENOSPC for longer version).

But the second thing just can't work on a COW system by definition, so
the whole notion is bogus, which is why I wondered if fruncate() is
actually a reasonable option to have, even though it just creates
holes (on Unixen).  I also know of another completely different reason
to want to use ftruncate(): NTFS, which *doesn't* create holes (NTFS
supports holes via other syscalls, but ftruncate() or rather
_chsize_s() as they spell it doesn't make them), making it more like
posix_fallocate() in this usage.  So I was beginning to wonder if we
might want to experiment with a patch that adds
file_extend_method=fallocate,ftruncate,write.  Perhaps accompanied by
a threshold setting below which it always writes.  Then we could
experiment with various COW file systems (zfs, btrfs, apfs, refs, ???)
and NTFS to see how that speculation works out in reality.

Wild speculation: To actually achieve the second thing on a COW file
system, you'd probably need some totally new kind of interface,
because that POSIX interface has the wrong shape.  I have wondered
about a new fcntl() or whatever that would let you reserve the right
to write N blocks (ie just once!) without ENOSPC on a given
descriptor, that a database could conceptually acquire when dirtying
buffers, since that's the point at which we know that a write must
eventually happen (then probably amortise that accounting a lot),
including but not limited to this relation-extension case, and that
way you could achieve goal #2, ie transferring ENOSPC errors to
transaction time.  But that's just a daydream about vapourware.  One
problem is that PostgreSQL has many processes with separate file
descriptors, so that'd make the bookkeeping trickier but not
impossible.

(*That has a few known issues...)



Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Bruce Momjian
Дата:
On Sun, Jun  1, 2025 at 02:00:17AM +1200, Thomas Munro wrote:
> I suppose something like the 0001 part could be back-patched if this
> is considered a serious enough problem without other workarounds, so I
> did this in two steps.  I wonder if there are good reasons to want to
> change the number on other file systems.  I suppose it at least allows
> experimentation.

Consider that postgresql.conf is installed by initdb, so backpatching
this is not going to add the setting to postgresql.conf unless we do
some magic.  That will be confusing to users.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.



Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Dimitrios Apostolou
Дата:
On Tue, 3 Jun 2025, Thomas Munro wrote:

> On Mon, Jun 2, 2025 at 10:14 PM Dimitrios Apostolou <jimis@gmx.net> wrote:
>> On Sun, 1 Jun 2025, Thomas Munro wrote:
>>> Or for a completely different approach: I wonder if ftruncate() would
>>> be more efficient on COW systems anyway.  The minimum thing we need is
>>> for the file system to remember the new size, 'cause, erm, we don't.
>>> All the rest is probably a waste of cycles, since they reserve real
>>> space (or fail to) later in the checkpointer or whatever process
>>> eventually writes the data out.
>>
>> FWIW I asked the btrfs devs. From
>> https://github.com/kdave/btrfs-progs/pull/976
>> I quote Qu Wenruo:
>>
>>> Only for falloc(), not ftruncate().
>>>
>>> The PREALLOC inode flag is added for any preallocated file extent,
>>> meanwhile truncate only creates holes.
>>>
>>> truncate is fast but it's really different from fallocate by there is
>>> nothing really allocated.
>>>
>>> This means the later writes will need to allocate their own data
>>> extents. This is fine and even preferred for btrfs, but may lead to
>>> performance drop for more traditional fses.
>>>
>>> We're in an era that fs features are not longer that generic, fallocate
>>> is just one example, in fact fallocate will cause more problems more
>>> than no compression.
>>>
>>> It's really a deep rabbit hole, and is not something simple true or
>>> false questions.
>>
>>
>> In other words, btrfs will not try to allocate anything with ftruncate(),
>> it will just mark the new space as a "hole". As such, the file is not
>> marked as "PREALLOC" which is what disables compression. Of course there
>> is no guarantee that further writes will succeed, and as quoted above,
>> other (non-COW) filesystems might be slower writing the
>> ftruncate()-allocated space.
>
> Yeah, right, I know.  But PostgreSQL has at least two different goals
> when extending a relation:
>
> 1.  Remember the new size of the relation somewhere*.
> 2.  Reserve space now, so that we can report ENOSPC and roll back the
> transaction that wants to extend the relation when the disk is full,
> instead of causing a checkpoint or buffer eviction to fail later (see
> https://wiki.postgresql.org/wiki/ENOSPC for longer version).
>
> But the second thing just can't work on a COW system by definition, so
> the whole notion is bogus, which is why I wondered if fruncate() is
> actually a reasonable option to have, even though it just creates
> holes (on Unixen).  I also know of another completely different reason
> to want to use ftruncate(): NTFS, which *doesn't* create holes (NTFS
> supports holes via other syscalls, but ftruncate() or rather
> _chsize_s() as they spell it doesn't make them), making it more like
> posix_fallocate() in this usage.  So I was beginning to wonder if we
> might want to experiment with a patch that adds
> file_extend_method=fallocate,ftruncate,write.  Perhaps accompanied by
> a threshold setting below which it always writes.

This sounds like the best solution IMO. People can then experiment with
different settings and filesystems, and that way we also learn in the
process. Thank you for the effort and patches so far.

Dimitris

Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Thomas Munro
Дата:
On Tue, Jun 3, 2025 at 1:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
> This sounds like the best solution IMO. People can then experiment with
> different settings and filesystems, and that way we also learn in the
> process. Thank you for the effort and patches so far.

OK, here's a basic patch to experiment with.  You can set:

file_extend_method = fallocate,ftruncate,write
file_extend_method_threshold = 8 # (below 8 always write, 0 means never write)

To really make COPY fly we also need to get write combining and AIO
going (we've had this working with various prototypes, but it all
missed the boat for v18 which can only do that stuff for reads).  Then
you'll have concurrent 128kB or up to 1MB writes trundling along in
the background which I guess should work pretty nicely for stuff like
BTRFS/ZFS and compression and all that jazz.

Вложения

Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Dimitrios Apostolou
Дата:
On Mon, 9 Jun 2025, Thomas Munro wrote:

> On Tue, Jun 3, 2025 at 1:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
>> This sounds like the best solution IMO. People can then experiment with
>> different settings and filesystems, and that way we also learn in the
>> process. Thank you for the effort and patches so far.
>
> OK, here's a basic patch to experiment with.  You can set:
>
> file_extend_method = fallocate,ftruncate,write
> file_extend_method_threshold = 8 # (below 8 always write, 0 means never write)
>

I applied the patch on PostgreSQL v17 and am testing it now. I chose
ftruncate method and I see ftruncate in action using strace while doing
pg_restore of a big database. Nothing unexpected has happened so far. I
also verified that files are being compressed, obeying Btrfs's mount
option compress=zstd.

Thanks for the patch! What are the odds of commiting it to v17?

Dimitris


Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Dimitrios Apostolou
Дата:
On Thu, 12 Jun 2025, Dimitrios Apostolou wrote:

> On Mon, 9 Jun 2025, Thomas Munro wrote:
>
>>  On Tue, Jun 3, 2025 at 1:58 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
>>>  This sounds like the best solution IMO. People can then experiment with
>>>  different settings and filesystems, and that way we also learn in the
>>>  process. Thank you for the effort and patches so far.
>>
>>  OK, here's a basic patch to experiment with.  You can set:
>>
>>  file_extend_method = fallocate,ftruncate,write
>>  file_extend_method_threshold = 8 # (below 8 always write, 0 means never
>>  write)
>>
>
> I applied the patch on PostgreSQL v17 and am testing it now. I chose
> ftruncate method and I see ftruncate in action using strace while doing
> pg_restore of a big database. Nothing unexpected has happened so far. I also
> verified that files are being compressed, obeying Btrfs's mount option
> compress=zstd.
>
> Thanks for the patch! What are the odds of commiting it to v17?

Ping. :-)
Patch behaves good for me. Any chance of applying it and backporting it?

>
> Dimitris
>
>

Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Thomas Munro
Дата:
On Fri, Jul 11, 2025 at 5:39 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
> > I applied the patch on PostgreSQL v17 and am testing it now. I chose
> > ftruncate method and I see ftruncate in action using strace while doing
> > pg_restore of a big database. Nothing unexpected has happened so far. I also
> > verified that files are being compressed, obeying Btrfs's mount option
> > compress=zstd.
> >
> > Thanks for the patch! What are the odds of commiting it to v17?
>
> Ping. :-)
> Patch behaves good for me. Any chance of applying it and backporting it?

Yeah, this seems to make sense, as it is a pretty bad regression for
people who are counting on BTRFS compression for their large database.
Not so sure about the threshold bit -- I'd probably leave that out of
the backport in the interest of stable branch-minimalism.  Anyone have
any better ideas, better naming, or objections?



Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Dimitrios Apostolou
Дата:
On Friday 2025-07-11 00:45, Thomas Munro wrote:

> On Fri, Jul 11, 2025 at 5:39 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
>>> I applied the patch on PostgreSQL v17 and am testing it now. I chose
>>> ftruncate method and I see ftruncate in action using strace while doing
>>> pg_restore of a big database. Nothing unexpected has happened so far. I also
>>> verified that files are being compressed, obeying Btrfs's mount option
>>> compress=zstd.
>>>
>>> Thanks for the patch! What are the odds of commiting it to v17?
>>
>> Ping. :-)
>> Patch behaves good for me. Any chance of applying it and backporting it?
>
> Yeah, this seems to make sense, as it is a pretty bad regression for
> people who are counting on BTRFS compression for their large database.
> Not so sure about the threshold bit -- I'd probably leave that out of
> the backport in the interest of stable branch-minimalism.  Anyone have
> any better ideas, better naming, or objections?

What is the right process to not lose track of this? Should I create a
commitfest entry? Should I keep pinging every couple of weeks? Or is the
patch queued somewhere and I have to wait patiently? If July commitfest
passes, could it miss the next release?

Please forgive my ignorance, but I'm lost with respect to the postgresql
development process. I also have some patches or suggestions of my own
that struggle to get feedback, so I'd appreciate any tips regarding the
development process.


Thank you,
Dimitris

Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Magnus Hagander
Дата:


On Fri, Jul 11, 2025 at 12:45 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, Jul 11, 2025 at 5:39 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
> > I applied the patch on PostgreSQL v17 and am testing it now. I chose
> > ftruncate method and I see ftruncate in action using strace while doing
> > pg_restore of a big database. Nothing unexpected has happened so far. I also
> > verified that files are being compressed, obeying Btrfs's mount option
> > compress=zstd.
> >
> > Thanks for the patch! What are the odds of commiting it to v17?
>
> Ping. :-)
> Patch behaves good for me. Any chance of applying it and backporting it?

Yeah, this seems to make sense, as it is a pretty bad regression for
people who are counting on BTRFS compression for their large database.
Not so sure about the threshold bit -- I'd probably leave that out of
the backport in the interest of stable branch-minimalism.  Anyone have
any better ideas, better naming, or objections?

Not just to throw a wrench in there, but... Should this perhaps be a tablespace option? ISTM having different filesystems for them is a good reason to use tablespaces in the first place, and then being able to pick different options...

--

Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Thomas Munro
Дата:
On Tue, Jul 29, 2025 at 6:52 PM Magnus Hagander <magnus@hagander.net> wrote:
> Not just to throw a wrench in there, but... Should this perhaps be a tablespace option? ISTM having different
filesystemsfor them is a good reason to use tablespaces in the first place, and then being able to pick different
options...

We discussed that a bit earlier in the thread.  Some problems about
layering violations and general weirdness, I recall trying it even.
On the flip side, is it right to declare very local
filesystem-specific choices in a system catalogue that is replicated
and affects replicas?
What about a fancier GUC that can reference tablespaces?



Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Magnus Hagander
Дата:


On Tue, Aug 5, 2025 at 3:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Tue, Jul 29, 2025 at 6:52 PM Magnus Hagander <magnus@hagander.net> wrote:
> Not just to throw a wrench in there, but... Should this perhaps be a tablespace option? ISTM having different filesystems for them is a good reason to use tablespaces in the first place, and then being able to pick different options...

We discussed that a bit earlier in the thread.  Some problems about
layering violations and general weirdness, I recall trying it even.
On the flip side, is it right to declare very local
filesystem-specific choices in a system catalogue that is replicated
and affects replicas?
What about a fancier GUC that can reference tablespaces?

Wouldn't that be something that applies to *all* the tablespace configs then, taht is a proper movement of the goalposts? :) Such as being able to set random_page_cost per tablespace to different values on different machines. I agree that it would be useful though.  But it seems like a different patch, if useful, and one that should be generic?

--

Re: [PING] fallocate() causes btrfs to never compress postgresql files

От
Thomas Munro
Дата:
On Fri, Aug 8, 2025 at 1:38 AM Magnus Hagander <magnus@hagander.net> wrote:
> On Tue, Aug 5, 2025 at 3:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> We discussed that a bit earlier in the thread.  Some problems about
>> layering violations and general weirdness, I recall trying it even.
>> On the flip side, is it right to declare very local
>> filesystem-specific choices in a system catalogue that is replicated
>> and affects replicas?
>> What about a fancier GUC that can reference tablespaces?
>
>
> Wouldn't that be something that applies to *all* the tablespace configs then, taht is a proper movement of the
goalposts?:) Such as being able to set random_page_cost per tablespace to different values on different machines. I
agreethat it would be useful though.  But it seems like a different patch, if useful, and one that should be generic? 

Yeah.  And while we're talking pie-in-the-sky future features,
full_page_writes is also describing a property of a particular
server's file system and/or hardware for a given tablespace.  Can't do
much about that today, as it can only be decided by the primary node
that must log full pages or not, but its potential replacement
"atomic_double_write" (as I call it) *can* be chosen on a per-server
basis in a replication chain.  We could probably have done that
independently, but it gets easier with new infrastructure for
streaming large asynchronous combined writes...

To solve Dimitrios's real production issue, I am planning to proceed
with the simple whole-system GUC(s) already posted, after I've done
some light testing on ZFS (which has similar design constraints though
makes different choices) and thought a bit harder about the
Windows/NTFS situation.  I'll post a new version before pushing
anything.  My plan is to have this in the next minor release, unless
the upcoming 18 release forces me to delay it until the one after.

Another thing I noticed is that macOS has its own funky way[1] of
preallocating disk space that looks plausibly relevant.  Not
investigated and not planning to work on that myself necessarily but
it might be worth thinking for a moment about the GUC future-proofing
implications.

[1] https://github.com/libgit2/libgit2/commit/bd132046b04875f928e52d16363fb73f8e85dded