Обсуждение: Regression tests fail with musl libc because libpq.so can't be loaded

Поиск

Список

Период

Сортировка

Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

16 марта, 15:38:48

Running the regression tests when building with musl libc fails, with
errors like the following:

ERROR: could not load library
"<builddir>/tmp_install/usr/lib/postgresql/libpqwalreceiver.so": Error
loading shared library libpq.so.5: No such file or directory (needed by
<builddir>/tmp_install/usr/lib/postgresql/libpqwalreceiver.so)

This was observed in Alpine Linux [1] and nixpkgs [2] a few years ago. I
now looked at this a bit and this is what happens:

- The temporary install location is set via LD_LIBRARY_PATH in the
regression tests, so that postgres can find those libs.

- All tests which load an extension / shared module via dlopen() fail,
when the loaded library in turn depends on another library in
tmp_install - I think in practice it's libpq.so all the time.

- LD_LIBRARY_PATH is used correctly to look for the direct dependency
loaded in dlopen(...), but is not taken into account anymore when trying
to locate libpq.so. This step only fails with musl, but works fine with
glibc.

I can reproduce this with a simple Dockerfile (attached), which uses the
library/postgres-alpine image, moves libpq.so to a different folder and
points LD_LIBRARY_PATH at it. Build and run the dockerfile like this:

docker build . -t pg-musl && docker run --rm pg-musl

This Dockerfile can easily be adjusted to work with the debian image -
which shows that doing the same with glibc works just fine.

Even though this originated in "just" the regression tests, I'm filing
this as a bug, because:
- The docs explicitly mention LD_LIBRARY_PATH support to point at a
different /lib folder in [3].
- This can clearly break outside the test-suite as shown with the
Dockerfile.

I tried a few more things:
- When I add an /etc/ld-musl-$(ARCH).path file and add the path to
libpq.so's libdir to it, libpq.so can be found.
- When I add the path to libpq.so as an rpath to the postgres binary,
libpq.so can be found.

Both is not surprising, but just confirms musl-ld actually works as
expected. It's just LD_LIBRARY_PATH that seems to not be passed on.

To rule out a musl bug, I also put together a very simple test-case of
an executable loading liba with dlopen(), which depends on libb and then
constructing the same scenario with LD_LIBRARY_PATH. This works fine
when compiled with glibc and musl, too. Thus, I believe the problem to
be somewhere in how postgres loads those libraries.

Best,

Wolfgang

[1]:

https://github.com/alpinelinux/aports/commit/d67ceb66a1ca9e1899071c9ef09fffba29fa0417#diff-2bd25b5172fc52319de1b09086ac0db6314d2e9fa73497979f5198f8caaec1b9

[2]:
https://github.com/NixOS/nixpkgs/commit/09ffd722072291f00f2a54d7404eb568a15e562a

[3]: https://www.postgresql.org/docs/current/install-post.html

Вложения

Dockerfile

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

16 марта, 17:24:43

Wolfgang Walther <walther@technowledgy.de> writes:
> - LD_LIBRARY_PATH is used correctly to look for the direct dependency 
> loaded in dlopen(...), but is not taken into account anymore when trying 
> to locate libpq.so. This step only fails with musl, but works fine with 
> glibc.

Why do you think this is our bug and not musl's?  We do not even have
any code that knows anything about indirect library dependencies.

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

16 марта, 18:55:49

Tom Lane:
> Why do you think this is our bug and not musl's?

Because I tried to reproduce with musl directly with a very simple 
example as mentioned in:

> To rule out a musl bug, I also put together a very simple test-case of an executable loading liba with dlopen(),
whichdepends on libb and then constructing the same scenario with LD_LIBRARY_PATH. This works fine when compiled with
glibcand musl, too. Thus, I believe the problem to be somewhere in how postgres loads those libraries.
 

My test case looked like the attached. To compile it with musl via 
Dockerfile:

   docker build . -t musl-dlopen && docker run --rm musl-dlopen

a.c/a.h is equivalent to libpqwalreceiver and b.c/b.h to libpq.

This works fine with both musl and glibc.

(Note: I also tried putting liba.so and libb.so in different folders, 
adding both to LD_LIBRARY_PATH - but that worked fine as well.)

Now my very simple example probably does something different than 
postgres, so that the problem doesn't appear there. But since it seems 
possible to do this with musl in principle, it should be possible to do 
it differently in postgres to make it work, too.

Any ideas?

Best,
Wolfgang

Вложения

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Andrew Dunstan

Дата:

16 марта, 22:49:27

On Sat, Mar 16, 2024 at 11:55 AM Wolfgang Walther <walther@technowledgy.de> wrote:

Tom Lane:
> Why do you think this is our bug and not musl's?

Because I tried to reproduce with musl directly with a very simple
example as mentioned in:

> To rule out a musl bug, I also put together a very simple test-case of an executable loading liba with dlopen(), which depends on libb and then constructing the same scenario with LD_LIBRARY_PATH. This works fine when compiled with glibc and musl, too. Thus, I believe the problem to be somewhere in how postgres loads those libraries.

My test case looked like the attached. To compile it with musl via
Dockerfile:

docker build . -t musl-dlopen && docker run --rm musl-dlopen

a.c/a.h is equivalent to libpqwalreceiver and b.c/b.h to libpq.

This works fine with both musl and glibc.

(Note: I also tried putting liba.so and libb.so in different folders,
adding both to LD_LIBRARY_PATH - but that worked fine as well.)

Now my very simple example probably does something different than
postgres, so that the problem doesn't appear there. But since it seems
possible to do this with musl in principle, it should be possible to do
it differently in postgres to make it work, too.

Any ideas?

On Alpine Linux, which uses musl libc, you have to run `make install` before you can run `make check`. Have you tried that?

(Note to self: need a new Alpine buildfarm member)

cheers

andrew

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

16 марта, 23:00:28

Andrew Dunstan:
> On Alpine Linux, which uses musl libc, you have to run `make install` 
> before you can run `make check`. Have you tried that?

I can see how that could work around the problem, because the library 
would already be available in the default library path / rpath and 
LD_LIBRARY_PATH would not be needed.

However, this would only be a workaround for the specific case of 
running the regression tests, not a solution. Using LD_LIBRARY_PATH, as 
documented, would still not be possible.

In my case, I am just using docker images with Alpine to easily 
reproduce the problem. I am building with NixOS / nixpkgs' pkgsMusl. The 
order of check and install phases can't be changed there, AFAICT. The 
workaround I use right now is to temporarily patch rpath of the postgres 
binary - this will be reset during the install phase anyway. This works, 
but again is not a real solution.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

16 марта, 23:10:56

Andrew Dunstan <andrew@dunslane.net> writes:
> On Alpine Linux, which uses musl libc, you have to run `make install`
> before you can run `make check`. Have you tried that?

We have the same situation on macOS.  There, it seems to be the result
of a "security feature" that strips DYLD_LIBRARY_PATH from the process
environment when make executes a shell.  There's not much we can do
about that, and I suspect there is not much we can do about musl's
behavior either.  (I am not a fan of proposals to modify the binaries
under test, because then you are not testing what you intend to
install.)

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

16 марта, 23:19:34

On Sun, Mar 17, 2024 at 4:56 AM Wolfgang Walther
<walther@technowledgy.de> wrote:
> Any ideas?

I'd look into whether there is a difference in the rules it uses for
deciding not to trust LD_LIBRARY_PATH, which seems to be around here
somewhere:

https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1812

I wonder if you can break into an affected program and check out the
magic there.  FWIW on MacOS something equivalent happens at the moment
we execute a shell, because the system shell is 'code signed' and that
OS treats signed stuff similar to setuid binaries for this purpose
(IIRC setting SHELL to point to a suitable unsigned shell could work
around the problem there?)

Another interesting thing that came up when I googled musl/glibc
differences -- old but looks plausibly still true (not that I expect
our code to be modifying that stuff in place, just something to
check):

https://www.openwall.com/lists/musl/2014/08/31/14

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

16 марта, 23:21:55

Tom Lane:
> We have the same situation on macOS.  There, it seems to be the result
> of a "security feature" that strips DYLD_LIBRARY_PATH from the process
> environment when make executes a shell.

I'm not sure whether this explanation is sufficient for the musl case, 
because LD_LIBRARY_PATH does make a difference: The direct dependency 
(libpqwalreceiver.so) can still be found if it's moved elsewhere and 
LD_LIBRARY_PATH points at it. So clearly the LD_LIBRARY_PATH variable is 
still set after make executed the shell - it's just not in effect on the 
*indirect* dependency (libpq.so) anymore.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

16 марта, 23:31:10

Thomas Munro:
> I'd look into whether there is a difference in the rules it uses for
> deciding not to trust LD_LIBRARY_PATH, which seems to be around here
> somewhere:
> 
> https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1812

Yeah, I have been looking at that, too. I had also experimented a bit 
with setuid/setgid for that matter, but that didn't lead anywhere, yet. 
I'm not 100% sure, but I think this would also not match my other 
observation, that LD_LIBRARY_PATH does work for libpqwalreceiver (direct 
dep), but not libpq (indirect dep).

> Another interesting thing that came up when I googled musl/glibc
> differences -- old but looks plausibly still true (not that I expect
> our code to be modifying that stuff in place, just something to
> check):
> 
> https://www.openwall.com/lists/musl/2014/08/31/14

To me, this seems very much like what could happen - it matches all my 
observations, so far. But I can't tell how likely that is, not knowing 
much of the postgres code.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

16 марта, 23:54:52

On Sun, Mar 17, 2024 at 9:19 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Another interesting thing that came up when I googled musl/glibc
> differences -- old but looks plausibly still true (not that I expect
> our code to be modifying that stuff in place, just something to
> check):
>
> https://www.openwall.com/lists/musl/2014/08/31/14

Hmm, that does mention setproctitle, and our ps_status.c does indeed
clobber some stuff in that region (in fact our ps_status.c is likely
derived from the setproctitle() function from sendmail AFAICT).  But
that's in our "backend" server processes, unlike the problems we have
on Macs... oh but you're failing to load libpqwalreceiver.so which
makes some sense for the backend hypothesis.  What happens if you hack
ps_status.c to use PS_USE_NONE?

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

17 марта, 12:44:03

Thomas Munro:
> Hmm, that does mention setproctitle, and our ps_status.c does indeed
> clobber some stuff in that region (in fact our ps_status.c is likely
> derived from the setproctitle() function from sendmail AFAICT).  But
> that's in our "backend" server processes, unlike the problems we have
> on Macs... oh but you're failing to load libpqwalreceiver.so which
> makes some sense for the backend hypothesis.  What happens if you hack
> ps_status.c to use PS_USE_NONE?

Nailed it. PS_USE_NONE fixes it.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Christophe Pettus

Дата:

17 марта, 13:33:52

> On Mar 17, 2024, at 02:44, Wolfgang Walther <walther@technowledgy.de> wrote:
>
> Nailed it. PS_USE_NONE fixes it.

Given the musl (still?) does not define a preprocessor macro specific to it, is there a way of improving the test in
pg_status.cto catch this case?  It seems wrong that the current test passes a case that doesn't actually work.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

17 марта, 16:11:19

Christophe Pettus:
> Given the musl (still?) does not define a preprocessor macro specific to it, is there a way of improving the test in
pg_status.cto catch this case?  It seems wrong that the current test passes a case that doesn't actually work.
 

The missing macro is on purpose and unlikely to change: 
https://openwall.com/lists/musl/2013/03/29/13

I also found this thread, which discusses exactly our case: 
https://www.openwall.com/lists/musl/2022/08/17/1

Some quotes from that thread:

> I understand that what Postgres et al are doing is a nasty hack.

And:

> Applications that *really* want setproctitle type functionality can
> presumably do something like re-exec themselves with a suitably large
> argv[0] to give them safe space to overwrite with their preferred
> message, rather than UB trying to relocate the environment (and auxv?
> how? they can't tell libc they moved it) to some other location.

Could that be a more portable way of doing this?

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Christophe Pettus

Дата:

17 марта, 18:44:45

> On Mar 17, 2024, at 06:11, Wolfgang Walther <walther@technowledgy.de> wrote:
> The missing macro is on purpose and unlikely to change: https://openwall.com/lists/musl/2013/03/29/13

Indeed.

> I also found this thread, which discusses exactly our case: https://www.openwall.com/lists/musl/2022/08/17/1

While getting proper setproctitle functionality on musl would be great, my goal was more modest: Have it correctly set
PS_USE_NONEwhen compiling against musl.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Andrew Dunstan

Дата:

17 марта, 23:33:40

On Sun, Mar 17, 2024 at 11:45 AM Christophe Pettus <xof@thebuild.com> wrote:

> On Mar 17, 2024, at 06:11, Wolfgang Walther <walther@technowledgy.de> wrote:
> The missing macro is on purpose and unlikely to change: https://openwall.com/lists/musl/2013/03/29/13

Indeed.

That seems a little shortsighted. If other libc implementations find it appropriate to have similar macros why should they be different?

> I also found this thread, which discusses exactly our case: https://www.openwall.com/lists/musl/2022/08/17/1

While getting proper setproctitle functionality on musl would be great, my goal was more modest: Have it correctly set PS_USE_NONE when compiling against musl.

One simple thing might be for us to enclose the block in ps_status.c at lines 49-59 in #ifndef PS_USE_NONE/#endif. Then you could compile with -DPS_USE_NONE in your CPPFLAGS.

cheers

andrew

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Christophe Pettus

Дата:

18 марта, 00:05:54

> On Mar 17, 2024, at 13:33, Andrew Dunstan <andrew@dunslane.net> wrote:
>
> That seems a little shortsighted. If other libc implementations find it appropriate to have similar macros why should
theybe different? 

It's a philosophical argument against checking for particular libc implementations instead of particular features.  I'm
notunsympathetic to that argument, but AFAICT there's no clean way of checking for this by examining feature #defines.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

18 марта, 02:20:26

On Mon, Mar 18, 2024 at 10:06 AM Christophe Pettus <xof@thebuild.com> wrote:
> > On Mar 17, 2024, at 13:33, Andrew Dunstan <andrew@dunslane.net> wrote:
> > That seems a little shortsighted. If other libc implementations find it appropriate to have similar macros why
shouldthey be different? 
>
> It's a philosophical argument against checking for particular libc implementations instead of particular features.
I'mnot unsympathetic to that argument, but AFAICT there's no clean way of checking for this by examining feature
#defines.

I like their philosophy, and I like their FAQ.  Down with software
monocultures, up with standards and cooperation.  But anyway...

I wondered for a moment if there could be a runtime way to test if
we'd broken stuff, but it seems hard without a way to ask the runtime
linker for its search path to see if it has any pointers into the
environment.  We can't, that "env_path" variable in dynlink.c is not
accessible to us by any reasonable means.  And yeah, this whole thing
is a nasty invasive hack that harks back to the 80s I assume, before
many systems provided a clean way to do this (and some never did)...

Hmm, I can think of one dirty hack on top of our existing dirty hack
that might work.  I feel bad typing this out, but here goes nothing:
In save_ps_display_args(), we compute end_of_area, stepping past
contiguous arguments and environment variables.  But what if we
terminated that if we saw an environment entry beginning "LD_"?  We'd
still feel free to clobber the memory up to that point (rather than
limiting ourselves to the argv space, another more conservative choice
that might truncate a few PS display messages, or maybe not given the
typical postmaster arguments, maye that'd work out OK), and we'd still
copy the environment to somewhere new, but anything like "LD_XXX" that
the runtime linker might have stashed a pointer to would remain valid.
/me runs away and hides

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Christophe Pettus

Дата:

18 марта, 12:33:45

> On Mar 17, 2024, at 16:20, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> We'd
> still feel free to clobber the memory up to that point (rather than
> limiting ourselves to the argv space, another more conservative choice
> that might truncate a few PS display messages, or maybe not given the
> typical postmaster arguments, maye that'd work out OK), and we'd still
> copy the environment to somewhere new, but anything like "LD_XXX" that
> the runtime linker might have stashed a pointer to would remain valid.
> /me runs away and hides

It doesn't lack for bravery!  (And I have to just comment that the linker storing pointers into that space as a way of
findinglibraries... well, that doesn't get them the moral high ground for nasty hacks.) 

I'm comfortable with "if you are using musl, you don't get the ps messages" as a first solution, if we can find a way
ofdetecting a libc that passes the other tests but doesn't support any of the existing hacks.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

18 марта, 16:25:33

On Mon, Mar 18, 2024 at 10:34 PM Christophe Pettus <xof@thebuild.com> wrote:
> > On Mar 17, 2024, at 16:20, Thomas Munro <thomas.munro@gmail.com> wrote:
> > We'd
> > still feel free to clobber the memory up to that point (rather than
> > limiting ourselves to the argv space, another more conservative choice
> > that might truncate a few PS display messages, or maybe not given the
> > typical postmaster arguments, maye that'd work out OK), and we'd still
> > copy the environment to somewhere new, but anything like "LD_XXX" that
> > the runtime linker might have stashed a pointer to would remain valid.
> > /me runs away and hides
>
> It doesn't lack for bravery!  (And I have to just comment that the linker storing pointers into that space as a way
offinding libraries... well, that doesn't get them the moral high ground for nasty hacks.) 

FWIW here is a blind patch if someone wants to try it out... no musl here.

(Hmm, I think it's not that unreasonable on their part to assume the
initial environment is immutable if their implementation doesn't
mutate it, and our doing so is undeniably UB; surprising, maybe, given
that the technique works on that other popular brand of C library on
that kind of kernel, not to mention dozens of old Unixen of yore...
The real solution may turn out to be the prctl() described in that
thread, where you can tell the kernel where you're planning to move
your argv and it can find it to show ps/top, but I checked and you
still can't call that without special privileges, so maybe someone
should get onto complaining to kernel hackers about that?  That thread
is wrong about us clobbering auxv BTW, we're not animals!)

Вложения

0001-Don-t-clobber-LD_-environment-variables.patch

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

18 марта, 17:23:32

Thomas Munro <thomas.munro@gmail.com> writes:
> (Hmm, I think it's not that unreasonable on their part to assume the
> initial environment is immutable if their implementation doesn't
> mutate it, and our doing so is undeniably UB; surprising, maybe, given
> that the technique works on that other popular brand of C library on
> that kind of kernel, not to mention dozens of old Unixen of yore...

Does their implementation also ignore the effects of putenv() or
setenv() on LD_LIBRARY_PATH?  They have no moral high ground
whatsoever if that's the case.  But if it doesn't, an alternative
route to a solution could be to scan the original environment, strdup
and putenv each entry to move it to freshly malloc'd space, and
then reclaim the old environment area.

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

19 марта, 00:17:36

On Tue, Mar 19, 2024 at 3:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > (Hmm, I think it's not that unreasonable on their part to assume the
> > initial environment is immutable if their implementation doesn't
> > mutate it, and our doing so is undeniably UB; surprising, maybe, given
> > that the technique works on that other popular brand of C library on
> > that kind of kernel, not to mention dozens of old Unixen of yore...
>
> Does their implementation also ignore the effects of putenv() or
> setenv() on LD_LIBRARY_PATH?  They have no moral high ground
> whatsoever if that's the case.  But if it doesn't, an alternative
> route to a solution could be to scan the original environment, strdup
> and putenv each entry to move it to freshly malloc'd space, and
> then reclaim the old environment area.

Yes, the musl linker/loader ignores putenv()/setenv() changes to
LD_LIBRARY_PATH after process start (that is, changes only effect the
search path when injected into a new program with exec*()).  As does
glibc, it's just that it captures by copy instead of reference
(according to one of the links above, I didn't check the source).  So
setenv() has no effect on dlopen() in *this* program, and using putenv
in that way won't help.  We simply can't move the value of
LD_LIBRARY_PATH (though my patch could be a little sneakier and steal
all the bytes right up to the = sign to get more space for our
message!).

One way to tell if a copy has been made is to trace a program that does:

        getenv("LD_LIBRARY_PATH")[2] = 'X';
        dlopen("foo.so", RTLD_NOW | RTLD_GLOBAL);

... when run with LD_LIBRARY_PATH set to /asdf.  On FreeBSD I see it
tries to open "/aXdf...", so now I know that FreeBSD also captures it
by reference like musl.  But we don't use the clobber trick on
FreeBSD, it has a proper setproctitle() function that knows how to
negotiate with the kernel, so it doesn't matter.  It also ignores
changes made with setent()/putenv(), because those create fresh
entries but leave the initial environment strings untouched.

Solaris also ignores changes made after startup (it's in the dlopen
man page), and from a very quick look at its ld_lib_setup() I think it
achieved that with a copy.  I believe its ancestor SunOS 4 invented
all of these conventions (and the mmap/virtual memory concepts they
rode in on), later nailed down to some degree in the System V ABI and
very widely adopted, but I don't see anything in the latter that
specifically addresses this point, eg LD_LIBRARY copy vs reference and
interaction with dlopen() (perhaps I didn't look hard enough).  I'm
not sure what else you can point to to make strong claims about this
stuff, but I bet every system ignores changes after startup, it's just
that they found two ways to achieve that.  POSIX says of dlopen that
the "file [argument] is used in an implementation-defined manner", and
of environ that we're welcome to swap a whole new environ, but doesn't
seem to tell us anything about the one that is replaced (who owns it?
is the initial one set up at execution time special? etc).  The line
banning manipulation of the pointers environ refers to doesn't exactly
describe what we're doing (we're manipulating the strings pointed to
by the *previous* environ).  UB.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

19 марта, 01:48:50

On Tue, Mar 19, 2024 at 10:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... (though my patch could be a little sneakier and steal
> all the bytes right up to the = sign to get more space for our
> message!).

Here's one like that.  No musl here -- does this work Wolfgang?  Do we
think it's generous enough with space in practice that we could just
always do this for __linux__ systems without anyone noticing (ie
including glibc users)?  Should we be more specific about which LD_*
variables?  Do people not doing hacking/testing ever really set those,
eg on production servers?  This code path was once used by up to a
dozen or so OSes but they're all dead, only Linux, Solaris and macOS
left, and I don't have any reason to think they suffer from this
problem and Macs don't even follow the SysV LD_ naming convention,
hence gating on Linux.

Вложения

v2-0001-Don-t-clobber-LD_-environment-variables.patch

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

20 марта, 02:54:49

On Tue, Mar 19, 2024 at 11:48:50AM +1300, Thomas Munro wrote:
> On Tue, Mar 19, 2024 at 10:17 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > ... (though my patch could be a little sneakier and steal
> > all the bytes right up to the = sign to get more space for our
> > message!).
> 
> Here's one like that.  No musl here -- does this work Wolfgang?  Do we
> think it's generous enough with space in practice that we could just
> always do this for __linux__ systems without anyone noticing (ie
> including glibc users)?  Should we be more specific about which LD_*
> variables?  Do people not doing hacking/testing ever really set those,
> eg on production servers?  This code path was once used by up to a
> dozen or so OSes but they're all dead, only Linux, Solaris and macOS
> left, and I don't have any reason to think they suffer from this
> problem and Macs don't even follow the SysV LD_ naming convention,
> hence gating on Linux.

So this would truncate the process title on all Linux that have an LD_
environment entry, even those without musl?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

20 марта, 04:12:54

On Wed, Mar 20, 2024 at 12:54 PM Bruce Momjian <bruce@momjian.us> wrote:
> So this would truncate the process title on all Linux that have an LD_
> environment entry, even those without musl?

Yep.  How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
postmaster near you?  You'd always get that much, plus as much of
/proc/XXX/environ as we can find before you reach LD_XXX=, which on a
typical system would, I guess, usually be never.  If it's a problem
you could try to arrange for LD_ XXX to come later in environ[].  What
I observe is that they seem to get copied in backwards, wrt the
environment exported by the parent, so if you set DUMMY=XXXXXXXX just
before starting the process it'll make sacrificial space in the right
place (but I'm not sure where that effect is coming from so don't
quote me).

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

20 марта, 04:27:23

On Wed, Mar 20, 2024 at 02:12:54PM +1300, Thomas Munro wrote:
> On Wed, Mar 20, 2024 at 12:54 PM Bruce Momjian <bruce@momjian.us> wrote:
> > So this would truncate the process title on all Linux that have an LD_
> > environment entry, even those without musl?
> 
> Yep.  How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
> postmaster near you?  You'd always get that much, plus as much of

    $ cat /proc/2000/cmdline |wc -c
    30

> /proc/XXX/environ as we can find before you reach LD_XXX=, which on a
> typical system would, I guess, usually be never.  If it's a problem
> you could try to arrange for LD_ XXX to come later in environ[].  What
> I observe is that they seem to get copied in backwards, wrt the
> environment exported by the parent, so if you set DUMMY=XXXXXXXX just
> before starting the process it'll make sacrificial space in the right
> place (but I'm not sure where that effect is coming from so don't
> quote me).

I am just cautious about changing behavior on our most common platform
for a libc library I have never heard of.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

20 марта, 04:53:08

Thomas Munro <thomas.munro@gmail.com> writes:
> Yep.  How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
> postmaster near you?  You'd always get that much, plus as much of
> /proc/XXX/environ as we can find before you reach LD_XXX=, which on a
> typical system would, I guess, usually be never.

I'd be happier about this if the target pattern were more restrictive.
Is there reason to think that musl keeps a pointer to anything besides
LD_LIBRARY_PATH?

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

20 марта, 05:01:09

On Wed, Mar 20, 2024 at 2:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Yep.  How long is /proc/XXX/cmdline (check with wc -c /proc/...) in a
> > postmaster near you?  You'd always get that much, plus as much of
> > /proc/XXX/environ as we can find before you reach LD_XXX=, which on a
> > typical system would, I guess, usually be never.
>
> I'd be happier about this if the target pattern were more restrictive.
> Is there reason to think that musl keeps a pointer to anything besides
> LD_LIBRARY_PATH?

Also LD_PRELOAD:

https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1824

Yeah we could do just those two.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

20 марта, 05:03:36

Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Mar 20, 2024 at 2:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'd be happier about this if the target pattern were more restrictive.
>> Is there reason to think that musl keeps a pointer to anything besides
>> LD_LIBRARY_PATH?

> Also LD_PRELOAD:
> https://github.com/bminor/musl/blob/7ada6dde6f9dc6a2836c3d92c2f762d35fd229e0/ldso/dynlink.c#L1824
> Yeah we could do just those two.

+1 for stopping only at one of those two names.

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

20 марта, 05:12:49

On Wed, Mar 20, 2024 at 2:27 PM Bruce Momjian <bruce@momjian.us> wrote:
> I am just cautious about changing behavior on our most common platform
> for a libc library I have never heard of.

Yeah, I hear you.  I don't have a dog in this race, I just like
retro-computing mysteries and arguments about the meaning of
standards...  That said I'm pretty sure no one should be running
production PostgreSQL systems held together by LD_LIBRARY_PATH, and if
they do, it looks like systemd/rc.d scripts set a bunch of PG_OOM_BLAH
stuff just before the start the cluster that would provide extra chaff
in front of LD_XXX stuff defined earlier, and then pg_ctl inserts even
more, and even if they don't use any of that stuff and just run the
postmaster directly with some other homegrown tooling, now we're down
to very niche/expert scenarios where it should be acceptable to point
to this thread that says "try setting an extra dummy variable after
you set LD_XXX!".

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

20 марта, 07:39:34

On Wed, Mar 20, 2024 at 3:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> +1 for stopping only at one of those two names.

Here's one like that for Wolfgang to test on musl.

Вложения

v3-0001-Don-t-clobber-LD_LIBRARY_PATH-or-LD_PRELOAD.patch

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Peter Eisentraut

Дата:

20 марта, 09:02:48

On 17.03.24 11:33, Christophe Pettus wrote:
>> On Mar 17, 2024, at 02:44, Wolfgang Walther <walther@technowledgy.de> wrote:
>>
>> Nailed it. PS_USE_NONE fixes it.
> 
> Given the musl (still?) does not define a preprocessor macro specific to it, is there a way of improving the test in
pg_status.cto catch this case?  It seems wrong that the current test passes a case that doesn't actually work.
 
> 

We could turn it around and do

#if defined(__linux__)
#if defined(__GLIBC__) || defined(__UCLIBC__ )
#define PS_USE_CLOBBER_ARGV
#else
#define PS_USE_NONE
#endif
#endif

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Andrew Dunstan

Дата:

20 марта, 10:16:14

On Wed, Mar 20, 2024 at 2:03 AM Peter Eisentraut <peter@eisentraut.org> wrote:

On 17.03.24 11:33, Christophe Pettus wrote:
>> On Mar 17, 2024, at 02:44, Wolfgang Walther <walther@technowledgy.de> wrote:
>>
>> Nailed it. PS_USE_NONE fixes it.
>
> Given the musl (still?) does not define a preprocessor macro specific to it, is there a way of improving the test in pg_status.c to catch this case? It seems wrong that the current test passes a case that doesn't actually work.
>

We could turn it around and do

#if defined(__linux__)
#if defined(__GLIBC__) || defined(__UCLIBC__ )
#define PS_USE_CLOBBER_ARGV
#else
#define PS_USE_NONE
#endif
#endif

I like it. Neat and minimal.

cheers

andrew

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

20 марта, 12:39:20

Thomas Munro:
> On Wed, Mar 20, 2024 at 3:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> +1 for stopping only at one of those two names.
> 
> Here's one like that for Wolfgang to test on musl.

Works fine.

Peter Eisentraut:
> We could turn it around and do
> 
> #if defined(__linux__)
> #if defined(__GLIBC__) || defined(__UCLIBC__ )
> #define PS_USE_CLOBBER_ARGV
> #else
> #define PS_USE_NONE
> #endif
> #endif

This works as well.

I also put together a PoC of what was mentioned in musl's mailing list: 
Instead of clobbering environ at all, exec yourself again with padded 
argv0. This works, too. Attached.

Best,

Wolfgang

Вложения

0001-ps-status-exec.patch

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

20 марта, 16:35:51

On Wed, Mar 20, 2024 at 10:39:20AM +0100, Wolfgang Walther wrote:
> Peter Eisentraut:
> > We could turn it around and do
> > 
> > #if defined(__linux__)
> > #if defined(__GLIBC__) || defined(__UCLIBC__ )
> > #define PS_USE_CLOBBER_ARGV
> > #else
> > #define PS_USE_NONE
> > #endif
> > #endif
> 
> This works as well.

Yes, I prefer this.  I am worried the environ hackery will bite us
someday and the cause will be hard to find.

> I also put together a PoC of what was mentioned in musl's mailing list:
> Instead of clobbering environ at all, exec yourself again with padded argv0.
> This works, too. Attached.

It is hard to imagine why we would add an extra exec on every Linux
server start for this.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

20 марта, 16:37:30

On Wed, Mar 20, 2024 at 09:35:51AM -0400, Bruce Momjian wrote:
> On Wed, Mar 20, 2024 at 10:39:20AM +0100, Wolfgang Walther wrote:
> > I also put together a PoC of what was mentioned in musl's mailing list:
> > Instead of clobbering environ at all, exec yourself again with padded argv0.
> > This works, too. Attached.
> 
> It is hard to imagine why we would add an extra exec on every Linux
> server start for this.

I guess we could conditionally exec only if we find we must, but then
such exec cases would be rare and rarely tested.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

20 марта, 17:24:58

Bruce Momjian:
> I guess we could conditionally exec only if we find we must, but then
> such exec cases would be rare and rarely tested.

I think you might be seriously underestimating how often musl is used. 
Alpine Linux uses musl and is very widespread in the container world 
because of smaller image size.

The library/postgres docker image has been pulled about 8 billion times 
since 2014 [1]. While we can't really tell how many of those pulled the 
alpine variant of the image, comparing the alpine [2] and ubuntu/debian 
[3,4] base images gives a rough estimate of >50% using alpine in general.

This is certainly not rare.

But yeah, buildfarm coverage for musl would be good, I agree. Maybe even 
directly in CI?

Best,

Wolfgang

[1]: https://hub.docker.com/v2/repositories/library/postgres
[2]: https://hub.docker.com/v2/repositories/library/alpine
[3]: https://hub.docker.com/v2/repositories/library/ubuntu
[4]: https://hub.docker.com/v2/repositories/library/debian

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Laurenz Albe

Дата:

20 марта, 17:28:53

On Wed, 2024-03-20 at 15:24 +0100, Wolfgang Walther wrote:
> I think you might be seriously underestimating how often musl is used.
> Alpine Linux uses musl and is very widespread in the container world
> because of smaller image size

The last time I looked, its collation support didn't work at all...

Yours,
Laurenz Albe

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

20 марта, 17:36:38

Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Wed, 2024-03-20 at 15:24 +0100, Wolfgang Walther wrote:
>> I think you might be seriously underestimating how often musl is used. 
>> Alpine Linux uses musl and is very widespread in the container world 
>> because of smaller image size

> The last time I looked, its collation support didn't work at all...

I think the same is true of some of the BSDen, so that's not a
large impediment to us.  But in any case, if somebody wants
Alpine or musl to be considered a supported platform, they'd
best step up and run a buildfarm animal.

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

20 марта, 17:37:07

Bruce Momjian:
> On Wed, Mar 20, 2024 at 10:39:20AM +0100, Wolfgang Walther wrote:
>> Peter Eisentraut:
>>> We could turn it around and do
>>>
>>> #if defined(__linux__)
>>> #if defined(__GLIBC__) || defined(__UCLIBC__ )
>>> #define PS_USE_CLOBBER_ARGV
>>> #else
>>> #define PS_USE_NONE
>>> #endif
>>> #endif
>>
>> This works as well.
> 
> Yes, I prefer this.  I am worried the environ hackery will bite us
> someday and the cause will be hard to find.

Well, the environ hackery already bit and it sure was hard to find. But 
this approach would still clobber environ happily... which is undefined 
behavior. But certainly the opt-in to known-to-be-good libc variants is 
a better approach than before.

Between this and "stop clobbering at LD_LIBRARY_PATH", I prefer the 
latter, though.

>> I also put together a PoC of what was mentioned in musl's mailing list:
>> Instead of clobbering environ at all, exec yourself again with padded argv0.
>> This works, too. Attached.
> 
> It is hard to imagine why we would add an extra exec on every Linux
> server start for this.

Would this be a problem? For a running server this would happen only 
once when the postmaster starts up, AFAICT.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

20 марта, 17:43:05

Laurenz Albe:
> On Wed, 2024-03-20 at 15:24 +0100, Wolfgang Walther wrote:
>> I think you might be seriously underestimating how often musl is used.
>> Alpine Linux uses musl and is very widespread in the container world
>> because of smaller image size
> 
> The last time I looked, its collation support didn't work at all...

IIUC, using icu collations should work. I didn't extensively try, 
though. But yeah - musl itself doesn't do it, knowingly so.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

20 марта, 17:46:39

Tom Lane:
> But in any case, if somebody wants
> Alpine or musl to be considered a supported platform, they'd
> best step up and run a buildfarm animal.

Yeah, I was already thinking about that. But I guess we'd need to first 
make the test suite pass on musl. i.e. $subject, but there are also some 
smaller issues after that, before the full test suite will pass.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Peter Eisentraut

Дата:

20 марта, 21:37:08

On 20.03.24 15:37, Wolfgang Walther wrote:
>> It is hard to imagine why we would add an extra exec on every Linux
>> server start for this.
> 
> Would this be a problem? For a running server this would happen only 
> once when the postmaster starts up, AFAICT.

I wonder if it would cause issues with systemd or similar, if the PID of 
the running process is not the one that systemd started.  If so, there 
is probably a workaround, but it would have to be analyzed.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

20 марта, 22:10:01

Peter Eisentraut:
>> Would this be a problem? For a running server this would happen only 
>> once when the postmaster starts up, AFAICT.
> 
> I wonder if it would cause issues with systemd or similar, if the PID of 
> the running process is not the one that systemd started.  If so, there 
> is probably a workaround, but it would have to be analyzed.

I don't think that exec even creates a new PID. The current process is 
replaced, so the PID stays the same.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

20 марта, 22:18:39

On Wed, Mar 20, 2024 at 03:24:58PM +0100, Wolfgang Walther wrote:
> Bruce Momjian:
> > I guess we could conditionally exec only if we find we must, but then
> > such exec cases would be rare and rarely tested.
> 
> I think you might be seriously underestimating how often musl is used.
> Alpine Linux uses musl and is very widespread in the container world because
> of smaller image size.
> 
> The library/postgres docker image has been pulled about 8 billion times
> since 2014 [1]. While we can't really tell how many of those pulled the
> alpine variant of the image, comparing the alpine [2] and ubuntu/debian
> [3,4] base images gives a rough estimate of >50% using alpine in general.

Uh, what is the current behavior of Postgres on musl?  It just fails if
the process title is longer than argv[0] plus the environment space to
the LD_ environment variable, and then linking fails for certain
extensions?  If there are many downloads, why would we only be getting
this report now?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

20 марта, 22:29:21

Bruce Momjian:
>> The library/postgres docker image has been pulled about 8 billion times
>> since 2014 [1]. While we can't really tell how many of those pulled the
>> alpine variant of the image, comparing the alpine [2] and ubuntu/debian
>> [3,4] base images gives a rough estimate of >50% using alpine in general.
> 
> Uh, what is the current behavior of Postgres on musl?  It just fails if
> the process title is longer than argv[0] plus the environment space to
> the LD_ environment variable, and then linking fails for certain
> extensions?  If there are many downloads, why would we only be getting
> this report now?

The process title works fine. It's just the way how space is cleared for 
the process title, that is causing problems elsewhere.

The thing that is broken when running postgres on alpine/musl is, to put 
libpq in a custom path and use LD_LIBRARY_PATH to find it when loading 
libpqwalreceiver (+ some contrib modules). Nobody does that, especially 
not in a container environment where postgres is likely the only thing 
running in that container, so there is no point in using any custom 
library paths or anything - the image is built once and made to work, 
and everybody else is just using that working image.

The much more practical problem is that the test suite doesn't run, 
because it makes use of LD_LIBRARY_PATH for that purpose. In the past, 
the packagers for alpine only disabled the failing tests, but IIRC they 
have given up on that and just disabled the whole test suite by now.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

20 марта, 22:42:16

On Wed, Mar 20, 2024 at 08:29:21PM +0100, Wolfgang Walther wrote:
> Bruce Momjian:
> > > The library/postgres docker image has been pulled about 8 billion times
> > > since 2014 [1]. While we can't really tell how many of those pulled the
> > > alpine variant of the image, comparing the alpine [2] and ubuntu/debian
> > > [3,4] base images gives a rough estimate of >50% using alpine in general.
> > 
> > Uh, what is the current behavior of Postgres on musl?  It just fails if
> > the process title is longer than argv[0] plus the environment space to
> > the LD_ environment variable, and then linking fails for certain
> > extensions?  If there are many downloads, why would we only be getting
> > this report now?
> 
> The process title works fine. It's just the way how space is cleared for the
> process title, that is causing problems elsewhere.
> 
> The thing that is broken when running postgres on alpine/musl is, to put
> libpq in a custom path and use LD_LIBRARY_PATH to find it when loading
> libpqwalreceiver (+ some contrib modules). Nobody does that, especially not
> in a container environment where postgres is likely the only thing running
> in that container, so there is no point in using any custom library paths or
> anything - the image is built once and made to work, and everybody else is
> just using that working image.
> 
> The much more practical problem is that the test suite doesn't run, because
> it makes use of LD_LIBRARY_PATH for that purpose. In the past, the packagers
> for alpine only disabled the failing tests, but IIRC they have given up on
> that and just disabled the whole test suite by now.

Thanks, that is very clear.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

21 марта, 01:26:26

On Thu, Mar 21, 2024 at 2:35 AM Bruce Momjian <bruce@momjian.us> wrote:
> Yes, I prefer this.  I am worried the environ hackery will bite us
> someday and the cause will be hard to find.

Some speculation on how widespread this trick is: as mentioned, it
seems to come from sendmail (I didn't spend the time to find a repo
that has ancient versions, so this is someone's random snapshot repo,
but it's referring to pretty old dead systems):

https://github.com/Distrotech/sendmail/blob/master/sendmail/conf.c#L2436

That was once ubiquitous, back in the day.  One of the most widespread
envon-clobberers these days must be openssh:

https://github.com/openssh/openssh-portable/blob/86bdd3853f4d32c85e295e6216a2fe0953ad93f0/openbsd-compat/setproctitle.c#L69

And funnily enough, googling LD_LIBRARY_PATH and openssh brings up a
few unsolved/unanswered questions about mysterious breakage (though I
didn't see any that mentioned musl by name and there could be other
explanations, *shrug*).

There is also Chromium/Chrome:

https://github.com/chromium/chromium/blob/main/base/process/set_process_title_linux.cc#L136

That code has some interesting commentary and points to a commit in
Linux which mentions setproctitle() and making sure it still works
(funny because setproctitle() isn't a function in any standard
userspace library AFAIK so I guess it just means the trick in
general), and also mentions the failure of attempts to get an official
way to do this negotiated between the relevant projects.

Of course we have to distinguish between the basic argv[] clobbering
trick which is barely even a trick, and the more advanced environ
stealing trick which confuses musl.  A very widespread user of the
basic trick would be systemd, which tries to use the prctl() if it can
to get a much bigger window of memory to write on, but otherwise falls
back to accepting a small one.  I guess we'd do the same if we could,
ie if a future Linux version didn't require CAP_SYS_RESOURCES to do
it:

https://github.com/systemd/systemd/blob/8810b782a17050d7f7a870b975f09e8a690b7bea/src/basic/argv-util.c

Anyway, it looks like there is plenty of will out there to keep this
working, albeit in a weird semi-supported state whose cruftiness is
undeniable.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

21 марта, 23:16:46

Thomas Munro:
> Of course we have to distinguish between the basic argv[] clobbering
> trick which is barely even a trick, and the more advanced environ
> stealing trick which confuses musl.

Right. The latter not only confuses musl, but also makes 
/proc/<pid>/environ return garbage. This is also mentioned at the bottom 
of main.c, which has a workaround for the specific case of UBSan 
depending on that. This is kind of funny: Because we are relying on 
undefined behavior regarding the modification of environ, we need a 
workaround for the "UndefinedBehaviorSanitizer" - I guess by failing 
without this workaround, it wanted to tell us something..

This happens on glibc, too.

So summarizing:

1. The simple approach is to use PS_USE_CLOBBER_ARGV on Linux only for 
glibc and other known-to-be-good-and-identifiable libc variants, 
otherwise default to PS_USE_NONE. This will not only keep the problem 
for /proc/../environ for glibc users, but also disable ps status for 
musl entirely. Considering that probably the biggest use-case for musl 
is to run postgres in containers, it's quite likely to actually run more 
than just one cluster on a single machine. In this case... ps status 
would be especially handy to identify which cluster a process belongs to.

2. The next proposal was to stop clobbering environ once LD_LIBRARY_PATH 
/ LD_PRELOAD is found to keep those intact. This will keep ps status 
support on musl, which is good. But the /proc/.../environ problem will 
still be there, unchanged.

Both of those approaches rely on the undefined behavior of clobbering 
environ.

3. The logical consequence of this is, to stop clobbering environ and 
use only the available argv space. However, this will quickly leave us 
with a very small ps status buffer to work with, making the feature less 
useful. Note, that this could happen theoretically by starting postgres 
with the fewest arguments and environment possible, too. Not sure what 
the minimal buffer size is that could be achieved that way. The point 
is: The buffer size is not guaranteed at all.

4. The upstream (musl) suggestion of which I sent a PoC was to "exec 
yourself with a bigger argv". This works. I chose to pad argv0 with 
trailing slashes. Those can safely be stripped away again, because any 
argv0 which would come with a trailing slash to start with, would not be 
the current executable, but a directory - so would fail exec immediately 
anyway. This keeps /proc/.../environ intact and does not rely on 
undefined behavior. Additionally, we get a guaranteed ps buffer size of 
256, which is what we use on BSDs and Windows, too.

I wonder why we actually fall back to PS_USE_NONE by default.. and how 
much of that is related to the environment clobbering to start with? 
Could we even use the exec-approach as the fallback in all other cases 
except BSDs and Windows and get rid of PS_USE_NONE? Clobbering only argv 
sure seems way safer to do than what we do right now.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

walther@technowledgy.de

Дата:

21 марта, 23:30:00

Here is what we could with this:

> 2. The next proposal was to stop clobbering environ once LD_LIBRARY_PATH 
> / LD_PRELOAD is found to keep those intact.

We could backpatch this down to v12. This would be one step to make the 
test suite pass on Alpine Linux with musl and ultimately allow setting 
up a buildfarm animal for that.

It does not solve the /proc/.../environ problem, but at least keeps ps 
status working on musl as it did before, so not a regression.

> 4. The upstream (musl) suggestion of which I sent a PoC was to "exec 
> yourself with a bigger argv". 

We could do this in HEAD now ...

> Could we even use the exec-approach as the fallback in all other cases 
> except BSDs and Windows and get rid of PS_USE_NONE?

... and then remove PS_USE_NONE at the beginning of the v18 cycle.

This would give a bit more time for those "other systems", which were 
previously falling back PS_USE_NONE and would then clobber argv, too.

Opinions?

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

22 марта, 01:42:52

On Fri, Mar 22, 2024 at 9:30 AM <walther@technowledgy.de> wrote:
> > 4. The upstream (musl) suggestion of which I sent a PoC was to "exec
> > yourself with a bigger argv".
>
> We could do this in HEAD now ...

Just a thought: if we want to go this way, do we need a new exec call?
 We already control the initial exec in pg_ctl.c.

> > Could we even use the exec-approach as the fallback in all other cases
> > except BSDs and Windows and get rid of PS_USE_NONE?
>
> ... and then remove PS_USE_NONE at the beginning of the v18 cycle.
>
> This would give a bit more time for those "other systems", which were
> previously falling back PS_USE_NONE and would then clobber argv, too.

RIght.  It's unspecified by POSIX whether ps shows changes to those
strings (and there are systems that don't), but it can't hurt to do so
anyway, and it'd be better than having a PS_USE_NONE code path that is
untested.  I dimly recall that it turned out that PS_USE_NONE was
actually broken for a while without anyone noticing.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

22 марта, 02:02:47

Thomas Munro <thomas.munro@gmail.com> writes:
> Just a thought: if we want to go this way, do we need a new exec call?
>  We already control the initial exec in pg_ctl.c.

I'm resistant to assuming the postmaster is launched through pg_ctl.
systemd, for example, might well prefer not to do that, not to
mention all the troglodytes still using 1990s launch scripts.

A question that seems worth debating in this thread is how much
updating the process title is even worth nowadays.  It feels like
a hangover from before we had pg_stat_activity and other monitoring
support.  So I don't feel a huge need to support it on musl.
The previously-suggested patch to whitelist glibc and variants,
and otherwise fall back to PS_USE_NONE, seems like it might be
the appropriate amount of effort.

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

22 марта, 02:20:11

On Fri, Mar 22, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The previously-suggested patch to whitelist glibc and variants,
> and otherwise fall back to PS_USE_NONE, seems like it might be
> the appropriate amount of effort.

What about meeting musl halfway: clobber argv, but only clobber
environ for the libcs known to tolerate that?  Then musl might see
truncation at 30-60 characters or whatever it is, but that's probably
enough to see your cluster_name and backend type/user name which is
pretty useful information.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Andrew Dunstan

Дата:

22 марта, 02:23:45

On Thu, Mar 21, 2024 at 7:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:
> Just a thought: if we want to go this way, do we need a new exec call?
> We already control the initial exec in pg_ctl.c.

I'm resistant to assuming the postmaster is launched through pg_ctl.
systemd, for example, might well prefer not to do that, not to
mention all the troglodytes still using 1990s launch scripts.

A question that seems worth debating in this thread is how much
updating the process title is even worth nowadays. It feels like
a hangover from before we had pg_stat_activity and other monitoring
support. So I don't feel a huge need to support it on musl.
The previously-suggested patch to whitelist glibc and variants,
and otherwise fall back to PS_USE_NONE, seems like it might be
the appropriate amount of effort.

cheers

andrew

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

22 марта, 02:30:52

Thomas Munro <thomas.munro@gmail.com> writes:
> On Fri, Mar 22, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The previously-suggested patch to whitelist glibc and variants,
>> and otherwise fall back to PS_USE_NONE, seems like it might be
>> the appropriate amount of effort.

> What about meeting musl halfway: clobber argv, but only clobber
> environ for the libcs known to tolerate that?  Then musl might see
> truncation at 30-60 characters or whatever it is, but that's probably
> enough to see your cluster_name and backend type/user name which is
> pretty useful information.

No real objection here.  I do wonder about the point you (or somebody)
made upthread that we don't have any testing of the PS_USE_NONE case;
but that could be addressed some other way.

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

22 марта, 03:07:14

On Fri, Mar 22, 2024 at 11:42:52AM +1300, Thomas Munro wrote:
> On Fri, Mar 22, 2024 at 9:30 AM <walther@technowledgy.de> wrote:
> > > 4. The upstream (musl) suggestion of which I sent a PoC was to "exec
> > > yourself with a bigger argv".
> >
> > We could do this in HEAD now ...
> 
> Just a thought: if we want to go this way, do we need a new exec call?
>  We already control the initial exec in pg_ctl.c.
> 
> > > Could we even use the exec-approach as the fallback in all other cases
> > > except BSDs and Windows and get rid of PS_USE_NONE?
> >
> > ... and then remove PS_USE_NONE at the beginning of the v18 cycle.
> >
> > This would give a bit more time for those "other systems", which were
> > previously falling back PS_USE_NONE and would then clobber argv, too.
> 
> RIght.  It's unspecified by POSIX whether ps shows changes to those
> strings (and there are systems that don't), but it can't hurt to do so
> anyway, and it'd be better than having a PS_USE_NONE code path that is
> untested.  I dimly recall that it turned out that PS_USE_NONE was
> actually broken for a while without anyone noticing.

Actually, I was thinking the opposite.  Since the musl libc is widely
used, it will be tested, and I don't want to disable process display
updates for such a common platform.

I suggest we use the #ifdef test to continue our existing behavior for
the libraries we know about, like glibc, and use the LD_* process title
truncation hack for libc's we don't recognize.

Attached is a prototype patch which implements this based on previous
patches.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Вложения

proctitle.diff

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

22 марта, 03:08:49

On Fri, Mar 22, 2024 at 12:20:11PM +1300, Thomas Munro wrote:
> On Fri, Mar 22, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The previously-suggested patch to whitelist glibc and variants,
> > and otherwise fall back to PS_USE_NONE, seems like it might be
> > the appropriate amount of effort.
> 
> What about meeting musl halfway: clobber argv, but only clobber
> environ for the libcs known to tolerate that?  Then musl might see
> truncation at 30-60 characters or whatever it is, but that's probably
> enough to see your cluster_name and backend type/user name which is
> pretty useful information.

I just posted a prototype patch to implement this.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Andres Freund

Дата:

22 марта, 03:19:48

Hi,

On 2024-03-21 21:16:46 +0100, Wolfgang Walther wrote:
> Right. The latter not only confuses musl, but also makes /proc/<pid>/environ
> return garbage. This is also mentioned at the bottom of main.c, which has a
> workaround for the specific case of UBSan depending on that. This is kind of
> funny: Because we are relying on undefined behavior regarding the
> modification of environ, we need a workaround for the
> "UndefinedBehaviorSanitizer" - I guess by failing without this workaround,
> it wanted to tell us something..

I don't think that's quite a fair description. Ubsan is basically doing
undefined things itself, so it's turtles all the way down.

> So summarizing:

FWIW, independent of which fix we go with, I think we need a buildfarm animal
using musl. Even better if one of the CI tasks can be made to use musl as
well.

Greetings,

Andres Freund

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Andrew Dunstan

Дата:

22 марта, 09:12:26

Sent from my iPad

> On Mar 22, 2024, at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
>> On 2024-03-21 21:16:46 +0100, Wolfgang Walther wrote:
>> Right. The latter not only confuses musl, but also makes /proc/<pid>/environ
>> return garbage. This is also mentioned at the bottom of main.c, which has a
>> workaround for the specific case of UBSan depending on that. This is kind of
>> funny: Because we are relying on undefined behavior regarding the
>> modification of environ, we need a workaround for the
>> "UndefinedBehaviorSanitizer" - I guess by failing without this workaround,
>> it wanted to tell us something..
>
> I don't think that's quite a fair description. Ubsan is basically doing
> undefined things itself, so it's turtles all the way down.
>
>
>> So summarizing:
>
> FWIW, independent of which fix we go with, I think we need a buildfarm animal
> using musl. Even better if one of the CI tasks can be made to use musl as
> well.


We had one till 3 months ago. It’s on my list to recreate.

Cheers

Andrew

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

22 марта, 09:15:27

Andrew Dunstan <andrew@dunslane.net> writes:
> On Mar 22, 2024, at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
>> FWIW, independent of which fix we go with, I think we need a buildfarm animal
>> using musl. Even better if one of the CI tasks can be made to use musl as
>> well.

> We had one till 3 months ago. It’s on my list to recreate.

How was it passing?  The issue discussed in this thread has surely
been there for a long time, and Wolfgang mentioned that he sees
others.

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Andrew Dunstan

Дата:

22 марта, 10:02:19


> On Mar 22, 2024, at 4:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrew Dunstan <andrew@dunslane.net> writes:age
>>> On Mar 22, 2024, at 10:49 AM, Andres Freund <andres@anarazel.de> wrote:
>>> FWIW, independent of which fix we go with, I think we need a buildfarm animal
>>> using musl. Even better if one of the CI tasks can be made to use musl as
>>> well.
>
>> We had one till 3 months ago. It’s on my list to recreate.
>
> How was it passing?  The issue discussed in this thread has surely
> been there for a long time, and Wolfgang mentioned that he sees
> others.
>
>


The buildfarm client has a switch that delays running regression tests until after the install stages.

Cheers

Andrew

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

22 марта, 10:55:52

Andres Freund:
> FWIW, independent of which fix we go with, I think we need a buildfarm animal
> using musl. Even better if one of the CI tasks can be made to use musl as
> well.

I am already working with Andrew to set up a buildfarm animal to run 
Alpine Linux/musl. I can look into the CI task as well. Are you 
suggesting to change an existing task to run with Alpine/musl or to add 
a new task for it? It would be docker image based for sure.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Wolfgang Walther

Дата:

22 марта, 11:02:04

Andrew Dunstan:
>> On Mar 22, 2024, at 4:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> How was it passing?  The issue discussed in this thread has surely
>> been there for a long time, and Wolfgang mentioned that he sees
>> others.
> 
> The buildfarm client has a switch that delays running regression tests until after the install stages.

Hm. So while that switch makes the animal pass the build, it did hide 
exactly this problem. Not sure whether this switch should be used at 
all, then. Was this switch only implemented for the specific case of 
Alpine/musl or is there a different reason for it, as well?

The other issues I had been seeing were during make check-world, but not 
make check. Those were things around setlocale() / /bin/locale, IIRC. 
Not sure whether all of the tests are run by the buildfarm?

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

walther@technowledgy.de

Дата:

22 марта, 11:12:47

Tom Lane:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> Just a thought: if we want to go this way, do we need a new exec call?
>>   We already control the initial exec in pg_ctl.c.
> 
> I'm resistant to assuming the postmaster is launched through pg_ctl.
> systemd, for example, might well prefer not to do that, not to
> mention all the troglodytes still using 1990s launch scripts.

Right, the systemd example in the docs is not using pg_ctl.

But, it should be easily possible to have:
- pg_ctl call postgres with a padded argv0
- postgres call itself with padding, if it wasn't called with that already

This way, there would be no additional exec call when started through 
pg_ctl, but one more call when started directly.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

walther@technowledgy.de

Дата:

22 марта, 11:33:38

Bruce Momjian:
> I suggest we use the #ifdef test to continue our existing behavior for
> the libraries we know about, like glibc, and use the LD_* process title
> truncation hack for libc's we don't recognize.
> 
> Attached is a prototype patch which implements this based on previous
> patches.

The condition to check for linux/glibc in your patch is slightly off:

   #if ! defined(__linux__) || (! defined(__GLIBC__) && 
defined(__UCLIBC__ ))

should be

   #if defined(__linux__) && ! (defined(__GLIBC__) || defined(__UCLIBC__ ))

With the latter, it passes tests with musl.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

walther@technowledgy.de

Дата:

22 марта, 12:41:39

Wolfgang Walther:
> The other issues I had been seeing were during make check-world, but not 
> make check. Those were things around setlocale() / /bin/locale, IIRC. 
> Not sure whether all of the tests are run by the buildfarm?

Ah, those other tests only fail when building --with-icu, but Andrew's 
animal didn't do that.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Andrew Dunstan

Дата:

22 марта, 15:10:27

On Fri, Mar 22, 2024 at 4:02 AM Wolfgang Walther <walther@technowledgy.de> wrote:

Andrew Dunstan:
>> On Mar 22, 2024, at 4:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> How was it passing? The issue discussed in this thread has surely
>> been there for a long time, and Wolfgang mentioned that he sees
>> others.
>
> The buildfarm client has a switch that delays running regression tests until after the install stages.

Hm. So while that switch makes the animal pass the build, it did hide
exactly this problem. Not sure whether this switch should be used at
all, then. Was this switch only implemented for the specific case of
Alpine/musl or is there a different reason for it, as well?

Alpine was the main motivation, but it's also probably useful on Macs with SIP enabled.

ISTR raising the Alpine issue back then (2018) but I can't find a reference now.

cheers

andrew

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

22 марта, 16:36:19

On Fri, Mar 22, 2024 at 09:33:38AM +0100, walther@technowledgy.de wrote:
> Bruce Momjian:
> > I suggest we use the #ifdef test to continue our existing behavior for
> > the libraries we know about, like glibc, and use the LD_* process title
> > truncation hack for libc's we don't recognize.
> > 
> > Attached is a prototype patch which implements this based on previous
> > patches.
> 
> The condition to check for linux/glibc in your patch is slightly off:
> 
>   #if ! defined(__linux__) || (! defined(__GLIBC__) && defined(__UCLIBC__ ))
> 
> should be
> 
>   #if defined(__linux__) && ! (defined(__GLIBC__) || defined(__UCLIBC__ ))
> 
> With the latter, it passes tests with musl.

Yes, my logic was wrong. Not sure what I was thinking, frankly.

I am not a big fan of negating a complex conditional, but would rather
pass the negation into the conditional, new patch attached.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Вложения

proctitle.diff

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

22 марта, 22:44:28

On Fri, Mar 22, 2024 at 09:36:19AM -0400, Bruce Momjian wrote:
> On Fri, Mar 22, 2024 at 09:33:38AM +0100, walther@technowledgy.de wrote:
> > Bruce Momjian:
> > > I suggest we use the #ifdef test to continue our existing behavior for
> > > the libraries we know about, like glibc, and use the LD_* process title
> > > truncation hack for libc's we don't recognize.
> > > 
> > > Attached is a prototype patch which implements this based on previous
> > > patches.
> > 
> > The condition to check for linux/glibc in your patch is slightly off:
> > 
> >   #if ! defined(__linux__) || (! defined(__GLIBC__) && defined(__UCLIBC__ ))
> > 
> > should be
> > 
> >   #if defined(__linux__) && ! (defined(__GLIBC__) || defined(__UCLIBC__ ))
> > 
> > With the latter, it passes tests with musl.
> 
> Yes, my logic was wrong. Not sure what I was thinking, frankly.
> 
> I am not a big fan of negating a complex conditional, but would rather
> pass the negation into the conditional, new patch attached.

With no one "hoping this patch dies in a fire"*, I have updated it with
more details, which I now think is committable to master.  Is this
something to backpatch?  Seems too rare a bug to me.

* Robert Haas,
https://www.postgresql.org/message-id/CA%2BTgmoYsyrCNmg%2BYh6rgP7K8r-bYPjCeF1tPxENRFwD4VZAZvw%40mail.gmail.com

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Вложения

proctitle.diff

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

25 марта, 22:20:47

On Thu, Mar 21, 2024 at 11:26 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Some speculation on how widespread this trick is: as mentioned, it

Just one more, which I just ran into by accident while looking for
something else, which I'm posting just in case this thread ever gets
used to try to convince musl hackers to change their end to support
it:

https://github.com/openzfs/zfs/blob/master/lib/libzutil/os/linux/zutil_setproctitle.c#L158

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

walther@technowledgy.de

Дата:

26 марта, 00:58:30

Bruce Momjian:
> With no one "hoping this patch dies in a fire"*, I have updated it with
> more details, which I now think is committable to master.  Is this
> something to backpatch?  Seems too rare a bug to me.

I would like to see this backpatched to be able to run the regression 
tests easily on all stable branches.

I have taken your's/Thomas' patch and extended it with a few more taking 
in many of the ideas in this thread:

0001 Don't clobber LD_*
This is the patch you posted. This applies cleanly all the way down to 
v12. This fixes the bug and allows running most of the tests with musl - 
yeah!
I also confirmed, that this will not create practical problems with 
library/postgres docker image, where this is likely used the most. While 
"postgres" is called by default without any arguments here, plenty of 
environment variables are passed. The docker image does use LD_PRELOAD 
to trick initdb, but that's not set when running the postmaster, so not 
a problem here.
This use-case also shows why the proposed patch to still partially 
clobber environ at this stage is better than to not clobber environ at 
all - in this case, the docker image would essentially have no ps status 
at all by default.-

0002 Allow setting PS_USE_NONE via CPPFLAGS
This was proposed by Andrew and applies cleanly down to v12. Thus, it 
could be backpatched, too. First and foremost this would allow setting a 
buildfarm animal to use this flag to make sure this code path is 
actually build/tested at all. This is something that Thomas and Tom 
hinted at.

0003 Don't ever clobber environ again
This is the approach I had previously posted as a PoC. This would not be 
backpatched, but I suggested this could go into v17 now. This avoids the 
undefined behavior and sets the table to eventually set ps status via 
argv by default and remove PS_USE_NONE later.
Compared to the PoC patch, I decided not to pad argv[0], because this 
will break the ps status display for the postmaster itself. Instead exec 
is called with an additional argument consisting of exactly 255 spaces.
I also tried avoiding the additional exec-call if postgres was called 
via pg_ctl, as suggested by Peter. This quickly turned out to be more 
invasive than I would have liked this to be, though.
The current approach works very well, the environment doesn't need to be 
copied anymore and the workaround for /proc/<pid>/environ in main.c can 
go away, too.

0004 Default to PS_USE_CLOBBER_ARGV
This changes the default to display ps status on all other systems, too. 
This could potentially go in now as well, or be delayed to the beginning 
of the v18 cycle. In the unlikely event that this breaks something on a 
platform not considered here and we get a bug report, we can easily 
advise to compile with CPPFLAGS=-DPS_USE_NONE, which is still there at 
this stage.

0005 Remove PS_USE_NONE
However, if no reports come in and no problems are detected with 0004, 
then this can be entirely removed. This for "later", whenever that is.

Best,

Wolfgang

Вложения

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

26 марта, 01:35:06

On Mon, Mar 25, 2024 at 10:58:30PM +0100, walther@technowledgy.de wrote:
> Bruce Momjian:
> > With no one "hoping this patch dies in a fire"*, I have updated it with
> > more details, which I now think is committable to master.  Is this
> > something to backpatch?  Seems too rare a bug to me.
> 
> I would like to see this backpatched to be able to run the regression tests
> easily on all stable branches.

You want to risk destabilizing Postgres by backpatching something this
complex so the regression tests can be run on all stable branches?  I
think you are overestimating our desire to take on risk.

Also, in my patch, the parentheses here:

    #if defined(__linux__) && (! defined(__GLIBC__) && ! defined(__UCLIBC__))

are unnecessary so they should be removed:

    #if defined(__linux__) && ! defined(__GLIBC__) && ! defined(__UCLIBC__)

I am only willing to apply my patch, and only to master.  Other
committers might be more willing.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Peter Eisentraut

Дата:

26 марта, 01:46:09

On 22.03.24 20:44, Bruce Momjian wrote:
> +                 * linking (dlopen) might fail.  Here, we truncate the update
> +                 * of the process title when either of two important dynamic
> +                 * linking environment variables are set.  Musl does not
> +                 * define any compiler symbols, so we have to do this for
> +                 * any Linux libc we don't know is safe.
> +                 */
> +                if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] ||
> +                    strstr(environ[i], "LD_PRELOAD=") == environ[i])

What determines which variables require this treatment?

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

26 марта, 01:51:09

On Mon, Mar 25, 2024 at 11:46:09PM +0100, Peter Eisentraut wrote:
> On 22.03.24 20:44, Bruce Momjian wrote:
> > +                 * linking (dlopen) might fail.  Here, we truncate the update
> > +                 * of the process title when either of two important dynamic
> > +                 * linking environment variables are set.  Musl does not
> > +                 * define any compiler symbols, so we have to do this for
> > +                 * any Linux libc we don't know is safe.
> > +                 */
> > +                if (strstr(environ[i], "LD_LIBRARY_PATH=") == environ[i] ||
> > +                    strstr(environ[i], "LD_PRELOAD=") == environ[i])
> 
> What determines which variables require this treatment?

Thomas Munro came up with that part of the patch.  I just combined his
patch with the macro test.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

26 марта, 01:54:59

On Tue, Mar 26, 2024 at 11:46 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> What determines which variables require this treatment?

That came from me peeking at their code:

https://www.postgresql.org/message-id/CA%2BhUKGKNK5V8XwwJJZm36s3EUy8V51xu4XiE8%3D26n%3DWq3OGd4A%40mail.gmail.com

I had originally proposed to avoid anything beginning "LD_" but Tom
suggested being more specific.  I doubt LD_PRELOAD can really hurt you
though (the linker probably only needs the value at the start by
definition, not at later dlopen() time (?)).  I dunno.  If you're
asking if there is any standard or whatever supplying these names, the
System V or at least ELF standards talk about LD_LIBRARY_PATH (though
those standards don't know/care what happens after userspace takes
over control of the environment concept, they just talk about how the
world is created when you exec a process, so they AFAICS they don't
address this clobbering stuff, and AFAIK other LD_XXX stuff is
probably implementation specific).

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

26 марта, 02:14:25

Bruce Momjian <bruce@momjian.us> writes:
> On Mon, Mar 25, 2024 at 10:58:30PM +0100, walther@technowledgy.de wrote:
>> I would like to see this backpatched to be able to run the regression tests
>> easily on all stable branches.

> You want to risk destabilizing Postgres by backpatching something this
> complex so the regression tests can be run on all stable branches?  I
> think you are overestimating our desire to take on risk.

If we want a buildfarm animal testing this platform, we kind of need
to support it on all branches.  Having said that, I agree with you
that we are looking for a minimalist fix not a maximalist one.
I think the 0001 patch is about right, but the rest seem to be solving
problems we don't have.

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

26 марта, 02:15:52

Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, Mar 26, 2024 at 11:46 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>> What determines which variables require this treatment?

> I had originally proposed to avoid anything beginning "LD_" but Tom
> suggested being more specific.  I doubt LD_PRELOAD can really hurt you
> though (the linker probably only needs the value at the start by
> definition, not at later dlopen() time (?)).

Oh, good point.  So we could simplify the patch by only looking for
LD_LIBRARY_PATH.

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

26 марта, 02:21:48

On Mon, Mar 25, 2024 at 07:14:25PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Mon, Mar 25, 2024 at 10:58:30PM +0100, walther@technowledgy.de wrote:
> >> I would like to see this backpatched to be able to run the regression tests
> >> easily on all stable branches.
> 
> > You want to risk destabilizing Postgres by backpatching something this
> > complex so the regression tests can be run on all stable branches?  I
> > think you are overestimating our desire to take on risk.
> 
> If we want a buildfarm animal testing this platform, we kind of need
> to support it on all branches.  Having said that, I agree with you
> that we are looking for a minimalist fix not a maximalist one.
> I think the 0001 patch is about right, but the rest seem to be solving
> problems we don't have.

I could support the minimalist patch applied to all branches.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

26 марта, 02:43:08

I wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> I had originally proposed to avoid anything beginning "LD_" but Tom
>> suggested being more specific.  I doubt LD_PRELOAD can really hurt you
>> though (the linker probably only needs the value at the start by
>> definition, not at later dlopen() time (?)).

> Oh, good point.  So we could simplify the patch by only looking for
> LD_LIBRARY_PATH.

I looked at the musl source code you identified and confirmed that
only the LD_LIBRARY_PATH string is remembered in a static variable;
LD_PRELOAD is only accessed locally in that initialization function.
So we only need to do the attached.  (I failed to resist the
temptation to rewrite the comments.)

            regards, tom lane

diff --git a/src/backend/utils/misc/ps_status.c b/src/backend/utils/misc/ps_status.c
index 5d829e6e48..92cd2c7899 100644
--- a/src/backend/utils/misc/ps_status.c
+++ b/src/backend/utils/misc/ps_status.c
@@ -151,7 +151,33 @@ save_ps_display_args(int argc, char **argv)
         for (i = 0; environ[i] != NULL; i++)
         {
             if (end_of_area + 1 == environ[i])
-                end_of_area = environ[i] + strlen(environ[i]);
+            {
+                /*
+                 * The musl runtime linker keeps a static pointer to the
+                 * initial value of LD_LIBRARY_PATH, if that is defined in the
+                 * process's environment. Therefore, we must not overwrite the
+                 * value of that setting and thus cannot advance end_of_area
+                 * beyond it.  Musl does not define any identifying compiler
+                 * symbol, so we have to do this for any Linux libc we don't
+                 * know is safe.
+                 */
+#if defined(__linux__) && (!defined(__GLIBC__) && !defined(__UCLIBC__ ))
+                if (strncmp(environ[i], "LD_LIBRARY_PATH=", 16) == 0)
+                {
+                    /*
+                     * We can overwrite the name, but stop at the equals sign.
+                     * Future loop iterations will not find any more
+                     * contiguous space, but we don't break early because we
+                     * need to count the total number of environ[] entries.
+                     */
+                    end_of_area = environ[i] + 15;
+                }
+                else
+#endif
+                {
+                    end_of_area = environ[i] + strlen(environ[i]);
+                }
+            }
         }

         ps_buffer = argv[0];

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Thomas Munro

Дата:

26 марта, 02:49:55

On Tue, Mar 26, 2024 at 12:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I looked at the musl source code you identified and confirmed that
> only the LD_LIBRARY_PATH string is remembered in a static variable;
> LD_PRELOAD is only accessed locally in that initialization function.
> So we only need to do the attached.  (I failed to resist the
> temptation to rewrite the comments.)

LGTM.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Andres Freund

Дата:

26 марта, 03:14:47

Hi,

On 2024-03-22 08:55:52 +0100, Wolfgang Walther wrote:
> Andres Freund:
> > FWIW, independent of which fix we go with, I think we need a buildfarm animal
> > using musl. Even better if one of the CI tasks can be made to use musl as
> > well.
>
> I am already working with Andrew to set up a buildfarm animal to run Alpine
> Linux/musl. I can look into the CI task as well. Are you suggesting to
> change an existing task to run with Alpine/musl or to add a new task for it?
> It would be docker image based for sure.

I'd rather adapt one of the existing tasks, to avoid increasing CI costs
unduly.

The way we currently run CI for testing of not-yet-merged patches runs
all tasks other than macos as full VMs, that turned out to be faster &
cheaper.

FWIW, except for one small issue, building postgres against musl works on
debian and the tests pass if I install first.

The small problem mentioned above is that on debian linux/fs.h isn't available
when building with musl, which in turn causes src/bin/pg_upgrade/file.c to
fail to compile.  I assume that's not the case on "fully musl" distro?

Greetings,

Andres Freund

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Bruce Momjian

Дата:

26 марта, 03:53:05

On Tue, Mar 26, 2024 at 12:49:55PM +1300, Thomas Munro wrote:
> On Tue, Mar 26, 2024 at 12:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I looked at the musl source code you identified and confirmed that
> > only the LD_LIBRARY_PATH string is remembered in a static variable;
> > LD_PRELOAD is only accessed locally in that initialization function.
> > So we only need to do the attached.  (I failed to resist the
> > temptation to rewrite the comments.)
> 
> LGTM.

+1

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Peter Eisentraut

Дата:

26 марта, 10:19:09

On 26.03.24 00:43, Tom Lane wrote:
> I wrote:
>> Thomas Munro <thomas.munro@gmail.com> writes:
>>> I had originally proposed to avoid anything beginning "LD_" but Tom
>>> suggested being more specific.  I doubt LD_PRELOAD can really hurt you
>>> though (the linker probably only needs the value at the start by
>>> definition, not at later dlopen() time (?)).
> 
>> Oh, good point.  So we could simplify the patch by only looking for
>> LD_LIBRARY_PATH.
> 
> I looked at the musl source code you identified and confirmed that
> only the LD_LIBRARY_PATH string is remembered in a static variable;
> LD_PRELOAD is only accessed locally in that initialization function.
> So we only need to do the attached.  (I failed to resist the
> temptation to rewrite the comments.)

Yeah, I was more looking for a comment for posterity for *why* we need 
to preserve this variable in particular.  The updated comment looks 
reasonable.

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

walther@technowledgy.de

Дата:

26 марта, 10:36:40

Bruce Momjian:
> You want to risk destabilizing Postgres by backpatching something this
> complex so the regression tests can be run on all stable branches?  I
> think you are overestimating our desire to take on risk.

I specifically wrote about backpatching the first (and maybe second) 
patch. None of that is risky.

Patches 3-5 were not meant for backpatching at all.

Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

walther@technowledgy.de

Дата:

26 марта, 10:43:54

Tom Lane:
> If we want a buildfarm animal testing this platform, we kind of need
> to support it on all branches.  Having said that, I agree with you
> that we are looking for a minimalist fix not a maximalist one.
> I think the 0001 patch is about right, but the rest seem to be solving
> problems we don't have.

The second patch potentially solves the problem of PS_USE_NONE not being 
tested. Of course you could also set up a buildfarm animal on some other 
platform, which is sure to fall through to PS_USE_NONE, but that seems 
to have not worked in the past:

Thomas Munro:
> I dimly recall that it turned out that PS_USE_NONE was
> actually broken for a while without anyone noticing
Best,

Wolfgang

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

26 марта, 17:35:43

walther@technowledgy.de writes:
> The second patch potentially solves the problem of PS_USE_NONE not being 
> tested. Of course you could also set up a buildfarm animal on some other 
> platform, which is sure to fall through to PS_USE_NONE, but that seems 
> to have not worked in the past:

> Thomas Munro:
>> I dimly recall that it turned out that PS_USE_NONE was
>> actually broken for a while without anyone noticing

I think what Thomas is recollecting is this:

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_15_BR [0fb6954aa] 2022-03-27 12:57:46 -0400
Branch: REL_14_STABLE Release: REL_14_3 [3f7a59c59] 2022-03-27 12:57:52 -0400
Branch: REL_13_STABLE Release: REL_13_7 [9016a2a3d] 2022-03-27 12:57:57 -0400

    Fix breakage of get_ps_display() in the PS_USE_NONE case.
    
    Commit 8c6d30f21 caused this function to fail to set *displen
    in the PS_USE_NONE code path.  If the variable's previous value
    had been negative, that'd lead to a memory clobber at some call
    sites.  We'd managed not to notice due to very thin test coverage
    of such configurations, but this appears to explain buildfarm member
    lorikeet's recent struggles.
    
    Credit to Andrew Dunstan for spotting the problem.  Back-patch
    to v13 where the bug was introduced.
    
    Discussion: https://postgr.es/m/136102.1648320427@sss.pgh.pa.us


The problem wasn't lack of coverage, it was that the failure was
intermittent and erratic enough to be very hard to diagnose.
I think that's more bad luck than anything else.

            regards, tom lane

Re: Regression tests fail with musl libc because libpq.so can't be loaded

От

Tom Lane

Дата:

26 марта, 18:45:55

Peter Eisentraut <peter@eisentraut.org> writes:
> On 26.03.24 00:43, Tom Lane wrote:
>> I looked at the musl source code you identified and confirmed that
>> only the LD_LIBRARY_PATH string is remembered in a static variable;
>> LD_PRELOAD is only accessed locally in that initialization function.
>> So we only need to do the attached.  (I failed to resist the
>> temptation to rewrite the comments.)

> Yeah, I was more looking for a comment for posterity for *why* we need 
> to preserve this variable in particular.  The updated comment looks 
> reasonable.

OK, pushed after a bit more comment-fiddling.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Regression tests fail with musl libc because libpq.so can't be loaded

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения