Обсуждение: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

Поиск
Список
Период
Сортировка

postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Andres Freund
Дата:
Hi,

I amd working on adding an "installcheck" equivalent mode to the meson
build. One invocation of something "installcheck-world" like lead to things
getting stuck.

Lots of log messages like:

2022-09-25 16:16:30.999 PDT [2705454][client backend][28/1112:41269][pg_regress] LOG:  still waiting for backend with
PID2705178 to accept ProcSignalBarrier
 
2022-09-25 16:16:30.999 PDT [2705454][client backend][28/1112:41269][pg_regress] STATEMENT:  DROP DATABASE IF EXISTS
"regression_test_parser_regress"
2022-09-25 16:16:31.006 PDT [2705472][client backend][22/3699:41294][pg_regress] LOG:  still waiting for backend with
PID2705178 to accept ProcSignalBarrier
 
2022-09-25 16:16:31.006 PDT [2705472][client backend][22/3699:41294][pg_regress] STATEMENT:  DROP DATABASE IF EXISTS
"regression_test_predtest_regress"

a stacktrace of 2705178 shows:

(gdb) bt
#0  0x00007f67d26fe1b3 in __GI___poll (fds=fds@entry=0x7ffebe187c88, nfds=nfds@entry=1, timeout=-1) at
../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f67cfd03c1c in pqSocketPoll (sock=<optimized out>, forRead=forRead@entry=1, forWrite=forWrite@entry=0,
end_time=end_time@entry=-1)
    at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1125
#2  0x00007f67cfd04310 in pqSocketCheck (conn=conn@entry=0x562f875a9b70, forRead=forRead@entry=1,
forWrite=forWrite@entry=0,end_time=end_time@entry=-1)
 
    at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:1066
#3  0x00007f67cfd043fd in pqWaitTimed (forRead=forRead@entry=1, forWrite=forWrite@entry=0,
conn=conn@entry=0x562f875a9b70,finish_time=finish_time@entry=-1)
 
    at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-misc.c:998
#4  0x00007f67cfcfc47b in connectDBComplete (conn=conn@entry=0x562f875a9b70) at
../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-connect.c:2166
#5  0x00007f67cfcfe248 in PQconnectdbParams (keywords=keywords@entry=0x562f87613d20,
values=values@entry=0x562f87613d70,expand_dbname=expand_dbname@entry=0)
 
    at ../../../../home/andres/src/postgresql/src/interfaces/libpq/fe-connect.c:659
#6  0x00007f67cfd29536 in connect_pg_server (server=server@entry=0x562f876139b0, user=user@entry=0x562f87613980)
    at ../../../../home/andres/src/postgresql/contrib/postgres_fdw/connection.c:474
#7  0x00007f67cfd29910 in make_new_connection (entry=entry@entry=0x562f8758b2c8, user=user@entry=0x562f87613980)
    at ../../../../home/andres/src/postgresql/contrib/postgres_fdw/connection.c:344
#8  0x00007f67cfd29da0 in GetConnection (user=0x562f87613980, will_prep_stmt=will_prep_stmt@entry=false,
state=state@entry=0x562f876136a0)
    at ../../../../home/andres/src/postgresql/contrib/postgres_fdw/connection.c:204
#9  0x00007f67cfd35294 in postgresBeginForeignScan (node=0x562f87612e70, eflags=<optimized out>)


and it turns out that backend can't be graciously be terminated.


Maybe I am missing something, but I don't think it's OK for
connect_pg_server() to connect in a blocking manner, without accepting
interrupts?

Greetings,

Andres Freund



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Andres Freund
Дата:
Hi,

I just re-discovered this issue, via
https://www.postgresql.org/message-id/20221209003607.bz2zdznvfnkq4zz3%40awork3.anarazel.de


On 2022-09-25 16:22:37 -0700, Andres Freund wrote:
> Maybe I am missing something, but I don't think it's OK for
> connect_pg_server() to connect in a blocking manner, without accepting
> interrupts?

It's definitely not. This basically means network issues or such can lead to
connections being unkillable...

We know how to do better, c.f. libpqrcv_connect(). I hacked that up for
postgres_fdw, and got through quite a few runs without related issues ([1]).

The same problem is present in two places in dblink.c. Obviously we can copy
and paste the code to dblink.c as well. But that means we have the same not
quite trivial code in three different c files. There's already a fair bit of
duplicated code around AcquireExternalFD().

It seems we should find a place to put backend specific libpq wrapper code
somewhere. Unless we want to relax the rule about not linking libpq into the
backend we can't just put it in the backend directly, though.

The only alternative way to provide a wrapper that I can think of are to
a) introduce a new static library that can be linked to by libpqwalreceiver,
   postgres_fdw, dblink
b) add a header with static inline functions implementing interrupt-processing
   connection establishment for libpq

Neither really has precedent.

The attached quick patch just adds and uses libpq_connect_interruptible() in
postgres_fdw. If we wanted to move this somewhere generic, at least part of
the external FD handling should also be moved into it.


dblink.c uses a lot of other blocking libpq functions, which obviously also
isn't ok.


Perhaps we could somehow make this easier from within libpq? My first thought
was a connection parameter that'd provide a different implementation of
pqSocketCheck() or such - but I don't think that'd work, because we might
throw errors when processing interrupts, and that'd not be ok from deep within
libpq.

Greetings,

Andres Freund


[1] I eventually encountered a deadlock in REINDEX, but it didn't involve
    postgres_fw / ProcSignalBarrier

Вложения

Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Andres Freund
Дата:
Hi,

On 2022-12-08 18:08:15 -0800, Andres Freund wrote:
> I just re-discovered this issue, via
> https://www.postgresql.org/message-id/20221209003607.bz2zdznvfnkq4zz3%40awork3.anarazel.de
> 
> 
> On 2022-09-25 16:22:37 -0700, Andres Freund wrote:
> > Maybe I am missing something, but I don't think it's OK for
> > connect_pg_server() to connect in a blocking manner, without accepting
> > interrupts?
> 
> It's definitely not. This basically means network issues or such can lead to
> connections being unkillable...
> 
> We know how to do better, c.f. libpqrcv_connect(). I hacked that up for
> postgres_fdw, and got through quite a few runs without related issues ([1]).
> 
> The same problem is present in two places in dblink.c. Obviously we can copy
> and paste the code to dblink.c as well. But that means we have the same not
> quite trivial code in three different c files. There's already a fair bit of
> duplicated code around AcquireExternalFD().
> 
> It seems we should find a place to put backend specific libpq wrapper code
> somewhere. Unless we want to relax the rule about not linking libpq into the
> backend we can't just put it in the backend directly, though.
> 
> The only alternative way to provide a wrapper that I can think of are to
> a) introduce a new static library that can be linked to by libpqwalreceiver,
>    postgres_fdw, dblink
> b) add a header with static inline functions implementing interrupt-processing
>    connection establishment for libpq
> 
> Neither really has precedent.
> 
> The attached quick patch just adds and uses libpq_connect_interruptible() in
> postgres_fdw. If we wanted to move this somewhere generic, at least part of
> the external FD handling should also be moved into it.
> 
> 
> dblink.c uses a lot of other blocking libpq functions, which obviously also
> isn't ok.
> 
> 
> Perhaps we could somehow make this easier from within libpq? My first thought
> was a connection parameter that'd provide a different implementation of
> pqSocketCheck() or such - but I don't think that'd work, because we might
> throw errors when processing interrupts, and that'd not be ok from deep within
> libpq.

Any opinions?  Due to the simplicity I'm currently leaning to a header-only
helper, but I don't feel confident about it.

The rate of cfbot failures is high enough that it'd be good to do something
about this.

Greetings,

Andres Freund



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Thomas Munro
Дата:
On Fri, Dec 30, 2022 at 6:23 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-12-08 18:08:15 -0800, Andres Freund wrote:
> > On 2022-09-25 16:22:37 -0700, Andres Freund wrote:
> > The only alternative way to provide a wrapper that I can think of are to
> > a) introduce a new static library that can be linked to by libpqwalreceiver,
> >    postgres_fdw, dblink
> > b) add a header with static inline functions implementing interrupt-processing
> >    connection establishment for libpq
> >
> > Neither really has precedent.

> Any opinions?  Due to the simplicity I'm currently leaning to a header-only
> helper, but I don't feel confident about it.

The header idea is a little bit sneaky (IIUC: a header that is part of
the core tree, but can't be used by core and possibly needs special
treatment in 'headercheck' to get the right include search path, can
only be used by libpqwalreceiver et al which are allowed to link to
libpq), but I think it is compatible with other goals we have
discussed in other threads.  I think in the near future we'll probably
remove the concept of non-threaded server builds (as proposed before
in the post HP-UX 10 cleanup thread, with patches, but not quite over
the line yet).  Then I think the server could be allowed to link libpq
directly?  And at that point this code wouldn't be sneaky anymore and
could optionally move into a .c.  Does that makes sense?



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> The header idea is a little bit sneaky (IIUC: a header that is part of
> the core tree, but can't be used by core and possibly needs special
> treatment in 'headercheck' to get the right include search path, can
> only be used by libpqwalreceiver et al which are allowed to link to
> libpq), but I think it is compatible with other goals we have
> discussed in other threads.  I think in the near future we'll probably
> remove the concept of non-threaded server builds (as proposed before
> in the post HP-UX 10 cleanup thread, with patches, but not quite over
> the line yet).  Then I think the server could be allowed to link libpq
> directly?  And at that point this code wouldn't be sneaky anymore and
> could optionally move into a .c.  Does that makes sense?

I don't like the idea of linking libpq directly into the backend.
It should remain a dynamically-loaded library to avoid problems 
during software updates.

            regards, tom lane



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Andres Freund
Дата:
Hi,

On 2022-12-30 10:31:22 +1300, Thomas Munro wrote:
> On Fri, Dec 30, 2022 at 6:23 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-12-08 18:08:15 -0800, Andres Freund wrote:
> > > On 2022-09-25 16:22:37 -0700, Andres Freund wrote:
> > > The only alternative way to provide a wrapper that I can think of are to
> > > a) introduce a new static library that can be linked to by libpqwalreceiver,
> > >    postgres_fdw, dblink
> > > b) add a header with static inline functions implementing interrupt-processing
> > >    connection establishment for libpq
> > >
> > > Neither really has precedent.
> 
> > Any opinions?  Due to the simplicity I'm currently leaning to a header-only
> > helper, but I don't feel confident about it.
> 
> The header idea is a little bit sneaky (IIUC: a header that is part of
> the core tree, but can't be used by core and possibly needs special
> treatment in 'headercheck' to get the right include search path, can
> only be used by libpqwalreceiver et al which are allowed to link to
> libpq), but I think it is compatible with other goals we have
> discussed in other threads.

Hm, what special search path / headerscheck magic are you thinking of? I think
something like src/include/libpq/libpq-be-fe-helpers.h defining a bunch of
static inlines should "just" work?


We likely could guard against that header being included from code ending up
in the postgres binary directly by #error'ing if BUILDING_DLL is
defined. That's a very badly named define, but it IIRC has to be iff building
code ending up in postgres directly.


> I think in the near future we'll probably remove the concept of non-threaded
> server builds (as proposed before in the post HP-UX 10 cleanup thread, with
> patches, but not quite over the line yet).  Then I think the server could be
> allowed to link libpq directly?  And at that point this code wouldn't be
> sneaky anymore and could optionally move into a .c.  Does that makes sense?

I was wondering about linking in libpq directly as well. But I am not sure
it's a good idea. I suspect we'd run into some issues around libraries
(including extensions) linking to different versions of libpq etc - if we
directly link to libpq that'd end up in tears.

It might be a different story if we had a version of libpq built with
different symbol names etc. But that's not exactly trivial either.

Greetings,

Andres Freund



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Thomas Munro
Дата:
On Fri, Dec 30, 2022 at 10:54 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-12-30 10:31:22 +1300, Thomas Munro wrote:
> > On Fri, Dec 30, 2022 at 6:23 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2022-12-08 18:08:15 -0800, Andres Freund wrote:
> > > > On 2022-09-25 16:22:37 -0700, Andres Freund wrote:
> > > > The only alternative way to provide a wrapper that I can think of are to
> > > > a) introduce a new static library that can be linked to by libpqwalreceiver,
> > > >    postgres_fdw, dblink
> > > > b) add a header with static inline functions implementing interrupt-processing
> > > >    connection establishment for libpq
> > > >
> > > > Neither really has precedent.
> >
> > > Any opinions?  Due to the simplicity I'm currently leaning to a header-only
> > > helper, but I don't feel confident about it.
> >
> > The header idea is a little bit sneaky (IIUC: a header that is part of
> > the core tree, but can't be used by core and possibly needs special
> > treatment in 'headercheck' to get the right include search path, can
> > only be used by libpqwalreceiver et al which are allowed to link to
> > libpq), but I think it is compatible with other goals we have
> > discussed in other threads.
>
> Hm, what special search path / headerscheck magic are you thinking of? I think
> something like src/include/libpq/libpq-be-fe-helpers.h defining a bunch of
> static inlines should "just" work?

Oh, I was imagining something slightly different.  Not something under
src/include/libpq, but conceptually a separate header-only library
that is above both the backend and libpq.  Maybe something like
src/include/febe_util/libpq_connect_interruptible.h.  In other words,
I thought your idea b was a header-only version of your idea a.  I
think that might be a bit nicer than putting it under libpq?
Superficial difference, perhaps...

And then I assumed that headerscheck would need to be told to add
libpq's header location in -I for that header, but on closer
inspection it already adds that unconditionally so I retract that
comment.

> > I think in the near future we'll probably remove the concept of non-threaded
> > server builds (as proposed before in the post HP-UX 10 cleanup thread, with
> > patches, but not quite over the line yet).  Then I think the server could be
> > allowed to link libpq directly?  And at that point this code wouldn't be
> > sneaky anymore and could optionally move into a .c.  Does that makes sense?
>
> I was wondering about linking in libpq directly as well. But I am not sure
> it's a good idea. I suspect we'd run into some issues around libraries
> (including extensions) linking to different versions of libpq etc - if we
> directly link to libpq that'd end up in tears.
>
> It might be a different story if we had a version of libpq built with
> different symbol names etc. But that's not exactly trivial either.

Hmm, yeah.  Some interesting things to think about.  Whether it's a
feature or an accident that new backends can pick up new libpq minor
updates without restarting the postmaster, and how we'd manage a
future libpq major version/ABI break.  Getting a bit off topic for
this thread I suppose.



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Andres Freund
Дата:
Hi,

On 2022-12-30 14:14:55 +1300, Thomas Munro wrote:
> Oh, I was imagining something slightly different.  Not something under
> src/include/libpq, but conceptually a separate header-only library
> that is above both the backend and libpq.  Maybe something like
> src/include/febe_util/libpq_connect_interruptible.h.  In other words,
> I thought your idea b was a header-only version of your idea a.  I
> think that might be a bit nicer than putting it under libpq?
> Superficial difference, perhaps...

It doesn't seem entirely right to introduce a top-level "module" for
libpq-in-extensions to me - we don't do that for other APIs used for
extensions. But a header only library also doesn't quite seem right. So ...

Looking at turning my patch upthread into something slightly less prototype-y,
I noticed that libpqwalreceiver doesn't do AcquireExternalFD(), added to other
backend uses of libpq in 3d475515a15. It's unlikely to matter for
walreceiver.c itelf, but it seems problematic for the logical replication
cases?

It's annoying to introduce PG_TRY/CATCH blocks, just to deal with the
potential for errors inside WaitLatchOrSocket(), which should never happen. I
wonder if we should consider making latch.c error out fatally, instead of
elog(ERROR). If latches are broken, things are bad.


The PG_CATCH() logic in postgres_fdw's GetConnection() looks quite suspicious
to me. It looks like 32a9c0bdf493 took entirely the wrong path. Instead of
optionally not throwing or directly re-establishing connections in
begin_remote_xact(), the PG_CATCH() hammer was applied.



The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver,
dblink, postgres_fdw.

As I made libpq-be-fe-helpers.h handle reserving external fds,
libpqwalreceiver now does so. I briefly looked through its users without
seeing cases of leaking in case of errors - which would already have been bad,
since we'd already have leaked a libpq connection/socket.


Given the lack of field complaints and the size of the required changes, I
don't think we should backpatch this, even though it's pretty clearly buggy
as-is.


Some time back Thomas had a patch to introduce a wrapper around
libpq-in-extensions that fixed issues cause by some events being
edge-triggered on windows. It's possible that combining these two efforts
would yield something better. I resisted the urge to create a wrapper around
each connection in this patch, as it'd have ended up being a whole lot more
invasive. But... Thomas, do you have a reference to that code handy?

Greetings,

Andres Freund

Вложения

Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Robert Haas
Дата:
On Sun, Sep 25, 2022 at 7:22 PM Andres Freund <andres@anarazel.de> wrote:
> Maybe I am missing something, but I don't think it's OK for
> connect_pg_server() to connect in a blocking manner, without accepting
> interrupts?

I remember noticing various problems in this area years ago. I'm not
sure whether I noticed this particular one, but the commit message for
ae9bfc5d65123aaa0d1cca9988037489760bdeae mentions a few others that I
did notice.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Andres Freund
Дата:
Hi,

On 2023-01-03 12:05:20 -0800, Andres Freund wrote:
> The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver,
> dblink, postgres_fdw.

> As I made libpq-be-fe-helpers.h handle reserving external fds,
> libpqwalreceiver now does so. I briefly looked through its users without
> seeing cases of leaking in case of errors - which would already have been bad,
> since we'd already have leaked a libpq connection/socket.
> 
> 
> Given the lack of field complaints and the size of the required changes, I
> don't think we should backpatch this, even though it's pretty clearly buggy
> as-is.

Any comments on this? Otherwise I think I'll go with this approach.

Greetings,

Andres Freund



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Thomas Munro
Дата:
On Tue, Jan 10, 2023 at 3:05 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-01-03 12:05:20 -0800, Andres Freund wrote:
> > The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver,
> > dblink, postgres_fdw.
>
> > As I made libpq-be-fe-helpers.h handle reserving external fds,
> > libpqwalreceiver now does so. I briefly looked through its users without
> > seeing cases of leaking in case of errors - which would already have been bad,
> > since we'd already have leaked a libpq connection/socket.
> >
> >
> > Given the lack of field complaints and the size of the required changes, I
> > don't think we should backpatch this, even though it's pretty clearly buggy
> > as-is.
>
> Any comments on this? Otherwise I think I'll go with this approach.

+1.  Not totally convinced about the location but we are free to
re-organise it any time, and the random CI failures are bad.



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Andres Freund
Дата:
Hi,

On 2023-01-14 14:45:06 +1300, Thomas Munro wrote:
> On Tue, Jan 10, 2023 at 3:05 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2023-01-03 12:05:20 -0800, Andres Freund wrote:
> > > The attached patch adds libpq-be-fe-helpers.h and uses it in libpqwalreceiver,
> > > dblink, postgres_fdw.
> >
> > > As I made libpq-be-fe-helpers.h handle reserving external fds,
> > > libpqwalreceiver now does so. I briefly looked through its users without
> > > seeing cases of leaking in case of errors - which would already have been bad,
> > > since we'd already have leaked a libpq connection/socket.
> > >
> > >
> > > Given the lack of field complaints and the size of the required changes, I
> > > don't think we should backpatch this, even though it's pretty clearly buggy
> > > as-is.
> >
> > Any comments on this? Otherwise I think I'll go with this approach.
> 
> +1.  Not totally convinced about the location but we are free to
> re-organise it any time, and the random CI failures are bad.

Cool.

Updated patch attached. I split it into multiple pieces.
1) A fix for [1], included here because I encountered it while testing
2) Introduction of libpq-be-fe-helpers.h
3) Convert dblink and postgres_fdw to the helper
4) Convert libpqwalreceiver.c to the helper

Even if we eventually decide to backpatch 3), we'd likely not backpatch 4), as
there's no bug (although perhaps the lack of FD handling could be called a
bug?).

There's also some light polishing, improving commit message, comments and
moving some internal helper functions to later in the file.


Greetings,

Andres Freund

[1] https://postgr.es/m/20230121011237.q52apbvlarfv6jm6%40awork3.anarazel.de

Вложения

Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Andres Freund
Дата:
Hi,

On 2023-01-20 19:00:08 -0800, Andres Freund wrote:
> Updated patch attached. I split it into multiple pieces.
> 1) A fix for [1], included here because I encountered it while testing
> 2) Introduction of libpq-be-fe-helpers.h
> 3) Convert dblink and postgres_fdw to the helper
> 4) Convert libpqwalreceiver.c to the helper
> 
> Even if we eventually decide to backpatch 3), we'd likely not backpatch 4), as
> there's no bug (although perhaps the lack of FD handling could be called a
> bug?).
> 
> There's also some light polishing, improving commit message, comments and
> moving some internal helper functions to later in the file.

After a tiny bit further polishing, and after separately pushing a resource
leak fix for walrcv_connect(), I pushed this.

Greetings,

Andres Freund



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Nathan Bossart
Дата:
On Mon, Jan 23, 2023 at 07:28:06PM -0800, Andres Freund wrote:
> After a tiny bit further polishing, and after separately pushing a resource
> leak fix for walrcv_connect(), I pushed this.

My colleague Robins Tharakan (CC'd) noticed crashes when testing recent
commits, and he traced it back to e460248.  From this information, I
discovered the following typo:

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 8982d623d3..78a8bcee6e 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -321,7 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS)
     else
     {
         if (pconn->conn)
-            libpqsrv_disconnect(conn);
+            libpqsrv_disconnect(pconn->conn);
         pconn->conn = conn;
     }

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Andres Freund
Дата:
Hi,

On 2023-01-30 11:30:08 -0800, Nathan Bossart wrote:
> On Mon, Jan 23, 2023 at 07:28:06PM -0800, Andres Freund wrote:
> > After a tiny bit further polishing, and after separately pushing a resource
> > leak fix for walrcv_connect(), I pushed this.
> 
> My colleague Robins Tharakan (CC'd) noticed crashes when testing recent
> commits, and he traced it back to e460248.  From this information, I
> discovered the following typo:
> 
> diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
> index 8982d623d3..78a8bcee6e 100644
> --- a/contrib/dblink/dblink.c
> +++ b/contrib/dblink/dblink.c
> @@ -321,7 +321,7 @@ dblink_connect(PG_FUNCTION_ARGS)
>      else
>      {
>          if (pconn->conn)
> -            libpqsrv_disconnect(conn);
> +            libpqsrv_disconnect(pconn->conn);
>          pconn->conn = conn;
>      }

Ugh. Good catch.

Why don't the dblink tests catch this?  Any chance you or Robins could prepare
a patch with fix and test, given that you know how to trigger this?

Greetings,

Andres Freund



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Nathan Bossart
Дата:
On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote:
> Why don't the dblink tests catch this?  Any chance you or Robins could prepare
> a patch with fix and test, given that you know how to trigger this?

It's trivially reproducible by calling 1-argument dblink_connect() multiple
times and then calling dblink_disconnect().  Here's a patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Вложения

Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Robins Tharakan
Дата:
Hi Andres,

On Tue, 31 Jan 2023 at 06:31, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote:
> > Why don't the dblink tests catch this?  Any chance you or Robins could prepare
> > a patch with fix and test, given that you know how to trigger this?
>
> It's trivially reproducible by calling 1-argument dblink_connect() multiple
> times and then calling dblink_disconnect().  Here's a patch.
>

My test instance has been running Nathan's patch for the past
few hours, and it looks like this should resolve the issue.

A little bit of background, since the past few days I was
noticing frequent crashes on a test instance. They're not simple
to reproduce manually, but if left running I can reliably see
crashes on an ~hourly basis.

In trying to trace the origin, I ran a multiple-hour test for
each commit going back a few commits and noticed that the
crashes stopped prior to commit e4602483e9, which is when the
issue became visible.

The point is now moot, but let me know if full backtraces still
help (I was concerned looking at the excerpts from some
of the crashes):

"double free or corruption (out)"
"munmap_chunk(): invalid pointer"
"free(): invalid pointer"
str->data[0] = '\0';


Some backtraces
###############

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 6c6d6ba3ee@master@sqith: u21 postgres
127.0.0.1(45334) SELECT       '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___libc_free (mem=0x312f352f65736162) at malloc.c:3102
#0  __GI___libc_free (mem=0x312f352f65736162) at malloc.c:3102
#1  0x00007fc0062dfefd in pqDropConnection (conn=0x564bb3e1a080,
    flushInput=true) at fe-connect.c:495
#2  0x00007fc0062e5cb3 in closePGconn (conn=0x564bb3e1a080)
    at fe-connect.c:4112
#3  0x00007fc0062e5d55 in PQfinish (conn=0x564bb3e1a080) at fe-connect.c:4134
#4  0x00007fc0061a442b in libpqsrv_disconnect (conn=0x564bb3e1a080)
    at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#5  0x00007fc0061a4df1 in dblink_disconnect (fcinfo=0x564bb3d980f0)
    at dblink.c:357
#6  0x0000564bb0e70aa7 in ExecInterpExpr (state=0x564bb3d98018,
    econtext=0x564bb3d979a0, isnull=0x7ffd60824b0f) at execExprInterp.c:728
#7  0x0000564bb0e72f36 in ExecInterpExprStillValid (state=0x564bb3d98018,
    econtext=0x564bb3d979a0, isNull=0x7ffd60824b0f) at execExprInterp.c:1838

============


Core was generated by `postgres: 6c6d6ba3ee@master@sqith: u52 postgres
127.0.0.1(58778) SELECT       '.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fc021792859 in __GI_abort () at abort.c:79
#2  0x00007fc0217fd26e in __libc_message (action=action@entry=do_abort,
    fmt=fmt@entry=0x7fc021927298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007fc0218052fc in malloc_printerr (
    str=str@entry=0x7fc021929670 "double free or corruption (out)")
    at malloc.c:5347
#4  0x00007fc021806fa0 in _int_free (av=0x7fc02195cb80 <main_arena>,
    p=0x7fc02195cbf0 <main_arena+112>, have_lock=<optimized out>)
    at malloc.c:4314
#5  0x00007fc0062e16ed in freePGconn (conn=0x564bb6e36b80)
    at fe-connect.c:3977
#6  0x00007fc0062e1d61 in PQfinish (conn=0x564bb6e36b80) at fe-connect.c:4135
#7  0x00007fc0062a142b in libpqsrv_disconnect (conn=0x564bb6e36b80)
    at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#8  0x00007fc0062a1df1 in dblink_disconnect (fcinfo=0x564bb5647b58)
    at dblink.c:357

============

Program terminated with signal SIGSEGV, Segmentation fault.
#0  resetPQExpBuffer (str=0x559d3af0e838) at pqexpbuffer.c:153
153                             str->data[0] = '\0';
#0  resetPQExpBuffer (str=0x559d3af0e838) at pqexpbuffer.c:153
#1  0x00007f2240b0a876 in PQexecStart (conn=0x559d3af0e410) at fe-exec.c:2320
#2  0x00007f2240b0a688 in PQexec (conn=0x559d3af0e410,
    query=0x559d56fb8ee8 "p3") at fe-exec.c:2227
#3  0x00007f223ba8c7e4 in dblink_exec (fcinfo=0x559d3b101f58) at dblink.c:1432
#4  0x0000559d2f003c82 in ExecInterpExpr (state=0x559d3b101ac0,
    econtext=0x559d34e76578, isnull=0x7ffe3d590fdf) at execExprInterp.c:752


============

Core was generated by `postgres: 728f86fec6@(HEAD detached at
728f86fec6)@sqith: u19 postgres 127.0.0.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f96f5fc6e99 in SSL_shutdown ()
   from /lib/x86_64-linux-gnu/libssl.so.1.1
#0  0x00007f96f5fc6e99 in SSL_shutdown ()
   from /lib/x86_64-linux-gnu/libssl.so.1.1
#1  0x00007f96da56027a in pgtls_close (conn=0x55d919752fb0)
    at fe-secure-openssl.c:1555
#2  0x00007f96da558e41 in pqsecure_close (conn=0x55d919752fb0)
    at fe-secure.c:192
#3  0x00007f96da53dd12 in pqDropConnection (conn=0x55d919752fb0,
    flushInput=true) at fe-connect.c:449
#4  0x00007f96da543cb3 in closePGconn (conn=0x55d919752fb0)
    at fe-connect.c:4112
#5  0x00007f96da543d55 in PQfinish (conn=0x55d919752fb0) at fe-connect.c:4134
#6  0x00007f96d9ebd42b in libpqsrv_disconnect (conn=0x55d919752fb0)
    at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#7  0x00007f96d9ebddf1 in dblink_disconnect (fcinfo=0x55d91f2692a8)
    at dblink.c:357



============

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f5f6b632859 in __GI_abort () at abort.c:79
#2  0x00007f5f6b69d26e in __libc_message (action=action@entry=do_abort,
    fmt=fmt@entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f5f6b6a52fc in malloc_printerr (
    str=str@entry=0x7f5f6b7c91e0 "munmap_chunk(): invalid pointer")
    at malloc.c:5347
#4  0x00007f5f6b6a554c in munmap_chunk (p=<optimized out>) at malloc.c:2830
#5  0x00007f5f50085efd in pqDropConnection (conn=0x55d12ebcd100,
    flushInput=true) at fe-connect.c:495
#6  0x00007f5f5008bcb3 in closePGconn (conn=0x55d12ebcd100)
    at fe-connect.c:4112
#7  0x00007f5f5008bd55 in PQfinish (conn=0x55d12ebcd100) at fe-connect.c:4134
#8  0x00007f5f5006c42b in libpqsrv_disconnect (conn=0x55d12ebcd100)
    at ../../src/include/libpq/libpq-be-fe-helpers.h:117


============

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f5f6b632859 in __GI_abort () at abort.c:79
#2  0x00007f5f6b69d26e in __libc_message (action=action@entry=do_abort,
    fmt=fmt@entry=0x7f5f6b7c7298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007f5f6b6a52fc in malloc_printerr (
    str=str@entry=0x7f5f6b7c54c1 "free(): invalid pointer") at malloc.c:5347
#4  0x00007f5f6b6a6b2c in _int_free (av=<optimized out>, p=<optimized out>,
    have_lock=0) at malloc.c:4173
#5  0x00007f5f500fe6ed in freePGconn (conn=0x55d142273000)
    at fe-connect.c:3977
#6  0x00007f5f500fed61 in PQfinish (conn=0x55d142273000) at fe-connect.c:4135
#7  0x00007f5f501de42b in libpqsrv_disconnect (conn=0x55d142273000)
    at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#8  0x00007f5f501dedf1 in dblink_disconnect (fcinfo=0x55d1527998f8)
    at dblink.c:357


============


Core was generated by `postgres: e4602483e9@(HEAD detached at
e4602483e9)@sqith: u73 postgres 127.0.0.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335)
    at malloc.c:3154
#0  __GI___libc_realloc (oldmem=0x7f7f7f7f7f7f7f7f, bytes=2139070335)
    at malloc.c:3154
#1  0x00007fb7bc0a580a in pqCheckOutBufferSpace (bytes_needed=2139062148,
    conn=0x55b191aa9380) at fe-misc.c:329
#2  0x00007fb7bc0a5b1c in pqPutMsgStart (msg_type=88 'X', conn=0x55b191aa9380)
    at fe-misc.c:476
#3  0x00007fb7bc097c60 in sendTerminateConn (conn=0x55b191aa9380)
    at fe-connect.c:4076
#4  0x00007fb7bc097c97 in closePGconn (conn=0x55b191aa9380)
    at fe-connect.c:4096
#5  0x00007fb7bc097d55 in PQfinish (conn=0x55b191aa9380) at fe-connect.c:4134
#6  0x00007fb7bc14a42b in libpqsrv_disconnect (conn=0x55b191aa9380)
    at ../../src/include/libpq/libpq-be-fe-helpers.h:117
#7  0x00007fb7bc14adf1 in dblink_disconnect (fcinfo=0x55b193894f00)
    at dblink.c:357

============

Thanks to SQLSmith for helping with this find.

-
Robins Tharakan
Amazon Web Services



Re: postgres_fdw uninterruptible during connection establishment / ProcSignalBarrier

От
Andres Freund
Дата:
Hi,

On 2023-01-30 12:00:55 -0800, Nathan Bossart wrote:
> On Mon, Jan 30, 2023 at 11:49:37AM -0800, Andres Freund wrote:
> > Why don't the dblink tests catch this?  Any chance you or Robins could prepare
> > a patch with fix and test, given that you know how to trigger this?
> 
> It's trivially reproducible by calling 1-argument dblink_connect() multiple
> times and then calling dblink_disconnect().  Here's a patch.

Thanks for the quick patch and for the find. Pushed.

Greetings,

Andres