Обсуждение: PostgreSQL 16.6 , query stuck with STAT Ssl, wait_event_type : IPC , wait_event : ParallelFinish

Поиск
Список
Период
Сортировка

Hi

a query is stuck with the above, it seems it waits for parallel worker to finish, however , there are no parallel works running :

postgres@[local]/dynacom=# SELECT application_name, backend_type, backend_start,xact_start,query_start,wait_event_type, wait_event ,state FROM pg_stat_activity;
application_name |         backend_type         |         backend_start         |          xact_start           |          query_start          | wait_event_type |     wait_event      | state   
------------------+------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+--------
                 | autovacuum launcher          | 2024-11-29 17:48:50.92935+02  |                               |                               | Activity        | AutoVacuumMain      |  
                 | logical replication launcher | 2024-11-29 17:48:50.929496+02 |                               |                               | Activity        | LogicalLauncherMain |  
DBMIRROR         | client backend               | 2025-05-31 19:04:16.724305+03 | 2025-05-31 19:05:21.686093+03 | 2025-05-31 19:05:21.909936+03 | IPC             | ParallelFinish      | active
                 | client backend               | 2025-05-31 23:31:30.030806+03 |                               | 2025-05-31 23:35:05.045573+03 | Client          | ClientRead          | idle
psql             | client backend               | 2025-05-31 23:29:33.863485+03 | 2025-05-31 23:35:09.322972+03 | 2025-05-31 23:35:09.322972+03 |                 |                     | active
RXMLFVSLS        | client backend               | 2025-05-31 23:32:37.351131+03 |                               | 2025-05-31 23:35:09.295221+03 | Client          | ClientRead          | idle
psql             | client backend               | 2025-04-28 16:59:55.968442+03 |                               | 2025-05-27 16:43:56.338228+03 | Client          | ClientRead          | idle
                 | background writer            | 2024-11-29 17:48:50.916876+02 |                               |                               | Activity        | BgWriterMain        |  
                 | archiver                     | 2024-12-03 18:57:36.447067+02 |                               |                               | Activity        | ArchiverMain        |  
                 | checkpointer                 | 2024-11-29 17:48:50.916648+02 |                               |                               | Activity        | CheckpointerMain    |  
                 | walwriter                    | 2024-11-29 17:48:50.928789+02 |                               |                               | Activity        | WalWriterMain       |  
(11 rows)

postgres@[local]/dynacom=#

So, I will terminate this backend now to get the system working again, we are curious why this happened, our system serves daily 22M+ transactions, this is Saturday night hence the low traffic.

postgres@smadb:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm
postgres@smadb:~$ psql -Aqt -c 'select version()'
PostgreSQL 16.6 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
postgres@smadb:~$


On 31/5/25 23:43, Achilleas Mantzios wrote:

Hi

a query is stuck with the above, it seems it waits for parallel worker to finish, however , there are no parallel works running :

postgres@[local]/dynacom=# SELECT application_name, backend_type, backend_start,xact_start,query_start,wait_event_type, wait_event ,state FROM pg_stat_activity;
application_name |         backend_type         |         backend_start         |          xact_start           |          query_start          | wait_event_type |     wait_event      | state   
------------------+------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+---------------------+--------
                 | autovacuum launcher          | 2024-11-29 17:48:50.92935+02  |                               |                               | Activity        | AutoVacuumMain      |  
                 | logical replication launcher | 2024-11-29 17:48:50.929496+02 |                               |                               | Activity        | LogicalLauncherMain |  
DBMIRROR         | client backend               | 2025-05-31 19:04:16.724305+03 | 2025-05-31 19:05:21.686093+03 | 2025-05-31 19:05:21.909936+03 | IPC             | ParallelFinish      | active
                 | client backend               | 2025-05-31 23:31:30.030806+03 |                               | 2025-05-31 23:35:05.045573+03 | Client          | ClientRead          | idle
psql             | client backend               | 2025-05-31 23:29:33.863485+03 | 2025-05-31 23:35:09.322972+03 | 2025-05-31 23:35:09.322972+03 |                 |                     | active
RXMLFVSLS        | client backend               | 2025-05-31 23:32:37.351131+03 |                               | 2025-05-31 23:35:09.295221+03 | Client          | ClientRead          | idle
psql             | client backend               | 2025-04-28 16:59:55.968442+03 |                               | 2025-05-27 16:43:56.338228+03 | Client          | ClientRead          | idle
                 | background writer            | 2024-11-29 17:48:50.916876+02 |                               |                               | Activity        | BgWriterMain        |  
                 | archiver                     | 2024-12-03 18:57:36.447067+02 |                               |                               | Activity        | ArchiverMain        |  
                 | checkpointer                 | 2024-11-29 17:48:50.916648+02 |                               |                               | Activity        | CheckpointerMain    |  
                 | walwriter                    | 2024-11-29 17:48:50.928789+02 |                               |                               | Activity        | WalWriterMain       |  
(11 rows)

postgres@[local]/dynacom=#

So, I will terminate this backend now to get the system working again, we are curious why this happened, our system serves daily 22M+ transactions, this is Saturday night hence the low traffic.

postgres@smadb:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm
postgres@smadb:~$ psql -Aqt -c 'select version()'
PostgreSQL 16.6 on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
postgres@smadb:~$


Some additional info :

1690535 : the pid in question

we found no trace or indication for OOM killer ,

root@smadb:~# strace -p 1690535      
strace: Process 1690535 attached

epoll_wait(12,

also lsof showed :

postgres  1690535 1690649 Sweeper           postgres    0r      CHR                1,3        0t0          4 /dev/null
postgres  1690535 1690649 Sweeper           postgres    1w     FIFO               0,14        0t0  213624178 pipe
postgres  1690535 1690649 Sweeper           postgres    2w     FIFO               0,14        0t0  213624178 pipe
postgres  1690535 1690649 Sweeper           postgres    3u  a_inode               0,15          0       1059 [signalfd]
postgres  1690535 1690649 Sweeper           postgres    4r     FIFO               0,14        0t0  213624177 pipe
postgres  1690535 1690649 Sweeper           postgres    5u  a_inode               0,15          0       1059 [eventpoll:3,4,11]
postgres  1690535 1690649 Sweeper           postgres    6u      REG               8,32       8192  157352475 /raid4/pgsql/data/PG_16_202
307071/207491653/2601
postgres  1690535 1690649 Sweeper           postgres    7u      REG               8,32     450560  157352631 /raid4/pgsql/data/PG_16_202
307071/207491653/207493206
postgres  1690535 1690649 Sweeper           postgres    8u      REG               8,32      40960  157356847 /raid4/pgsql/data/PG_16_202
307071/207491653/207503536
postgres  1690535 1690649 Sweeper           postgres    9u      REG               8,32      40960  157356848 /raid4/pgsql/data/PG_16_202
307071/207491653/207503538
postgres  1690535 1690649 Sweeper           postgres   10u      REG               8,32      40960  157357848 /raid4/pgsql/data/PG_16_202
307071/207491653/207504627
postgres  1690535 1690649 Sweeper           postgres   11u     IPv4         1241927029        0t0        TCP smadb.internal.net:postgres
ql->sma.internal.net:42615 (ESTABLISHED)

postgres  1690535 1690649 Sweeper           postgres   12u  a_inode               0,15          0       1059 [eventpoll:3,4]

So , as far as we understand, it waited for an inode ?

I tried to pg_terminate_backend(1690535); it did nothing to the process, then pg_cancel_backend(1690535);

I went to the shell and did a normal 

kill 1690535

it did nothing, and before i kill -9 it, I just tried kill -HUP, and it woke up with

10.9.0.10(42615) [1690535] 683b2880.19cba7 2025-05-31 23:53:48.231 EEST DBMIRROR postgres@dynacom line:4 FATAL:  terminating connection due to administrator command
10.9.0.10(42615) [1690535] 683b2880.19cba7 2025-05-31 23:53:48.231 EEST DBMIRROR postgres@dynacom line:5 STATEMENT:  SELECT pd.XID,MAX(SeqId) FROM dbmirror_Pending
pd LEFT JOIN dbmirror_MirroredTransaction mt INNER JOIN dbmirror_MirrorHost mh ON mt.MirrorHostId =  mh.MirrorHostId AND mh.HostName= '192.168.211.1'  ON pd.XID =
mt.XID WHERE mt.XID is null and (pd.slaveid is null or pd.slaveid = '579')  GROUP BY pd.XID  ORDER BY MAX(pd.SeqId)
10.9.0.10(42615) [1690535] 683b2880.19cba7 2025-05-31 23:53:48.234 EEST DBMIRROR postgres@dynacom line:6 LOG:  disconnection: session time: 4:49:31.510 user=postgr
es database=dynacom host=10.9.0.10 port=42615


Achilleas Mantzios <a.mantzios@cloud.gatewaynet.com> writes:
>> a query is stuck with the above, it seems it waits for parallel worker 
>> to finish, however , there are no parallel works running :

You didn't explain the subject about "STAT Ssl", but if you mean
that that was what ps was showing for the backend process, there's
something very wrong there.  According to "man ps", the "l" means

               l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads
                    do)

which is something that a Postgres backend should never be
(in existing releases anyway).  So I'm speculating that
the process somehow became multi-threaded and then some
wakeup signal went to the wrong thread.

We've had issues with perl or python introducing multi-threading
because of plperl or plpython functions doing things they
probably shouldn't.  Do you have any of those in your system?

            regards, tom lane



On 1/6/25 07:19, Tom Lane wrote:

Achilleas Mantzios <a.mantzios@cloud.gatewaynet.com> writes:
a query is stuck with the above, it seems it waits for parallel worker 
to finish, however , there are no parallel works running :
You didn't explain the subject about "STAT Ssl", but if you mean
that that was what ps was showing for the backend process, there's
something very wrong there.  According to "man ps", the "l" means
               l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads                    do)
Yes. sorry, I didn't include this info, you are spot on, yes this the output of ps aux .
which is something that a Postgres backend should never be
(in existing releases anyway).  So I'm speculating that
the process somehow became multi-threaded and then some
wakeup signal went to the wrong thread.

We've had issues with perl or python introducing multi-threading
because of plperl or plpython functions doing things they
probably shouldn't.  Do you have any of those in your system?

Yes we have two perl functions only that I'd be happy to get rid off :

postgres@[local]/dynacom=# select p.proname, l.lanname from pg_language l, pg_proc p where p.prolang=l.oid and l.lanname ~* '.*perl.*';
proname  | lanname  
----------+---------
basename | plperlu
filetype | plperlu
(2 rows)

Nothing used in the app, just some two utility functions to help us batch insert some attachments, guess mimetype etc. However the calling client is Perl , based on libpg-perl (not DBI), basically this is a descendant of DBMirror.pl (we are still using it).

The strange thing is that we run pgsql 16.* since November, also we run our version of DBMirror since 2005 (and PostgreSQL since 2001) and we never had this problem before (at least from what I know).

			regards, tom lane

Hi ,

we had the same problem today again.

postgres@[local]/dynacom=# select * from pg_stat_activity where application_name~*'dbmirr';
-[ RECORD 1 ]----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------
datid            | 207491653
datname          | dynacom
pid              | 1821681
leader_pid       |  
usesysid         | 10
usename          | postgres
application_name | DBMIRROR
client_addr      | 10.9.0.10
client_hostname  |  
client_port      | 45051
backend_start    | 2025-08-22 03:58:32.321683+03
xact_start       | 2025-08-22 04:06:08.897252+03
query_start      | 2025-08-22 04:06:09.254048+03
state_change     | 2025-08-22 04:06:09.254049+03
wait_event_type  | IPC
wait_event       | ParallelFinish
state            | active
backend_xid      |  
backend_xmin     | 222705697
query_id         | -3929522546936394707
query            | SELECT pd.XID,MAX(SeqId) FROM dbmirror_Pending pd LEFT JOIN dbmirror_MirroredTransaction mt INNER JOIN dbmirror_MirrorHost mh ON mt.MirrorHostId =  mh.MirrorHostId AND m
h.HostName= '192.168.145.1'  ON pd.XID = mt.XID WHERE mt.XID is null and (pd.slaveid is null or pd.slaveid = '4826')  GROUP BY pd.XID  ORDER BY MAX(pd.SeqId)
backend_type     | client backend

postgres@smadb:~$ ps -u -p 1821681
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres 1821681  0.5  4.8 37111844 3177260 ?    Ssl  03:58   2:25 postgres: postgres dynacom 10.9.0.10(45051) SELECT
postgres@smadb:~$

Again the process is stuck with this Ssl state.

Again,
postgres@smadb:~$ strace -p 1821681
strace: Process 1821681 attached
epoll_wait(12,

postgres@smadb:~$ sudo lsof -p 1821681 | egrep -e ' 12u| 3. | 4. '  
postgres 1821681 postgres  DEL       REG                0,1                    33 /SYSV00041c60
postgres 1821681 postgres    3u  a_inode               0,15          0       1059 [signalfd]
postgres 1821681 postgres    4r     FIFO               0,14        0t0 1292545666 pipe
postgres 1821681 postgres   12u  a_inode               0,15          0       1059 [eventpoll:3,4]

Again, pg_cancel_backend, pg_terminate_backend, kill pid (-TERM) did nothing. Only -HUP killed it .

10.9.0.10(45051) [1821681] 68a7c0b8.1bcbf1 2025-08-22 11:09:32.517 EEST DBMIRROR postgres@dynacom line:26 FATAL:  terminating connection due to administrator command
10.9.0.10(45051) [1821681] 68a7c0b8.1bcbf1 2025-08-22 11:09:32.517 EEST DBMIRROR postgres@dynacom line:27 STATEMENT:  SELECT pd.XID,MAX(SeqId) FROM dbmirror_Pending pd LEFT JOIN dbmirror
_MirroredTransaction mt INNER JOIN dbmirror_MirrorHost mh ON mt.MirrorHostId =  mh.MirrorHostId AND mh.HostName= '192.168.145.1'  ON pd.XID = mt.XID WHERE mt.XID is null and (pd.slaveid
is null or pd.slaveid = '4826')  GROUP BY pd.XID  ORDER BY MAX(pd.SeqId)
10.9.0.10(45051) [1821681] 68a7c0b8.1bcbf1 2025-08-22 11:09:32.519 EEST DBMIRROR postgres@dynacom line:28 LOG:  disconnection: session time: 7:11:00.197 user=postgres database=dynacom ho
st=10.9.0.10 port=45051



The client is written in PERL (a variant of the old DBMIRROR perl), and calls nothing but pure SQL, no perl or python functions.

This program is called 38592 times per day and the previous such incident was on 1st June, so 2,392,704 times between the two incidents.


On 6/1/25 06:32, Achilleas Mantzios wrote:

On 1/6/25 07:19, Tom Lane wrote:

Achilleas Mantzios <a.mantzios@cloud.gatewaynet.com> writes:
a query is stuck with the above, it seems it waits for parallel worker 
to finish, however , there are no parallel works running :
You didn't explain the subject about "STAT Ssl", but if you mean
that that was what ps was showing for the backend process, there's
something very wrong there.  According to "man ps", the "l" means
               l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads                    do)
Yes. sorry, I didn't include this info, you are spot on, yes this the output of ps aux .
which is something that a Postgres backend should never be
(in existing releases anyway).  So I'm speculating that
the process somehow became multi-threaded and then some
wakeup signal went to the wrong thread.

We've had issues with perl or python introducing multi-threading
because of plperl or plpython functions doing things they
probably shouldn't.  Do you have any of those in your system?

Yes we have two perl functions only that I'd be happy to get rid off :

postgres@[local]/dynacom=# select p.proname, l.lanname from pg_language l, pg_proc p where p.prolang=l.oid and l.lanname ~* '.*perl.*';
proname  | lanname  
----------+---------
basename | plperlu
filetype | plperlu
(2 rows)

Nothing used in the app, just some two utility functions to help us batch insert some attachments, guess mimetype etc. However the calling client is Perl , based on libpg-perl (not DBI), basically this is a descendant of DBMirror.pl (we are still using it).

The strange thing is that we run pgsql 16.* since November, also we run our version of DBMirror since 2005 (and PostgreSQL since 2001) and we never had this problem before (at least from what I know).

			regards, tom lane
On Fri, 2025-08-22 at 09:16 +0100, Achilleas Mantzios wrote:
> we had the same problem today again.
>
> postgres@[local]/dynacom=# select * from pg_stat_activity where application_name~*'dbmirr';
>  -[ RECORD 1 ]----+-----------------------------------------------------------------------------------------------
> [...]
>  pid              | 1821681
> [...]
>  wait_event_type  | IPC
>  wait_event       | ParallelFinish
>  state            | active
>
> postgres@smadb:~$ ps -u -p 1821681
>  USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>  postgres 1821681  0.5  4.8 37111844 3177260 ?    Ssl  03:58   2:25 postgres: postgres dynacom 10.9.0.10(45051)
SELECT 

Did you check whether the Perl function you mentioned before starts threads,
like Tom suggested?  That would be the probable cause, and the solution is
not to start any threads in a PostgreSQL function.

Yours,
Laurenz Albe




On 8/22/25 09:29, Laurenz Albe wrote:
On Fri, 2025-08-22 at 09:16 +0100, Achilleas Mantzios wrote:
we had the same problem today again.

postgres@[local]/dynacom=# select * from pg_stat_activity where application_name~*'dbmirr'; 
 -[ RECORD 1 ]----+-----------------------------------------------------------------------------------------------
[...]
 pid              | 1821681 
[...]
 wait_event_type  | IPC 
 wait_event       | ParallelFinish 
 state            | active 

postgres@smadb:~$ ps -u -p 1821681 
 USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND 
 postgres 1821681  0.5  4.8 37111844 3177260 ?    Ssl  03:58   2:25 postgres: postgres dynacom 10.9.0.10(45051) SELECT 
Did you check whether the Perl function you mentioned before starts threads,
like Tom suggested?  That would be the probable cause, and the solution is
not to start any threads in a PostgreSQL function.

Hi not att all , here is the code :

Main();

sub Main() {
 my $batchTxMode = 0;
  
#run the configuration file.
 #if ($#ARGV != 0) {
 if ($#ARGV < 0) {
   die "usage: DBMirror.pl configFile\n";
 }
 elsif ($#ARGV == 1) {
   if ($ARGV[1] eq "batch") { ##run scp at the end of dbmirror.pl main loop
     $batchTxMode = 1;
   }
   elsif ($ARGV[1] eq "batchall") { ##run scp at the end of dbmirror.sh wrapper script
     $batchTxMode = 2;
   }
   else {
     die "usage: DBMirror.pl configFile [batch|batchall]\n";
   }
 }
 if( ! defined do $ARGV[0]) {
   logErrorMessage("Invalid Configuration file $ARGV[0]");
   die;
 }

 
 #my $connectString = "host=$::masterHost dbname=$::masterDb user=$::masterUser password=$::masterPassword";
 my $connectString = "host=localhost port=6432 dbname=$::masterDb user=$::masterUser password=$::masterPassword";
  
 $masterConn = Pg::connectdb($connectString);
  
 unless($masterConn->status == PGRES_CONNECTION_OK) {
   logErrorMessage("Can't connect to master database\n" .
                   $masterConn->errorMessage);
   die;
 }
    
 my $setQuery;
 $setQuery = "SET search_path = public; SET application_name = 'DBMIRROR'";
 my $setResult = $masterConn->exec($setQuery);
 if($setResult->resultStatus!=PGRES_COMMAND_OK) {  
   logErrorMessage($masterConn->errorMessage . "\n" .  
                   $setQuery);
   die;
 }
    
 my $setQuery2;
 $setQuery2 = "BEGIN";
 my $setResult2 = $masterConn->exec($setQuery2);
 if($setResult2->resultStatus!=PGRES_COMMAND_OK) {  
   logErrorMessage($masterConn->errorMessage . "\n" .  
                   $setQuery2);
   die;
 }
    
    
   setupSlave($::slaveInfo);
#print $::slaveInfo->{"uucpnode"} . "\n";
#LOCK CODE!!!!
   my $pendingLockQuery = "SELECT 1 FROM dbmirror_Pending pd";
   $pendingLockQuery .= " LEFT JOIN dbmirror_MirroredTransaction mt INNER JOIN";
   $pendingLockQuery .= " dbmirror_MirrorHost mh ON mt.MirrorHostId = ";
   $pendingLockQuery .= " mh.MirrorHostId AND mh.HostName=";
   $pendingLockQuery .= " '$::slaveInfo->{\"slaveHost\"}' ";  
   $pendingLockQuery .= " ON pd.XID";
   $pendingLockQuery .= " = mt.XID WHERE mt.XID is null and (pd.slaveid is null or pd.slaveid = '$::slaveInfo->{\"MirrorHostId\"}') ";
   $pendingLockQuery .= " FOR UPDATE OF pd ";
    
   my $pendingLockResults = $masterConn->exec($pendingLockQuery);
   unless($pendingLockResults->resultStatus==PGRES_TUPLES_OK) {
     logErrorMessage("Can't query pending table\n" . $masterConn->errorMessage);
     die;
   }
#END LOCK CODE!!!!
   
   #Obtain a list of pending transactions using ordering by our approximation
   #to the commit time.  The commit time approximation is taken to be the
   #SeqId of the last row edit in the transaction.
   my $pendingTransQuery = "SELECT pd.XID,MAX(SeqId) FROM dbmirror_Pending pd";
   $pendingTransQuery .= " LEFT JOIN dbmirror_MirroredTransaction mt INNER JOIN";
   $pendingTransQuery .= " dbmirror_MirrorHost mh ON mt.MirrorHostId = ";
   $pendingTransQuery .= " mh.MirrorHostId AND mh.HostName=";
   $pendingTransQuery .= " '$::slaveInfo->{\"slaveHost\"}' ";  
   $pendingTransQuery .= " ON pd.XID";
   $pendingTransQuery .= " = mt.XID WHERE mt.XID is null and (pd.slaveid is null or pd.slaveid = '$::slaveInfo->{\"MirrorHostId\"}') ";
   $pendingTransQuery .= " GROUP BY pd.XID ";
   $pendingTransQuery .= " ORDER BY MAX(pd.SeqId)";
   
It got stuck inside the second query, after the FOR UPDATE locking  .

I attach the client program, just for completeness. It just queries the three tables :

- dbmirror_pending

- dbmirror_mirroredtransaction

- dbmirror_mirrorhost

first it tries do lock via FOR UPDATE , then queries the tables again.


Yours,
Laurenz Albe
Вложения