Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
От | hubert depesz lubaczewski |
---|---|
Тема | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Дата | |
Msg-id | Y+ZVPHHcYirQDgJF@depesz.com обсуждение исходный текст |
Ответ на | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (Masahiko Sawada <sawada.mshk@gmail.com>) |
Список | pgsql-bugs |
Hi, so, we have another bit of interesting information. maybe related, maybe not. We noticed weird situation on two clusters we're trying to upgrade. In both cases sitaution looked the same: 1. there was another process (debezium) connected to source (pg12) using logical replication 2. pg12 -> pg14 replication failed with the message 'ERROR: requested WAL segment ... has already been ' 3. some time afterwards (most likely couple of hours) the process that is/was responsible for debezium replicaiton (pg process) stopped handling WAL, but instead is eating 100% of cpu. When this situation happens, we can't pg_cancel_backend(pid) for the "broken" wal sender, it also can't be pg_terminate_backend() ! strace of the process doesn't show anything. When I tried to get backtrace from gdb all I got was: (gdb) bt #0 0x0000aaaad270521c in hash_seq_search () #1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so #2 0x0000aaaad26e3644 in CallSyscacheCallbacks () #3 0x0000aaaad26e3644 in CallSyscacheCallbacks () #4 0x0000aaaad257764c in ReorderBufferCommit () #5 0x0000aaaad256c804 in ?? () #6 0x0000aaaaf303d280 in ?? () If I'd quit gdb, and restart, and redo bt, I get #0 0x0000ffff806c81a8 in hash_seq_search@plt () from /usr/lib/postgresql/12/lib/pgoutput.so #1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so #2 0x0000aaaad291ae58 in ?? () or #0 0x0000aaaad2705244 in hash_seq_search () #1 0x0000ffff806c86cc in ?? () from /usr/lib/postgresql/12/lib/pgoutput.so #2 0x0000aaaad26e3644 in CallSyscacheCallbacks () #3 0x0000aaaad26e3644 in CallSyscacheCallbacks () #4 0x0000aaaad257764c in ReorderBufferCommit () #5 0x0000aaaad256c804 in ?? () #6 0x0000aaaaf303d280 in ?? () At this moment, the only thing that we can do is kill -9 the process (or restart pg). I don't know if it's relevant, but I have this case *right now*, and if it's helpful I can provide more information before we will have to kill it. Best regards, depesz
В списке pgsql-bugs по дате отправления: