Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint
От | Achilleas Mantzios |
---|---|
Тема | Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with theshutdown checkpoint |
Дата | |
Msg-id | 91736c68-3837-cf51-19cb-11fe0f096cef@matrix.gatewaynet.com обсуждение исходный текст |
Ответ на | Re: PostgreSQL 10.5 : Strange pg_wal fill-up, solved with the shutdown checkpoint (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-admin |
On 6/11/18 4:58 μ.μ., Tom Lane wrote: > Achilleas Mantzios <achill@matrix.gatewaynet.com> writes: >> Remember : postgresql checkpointer decided to remove 5000+ files before shutdown. If any conditions were keeping thosefiles afloat should also hold at this point, right. >> The question is why didn't Postgresql removed them earlier. > WAL files get removed/recycled after completion of a checkpoint. So > apparently, checkpoints were not finishing during normal operation, > but the shutdown checkpoint managed to terminate normally. That > eliminates a lot of the usual theories about why checkpoints might > not be succeeding (like a dirty buffer that always fails to be > written, say as a result of broken permissions on its file). > > The only theory that comes to mind is that the checkpointer process > was stuck somehow, but just "soft" stuck, in a way that allowed the > postmaster's time-to-shut-down-please signal to unstick it. No, > I have no idea how that could happen exactly. If it happens again, > it'd be really interesting to attach to the checkpointer with a > debugger and collect a stack trace. Hello Tom, I enabled debugging as you suggested. In the meantime yesterday (also Sunday) we had an identical incident as far symptoms are concerned , only that in this (yesterday)case I fully explained it, and this is reported here : https://www.postgresql.org/message-id/ae8812c3-d138-73b7-537a-a273e15ef6e1%40matrix.gatewaynet.com What happened is that walsender died on Friday, replication slot stuck, server kept wal files, pg_wal disk was full (on Sundaymorning), system PANIC'ed : 10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr(at)dynacom line:9 LOG: terminating walsender process due to replication timeout 10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.052 EET data_for_testsmadb_pub repmgr(at)dynacom line:10 CONTEXT: slot "data_for_testsmadb_pub", output plugin "pgoutput", in the change callback, associated LSN 13DF/393BF7F0 10.9.0.77(48650) [65253] 5be2ca1d.fee5 2018-11-09 15:06:11.066 EET data_for_testsmadb_pub repmgr(at)dynacom line:11 LOG:disconnection: session time: 49:47:17.937 user=repmgr database=dynacom host=10.9.0.77 port=48650 (so this problem appeared 2 days after walsender died) So, I tried to find evidence that would connect the two incidents (this and previous Sunday) but could not find anythingin either the publisher (primary) or the subscriber logs that would match wal sender or logical replication worker terminate or exit. *but* I found this on the primary's log back from 2018-11-01 : 10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.424 EET data_for_testsmadb_pub repmgr@dynacom line:9 LOG: couldnot send data to client: Connection reset by peer 10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.424 EET data_for_testsmadb_pub repmgr@dynacom line:10 CONTEXT: slot "data_for_testsmadb_pub", output plugin "pgoutput", in the commit callback, associated LSN 1290/6ADB518 10.9.0.77(47489) [73151] 5bd96f82.11dbf 2018-11-01 16:51:58.434 EET data_for_testsmadb_pub repmgr@dynacom line:11 LOG: disconnection:session time: 29:50:03.592 user=repmgr database=dynacom host=10.9.0.77 port=47489 10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.476 EET [unknown] [unknown]@[unknown] line:1 LOG: connectionreceived: host=10.9.0.77 port=47612 10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.477 EET [unknown] repmgr@dynacom line:2 LOG: replication connectionauthorized: user=repmgr 10.9.0.77(47612) [126667] 5bdb130e.1eecb 2018-11-01 16:51:58.478 EET data_for_testsmadb_pub repmgr@dynacom line:3 LOG: receivedreplication command: IDENTIFY_SYSTEM ...... above pattern repeats 9 more times (so this problem appeared 3 days after this connection reset) In this case the wal sender did not terminate, but I wonder if it caused this "soft" stuck in the replication slot? In this primary server we have only one replication slot (the one to the logical subscriber node), and archiving works withoutproblems for years. So I cannot think of another reason that wals were accumulating in pg_wal . > > regards, tom lane > -- Achilleas Mantzios IT DEV Lead IT DEPT Dynacom Tankers Mgmt
В списке pgsql-admin по дате отправления:
Предыдущее
От: Achilleas MantziosДата:
Сообщение: PostgreSQL 10.5 : Logical replication timeout results in PANIC inpg_wal "No space left on device"