Обсуждение: Standby got invalid primary checkpoint after crashed right after promoted.
Hi, pgsql-hackers, I think I found a case that database is not recoverable, would you please give a look? Here is how it happens: - setup primary/standby - do a lots INSERT at primary - create a checkpoint at primary - wait until standby start doing restart point, it take about 3mins syncing buffers to complete - before the restart point update ControlFile, promote the standby, that changed ControlFile ->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving the ControlFile ->checkPoint pointing to a removed file - before the promoted standby request the post-recovery checkpoint (fast promoted), one backend crashed, it will kill other server process, so the post-recovery checkpoint skipped - the database restart startup process, which report: "could not locate a valid checkpoint record" I attached a test to reproduce it, it does not fail every time, it fails every 10 times to me. To increase the chance CreateRestartPoint skip update ControlFile and to simulate a crash, the patch 0001 is needed. Best Regard. Harry Hao
Вложения
Found this issue is duplicated to [1], after applied that patch, I cannot reproduce it anymore.
2022年3月16日 下午3:16,hao harry <harry-hao@outlook.com> 写道:Hi, pgsql-hackers,
I think I found a case that database is not recoverable, would you please give a look?
Here is how it happens:
- setup primary/standby
- do a lots INSERT at primary
- create a checkpoint at primary
- wait until standby start doing restart point, it take about 3mins syncing buffers to complete
- before the restart point update ControlFile, promote the standby, that changed ControlFile
->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving the ControlFile
->checkPoint pointing to a removed file
- before the promoted standby request the post-recovery checkpoint (fast promoted),
one backend crashed, it will kill other server process, so the post-recovery checkpoint skipped
- the database restart startup process, which report: "could not locate a valid checkpoint record"
I attached a test to reproduce it, it does not fail every time, it fails every 10 times to me.
To increase the chance CreateRestartPoint skip update ControlFile and to simulate a crash,
the patch 0001 is needed.
Best Regard.
Harry Hao
<0001-Patched-CreateRestartPoint-to-reproduce-invalid-chec.patch><reprod_crash_right_after_promoted.pl>
Re: Standby got invalid primary checkpoint after crashed right after promoted.
От
Kyotaro Horiguchi
Дата:
At Wed, 16 Mar 2022 07:16:16 +0000, hao harry <harry-hao@outlook.com> wrote in > Hi, pgsql-hackers, > > I think I found a case that database is not recoverable, would you please give a look? > > Here is how it happens: > > - setup primary/standby > - do a lots INSERT at primary > - create a checkpoint at primary > - wait until standby start doing restart point, it take about 3mins syncing buffers to complete > - before the restart point update ControlFile, promote the standby, that changed ControlFile > ->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving the ControlFile > ->checkPoint pointing to a removed file Yeah, it seems like exactly the same issue pointed in [1]. A fix is proposed in [1]. Maybe I can remove "possible" from the mail subject:p [1] https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com [2] https://www.postgresql.org/message-id/20220316.102444.2193181487576617583.horikyota.ntt@gmail.com > - before the promoted standby request the post-recovery checkpoint (fast promoted), > one backend crashed, it will kill other server process, so the post-recovery checkpoint skipped > - the database restart startup process, which report: "could not locate a valid checkpoint record" > > I attached a test to reproduce it, it does not fail every time, it fails every 10 times to me. > To increase the chance CreateRestartPoint skip update ControlFile and to simulate a crash, > the patch 0001 is needed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Standby got invalid primary checkpoint after crashed right after promoted.
От
Kyotaro Horiguchi
Дата:
(My previous mail hass crossed with this one) At Wed, 16 Mar 2022 08:21:46 +0000, hao harry <harry-hao@outlook.com> wrote in > Found this issue is duplicated to [1], after applied that patch, I cannot reproduce it anymore. > > [1] https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota.ntt%40gmail.com<https://www.postgresql.org/message-id/flat/20220316.102444.2193181487576617583.horikyota.ntt@gmail.com> Glad to hear that. Thanks for checking it! regards. -- Kyotaro Horiguchi NTT Open Source Software Center