BUG #18025: Probably we need to change behaviour of the checkpoint failures in PG

Поиск

Список

Период

Сортировка

От	PG Bug reporting form
Тема	BUG #18025: Probably we need to change behaviour of the checkpoint failures in PG
Дата	17 июля 2023 г. 05:03:17
Msg-id	18025-85e7e63f3ff93d40@postgresql.org обсуждение исходный текст
Ответы	Re: BUG #18025: Probably we need to change behaviour of the checkpoint failures in PG
Список	pgsql-bugs

Дерево обсуждения

The following bug has been logged on the website:

Bug reference:      18025
Logged by:          Kishor Hargude
Email address:      hargudekishor@gmail.com
PostgreSQL version: 14.8
Operating system:   rhel
Description:

Hello , 

I have just witnessed the data loss scenario. 

Scenario is like, there was checkpoint operation failures going on the DB
server since last 8 hours which means no successful checkpoint happened in
the DB server since last 8 hours. Then DB server went into the crash mode
due to the exhausted disk space and did not came up as part of crash
recovery. Actually the victim had moved few WALs from the pg_wal to other
location and reimporting those wal on original location also did not solved
the problem. DB server was not able to find out the valid checkpoint record.
The victim was not having the backup which he could use that backup to
recover the data with the help of available archived WALs. So , the victim
had only one option left in his hand that is pg_resetwal.  We have tried
every possible solution but did not worked so we did not left with more
choices other than pg_Resetwal

Now, The victim has to opt for pg_resetwal but the cost for this option is 8
hours of data loss as last successful checkpoint was happened 8 hours back.
There is not other option available other than pg_resetwal. 

Above mentioned scenario can occurred with many of the PostgreSQL customers
as the customer does not bother about to monitor the DB logs for finding out
the errors/fatals/PANICs on daily basis or hourly basis (for the events like
checkpoint failures//archiving failures). Due the ignorance of monitoring
the DB logs or improper backup recovery strategies ,any customer can meet
with the above mentioned scenario which costs to data loss. 

In my opinion, to limit the data loss or damage to the data in above
scenario, we should think on changing the behaviour of the checkpoint
failure operations. right now , we are just erroring out the checkpoint
failures and allowing write transctions on DB server which kept generating
the WALs. Now as checkpoints are unable to recycle or remove the unwanted
WALs from pg_wal , the WALs keeps accumulating and eventually leads to disk
full issue and server gets crashed.And then above scenario may happens. 

I am sharing my thought to limit the data loss in above situation:

I learned that failure of the checkpoint operation can lead to data loss .
The dataloss of several hours/days etc. 

Customers will not accept their mistakes but blame on PostgreSQL.

I think we can improve the behaviour of the PostgreSQL when it comes to
constantly failures of the checkpoints operation. We should stop accepting
the write transactions and allow only read only transactions as soon as
postgres detects the failure of the checkpoint operation (which can not be
resolved) . This will give opportunity customer to import the databases from
problematic cluster and build new cluster. But this will definitely limit
the loss of the data if any customer suffer through the scenario mentioned
as above. This approach also will give immediate alarm to the customer to
fix the issue if its fixable otherwise do the recovery steps. Having
downtime is always always better than loosing data. 

Thank You.

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

BUG #18025: Probably we need to change behaviour of the checkpoint failures in PG