BUG #18025: Probably we need to change behaviour of the checkpoint failures in PG
От | PG Bug reporting form |
---|---|
Тема | BUG #18025: Probably we need to change behaviour of the checkpoint failures in PG |
Дата | |
Msg-id | 18025-85e7e63f3ff93d40@postgresql.org обсуждение исходный текст |
Ответы |
Re: BUG #18025: Probably we need to change behaviour of the checkpoint failures in PG
(Laurenz Albe <laurenz.albe@cybertec.at>)
|
Список | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 18025 Logged by: Kishor Hargude Email address: hargudekishor@gmail.com PostgreSQL version: 14.8 Operating system: rhel Description: Hello , I have just witnessed the data loss scenario. Scenario is like, there was checkpoint operation failures going on the DB server since last 8 hours which means no successful checkpoint happened in the DB server since last 8 hours. Then DB server went into the crash mode due to the exhausted disk space and did not came up as part of crash recovery. Actually the victim had moved few WALs from the pg_wal to other location and reimporting those wal on original location also did not solved the problem. DB server was not able to find out the valid checkpoint record. The victim was not having the backup which he could use that backup to recover the data with the help of available archived WALs. So , the victim had only one option left in his hand that is pg_resetwal. We have tried every possible solution but did not worked so we did not left with more choices other than pg_Resetwal Now, The victim has to opt for pg_resetwal but the cost for this option is 8 hours of data loss as last successful checkpoint was happened 8 hours back. There is not other option available other than pg_resetwal. Above mentioned scenario can occurred with many of the PostgreSQL customers as the customer does not bother about to monitor the DB logs for finding out the errors/fatals/PANICs on daily basis or hourly basis (for the events like checkpoint failures//archiving failures). Due the ignorance of monitoring the DB logs or improper backup recovery strategies ,any customer can meet with the above mentioned scenario which costs to data loss. In my opinion, to limit the data loss or damage to the data in above scenario, we should think on changing the behaviour of the checkpoint failure operations. right now , we are just erroring out the checkpoint failures and allowing write transctions on DB server which kept generating the WALs. Now as checkpoints are unable to recycle or remove the unwanted WALs from pg_wal , the WALs keeps accumulating and eventually leads to disk full issue and server gets crashed.And then above scenario may happens. I am sharing my thought to limit the data loss in above situation: I learned that failure of the checkpoint operation can lead to data loss . The dataloss of several hours/days etc. Customers will not accept their mistakes but blame on PostgreSQL. I think we can improve the behaviour of the PostgreSQL when it comes to constantly failures of the checkpoints operation. We should stop accepting the write transactions and allow only read only transactions as soon as postgres detects the failure of the checkpoint operation (which can not be resolved) . This will give opportunity customer to import the databases from problematic cluster and build new cluster. But this will definitely limit the loss of the data if any customer suffer through the scenario mentioned as above. This approach also will give immediate alarm to the customer to fix the issue if its fixable otherwise do the recovery steps. Having downtime is always always better than loosing data. Thank You.
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Paul De AudneyДата:
Сообщение: Query returns error "there is no parameter $1" but server logs that there are two parameters supplied