Re: How to solve the problem of one backend process crashing and causing other processes to restart?
От | Joe Conway |
---|---|
Тема | Re: How to solve the problem of one backend process crashing and causing other processes to restart? |
Дата | |
Msg-id | d13cb984-1fa4-4d4b-8c87-b7f2227f4999@joeconway.com обсуждение исходный текст |
Ответ на | Re: How to solve the problem of one backend process crashing and causing other processes to restart? (Laurenz Albe <laurenz.albe@cybertec.at>) |
Список | pgsql-hackers |
On 11/13/23 00:53, Laurenz Albe wrote: > On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote: >> yuansong <yyuansong@126.com> writes: >> > In PostgreSQL, when a backend process crashes, it can cause other backend >> > processes to also require a restart, primarily to ensure data consistency. >> > I understand that the correct approach is to analyze and identify the >> > cause of the crash and resolve it. However, it is also important to be >> > able to handle a backend process crash without affecting the operation of >> > other processes, thus minimizing the scope of negative impact and >> > improving availability. To achieve this goal, could we mimic the Oracle >> > process by introducing a "pmon" process dedicated to rolling back crashed >> > process transactions and performing resource cleanup? I wonder if anyone >> > has attempted such a strategy or if there have been previous discussions >> > on this topic. >> >> The reason we force a database-wide restart is that there's no way to >> be certain that the crashed process didn't corrupt anything in shared >> memory. (Even with the forced restart, there's a window where bad >> data could reach disk before we kill off the other processes that >> might write it. But at least it's a short window.) "Corruption" >> here doesn't just involve bad data placed into disk buffers; more >> often it's things like unreleased locks, which would block other >> processes indefinitely. >> >> I seriously doubt that anything like what you're describing >> could be made reliable enough to be acceptable. "Oracle does >> it like this" isn't a counter-argument: they have a much different >> (and non-extensible) architecture, and they also have an army of >> programmers to deal with minutiae like undoing resource acquisition. >> Even with that, you'd have to wonder about the number of bugs >> existing in such necessarily-poorly-tested code paths. > > Yes. > I think that PostgreSQL's approach is superior: rather than investing in > code to mitigate the impact of data corruption caused by a crash, invest > in quality code that doesn't crash in the first place. While true, this does nothing to prevent OOM kills, which are becoming more prevalent as, for example, running Postgres in a container (or otherwise) with a cgroup memory limit becomes more popular. And in any case, there are enterprise use cases that necessarily avoid Postgres due to this behavior, which is a shame. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
В списке pgsql-hackers по дате отправления: