Re: strange parallel query behavior after OOM crashes
От | Tomas Vondra |
---|---|
Тема | Re: strange parallel query behavior after OOM crashes |
Дата | |
Msg-id | da03d726-e87d-ef5e-396d-188549e6cc23@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: strange parallel query behavior after OOM crashes (Kuntal Ghosh <kuntalghosh.2007@gmail.com>) |
Ответы |
Re: strange parallel query behavior after OOM crashes
(Kuntal Ghosh <kuntalghosh.2007@gmail.com>)
|
Список | pgsql-hackers |
On 04/05/2017 12:36 PM, Kuntal Ghosh wrote: > On Wed, Apr 5, 2017 at 3:07 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> >> >> On 04/05/2017 09:05 AM, Kuntal Ghosh wrote: >>> >>> AFAICU, during crash recovery, we wait for all non-syslogger children >>> to exit, then reset shmem(call BackgroundWorkerShmemInit) and perform >>> StartupDataBase. While starting the startup process we check if any >>> bgworker is scheduled for a restart. If a bgworker is crashed and not >>> meant for a restart(parallel worker in our case), we call >>> ForgetBackgroundWorker() to discard it. >>> >> >> OK, so essentially we reset the counters, getting >> >> parallel_register_count = 0 >> parallel_terminate_count = 0 >> >> and then ForgetBackgroundWworker() runs and increments the terminate_count, >> breaking the logic? >> >> And you're using (parallel_register_count > 0) to detect whether we're still >> in the init phase or not. Hmmm. >> > Yes. But, as Robert suggested up in the thread, we should not use > (parallel_register_count = 0) as an alternative to define a bgworker > crash. Hence, I've added an argument named 'wasCrashed' in > ForgetBackgroundWorker to indicate a bgworker crash. > Sure, and I agree that having an explicit flag is the right solution. I'm just trying to make sure I understand what's happening. >> >> The comment says that the counters are allowed to overflow, i.e. after a >> long uptime you might get these values >> >> parallel_register_count = UINT_MAX + 1 = 1 >> parallel_terminate_count = UINT_MAX >> >> which is fine, because the C handles the overflow during subtraction and so >> >> parallel_register_count - parallel_terminate_count = 1 >> >> But the assert is not doing subtraction, it's comparing the values directly: >> >> Assert(parallel_register_count >= parallel_terminate_count); >> >> and the (perfectly valid) values trivially violate this comparison. >> > Thanks for the explanation. So, we can't use the above assert > statement. Even the following assert statement will not be helpful: > Assert(parallel_register_count - parallel_terminate_count >= 0); > Because, it'll fail to track the scenario when parallel_register_count > is not overflowed, still less than parallel_terminate_count. :( > Actually, that assert would work, because C does handle overflows on uint values during subtraction just fine. That is, (UINT_MAX+x) - UINT_MAX = x Also, the comment about overflows before BackgroundWorkerArray claims this is the case. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-hackers по дате отправления: