Re: Problem with synchronous replication
От | Kyotaro Horiguchi |
---|---|
Тема | Re: Problem with synchronous replication |
Дата | |
Msg-id | 20191029.195001.642314195780172818.horikyota.ntt@gmail.com обсуждение исходный текст |
Ответ на | Problem with synchronous replication ("Dongming Liu" <lingce.ldm@alibaba-inc.com>) |
Ответы |
Re: Problem with synchronous replication
Re: Problem with synchronous replication |
Список | pgsql-hackers |
Hello. At Fri, 25 Oct 2019 15:18:34 +0800, "Dongming Liu" <lingce.ldm@alibaba-inc.com> wrote in > > Hi, > > I recently discovered two possible bugs about synchronous replication. > > 1. SyncRepCleanupAtProcExit may delete an element that has been deleted > SyncRepCleanupAtProcExit first checks whether the queue is detached, if it is not detached, > acquires the SyncRepLock lock and deletes it. If this element has been deleted by walsender, > it will be deleted repeatedly, SHMQueueDelete will core with a segment fault. > > IMO, like SyncRepCancelWait, we should lock the SyncRepLock first and then check > whether the queue is detached or not. I think you're right here. > 2. SyncRepWaitForLSN may not call SyncRepCancelWait if ereport check one interrupt. > For SyncRepWaitForLSN, if a query cancel interrupt arrives, we just terminate the wait > with suitable warning. As follows: > > a. set QueryCancelPending to false > b. errport outputs one warning > c. calls SyncRepCancelWait to delete one element from the queue > > If another cancel interrupt arrives when we are outputting warning at step b, the errfinish > will call CHECK_FOR_INTERRUPTS that will output an ERROR, such as "canceling autovacuum > task", then the process will jump to the sigsetjmp. Unfortunately, the step c will be skipped > and the element that should be deleted by SyncRepCancelWait is remained. > > The easiest way to fix this is to swap the order of step b and step c. On the other hand, > let sigsetjmp clean up the queue may also be a good choice. What do you think? > > Attached the patch, any feedback is greatly appreciated. This is not right. It is in transaction commit so it is in a HOLD_INTERRUPTS section. ProcessInterrupt does not respond to cancel/die interrupts thus the ereport should return. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
В списке pgsql-hackers по дате отправления: