Assuming that TAS() will succeed the first time is verboten
От | Tom Lane |
---|---|
Тема | Assuming that TAS() will succeed the first time is verboten |
Дата | |
Msg-id | 5926.978036890@sss.pgh.pa.us обсуждение исходный текст |
Ответы |
Re: Assuming that TAS() will succeed the first time is verboten
|
Список | pgsql-hackers |
I have been digging into the observed failure FATAL: Checkpoint lock is busy while data base is shutting down on some Alpha machines. It apparently doesn't happen on all Alphas, but it's quite reproducible on some of them. The bottom line turns out to be that on the Alpha hardware, it is possible for TAS() to fail even when the lock is initially zero, because that hardware's locking protocol will fail to acquire the lock if the ldq_l/stq_c sequence is interrupted. TAS() *must* be called in a retry loop on Alphas. Thus, the coding presently in xlog.c, while (TAS(&(XLogCtl->chkp_lck))) { struct timeval delay = {2, 0}; if (shutdown) elog(STOP, "Checkpoint lock is busy while data base is shutting down"); (void) select(0,NULL, NULL, NULL, &delay); } is no good because it does not allow for multiple retries. Offhand I see no good reason why the above-quoted code isn't just S_LOCK(&(XLogCtl->chkp_lck)); and propose to fix this problem by reducing it to that. If the lock is held when it shouldn't be, we'll fail with a stuck-spinlock error. It also bothers me that xlog.c contains several places where there is a potentially infinite wait for a lock. It seems to me that these should time out with stuck-spinlock messages. Do you object to such a change? regards, tom lane
В списке pgsql-hackers по дате отправления: