Re: the s_lock_stuck on perform_spin_delay

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: the s_lock_stuck on perform_spin_delay
Дата
Msg-id 20240105192713.nrszmgxk3xgvdauq@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: the s_lock_stuck on perform_spin_delay  (Andy Fan <zhihuifan1213@163.com>)
Список pgsql-hackers
Hi,

On 2024-01-05 10:20:39 +0800, Andy Fan wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2024-01-04 14:59:06 +0800, Andy Fan wrote:
> >> My question is if someone doesn't obey the rule by mistake (everyone
> >> can make mistake), shall we PANIC on a production environment? IMO I
> >> think it can be a WARNING on a production environment and be a stuck
> >> when 'ifdef USE_ASSERT_CHECKING'.
> >>
> >> [...]
> >>
> >> I notice this issue actually because of the patch "Cache relation
> >> sizes?" from Thomas Munro [1], where the latest patch[2] still have the
> >> following code.
> >> +        sr = smgr_alloc_sr();  <-- HERE a spin lock is hold
> >> +
> >> +        /* Upgrade to exclusive lock so we can create a mapping. */
> >> +        LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
> >>   operation is needed. it may take a long time.
> >>
> >> Our internal testing system found more misuses on our own PG version.
> >
> >> I think a experienced engineer like Thomas can make this mistake and the
> >> patch was reviewed by 3 peoples, the bug is still there. It is not easy
> >> to say just don't do it.
> >
> > I don't follow this argument - just ignoring the problem,
>
> I agree with you but I'm feeling you ignored my post at [1], where I
> said for the known issue, it should be fixed at the first chance.

With "ignoring the problem" I was referencing emitting a WARNING instead of
crash-restart.

IME stuck spinlocks are caused by issues like not releasing a spinlock,
possibly due to returning early due to an error or such, having lock-nesting
issues leading to deadlocks, acquiring spinlocks or lwlocks in signal
handlers, blocking in signal handlers while holding a spinlock outside of the
signal handers and many variations of those. The common theme of these causes
is that they don't resolve after some time. The only way out of the situation
is to crash-restart, either "automatically" or by operator intervention.


> > which emitting a
> > WARNING basically is, doesn't reduce the impact of the bug, it *increases* the
> > impact, because now the system will not recover from the bug without explicit
> > operator intervention.  During that time the system might be essentially
> > unresponsive, because all backends end up contending for some spinlock, which
> > makes investigating such issues very hard.
>
> Acutally they are doing pg_usleep at the most time.

Sure - but that changes nothing about the problem. The concern isn't CPU
usage, the concern is that there's often no possible forward progress. To take
a recent-ish production issue I looked at, a buggy signal handler lead to a
backend sleeping while holding a spinlock. Soon after the entire system got
stuck, because they also acquired the spinlock. The person initially
investigating the issue at first contacted me because they couldn't even log
into the system, because connection establishment also acquired the spinlock
(I'm not sure anymore which spinlock it was, possibly xlog.c's info_lck?).


> > but not PANICing when this happens in production seems completely
> > non-viable.
> >
>
> Not sure what does *this* exactly means. If it means the bug in Thomas's
> patch, I absoluately agree with you(since it is a known bug and it
> should be fixed). If it means the general *unknown* case, it's something
> we talked above.

I mean that I think that not PANICing anymore would be a seriously bad idea
and cause way more problems than the PANIC.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: the s_lock_stuck on perform_spin_delay
Следующее
От: Andres Freund
Дата:
Сообщение: Re: the s_lock_stuck on perform_spin_delay