Обсуждение: [MASSMAIL]Issue with the PRNG used by Postgres

Поиск

Список

Период

Сортировка

[MASSMAIL]Issue with the PRNG used by Postgres

От

Parag Paul

Дата:

09 апреля, 08:52:09

hi all,

We have an interesting problem, where PG went to PANIC due to stuck spinlock case.

On careful analysis and hours of trying to reproduce this(something that showed up in production after almost 2 weeks of stress run), I did some statistical analysis on the RNG generator that PG uses to create the backoff for the spin locks.

My expectation is that, this RNG will be fair to all backends.

After studying the underlying hash function that PG uses,"xoroshiro" which is from the LFSR family, it seems, that it has failed a bunch of statistical tests and has been mentioned in various blogs.

The one that caught my attention was :https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html

This mentions the zeroland problem and the big belt when in low Hamming index. What that means, is that when the status reaches a low hamming index state(state where it is mainly all 0's and less 1's), it takes a while to get back to regular entropy.

Here is the code from s_lock.c that shows how we add delay for the backoff, before the TAS is tried again.

Also, there is a bug here in all likelihood, as the author wanted to round off the value coming out of the equation by adding 0.5. Integer casting in c/c++ will drop the decimal part out((int)12.9999 = 12).

And if the RNG keeps producing values close to 0, the delay will never increase(as it will keep integer casting to 0) and end up spinning 1000 times within 1second itself. That is one of the reasons why we could reproduce this by forcing low RNG numbers in our test cases.

So, steps that I would recommend

1. Firstly, use ceiling for round off

2. cur_delay should increase by max(1000, RNG generated value)

3. Change the hashing function

4. Capture the new 0 state, and then try to reseed.

        /* increase delay by a random fraction between 1X and 2X */
        status->cur_delay += (int) (status->cur_delay *
                                    pg_prng_double(&pg_global_prng_state) + 0.5);

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

10 апреля, 00:05:21

Hi,

On 2024-04-08 22:52:09 -0700, Parag Paul wrote:
>  We have an interesting problem, where PG went to PANIC due to stuck
> spinlock case.
> On careful analysis and hours of trying to reproduce this(something that
> showed up in production after almost 2 weeks of stress run), I did some
> statistical analysis on the RNG generator that PG uses to create the
> backoff for the spin locks.

ISTM that the fix here is to not use a spinlock for whatever the contention is
on, rather than improve the RNG.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Robert Haas

Дата:

10 апреля, 17:43:02

On Tue, Apr 9, 2024 at 5:05 PM Andres Freund <andres@anarazel.de> wrote:
> ISTM that the fix here is to not use a spinlock for whatever the contention is
> on, rather than improve the RNG.

I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.

If I'm reading this correctly, this was introduced here:

commit 59bb147353ba274e0836d06f429176d4be47452c
Author: Bruce Momjian <bruce@momjian.us>
Date:   Fri Feb 3 12:45:47 2006 +0000

    Update random() usage so ranges are inclusive/exclusive as required.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Parag Paul

Дата:

10 апреля, 17:55:16

Hi Andres,

This is a little bit more complex than that. The spinlocks are taken in the LWLock(Mutex) code, when the lock is not available right away.

The spinlock is taken to attach the current backend to the wait list of the LWLock. This means, that this cannot be controlled.

The repro when it happens, it affects any mutex or LWLock code path, since the low hamming index can cause problems by removing fairness from the system.

Also, I believe the rounding off error still remains within the RNG. I will send a patch today.

Thanks for the response.

-Parag

On Tue, Apr 9, 2024 at 2:05 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-04-08 22:52:09 -0700, Parag Paul wrote:
> We have an interesting problem, where PG went to PANIC due to stuck
> spinlock case.
> On careful analysis and hours of trying to reproduce this(something that
> showed up in production after almost 2 weeks of stress run), I did some
> statistical analysis on the RNG generator that PG uses to create the
> backoff for the spin locks.

ISTM that the fix here is to not use a spinlock for whatever the contention is
on, rather than improve the RNG.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Parag Paul

Дата:

10 апреля, 17:55:41

Thank you Robert.

I am in the process of patching this.

-Parag

On Wed, Apr 10, 2024 at 7:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 9, 2024 at 5:05 PM Andres Freund <andres@anarazel.de> wrote:
> ISTM that the fix here is to not use a spinlock for whatever the contention is
> on, rather than improve the RNG.

I'm not convinced that we should try to improve the RNG, but surely we
need to put parentheses around pg_prng_double(&pg_global_prng_state) +
0.5. IIUC, the current logic is making us multiply the spin delay by a
value between 0 and 1 when what was intended was that it should be
multiplied by a value between 0.5 and 1.5.

If I'm reading this correctly, this was introduced here:

commit 59bb147353ba274e0836d06f429176d4be47452c
Author: Bruce Momjian <bruce@momjian.us>
Date: Fri Feb 3 12:45:47 2006 +0000

Update random() usage so ranges are inclusive/exclusive as required.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 18:09:55

Robert Haas <robertmhaas@gmail.com> writes:
> I'm not convinced that we should try to improve the RNG, but surely we
> need to put parentheses around pg_prng_double(&pg_global_prng_state) +
> 0.5. IIUC, the current logic is making us multiply the spin delay by a
> value between 0 and 1 when what was intended was that it should be
> multiplied by a value between 0.5 and 1.5.

No, I think you are misreading it, because the assignment is += not =.
The present coding is

        /* increase delay by a random fraction between 1X and 2X */
        status->cur_delay += (int) (status->cur_delay *
                                    pg_prng_double(&pg_global_prng_state) + 0.5);

which looks fine to me.  The +0.5 is so that the conversion to integer
rounds rather than truncating.

In any case, I concur with Andres: if this behavior is anywhere near
critical then the right fix is to not be using spinlocks.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Parag Paul

Дата:

10 апреля, 18:26:39

hi Tom,

First of all thanks for you response. I did not misread it. The 0.5 is added to the result of the multiplication which then uses C integer casting, which does not round off, but just drops the decimal portion.

status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) + 0.5);

So, if RNG generated 0.0000001 and cur_delay =1000.

Result will be

1000 + int(1000*0.000001 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000 <-- back to the same value

and there is no change after that starts happening, if the RNG is in the low hamming index state. If avalanche happens soon, then it will correct it self, but in the mean time, we have a stuck_spin_lock PANIC that could bring down a production server.

-Parag

On Wed, Apr 10, 2024 at 8:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:
> I'm not convinced that we should try to improve the RNG, but surely we
> need to put parentheses around pg_prng_double(&pg_global_prng_state) +
> 0.5. IIUC, the current logic is making us multiply the spin delay by a
> value between 0 and 1 when what was intended was that it should be
> multiplied by a value between 0.5 and 1.5.

No, I think you are misreading it, because the assignment is += not =.
The present coding is

/* increase delay by a random fraction between 1X and 2X */
status->cur_delay += (int) (status->cur_delay *
pg_prng_double(&pg_global_prng_state) + 0.5);

which looks fine to me. The +0.5 is so that the conversion to integer
rounds rather than truncating.

In any case, I concur with Andres: if this behavior is anywhere near
critical then the right fix is to not be using spinlocks.

regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Robert Haas

Дата:

10 апреля, 18:54:22

On Wed, Apr 10, 2024 at 11:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> No, I think you are misreading it, because the assignment is += not =.
> The present coding is
>
>         /* increase delay by a random fraction between 1X and 2X */
>         status->cur_delay += (int) (status->cur_delay *
>                                     pg_prng_double(&pg_global_prng_state) + 0.5);
>
> which looks fine to me.  The +0.5 is so that the conversion to integer
> rounds rather than truncating.

Oh, yeah ... right. But then why does the comment say that it's
increasing the delay between a random fraction between 1X and 2X?
Isn't this a random fraction between 0X and 1X? Or am I still full of
garbage?

> In any case, I concur with Andres: if this behavior is anywhere near
> critical then the right fix is to not be using spinlocks.

It would certainly be interesting to know which spinlocks were at
issue, here. But it also seems to me that it's reasonable to be
unhappy about the possibility of this increasing cur_delay by exactly
0. Storms of spinlock contention can and do happen on real-world
production servers, and trying to guess which spinlocks might be
affected before it happens is difficult. Until we fully eliminate all
spinlocks from our code base, the theoretical possibility of such
storms remains, and if making sure that the increment here is >0
mitigates that, then I think we should. Mind you, I don't think that
the original poster has necessarily provided convincing or complete
evidence to justify a change here, but I find it odd that you and
Andres seem to want to dismiss it out of hand.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 18:55:28

Parag Paul <parag.paul@gmail.com> writes:
> So, if RNG generated 0.0000001 and cur_delay =1000.
> Result will be
> 1000 + int(1000*0.000001 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000
> <--  back to the same value

Yes, with a sufficiently small RNG result, the sleep delay will not
increase that time through the loop.  So what?  It's supposed to be
a random amount of backoff, and I don't see why "random" shouldn't
occasionally include "zero".  Why would requiring the delay to
increase improve matters at all?

Now, if the xoroshiro RNG is so bad that there's a strong likelihood
that multiple backends will sit on cur_delay = MIN_DELAY_USEC for
a long time, then yeah we could have a thundering-herd problem with
contention for the spinlock.  You've provided no reason to think
that the probability of that is more than microscopic, though.
(If it is more than microscopic, then we are going to have
nonrandomness problems in a lot of other places, so I'd lean
towards revising the RNG not band-aiding s_lock.  We've seen
nothing to suggest such problems, though.)

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 19:02:25

Robert Haas <robertmhaas@gmail.com> writes:
> Oh, yeah ... right. But then why does the comment say that it's
> increasing the delay between a random fraction between 1X and 2X?

I think the comment is meant to say that the new delay value will be
1X to 2X the old value.  If you want to suggest different phrasing,
feel free.

> It would certainly be interesting to know which spinlocks were at
> issue, here. But it also seems to me that it's reasonable to be
> unhappy about the possibility of this increasing cur_delay by exactly
> 0.

As I said to Parag, I see exactly no reason to believe that that's a
problem, unless it happens *a lot*, like hundreds of times in a row.
If it does, that's an RNG problem not s_lock's fault.  Now, I'm not
going to say that xoroshiro can't potentially do that, because I've
not studied it.  But if it is likely to do that, I'd think we'd
have noticed the nonrandomness in other ways.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

10 апреля, 19:09:31

Hi,

On 2024-04-10 07:55:16 -0700, Parag Paul wrote:
> This is a little bit more complex than that. The spinlocks are taken in the
> LWLock(Mutex) code, when the lock is not available right away.
> The spinlock is taken to attach the current backend to the wait list of the
> LWLock. This means, that this cannot be controlled.
> The repro when it happens, it affects any mutex or LWLock code path, since
> the low hamming index can cause problems by removing fairness from the
> system.

Please provide a profile of a problematic workload.  I'm fairly suspicious
of the claim that RNG unfairness is a real part of the problem here.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 19:28:10

Actually ... Parag mentioned that this was specifically about
lwlock.c's usage of spinlocks.  It doesn't really use a spinlock,
but it does use s_lock.c's delay logic, and I think it's got the
usage pattern wrong:

    while (true)
    {
        /* always try once to acquire lock directly */
        old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
        if (!(old_state & LW_FLAG_LOCKED))
            break;                /* got lock */

        /* and then spin without atomic operations until lock is released */
        {
            SpinDelayStatus delayStatus;

            init_local_spin_delay(&delayStatus);

            while (old_state & LW_FLAG_LOCKED)
            {
                perform_spin_delay(&delayStatus);
                old_state = pg_atomic_read_u32(&lock->state);
            }
#ifdef LWLOCK_STATS
            delays += delayStatus.delays;
#endif
            finish_spin_delay(&delayStatus);
        }

        /*
         * Retry. The lock might obviously already be re-acquired by the time
         * we're attempting to get it again.
         */
    }

I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop.  That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.

(One should still wonder what is the LWLock usage pattern that is
causing this spot to become so heavily contended.)

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Robert Haas

Дата:

10 апреля, 19:37:12

On Wed, Apr 10, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> As I said to Parag, I see exactly no reason to believe that that's a
> problem, unless it happens *a lot*, like hundreds of times in a row.
> If it does, that's an RNG problem not s_lock's fault.  Now, I'm not
> going to say that xoroshiro can't potentially do that, because I've
> not studied it.  But if it is likely to do that, I'd think we'd
> have noticed the nonrandomness in other ways.

The blog post to which Parag linked includes this histogram as an
example of a low-Hamming-weight situation:

Reps | Value
-----+--------
  84 | 0x00
  10 | 0x2d
   6 | 0xa0
   6 | 0x68
   6 | 0x40
   6 | 0x02
   5 | 0x0b
   4 | 0x05
   4 | 0x01
   3 | 0xe1
   3 | 0x5a
   3 | 0x41
   3 | 0x20
   3 | 0x16

That's certainly a lot more zeroes than you'd like to see in 192 bytes
of output, but it's not hundreds in a row either.

Also, the blog post says "On the one hand, from a practical
perspective, having vastly, vastly more close repeats than it should
isn't likely to be an issue that users will ever detect in practice.
Xoshiro256's large state space means it is too big to fail any
statistical tests that look for close repeats." So your theory that we
have a bug elsewhere in the code seems like it might be the real
answer.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Parag Paul

Дата:

10 апреля, 19:39:45

hi Tom,

Sorry for the delayed response. I was collecting of the data from my production servers.

The reason why this could be a problem is a flaw in the RNG with the enlarged Hamming belt.

I attached an image here, with the RNG outputs from 2 backends. I ran our code for weeks, and collected ther

values generated by the RNG over many backends. The one in Green (say backend id 600), stopped flapping values and

only produced low (near 0 ) values for half an hour, whereas the Blue(say backend 700), kept generating good values and had

a range between [0-1)

During this period, the backed 600 suffered and ended up with spinlock stuck condition.

Extensive analysis here - https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html

-Parag

On Wed, Apr 10, 2024 at 9:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Actually ... Parag mentioned that this was specifically about
lwlock.c's usage of spinlocks. It doesn't really use a spinlock,
but it does use s_lock.c's delay logic, and I think it's got the
usage pattern wrong:

while (true)
{
/* always try once to acquire lock directly */
old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
if (!(old_state & LW_FLAG_LOCKED))
break; /* got lock */

/* and then spin without atomic operations until lock is released */
{
SpinDelayStatus delayStatus;

init_local_spin_delay(&delayStatus);

while (old_state & LW_FLAG_LOCKED)
{
perform_spin_delay(&delayStatus);
old_state = pg_atomic_read_u32(&lock->state);
}
#ifdef LWLOCK_STATS
delays += delayStatus.delays;
#endif
finish_spin_delay(&delayStatus);
}

/*
* Retry. The lock might obviously already be re-acquired by the time
* we're attempting to get it again.
*/
}

I don't think it's correct to re-initialize the SpinDelayStatus each
time around the outer loop. That state should persist through the
entire acquire operation, as it does in a regular spinlock acquire.
As this stands, it resets the delay to minimum each time around the
outer loop, and I bet it is that behavior not the RNG that's to blame
for what he's seeing.

(One should still wonder what is the LWLock usage pattern that is
causing this spot to become so heavily contended.)

regards, tom lane

Вложения

image001.png

Re: Issue with the PRNG used by Postgres

От

Parag Paul

Дата:

10 апреля, 19:42:03

hi Robert,

We are using xoroshiro128 and not moved to the next state of art.

We did see a lot of low values as put in my last message.

-Parag

On Wed, Apr 10, 2024 at 9:37 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 10, 2024 at 12:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> As I said to Parag, I see exactly no reason to believe that that's a
> problem, unless it happens *a lot*, like hundreds of times in a row.
> If it does, that's an RNG problem not s_lock's fault. Now, I'm not
> going to say that xoroshiro can't potentially do that, because I've
> not studied it. But if it is likely to do that, I'd think we'd
> have noticed the nonrandomness in other ways.

The blog post to which Parag linked includes this histogram as an
example of a low-Hamming-weight situation:

Reps | Value
-----+--------
84 | 0x00
10 | 0x2d
6 | 0xa0
6 | 0x68
6 | 0x40
6 | 0x02
5 | 0x0b
4 | 0x05
4 | 0x01
3 | 0xe1
3 | 0x5a
3 | 0x41
3 | 0x20
3 | 0x16

That's certainly a lot more zeroes than you'd like to see in 192 bytes
of output, but it's not hundreds in a row either.

Also, the blog post says "On the one hand, from a practical
perspective, having vastly, vastly more close repeats than it should
isn't likely to be an issue that users will ever detect in practice.
Xoshiro256's large state space means it is too big to fail any
statistical tests that look for close repeats." So your theory that we
have a bug elsewhere in the code seems like it might be the real
answer.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Robert Haas

Дата:

10 апреля, 19:43:00

On Wed, Apr 10, 2024 at 12:40 PM Parag Paul <parag.paul@gmail.com> wrote:
> The reason why this could be a problem is a flaw in the RNG with the enlarged Hamming belt.
> I attached an image here, with the RNG outputs from 2 backends. I ran our code for weeks, and collected ther
> values generated by the RNG over many backends. The one in Green (say backend id 600), stopped flapping values and
> only produced low (near 0 ) values for half an hour, whereas the Blue(say backend 700), kept generating good values
andhad 
> a range between [0-1)
> During this period, the backed 600 suffered and ended up with spinlock stuck condition.

This is a very vague description of a test procedure. If you provide a
reproducible series of steps that causes a stuck spinlock, I imagine
everyone will be on board with doing something about it. But this
graph is not going to convince anyone of anything.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Parag Paul

Дата:

10 апреля, 19:48:42

Yes, the probability of this happening is astronomical, but in production with 128 core servers with 7000 max_connections, with petabyte scale data, this did repro 2 times in the last month. We had to move to a local approach to manager our ratelimiting counters.

This is not reproducible very easily. I feel that we should at least shield ourselves with the following change, so that we at least increase the delay by 1000us every time. We will follow a linear back off, but better than no backoff.

      status->cur_delay += max(1000, (int) (status->cur_delay *

                                    pg_prng_double(&pg_global_prng_state) + 0.5));

On Wed, Apr 10, 2024 at 9:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 10, 2024 at 12:40 PM Parag Paul <parag.paul@gmail.com> wrote:
> The reason why this could be a problem is a flaw in the RNG with the enlarged Hamming belt.
> I attached an image here, with the RNG outputs from 2 backends. I ran our code for weeks, and collected ther
> values generated by the RNG over many backends. The one in Green (say backend id 600), stopped flapping values and
> only produced low (near 0 ) values for half an hour, whereas the Blue(say backend 700), kept generating good values and had
> a range between [0-1)
> During this period, the backed 600 suffered and ended up with spinlock stuck condition.

This is a very vague description of a test procedure. If you provide a
reproducible series of steps that causes a stuck spinlock, I imagine
everyone will be on board with doing something about it. But this
graph is not going to convince anyone of anything.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 20:03:05

I wrote:
> I don't think it's correct to re-initialize the SpinDelayStatus each
> time around the outer loop.  That state should persist through the
> entire acquire operation, as it does in a regular spinlock acquire.
> As this stands, it resets the delay to minimum each time around the
> outer loop, and I bet it is that behavior not the RNG that's to blame
> for what he's seeing.

After thinking about this some more, it is fairly clear that that *is*
a mistake that can cause a thundering-herd problem.  Assume we have
two or more backends waiting in perform_spin_delay, and for whatever
reason the scheduler wakes them up simultaneously.  They see the
LW_FLAG_LOCKED bit clear (because whoever had the lock when they went
to sleep is long gone) and iterate the outer loop.  One gets the lock;
the rest go back to sleep.  And they will all sleep exactly
MIN_DELAY_USEC, because they all reset their SpinDelayStatus.
Lather, rinse, repeat.  If the s_lock code were being used as
intended, they would soon acquire different cur_delay settings;
but that never happens.  That is not the RNG's fault.

So I think we need something like the attached.

            regards, tom lane

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index b1e388dc7c..54c84b9d67 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -857,46 +857,49 @@ static void
 LWLockWaitListLock(LWLock *lock)
 {
     uint32        old_state;
-#ifdef LWLOCK_STATS
-    lwlock_stats *lwstats;
-    uint32        delays = 0;

-    lwstats = get_lwlock_stats_entry(lock);
-#endif
+    /*
+     * Try once to acquire the lock, before we go to the trouble of setting up
+     * state for the spin loop.
+     */
+    old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
+    if (!(old_state & LW_FLAG_LOCKED))
+        return;                    /* got lock */

-    while (true)
     {
-        /* always try once to acquire lock directly */
-        old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
-        if (!(old_state & LW_FLAG_LOCKED))
-            break;                /* got lock */
+        SpinDelayStatus delayStatus;
+#ifdef LWLOCK_STATS
+        lwlock_stats *lwstats = get_lwlock_stats_entry(lock);
+#endif

-        /* and then spin without atomic operations until lock is released */
-        {
-            SpinDelayStatus delayStatus;
+        init_local_spin_delay(&delayStatus);

-            init_local_spin_delay(&delayStatus);
+        while (true)
+        {
+            /* try again to acquire lock directly */
+            old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
+            if (!(old_state & LW_FLAG_LOCKED))
+                break;            /* got lock */

+            /* spin without atomic operations until lock is released */
             while (old_state & LW_FLAG_LOCKED)
             {
                 perform_spin_delay(&delayStatus);
                 old_state = pg_atomic_read_u32(&lock->state);
             }
-#ifdef LWLOCK_STATS
-            delays += delayStatus.delays;
-#endif
-            finish_spin_delay(&delayStatus);
+
+            /*
+             * Retry. The lock might obviously already be re-acquired by the
+             * time we're attempting to get it again.
+             */
         }

-        /*
-         * Retry. The lock might obviously already be re-acquired by the time
-         * we're attempting to get it again.
-         */
-    }
+        finish_spin_delay(&delayStatus);

 #ifdef LWLOCK_STATS
-    lwstats->spin_delay_count += delays;
+        lwstats->spin_delay_count += delayStatus.delays;
 #endif
+    }
 }

 /*

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

10 апреля, 20:03:43

Hi,

On 2024-04-10 09:48:42 -0700, Parag Paul wrote:
> Yes, the probability of this happening is astronomical, but in production
> with 128 core servers with 7000 max_connections, with petabyte scale data,
> this did repro 2 times in the last month. We had to move to a local
> approach to manager our ratelimiting counters.

What version of PG was this?  I think it's much more likely that you're
hitting a bug that caused a lot more contention inside lwlocks. That was fixed
for 16+ in a4adc31f690 on 2022-11-20, but only backpatched to 12-15 on
2024-01-18.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

10 апреля, 20:08:46

Hi,

On 2024-04-10 12:28:10 -0400, Tom Lane wrote:
> Actually ... Parag mentioned that this was specifically about
> lwlock.c's usage of spinlocks.  It doesn't really use a spinlock,
> but it does use s_lock.c's delay logic, and I think it's got the
> usage pattern wrong:
> 
>     while (true)
>     {
>         /* always try once to acquire lock directly */
>         old_state = pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_LOCKED);
>         if (!(old_state & LW_FLAG_LOCKED))
>             break;                /* got lock */
> 
>         /* and then spin without atomic operations until lock is released */
>         {
>             SpinDelayStatus delayStatus;
> 
>             init_local_spin_delay(&delayStatus);
> 
>             while (old_state & LW_FLAG_LOCKED)
>             {
>                 perform_spin_delay(&delayStatus);
>                 old_state = pg_atomic_read_u32(&lock->state);
>             }
> #ifdef LWLOCK_STATS
>             delays += delayStatus.delays;
> #endif
>             finish_spin_delay(&delayStatus);
>         }
> 
>         /*
>          * Retry. The lock might obviously already be re-acquired by the time
>          * we're attempting to get it again.
>          */
>     }
> 
> I don't think it's correct to re-initialize the SpinDelayStatus each
> time around the outer loop.  That state should persist through the
> entire acquire operation, as it does in a regular spinlock acquire.
> As this stands, it resets the delay to minimum each time around the
> outer loop, and I bet it is that behavior not the RNG that's to blame
> for what he's seeing.

Hm, yea, that's not right. Normally this shouldn't be heavily contended enough
to matter.  I don't think just pulling out the spin delay would be the right
thing though, because that'd mean we'd initialize it even in the much much
more common case of there not being any contention.  I think we'll have to add
a separate fetch_or before the outer loop.


> (One should still wonder what is the LWLock usage pattern that is
> causing this spot to become so heavily contended.)

My suspicion is that it's a4adc31f690 which we only recently backpatched to
< 16.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 20:12:21

Parag Paul <parag.paul@gmail.com> writes:
> Yes, the probability of this happening is astronomical, but in production
> with 128 core servers with 7000 max_connections, with petabyte scale data,
> this did repro 2 times in the last month. We had to move to a local
> approach to manager our ratelimiting counters.
> This is not reproducible very easily. I feel that we should at least shield
> ourselves with the following change, so that we at least increase the delay
> by 1000us every time. We will follow a linear back off, but better than no
> backoff.

I still say you are proposing to band-aid the wrong thing.  Moreover:

* the proposed patch will cause the first few cur_delay values to grow
much faster than before, with direct performance impact to everyone,
whether they are on 128-core servers or not;

* if we are in a regime where xoroshiro repeatedly returns zero
across multiple backends, your patch doesn't improve the situation
AFAICS, because the backends will still choose the same series
of cur_delay values and thus continue to exhibit thundering-herd
behavior.  Indeed, as coded I think the patch makes it *more*
likely that the same series of cur_delay values would be chosen
by multiple backends.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 20:15:56

Andres Freund <andres@anarazel.de> writes:
> Hi,
> On 2024-04-10 12:28:10 -0400, Tom Lane wrote:
>> I don't think it's correct to re-initialize the SpinDelayStatus each
>> time around the outer loop.  That state should persist through the
>> entire acquire operation, as it does in a regular spinlock acquire.
>> As this stands, it resets the delay to minimum each time around the
>> outer loop, and I bet it is that behavior not the RNG that's to blame
>> for what he's seeing.

> Hm, yea, that's not right. Normally this shouldn't be heavily contended enough
> to matter.  I don't think just pulling out the spin delay would be the right
> thing though, because that'd mean we'd initialize it even in the much much
> more common case of there not being any contention.  I think we'll have to add
> a separate fetch_or before the outer loop.

Agreed, and I did that in my draft patch.  AFAICS we can also avoid
the LWLOCK_STATS overhead if the initial attempt succeeds, not that
that is something you'd really be worried about.

>> (One should still wonder what is the LWLock usage pattern that is
>> causing this spot to become so heavily contended.)

> My suspicion is that it's a4adc31f690 which we only recently backpatched to
> < 16.

Seems like a plausible theory.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

10 апреля, 20:32:42

Hi,

On 2024-04-10 13:03:05 -0400, Tom Lane wrote:
> After thinking about this some more, it is fairly clear that that *is*
> a mistake that can cause a thundering-herd problem.

> Assume we have two or more backends waiting in perform_spin_delay, and for
> whatever reason the scheduler wakes them up simultaneously.

That's not really possible, at least not repeatably. Multiple processes
obviously can't be scheduled concurrently on one CPU and scheduling something
on another core entails interrupting that CPU with an inter processor
interrupt or that other CPU scheduling on its own, without coordination.

That obviously isn't a reason to not fix the delay logic in lwlock.c.

Looks like the wrong logic was introduced by me in

commit 008608b9d51061b1f598c197477b3dc7be9c4a64
Author: Andres Freund <andres@anarazel.de>
Date:   2016-04-10 20:12:32 -0700

    Avoid the use of a separate spinlock to protect a LWLock's wait queue.

Likely because I was trying to avoid the overhead of init_local_spin_delay(),
without duplicating the few lines to acquire the "spinlock".

> So I think we need something like the attached.

LGTM.

I think it might be worth breaking LWLockWaitListLock() into two pieces, a
fastpath to be inlined into a caller, and a slowpath, but that's separate work
from a bugfix.

I looked around and the other uses of init_local_spin_delay() look correct
from this angle. However LockBufHdr() is more expensive than it needs to be,
because it always initializes SpinDelayStatus. IIRC I've seen that show up in
profiles before, but never got around to writing a nice-enough patch.  But
that's also something separate from a bugfix.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 21:02:20

Andres Freund <andres@anarazel.de> writes:
> On 2024-04-10 13:03:05 -0400, Tom Lane wrote:
>> So I think we need something like the attached.

> LGTM.

On third thought ... while I still think this is a misuse of
perform_spin_delay and we should change it, I'm not sure it'll do
anything to address Parag's problem, because IIUC he's seeing actual
"stuck spinlock" reports.  That implies that the inner loop of
LWLockWaitListLock slept NUM_DELAYS times without ever seeing
LW_FLAG_LOCKED clear.  What I'm suggesting would change the triggering
condition to "NUM_DELAYS sleeps without acquiring the lock", which is
strictly more likely to happen, so it's not going to help him.  It's
certainly still well out in we-shouldn't-get-there territory, though.

Also, fooling around with the cur_delay adjustment doesn't affect
this at all: "stuck spinlock" is still going to be raised after
NUM_DELAYS failures to observe the lock clear or obtain the lock.
Increasing cur_delay won't change that, it'll just spread the
fixed number of attempts over a longer period; and there's no
reason to believe that does anything except make it take longer
to fail.  Per the header comment for s_lock.c:

 * We time out and declare error after NUM_DELAYS delays (thus, exactly
 * that many tries).  With the given settings, this will usually take 2 or
 * so minutes.  It seems better to fix the total number of tries (and thus
 * the probability of unintended failure) than to fix the total time
 * spent.

If you believe that waiting processes can be awakened close enough to
simultaneously to hit the behavior I posited earlier, then encouraging
them to have different cur_delay values will help; but Andres doesn't
believe it and I concede it seems like a stretch.

So I think fooling with the details in s_lock.c is pretty much beside
the point.  The most likely bet is that Parag's getting bit by the
bug fixed in a4adc31f690.  It's possible he's seeing the effect of
some different issue that causes lwlock.c to hold that lock a long
time at scale, but that's where I'd look first.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 21:15:51

Robert Haas <robertmhaas@gmail.com> writes:
> The blog post to which Parag linked includes this histogram as an
> example of a low-Hamming-weight situation:

That's an interesting post indeed, but I'm not sure how relevant
it is to us, because it is about Xoshiro not Xoroshiro, and the
latter is what we use.  The last sentence of the blog post is

    In this case, a mix of practice and theory has shown that the
    structure of Xoshiro's state space is poorer than that of many
    competing generation schemes, including Blackman's gjrand and
    perhaps even Vigna and Blackman's earlier Xoroshiro scheme (which
    has smaller zeroland expanses and does not appear to have similar
    close-repeat problems), and its output functions are unable to
    completely conceal these issues.

So while pg_prng.c might have the issue posited here, this blog
post is not evidence for that, and indeed might be evidence
against it.  Someone would have to do similar analysis on the
code we *actually* use to convince me that we need to worry.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

10 апреля, 22:08:21

Hi,

On 2024-04-10 14:02:20 -0400, Tom Lane wrote:
> On third thought ... while I still think this is a misuse of
> perform_spin_delay and we should change it, I'm not sure it'll do
> anything to address Parag's problem, because IIUC he's seeing actual
> "stuck spinlock" reports.  That implies that the inner loop of
> LWLockWaitListLock slept NUM_DELAYS times without ever seeing
> LW_FLAG_LOCKED clear.  What I'm suggesting would change the triggering
> condition to "NUM_DELAYS sleeps without acquiring the lock", which is
> strictly more likely to happen, so it's not going to help him.  It's
> certainly still well out in we-shouldn't-get-there territory, though.

I think it could exascerbate the issue. Parag reported ~7k connections on a
128 core machine. The buffer replacement logic in < 16 tries to lock the old
and new lock partitions at once. That can lead to quite bad "chains" of
dependent lwlocks, occasionally putting all the pressure on a single lwlock.
With 7k waiters on a single spinlock, higher frequency of wakeups will make it
much more likely that the process holding the spinlock will be put to sleep.

This is greatly exacerbated by the issue fixed in a4adc31f690, once the
waitqueue is long, the spinlock will be held for an extended amount of time.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 23:05:21

Andres Freund <andres@anarazel.de> writes:
> I think it could exascerbate the issue. Parag reported ~7k connections on a
> 128 core machine. The buffer replacement logic in < 16 tries to lock the old
> and new lock partitions at once. That can lead to quite bad "chains" of
> dependent lwlocks, occasionally putting all the pressure on a single lwlock.
> With 7k waiters on a single spinlock, higher frequency of wakeups will make it
> much more likely that the process holding the spinlock will be put to sleep.
> This is greatly exacerbated by the issue fixed in a4adc31f690, once the
> waitqueue is long, the spinlock will be held for an extended amount of time.

Yeah.  So what's the conclusion?  Leave it alone?  Commit to
HEAD only?

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

10 апреля, 23:32:13

Hi,

On 2024-04-10 16:05:21 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I think it could exascerbate the issue. Parag reported ~7k connections on a
> > 128 core machine. The buffer replacement logic in < 16 tries to lock the old
> > and new lock partitions at once. That can lead to quite bad "chains" of
> > dependent lwlocks, occasionally putting all the pressure on a single lwlock.
> > With 7k waiters on a single spinlock, higher frequency of wakeups will make it
> > much more likely that the process holding the spinlock will be put to sleep.
> > This is greatly exacerbated by the issue fixed in a4adc31f690, once the
> > waitqueue is long, the spinlock will be held for an extended amount of time.
> 
> Yeah.  So what's the conclusion?  Leave it alone?  Commit to
> HEAD only?

I think we should certainly fix it. I don't really have an opinion about
backpatching, it's just on the line between the two for me.

Hm. The next set of releases is still a bit away, and this is one of the
period where HEAD is hopefully going to be more tested than usual, so I'd
perhaps very softly lean towards backpatching. There'd have to be some very
odd compiler behaviour to make it slower than before anyway.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

10 апреля, 23:40:05

Andres Freund <andres@anarazel.de> writes:
> On 2024-04-10 16:05:21 -0400, Tom Lane wrote:
>> Yeah.  So what's the conclusion?  Leave it alone?  Commit to
>> HEAD only?

> I think we should certainly fix it. I don't really have an opinion about
> backpatching, it's just on the line between the two for me.
> Hm. The next set of releases is still a bit away, and this is one of the
> period where HEAD is hopefully going to be more tested than usual, so I'd
> perhaps very softly lean towards backpatching. There'd have to be some very
> odd compiler behaviour to make it slower than before anyway.

I'm not worried about it being slower, but about whether it could
report "stuck spinlock" in cases where the existing code succeeds.
While that seems at least theoretically possible, it seems like
if you hit it you have got problems that need to be fixed anyway.
Nonetheless, I'm kind of leaning to not back-patching.  I do agree
on getting it into HEAD sooner not later though.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Robert Haas

Дата:

11 апреля, 03:35:38

On Wed, Apr 10, 2024 at 4:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not worried about it being slower, but about whether it could
> report "stuck spinlock" in cases where the existing code succeeds.
> While that seems at least theoretically possible, it seems like
> if you hit it you have got problems that need to be fixed anyway.
> Nonetheless, I'm kind of leaning to not back-patching.  I do agree
> on getting it into HEAD sooner not later though.

I just want to mention that I have heard of "stuck spinlock" happening
in production just because the server was busy. And I think that's not
intended. The timeout is supposed to be high enough that you only hit
it if there's a bug in the code. At least AIUI. But it isn't.

I know that's a separate issue, but I think it's an important one. It
shouldn't happen that a system which was installed to defend against
bugs in the code causes more problems than the bugs themselves.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

11 апреля, 04:33:16

Robert Haas <robertmhaas@gmail.com> writes:
> I just want to mention that I have heard of "stuck spinlock" happening
> in production just because the server was busy. And I think that's not
> intended. The timeout is supposed to be high enough that you only hit
> it if there's a bug in the code. At least AIUI. But it isn't.

Well, certainly that's not supposed to happen, but anecdotal reports
like this provide little basis for discussing what we ought to do
about it.  It seems likely that raising the timeout would do nothing
except allow the stuck state to persist even longer before failing.
(Keep in mind that the existing timeout of ~2 min is already several
geological epochs to a computer, so arguing that spinning another
couple minutes would have resulted in success seems more like wishful
thinking than reality.)

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

11 апреля, 04:52:59

I wrote:
> I'm not worried about it being slower, but about whether it could
> report "stuck spinlock" in cases where the existing code succeeds.

On fourth thought ... the number of tries to acquire the lock, or
in this case number of tries to observe the lock free, is not
NUM_DELAYS but NUM_DELAYS * spins_per_delay.  Decreasing
spins_per_delay should therefore increase the risk of unexpected
"stuck spinlock" failures.  And finish_spin_delay will decrement
spins_per_delay in any cycle where we slept at least once.
It's plausible therefore that this coding with finish_spin_delay
inside the main wait loop puts more downward pressure on
spins_per_delay than the algorithm is intended to cause.

I kind of wonder whether the premises finish_spin_delay is written
on even apply anymore, given that nobody except some buildfarm
dinosaurs runs Postgres on single-processor hardware anymore.
Maybe we should rip out the whole mechanism and hard-wire
spins_per_delay at 1000 or so.

Less drastically, I wonder if we should call finish_spin_delay
at all in these off-label uses of perform_spin_delay.  What
we're trying to measure there is the behavior of TAS() spin loops,
and I'm not sure that what LWLockWaitListLock and the bufmgr
callers are doing should be assumed to have timing behavior
identical to that.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

"Andrey M. Borodin"

Дата:

11 апреля, 08:52:43

> On 10 Apr 2024, at 21:48, Parag Paul <parag.paul@gmail.com> wrote:
>
> Yes, the probability of this happening is astronomical, but in production with 128 core servers with 7000
max_connections,with petabyte scale data, this did repro 2 times in the last month. We had to move to a local approach
tomanager our ratelimiting counters.  

FWIW we observed such failure on this [0] LWLock two times too. Both cases were recent (February).
We have ~15k clusters with 8MTPS, so it’s kind of infrequent, but not astronomic. We decided to remove that lock.

Best regards, Andrey Borodin.

[0]
https://github.com/munakoiso/logerrors/pull/25/files#diff-f8903c463a191f399b3e84c815ed6dc60adbbfc0fb0b2db490be1e58dc692146L85

Re: Issue with the PRNG used by Postgres

От

Robert Haas

Дата:

11 апреля, 22:24:28

On Wed, Apr 10, 2024 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe we should rip out the whole mechanism and hard-wire
> spins_per_delay at 1000 or so.

Or, rip out the whole, whole mechanism and just don't PANIC.

I'm not 100% sure that's better, but I think it's worth considering.
The thing is, if you're panicking regularly, that's almost as bad as
being down, because you're going to lose all of your connections and
have to run recovery before they can be reestablished. And if you're
panicking once in a blue moon, the PANIC actually prevents you from
easily getting a stack trace that might help you figure out what's
happening. And of course if it's not really a stuck spinlock but just
a very slow system, which I think is an extremely high percentage of
real-world cases, then it's completely worthless at best.

To believe that the PANIC is the right idea, we have to suppose that
we have stuck-spinlock bugs that people actually hit, but that those
people don't hit them often enough to care, as long as the system
resets when the spinlock gets stuck, instead of hanging. I can't
completely rule out the existence of either such bugs or such people,
but I'm not aware of having encountered them.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

11 апреля, 22:52:11

Hi,

On 2024-04-11 15:24:28 -0400, Robert Haas wrote:
> On Wed, Apr 10, 2024 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Maybe we should rip out the whole mechanism and hard-wire
> > spins_per_delay at 1000 or so.
> 
> Or, rip out the whole, whole mechanism and just don't PANIC.

I continue believe that that'd be a quite bad idea.

My suspicion is that most of the false positives are caused by lots of signals
interrupting the pg_usleep()s. Because we measure the number of delays, not
the actual time since we've been waiting for the spinlock, signals
interrupting pg_usleep() trigger can very significantly shorten the amount of
time until we consider a spinlock stuck.  We should fix that.

> To believe that the PANIC is the right idea, we have to suppose that
> we have stuck-spinlock bugs that people actually hit, but that those
> people don't hit them often enough to care, as long as the system
> resets when the spinlock gets stuck, instead of hanging. I can't
> completely rule out the existence of either such bugs or such people,
> but I'm not aware of having encountered them.

I don't think that's a fair description of the situation. It supposes that the
alternative to the PANIC is that the problem is detected and resolved some
other way. But, depending on the spinlock, the problem will not be detected by
automated checks for the system being up. IME you end up with a system that's
degraded in a complicated hard to understand way, rather than one that's just
down.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

11 апреля, 23:00:26

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Apr 10, 2024 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe we should rip out the whole mechanism and hard-wire
>> spins_per_delay at 1000 or so.

> Or, rip out the whole, whole mechanism and just don't PANIC.

By that you mean "remove the NUM_DELAYS limit", right?  We still ought
to sleep sometimes if we don't get a spinlock promptly, so that seems
fairly orthogonal to the other points raised in this thread.

Having said that, it's not an insane suggestion.  Our coding rules
for spinlocks are tight enough that a truly stuck spinlock should
basically never happen, and certainly it basically never happens in
developer testing (IME anyway, maybe other people break things at
that level more often).  Besides, if it does happen it wouldn't be
different in a user's eyes from other infinite or near-infinite loops.
I don't think we could risk putting in a CHECK_FOR_INTERRUPTS, so
getting out of it would require "kill -9" or so; but it's hardly
unheard-of for other problematic loops to not have such a check.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

11 апреля, 23:03:47

Hi,

On 2024-04-10 21:52:59 -0400, Tom Lane wrote:
> Less drastically, I wonder if we should call finish_spin_delay
> at all in these off-label uses of perform_spin_delay.  What
> we're trying to measure there is the behavior of TAS() spin loops,
> and I'm not sure that what LWLockWaitListLock and the bufmgr
> callers are doing should be assumed to have timing behavior
> identical to that.

I think the difference between individual spinlocks is bigger than between
spinlocks and lwlock/buffer header locks.

I think we probably should move away from having any spinlocks.  I tried to
just replace them with lwlocks, but today the overhead is too big. The issue
isn't primarily the error handling or such, but the fact that rwlocks are more
expensive than simple exclusive locks. The main differences are:

1) On x86 a spinlock release just needs a compiler barrier, but an rwlock
   needs an atomic op.

2) Simple mutex acquisition is easily done with an atomic-exchange, which is
   much harder for an rwlock (as the lock has more states, so just setting to
   "locked" and checking the return value isn't sufficient). Once a lock is
   contended, a atomic compare-exchange ends up with lots of retries due to
   concurrent changes.

It might be that we can still get away with just removing spinlocks - to my
knowledge we have one very heavily contended performance critical spinlock,
XLogCtl->Insert->insertpos_lck.  I think Thomas and I have come up with a way
to do away with that spinlock.

OTOH, there are plenty other lwlocks where we pay the price of lwlocks being
an rwlock, but just use the exclusive mode. So perhaps we should just add a
exclusive-only lwlock variant.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

11 апреля, 23:11:40

Andres Freund <andres@anarazel.de> writes:
> On 2024-04-11 15:24:28 -0400, Robert Haas wrote:
>> Or, rip out the whole, whole mechanism and just don't PANIC.

> I continue believe that that'd be a quite bad idea.

I'm warming to it myself.

> My suspicion is that most of the false positives are caused by lots of signals
> interrupting the pg_usleep()s. Because we measure the number of delays, not
> the actual time since we've been waiting for the spinlock, signals
> interrupting pg_usleep() trigger can very significantly shorten the amount of
> time until we consider a spinlock stuck.  We should fix that.

We wouldn't need to fix it, if we simply removed the NUM_DELAYS
limit.  Whatever kicked us off the sleep doesn't matter, we might
as well go check the spinlock.

Also, you propose in your other message replacing spinlocks with
lwlocks.  Whatever the other merits of that, I notice that we have
no timeout or "stuck lwlock" detection.  So that would basically
remove the stuck-spinlock behavior in an indirect way, without
adding any safety measures that would justify thinking that it's
less likely we needed stuck-lock detection than before.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

11 апреля, 23:21:39

Hi,

On 2024-04-11 16:11:40 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2024-04-11 15:24:28 -0400, Robert Haas wrote:
> >> Or, rip out the whole, whole mechanism and just don't PANIC.
>
> > I continue believe that that'd be a quite bad idea.
>
> I'm warming to it myself.
>
> > My suspicion is that most of the false positives are caused by lots of signals
> > interrupting the pg_usleep()s. Because we measure the number of delays, not
> > the actual time since we've been waiting for the spinlock, signals
> > interrupting pg_usleep() trigger can very significantly shorten the amount of
> > time until we consider a spinlock stuck.  We should fix that.
>
> We wouldn't need to fix it, if we simply removed the NUM_DELAYS
> limit.  Whatever kicked us off the sleep doesn't matter, we might
> as well go check the spinlock.

I suspect we should fix it regardless of whether we keep NUM_DELAYS. We
shouldn't increase cur_delay faster just because a lot of signals are coming
in.  If it were just user triggered signals it'd probably not be worth
worrying about, but we do sometimes send a lot of signals ourselves...


> Also, you propose in your other message replacing spinlocks with lwlocks.
> Whatever the other merits of that, I notice that we have no timeout or
> "stuck lwlock" detection.

True. And that's not great. But at least lwlocks can be identified in
pg_stat_activity, which does help some.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

11 апреля, 23:46:10

Andres Freund <andres@anarazel.de> writes:
> On 2024-04-11 16:11:40 -0400, Tom Lane wrote:
>> We wouldn't need to fix it, if we simply removed the NUM_DELAYS
>> limit.  Whatever kicked us off the sleep doesn't matter, we might
>> as well go check the spinlock.

> I suspect we should fix it regardless of whether we keep NUM_DELAYS. We
> shouldn't increase cur_delay faster just because a lot of signals are coming
> in.

I'm unconvinced there's a problem there.  Also, what would you do
about this that wouldn't involve adding kernel calls for gettimeofday?
Admittedly, if we only do that when we're about to sleep, maybe it's
not so awful; but it's still adding complexity that I'm unconvinced
is warranted.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Robert Haas

Дата:

11 апреля, 23:46:23

On Thu, Apr 11, 2024 at 3:52 PM Andres Freund <andres@anarazel.de> wrote:
> My suspicion is that most of the false positives are caused by lots of signals
> interrupting the pg_usleep()s. Because we measure the number of delays, not
> the actual time since we've been waiting for the spinlock, signals
> interrupting pg_usleep() trigger can very significantly shorten the amount of
> time until we consider a spinlock stuck.  We should fix that.

I mean, go nuts. But <dons asbestos underpants, asbestos regular
pants, 2 pair of asbestos socks, 3 asbestos shirts, 2 asbestos
jackets, and then hides inside of a flame-proof capsule at the bottom
of the Pacific ocean> this is just another thing like query hints,
where everybody says "oh, the right thing to do is fix X or Y or Z and
then you won't need it". But of course it never actually gets fixed
well enough that people stop having problems in the real world. And
eventually we look like a developer community that cares more about
our own opinion about what is right than what the experience of real
users actually is.

> I don't think that's a fair description of the situation. It supposes that the
> alternative to the PANIC is that the problem is detected and resolved some
> other way. But, depending on the spinlock, the problem will not be detected by
> automated checks for the system being up. IME you end up with a system that's
> degraded in a complicated hard to understand way, rather than one that's just
> down.

I'm interested to read that you've seen this actually happen and that
you got that result. What I would have thought would happen is that,
within a relatively short period of time, every backend in the system
would pile up waiting for that spinlock and the whole system would
become completely unresponsive. I mean, I know it depends on exactly
which spinlock it is. But, I would have thought that if this was
happening, it would be happening because some regular backend died in
a weird way, and if that is indeed what happened, then it's likely
that the other backends are doing similar kinds of work, because
that's how application workloads typically behave, so they'll probably
all hit the part of the code where they need that spinlock too, and
now everybody's just spinning.

If it's something like a WAL receiver mutex or the checkpointer mutex
or even a parallel query mutex, then I guess it would look different.
But even then, what I'd expect to see is all backends of type X pile
up on the stuck mutex, and when you check 'ps' or 'top',  you go "oh
hmm, all my WAL receivers are at 100% CPU" and you get a backtrace or
an strace and you go "hmm". Now, I agree that in this kind of scenario
where only some backends lock up, automated checks are not necessarily
going to notice the problem - but a PANIC is hardly better. Now you
just have a system that keeps PANICing, which liveness checks aren't
necessarily going to notice either.

In all seriousness, I'd really like to understand what experience
you've had that makes this check seem useful. Because I think all of
my experiences with it have been bad. If they weren't, the last good
one was a very long time ago.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

12 апреля, 00:17:11

Hi,

On 2024-04-11 16:46:10 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2024-04-11 16:11:40 -0400, Tom Lane wrote:
> >> We wouldn't need to fix it, if we simply removed the NUM_DELAYS
> >> limit.  Whatever kicked us off the sleep doesn't matter, we might
> >> as well go check the spinlock.
> 
> > I suspect we should fix it regardless of whether we keep NUM_DELAYS. We
> > shouldn't increase cur_delay faster just because a lot of signals are coming
> > in.
> 
> I'm unconvinced there's a problem there.

Obviously that's a different aspect than efficiency, but in local, admittedly
extreme, testing I've seen stuck spinlocks being detected in a fraction of the
normally expected time. A spinlock that ends up sleeping for close to a second
after a relatively short amount of time surely isn't good for predictable
performance.

IIRC the bad case was on a hot standby, with some recovery conflict causing
the startup process to send a lot of signals.

> Also, what would you do about this that wouldn't involve adding kernel calls
> for gettimeofday?  Admittedly, if we only do that when we're about to sleep,
> maybe it's not so awful; but it's still adding complexity that I'm
> unconvinced is warranted.

At least on !windows, pg_usleep() uses nanosleep(), which, when interrupted by
a signal, can return the remaining time until the experation of the timer.

I suspect that on windows computing the time when a signal arrived wouldn't be
expensive, compared to all the other overhead implied by our signal handling
emulation.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

12 апреля, 00:30:23

Hi,

On 2024-04-11 16:46:23 -0400, Robert Haas wrote:
> On Thu, Apr 11, 2024 at 3:52 PM Andres Freund <andres@anarazel.de> wrote:
> > My suspicion is that most of the false positives are caused by lots of signals
> > interrupting the pg_usleep()s. Because we measure the number of delays, not
> > the actual time since we've been waiting for the spinlock, signals
> > interrupting pg_usleep() trigger can very significantly shorten the amount of
> > time until we consider a spinlock stuck.  We should fix that.
> 
> I mean, go nuts. But <dons asbestos underpants, asbestos regular
> pants, 2 pair of asbestos socks, 3 asbestos shirts, 2 asbestos
> jackets, and then hides inside of a flame-proof capsule at the bottom
> of the Pacific ocean> this is just another thing like query hints,
> where everybody says "oh, the right thing to do is fix X or Y or Z and
> then you won't need it". But of course it never actually gets fixed
> well enough that people stop having problems in the real world. And
> eventually we look like a developer community that cares more about
> our own opinion about what is right than what the experience of real
> users actually is.

I don't think that's a particularly apt comparison. If you have spinlocks that
cannot be acquired within tens of seconds, you're in a really bad situation,
regardless of whether you crash-restart or not.

Whereas with hints, you might actually be operating perfectly normally when
using hints.  Never using the wrong plan is also just an order of magnitude
harder and fuzzier problem than ensuring we don't wait for spinlocks for a
long time.

> In all seriousness, I'd really like to understand what experience
> you've had that makes this check seem useful. Because I think all of
> my experiences with it have been bad. If they weren't, the last good
> one was a very long time ago.

By far the most of the stuck spinlocks I've seen were due to bugs in
out-of-core extensions. Absurdly enough, the next common thing probably is due
to people using gdb to make an uninterruptible process break out of some code,
without a crash-restart, accidentally doing so while a spinlock is held.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Robert Haas

Дата:

12 апреля, 05:31:21

On Thu, Apr 11, 2024 at 5:30 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think that's a particularly apt comparison. If you have spinlocks that
> cannot be acquired within tens of seconds, you're in a really bad situation,
> regardless of whether you crash-restart or not.

I agree with that. I just don't think panicking makes it better.

> > In all seriousness, I'd really like to understand what experience
> > you've had that makes this check seem useful. Because I think all of
> > my experiences with it have been bad. If they weren't, the last good
> > one was a very long time ago.
>
> By far the most of the stuck spinlocks I've seen were due to bugs in
> out-of-core extensions. Absurdly enough, the next common thing probably is due
> to people using gdb to make an uninterruptible process break out of some code,
> without a crash-restart, accidentally doing so while a spinlock is held.

Hmm, interesting. I'm glad I haven't seen those extensions. But I
think I have seen cases of people attaching gdb to grab a backtrace to
debug some problem in production, and failing to detach it within 60
seconds.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

12 апреля, 05:55:48

Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Apr 11, 2024 at 5:30 PM Andres Freund <andres@anarazel.de> wrote:
>> By far the most of the stuck spinlocks I've seen were due to bugs in
>> out-of-core extensions. Absurdly enough, the next common thing probably is due
>> to people using gdb to make an uninterruptible process break out of some code,
>> without a crash-restart, accidentally doing so while a spinlock is held.

> Hmm, interesting. I'm glad I haven't seen those extensions. But I
> think I have seen cases of people attaching gdb to grab a backtrace to
> debug some problem in production, and failing to detach it within 60
> seconds.

I don't doubt that there are extensions with bugs of this ilk
(and I wouldn't bet an appendage that there aren't such bugs
in core, either).  But Robert's question remains: how does
PANIC'ing after awhile make anything better?  I flat out don't
believe the idea that having a backend stuck on a spinlock
would otherwise go undetected.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

12 апреля, 06:15:38

I wrote:
> ... But Robert's question remains: how does
> PANIC'ing after awhile make anything better?  I flat out don't
> believe the idea that having a backend stuck on a spinlock
> would otherwise go undetected.

Oh, wait.  After thinking a bit longer I believe I recall the argument
for this behavior: it automates recovery from a genuinely stuck
spinlock.  If we waited forever, the only way out of that is for a
DBA to kill -9 the stuck process, which has exactly the same end
result as a PANIC, except that it takes a lot longer to put the system
back in service and perhaps rousts somebody, or several somebodies,
out of their warm beds to fix it.  If you don't have a DBA on-call
24x7 then that answer looks even worse.

So there's that.  But that's not an argument that we need to be in a
hurry to timeout; if the built-in reaction time is less than perhaps
10 minutes you're still miles ahead of the manual solution.

On the third hand, it's still true that we have no comparable
behavior for any other source of system lockups, and it's difficult
to make a case that stuck spinlocks really need more concern than
other kinds of bugs.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

12 апреля, 06:47:37

Hi,

On 2024-04-11 23:15:38 -0400, Tom Lane wrote:
> I wrote:
> > ... But Robert's question remains: how does
> > PANIC'ing after awhile make anything better?  I flat out don't
> > believe the idea that having a backend stuck on a spinlock
> > would otherwise go undetected.
> 
> Oh, wait.  After thinking a bit longer I believe I recall the argument
> for this behavior: it automates recovery from a genuinely stuck
> spinlock.  If we waited forever, the only way out of that is for a
> DBA to kill -9 the stuck process, which has exactly the same end
> result as a PANIC, except that it takes a lot longer to put the system
> back in service and perhaps rousts somebody, or several somebodies,
> out of their warm beds to fix it.  If you don't have a DBA on-call
> 24x7 then that answer looks even worse.

Precisely.  And even if you have a DBA on call 24x7, they need to know that
they need to react to something.

Today you can automate getting notified of crash-restarts, by using
  restart_after_crash = false
and restarting somewhere outside of postgres.

Imo that's the only sensible setting for larger production environments,
although I'm sure not everyone agrees with that.

> So there's that.  But that's not an argument that we need to be in a
> hurry to timeout; if the built-in reaction time is less than perhaps
> 10 minutes you're still miles ahead of the manual solution.

The current timeout is of a hard to determine total time, due to the
increasing and wrapping around wait times, but it's normally longer than 60s,
unless you're interrupted by a lot of signals. 1000 sleeps between 1000 and
1000000 us.

I think we should make the timeout something predictable and probably somewhat
longer.

> On the third hand, it's still true that we have no comparable
> behavior for any other source of system lockups, and it's difficult
> to make a case that stuck spinlocks really need more concern than
> other kinds of bugs.

Spinlocks are somewhat more finnicky though, compared to e.g. lwlocks that are
released on error. Lwlocks also take e.g. care to hold interrupts so code
doesn't just jump out of a section with lwlocks held.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

12 апреля, 07:03:43

Andres Freund <andres@anarazel.de> writes:
> On 2024-04-11 23:15:38 -0400, Tom Lane wrote:
>> On the third hand, it's still true that we have no comparable
>> behavior for any other source of system lockups, and it's difficult
>> to make a case that stuck spinlocks really need more concern than
>> other kinds of bugs.

> Spinlocks are somewhat more finnicky though, compared to e.g. lwlocks that are
> released on error. Lwlocks also take e.g. care to hold interrupts so code
> doesn't just jump out of a section with lwlocks held.

Yeah.  I don't think that unifying spinlocks with lwlocks is a great
idea, precisely because lwlocks have these other small overheads
in the name of bug detection/prevention.  It seems to me that the
division between spinlocks and lwlocks and heavyweight locks is
basically a good idea that matches up well with the actual
requirements of some different parts of our code.  The underlying
implementations can change (and have), but the idea of successively
increasing amounts of protection against caller error seems sound
to me.

If you grant that concept, then the idea that spinlocks need more
protection against stuck-ness than the higher-overhead levels do
seems mighty odd.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

12 апреля, 07:41:39

Hi,

On 2024-04-11 20:47:37 -0700, Andres Freund wrote:
> > So there's that.  But that's not an argument that we need to be in a
> > hurry to timeout; if the built-in reaction time is less than perhaps
> > 10 minutes you're still miles ahead of the manual solution.
> 
> The current timeout is of a hard to determine total time, due to the
> increasing and wrapping around wait times, but it's normally longer than 60s,
> unless you're interrupted by a lot of signals. 1000 sleeps between 1000 and
> 1000000 us.
> 
> I think we should make the timeout something predictable and probably somewhat
> longer.

FWIW, I just reproduced the scenario with signals. I added tracking of the
total time actually slept and lost to SpinDelayStatus, and added a function to
trigger a wait on a spinlock.

To wait less, I set max_standby_streaming_delay=0.1, but that's just for
easier testing in isolation. In reality that could have been reached before
the spinlock is even acquired.

On a standby, while a recovery conflict is happening:
PANIC:  XX000: stuck spinlock detected at crashme, path/to/file:line, after 4.38s, lost 127.96s

So right now it's really not hard to trigger the stuck-spinlock logic
completely spuriously.  This doesn't just happen with hot standby, there are
plenty other sources of lots of signals being sent.

Tracking the total amount of time spent sleeping doesn't require any
additional syscalls, due to the nanosleep()...

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Alexander Lakhin

Дата:

12 апреля, 08:05:05

Hi Andres,

12.04.2024 07:41, Andres Freund wrote:
>
> FWIW, I just reproduced the scenario with signals. I added tracking of the
> total time actually slept and lost to SpinDelayStatus, and added a function to
> trigger a wait on a spinlock.
>
> To wait less, I set max_standby_streaming_delay=0.1, but that's just for
> easier testing in isolation. In reality that could have been reached before
> the spinlock is even acquired.
>
> On a standby, while a recovery conflict is happening:
> PANIC:  XX000: stuck spinlock detected at crashme, path/to/file:line, after 4.38s, lost 127.96s
>
>
> So right now it's really not hard to trigger the stuck-spinlock logic
> completely spuriously.  This doesn't just happen with hot standby, there are
> plenty other sources of lots of signals being sent.

I managed to trigger that logic when trying to construct a reproducer
for bug #18426.

With the following delays added:
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1776,6 +1776,7 @@ retry:
       */
      if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
      {
+pg_usleep(300000L);
          UnlockBufHdr(buf, buf_state);
          LWLockRelease(oldPartitionLock);
          /* safety check: should definitely not be our *own* pin */
@@ -5549,6 +5550,7 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits,

      Assert(buf_state & BM_IO_IN_PROGRESS);

+pg_usleep(300);
      buf_state &= ~(BM_IO_IN_PROGRESS | BM_IO_ERROR);
      if (clear_dirty && !(buf_state & BM_JUST_DIRTIED))
          buf_state &= ~(BM_DIRTY | BM_CHECKPOINT_NEEDED);

and /tmp/temp.config:
bgwriter_delay = 10

TEMP_CONFIG=/tmp/temp.config make -s check -C src/test/recovery PROVE_TESTS="t/032*"
fails for me on iterations 22, 23, 37:
2024-04-12 05:00:17.981 UTC [762336] PANIC:  stuck spinlock detected at WaitBufHdrUnlocked, bufmgr.c:5726

I haven't investigated this case yet.

Best regards,
Alexander

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

12 апреля, 09:02:41

Hi,

On 2024-04-11 21:41:39 -0700, Andres Freund wrote:
> FWIW, I just reproduced the scenario with signals. I added tracking of the
> total time actually slept and lost to SpinDelayStatus, and added a function to
> trigger a wait on a spinlock.
>
> To wait less, I set max_standby_streaming_delay=0.1, but that's just for
> easier testing in isolation. In reality that could have been reached before
> the spinlock is even acquired.
>
> On a standby, while a recovery conflict is happening:
> PANIC:  XX000: stuck spinlock detected at crashme, path/to/file:line, after 4.38s, lost 127.96s
>
>
> So right now it's really not hard to trigger the stuck-spinlock logic
> completely spuriously.  This doesn't just happen with hot standby, there are
> plenty other sources of lots of signals being sent.

Oh my. There's a workload that completely trivially hits this, without even
trying hard. LISTEN/NOTIFY.

PANIC:  XX000: stuck spinlock detected at crashme, file:line, after 0.000072s, lost 133.027159s

Yes, it really triggered in less than 1ms. That was with just one session
doing NOTIFYs from a client. There's plenty users that send NOTIFY from
triggers, which afaics will result in much higher rates of signals being sent.

Even with a bit less NOTIFY traffic, this very obviously gets into the
territory where plain scheduling delays will trigger the stuck spinlock logic.

Greetings,

Andres

Re: Issue with the PRNG used by Postgres

От

Tom Lane

Дата:

12 апреля, 16:43:46

Andres Freund <andres@anarazel.de> writes:
> Oh my. There's a workload that completely trivially hits this, without even
> trying hard. LISTEN/NOTIFY.

Hm.  Bug in the NOTIFY logic perhaps?  Sending that many signals
can't be good from a performance POV, whether or not it triggers
spinlock issues.

            regards, tom lane

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

12 апреля, 18:45:27

Hi,

On 2024-04-12 09:43:46 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Oh my. There's a workload that completely trivially hits this, without even
> > trying hard. LISTEN/NOTIFY.
> 
> Hm.  Bug in the NOTIFY logic perhaps?

I don't think it's a bug, async.c:SignalBackends() will signal once for every
NOTIFY, regardless of whether the target has already been signaled. You're
certainly right that:

> Sending that many signals can't be good from a performance POV, whether or
> not it triggers spinlock issues.

I wonder if we could switch this to latches, because with latches we'd only
re-send a signal if the receiving end has already processed the signal. Or
alternatively, make procsignal.c do something similar, although that might
take a bit of work to be done race-free.

Greetings,

Andres Freund

Re: Issue with the PRNG used by Postgres

От

Alexander Lakhin

Дата:

12 апреля, 19:00:00

12.04.2024 08:05, Alexander Lakhin wrote:
> 2024-04-12 05:00:17.981 UTC [762336] PANIC:  stuck spinlock detected at WaitBufHdrUnlocked, bufmgr.c:5726
>

It looks like that spinlock issue caused by a race condition/deadlock.
What I see when the test fails is:
A client backend executing "DROP DATABASE conflict_db" performs
dropdb() -> DropDatabaseBuffers() -> InvalidateBuffer()
At the same time, bgwriter performs (for the same buffer):
BgBufferSync() -> SyncOneBuffer()

When InvalidateBuffer() is called, the buffer refcount is zero,
then bgwriter pins the buffer, thus increases refcount;
InvalidateBuffer() gets into the retry loop;
bgwriter calls UnpinBuffer() -> UnpinBufferNoOwner() ->
   WaitBufHdrUnlocked(), which waits for !BM_LOCKED state,
while InvalidateBuffer() waits for the buffer refcount decrease.

As it turns out, it's not related to spinlocks' specifics or PRNG, just a
serendipitous find.

Best regards,
Alexander

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

12 апреля, 21:33:17

Hi,

Given that I found several ways to spuriously trigger the stuck spinlock
logic, I think we need to backpatch a fix.

On 2024-04-11 21:41:39 -0700, Andres Freund wrote:
> Tracking the total amount of time spent sleeping doesn't require any
> additional syscalls, due to the nanosleep()...

I did quickly implement this, and it works nicely on unix. Unfortunately I
couldn't find something similar for windows. Which would mean we'd need to
record the time before every sleep.

Another issue with that approach is that we'd need to add a field to
SpinDelayStatus, which'd be an ABI break. Not sure there's any external code
using it, but if there were, it'd be a fairly nasty, rarely hit, path.

Afaict there's no padding that could be reused on 32 bit platforms [1].

I wonder if, for easy backpatching, the easiest solution is to just reset
errno before calling pg_usleep(), and only increment status->delays if
errno != EINTR. Given our pg_usleep() implementations, that'd preven the stuck
spinlock logic from triggering too fast.

If the ABI issue weren't there, we could record the current time the first
time we sleep during a spinlock acquisition (there's an existing branch for
that) and then only trigger the stuck spinlock detection when enough time has
passed.  That'd be fine from an efficiency POV.

Even if we decide to do so, I don't think it'd be a good idea to remove the
stuck logic alltogether in the backbranches. That might take out services that
have been running ok-ish for a long time.

Greetings,

Andres Freund

[1] But we are wasting a bunch of space on 64bit platforms due to switching
between pointers and 32bit integers, should probably fix that.

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

12 апреля, 22:33:02

Hi,

On 2024-04-12 11:33:17 -0700, Andres Freund wrote:
> I wonder if, for easy backpatching, the easiest solution is to just reset
> errno before calling pg_usleep(), and only increment status->delays if
> errno != EINTR. Given our pg_usleep() implementations, that'd preven the stuck
> spinlock logic from triggering too fast.

Here's a patch implementing this approach. I confirmed that before we trigger
the stuck spinlock logic very quickly and after we don't. However, if most
sleeps are interrupted, it can delay the stuck spinlock detection a good
bit. But that seems much better than triggering it too quickly.

Greetings,

Andres Freund

Вложения

Re: Issue with the PRNG used by Postgres

От

Robert Haas

Дата:

15 апреля, 17:54:16

On Fri, Apr 12, 2024 at 3:33 PM Andres Freund <andres@anarazel.de> wrote:
> Here's a patch implementing this approach. I confirmed that before we trigger
> the stuck spinlock logic very quickly and after we don't. However, if most
> sleeps are interrupted, it can delay the stuck spinlock detection a good
> bit. But that seems much better than triggering it too quickly.

+1 for doing something about this. I'm not sure if it goes far enough,
but it definitely seems much better than doing nothing. Given your
findings, I'm honestly kind of surprised that I haven't seen problems
of this type more frequently. And I think the general idea of not
counting the waits if they're interrupted makes sense. Sure, it's not
going to be 100% accurate, but it's got to be way better for the timer
to trigger too slowly than too quickly. Perhaps that's too glib of me,
given that I'm not sure we should even have a timer, but even if we
stipulate that the panic is useful in some cases, spurious panics are
still really bad.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Issue with the PRNG used by Postgres

От

Andres Freund

Дата:

16 апреля, 18:54:01

Hi,

On 2024-04-15 10:54:16 -0400, Robert Haas wrote:
> On Fri, Apr 12, 2024 at 3:33 PM Andres Freund <andres@anarazel.de> wrote:
> > Here's a patch implementing this approach. I confirmed that before we trigger
> > the stuck spinlock logic very quickly and after we don't. However, if most
> > sleeps are interrupted, it can delay the stuck spinlock detection a good
> > bit. But that seems much better than triggering it too quickly.
>
> +1 for doing something about this. I'm not sure if it goes far enough,
> but it definitely seems much better than doing nothing.

One thing I started to be worried about is whether a patch ought to prevent
the timeout used by perform_spin_delay() from increasing when
interrupted. Otherwise a few signals can trigger quite long waits.

But as a I can't quite see a way to make this accurate in the backbranches, I
suspect something like what I posted is still a good first version.

> Given your findings, I'm honestly kind of surprised that I haven't seen
> problems of this type more frequently.

Same. I did a bunch of searches for the error, but found surprisingly
little.

I think in practice most spinlocks just aren't contended enough to reach
perform_spin_delay(). And we have improved some on that over time.

Greetings,

Andres Freund