Обсуждение: error "can only drop stats once" brings down database

Поиск
Список
Период
Сортировка

error "can only drop stats once" brings down database

От
Floris Van Nee
Дата:

Hi,

 

On a database we have we’ve recently seen a fatal error occur twice. The error happened on two different physical replicas (of the same cluster) during a WAL redo action in the recovery process. They’re running Postgres 15.5.

 

Occurrence 1:

2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; dropped stats: 2/16405/2991517839 2/16405/2991517838 2/16405/2991517835; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 80 catcache 79 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 snapshot 2608 relcache 2991517835 snapshot 1214 relcache 2991517838 relcache 2991517839 relcache 2991517838 snapshot 2608 relcache 2991517838 relcache 2991517839 relcache 2991517835 snapshot 2608 relcache 2991517839 relcache 2991517838 snapshot 2608 relcache 2991517838 snapshot 2608 snapshot 2608 snapshot 2608 relcache 2991517835 snapshot 2608 snapshot 1214",,,,"","startup",,0

2024-02-01 06:55:56.793 CET,,,70282,,65a29b5f.1128a,8,,2024-01-13 15:17:03 CET,,0,LOG,00000,"startup process (PID 70290) exited with exit code 1",,,,,,,,,"","postmaster",,0 2024-02-01 06:55:56.793 CET,,,70282,,65a29b5f.1128a,9,,2024-01-13 15:17:03 CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"","postmaster",,0 2024-02-01 06:55:57.145 CET,,,26624,"100.104.20.59:37478",65bb326d.6800,1,"",2024-02-01 06:55:57

 

Occurrence 2:

2024-02-09 19:15:41.353 CET,,,88714,,65a29b5e.15a8a,6,,2024-01-13 15:17:02 CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at A95F1/3C9D3D88 for Transaction/COMMIT: 2024-02-09 19:15:41.33755+01; dropped stats: 2/16405/3843195112 2/16405/3843195111 2/16405/3843195107; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 80 catcache 79 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 snapshot 2608 relcache 3843195107 snapshot 1214 relcache 3843195111 relcache 3843195112 relcache 3843195111 snapshot 2608 relcache 3843195111 relcache 3843195112 relcache 3843195107 snapshot 2608 relcache 3843195107 snapshot 2608 relcache 3843195107 snapshot 2608 relcache 3843195112 relcache 3843195111 snapshot 2608 relcache 3843195111 snapshot 2608 snapshot 2608 snapshot 2608 relcache 3843195107 snapshot 2608 snapshot 1214",,,,"","startup",,0

 

Googling the error only shows one conversation from last year: https://www.postgresql.org/message-id/flat/17947-b9554521ad963c9c%40postgresql.org

However, in that thread it is caused by logical replication slots, which this cluster does not have.

 

Does anyone have an idea what may be causing this?

 

Separately, it seems like Postgres did not restart by itself when this error occurred (normally if a backend crashes Postgres would kill all connections and initiate a restart), but in this case because it’s the recovery process that stops, the whole database just shuts down and needs to be manually restarted. Is that intended?

 

After manually starting the database again, recovery would continue as normal. The database would be in normal operation again.

 

-Floris

 

Re: error "can only drop stats once" brings down database

От
Tom Lane
Дата:
Floris Van Nee <florisvannee@Optiver.com> writes:
> Hi,
> On a database we have we've recently seen a fatal error occur twice. The error happened on two different physical
replicas(of the same cluster) during a WAL redo action in the recovery process. They're running Postgres 15.5. 

> Occurrence 1:
> 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 CET,1/0,0,FATAL,XX000,"can only drop stats
once",,,,,"WALredo at A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; ... 

Hmm.  This must be coming from pgstat_drop_entry_internal.
I suspect the correct fix is in pgstat_drop_entry, along
the lines of

-    if (shent)
+    if (shent && !shent->dropped)

but it's not clear to me how the already-dropped case ought to affect
the function's bool result.  Also, how are we getting into a
concurrent-drop situation in recovery?

            regards, tom lane



RE: error "can only drop stats once" brings down database

От
Floris Van Nee
Дата:
> Floris Van Nee <florisvannee@Optiver.com> writes:
> > Hi,
> > On a database we have we've recently seen a fatal error occur twice. The
> error happened on two different physical replicas (of the same cluster)
> during a WAL redo action in the recovery process. They're running Postgres
> 15.5.
>
> > Occurrence 1:
> > 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04
> CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at
> A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01;
> ...
>
> Hmm.  This must be coming from pgstat_drop_entry_internal.
> I suspect the correct fix is in pgstat_drop_entry, along the lines of
>
> -    if (shent)
> +    if (shent && !shent->dropped)
>
> but it's not clear to me how the already-dropped case ought to affect the
> function's bool result.  Also, how are we getting into a concurrent-drop
> situation in recovery?

Anyone has further thoughts on this? This still happens occasionally.

-Floris




Re: error "can only drop stats once" brings down database

От
Tom Lane
Дата:
Floris Van Nee <florisvannee@Optiver.com> writes:
>> Hmm.  This must be coming from pgstat_drop_entry_internal.
>> I suspect the correct fix is in pgstat_drop_entry, along the lines of
>> 
>> -    if (shent)
>> +    if (shent && !shent->dropped)
>> 
>> but it's not clear to me how the already-dropped case ought to affect the
>> function's bool result.  Also, how are we getting into a concurrent-drop
>> situation in recovery?

> Anyone has further thoughts on this? This still happens occasionally.

I was expecting Andres to comment on it eventually.  I don't know
that code well enough to want to modify it on my own authority,
especially not just before a release.

One thing that would be safe enough and perhaps useful is to
modify the "can only drop stats once" message to provide a
little more detail, like the hash key of the problematic
entry.  That might help us understand what's triggering this.
The OIDs would be opaque perhaps, but the PgStat_Kind value
seems useful to know.

            regards, tom lane



Re: error "can only drop stats once" brings down database

От
Alexander Lakhin
Дата:
Hello Floris,

03.05.2024 21:10, Floris Van Nee wrote:
>> Floris Van Nee <florisvannee@Optiver.com> writes:
>>> Hi,
>>> On a database we have we've recently seen a fatal error occur twice. The
>> error happened on two different physical replicas (of the same cluster)
>> during a WAL redo action in the recovery process. They're running Postgres
>> 15.5.
>>
>>> Occurrence 1:
>>> 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04
>> CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at
>> A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01;
>> ...
>>
>> Hmm.  This must be coming from pgstat_drop_entry_internal.
>> I suspect the correct fix is in pgstat_drop_entry, along the lines of
>>
>> -    if (shent)
>> +    if (shent && !shent->dropped)
>>
>> but it's not clear to me how the already-dropped case ought to affect the
>> function's bool result.  Also, how are we getting into a concurrent-drop
>> situation in recovery?
> Anyone has further thoughts on this? This still happens occasionally.
>

Please take a look at bug #17947 (maybe you encounter the same?):
https://www.postgresql.org/message-id/17947-b9554521ad963c9c%40postgresql.org

Best regards,
Alexander



Re: error "can only drop stats once" brings down database

От
Andres Freund
Дата:
Hi,

On 2024-05-03 18:10:05 +0000, Floris Van Nee wrote:
> > Floris Van Nee <florisvannee@Optiver.com> writes:
> > > Hi,
> > > On a database we have we've recently seen a fatal error occur twice. The
> > error happened on two different physical replicas (of the same cluster)
> > during a WAL redo action in the recovery process. They're running Postgres
> > 15.5.
> >
> > > Occurrence 1:
> > > 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04
> > CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at
> > A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01;
> > ...
> >
> > Hmm.  This must be coming from pgstat_drop_entry_internal.
> > I suspect the correct fix is in pgstat_drop_entry, along the lines of
> >
> > -    if (shent)
> > +    if (shent && !shent->dropped)
> >
> > but it's not clear to me how the already-dropped case ought to affect the
> > function's bool result.

I don't think that'd be quite right - just ignoring that we're confused about
tracking "stats object" liveliness seems likely to hide bugs.

Elsewhere in this thread you suggested adding more details about the error -
let's do that. Something like the attached might already be an improvement?


> Also, how are we getting into a concurrent-drop situation in recovery?

I'd like to know how we get into the situation too. It's perhaps worth noting
that stats can be generated on a standby, albeit not by the replay
process. But locking should prevent active use of the stats entry when it's
being dropped...


> Anyone has further thoughts on this? This still happens occasionally.

Do you have any more details about the workload leading to this issue? Is the
standby used for queries?  Given the "high value" your oids/relfilenodes have,
I assume there are a lot of created/dropped/truncated relations?

Greetings,

Andres Freund

Вложения

Re: error "can only drop stats once" brings down database

От
Andres Freund
Дата:
Hi,

On 2024-05-05 09:09:15 -0700, Andres Freund wrote:
> Do you have any more details about the workload leading to this issue? Is the
> standby used for queries?  Given the "high value" your oids/relfilenodes have,
> I assume there are a lot of created/dropped/truncated relations?

I suspect that the "high value" oids are related to the issue. I bet you're
running into OID wraparound for objects at a much higher frequency than most
users.  I suspect the reason this causes problems is the problem I just found
and described here:
https://www.postgresql.org/message-id/20240505183741.5tefii3razzefvtc%40awork3.anarazel.de

Could you confirm that you have a) a lot of oid assignments b) your startup
process was running for a long time by the time of the crash?

The fact that we're not triggering "stats object garbage collection" for most
drops also explains why you're much more likely to see this on a standby than
on a primary. An a primary it's going to be rare to have a single backend live
long enough to observe an oid wraparound leading to one backend accessing
stats for the same object type with the same oid after that object previously
having been dropped. But most stats accesses on a standby are going to be by
the same process, the startup process.  You'd still need some other accesses
to prevent the object from being dropped "immediately", but that could be due
to shorter lived processes.

Greetings,

Andres Freund



RE: error "can only drop stats once" brings down database

От
Floris Van Nee
Дата:
>
> Could you confirm that you have a) a lot of oid assignments b) your startup
> process was running for a long time by the time of the crash?
>

Thanks Andres. Both higher than average I guess, although it depends on what is considered 'a lot' and 'a long time'.
Thestartup process was running for a few months. There are ~500k entries in pg_class, of which most are (Timescale)
partitions.However, even with this number of items in pg_class I would't expect wraparound to happen frequently? These
arenot dropped/recreated. I've monitored "select count(*) from pg_class" for a while to see if it changes often, and
whilethere are changes during the day (likely temporary tables being created), it also doesn't nearly happen at a
frequencythat will get us to a wraparound quickly. 

Oids aren't just consumed by pg_class though. And I do see system oid growing quickly (when doing CREATE TABLE twice
witha minute in between, then checking the oid difference of the table). I don't know how to investigate the cause of
this.What would be the best way to check what could be consuming these oids so quickly? 

-Floris