Обсуждение: error "can only drop stats once" brings down database
Hi,
On a database we have we’ve recently seen a fatal error occur twice. The error happened on two different physical replicas (of the same cluster) during a WAL redo action in the recovery process. They’re running Postgres 15.5.
Occurrence 1:
2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; dropped stats: 2/16405/2991517839 2/16405/2991517838 2/16405/2991517835; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 80 catcache 79 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 snapshot 2608 relcache 2991517835 snapshot 1214 relcache 2991517838 relcache 2991517839 relcache 2991517838 snapshot 2608 relcache 2991517838 relcache 2991517839 relcache 2991517835 snapshot 2608 relcache 2991517839 relcache 2991517838 snapshot 2608 relcache 2991517838 snapshot 2608 snapshot 2608 snapshot 2608 relcache 2991517835 snapshot 2608 snapshot 1214",,,,"","startup",,0
2024-02-01 06:55:56.793 CET,,,70282,,65a29b5f.1128a,8,,2024-01-13 15:17:03 CET,,0,LOG,00000,"startup process (PID 70290) exited with exit code 1",,,,,,,,,"","postmaster",,0 2024-02-01 06:55:56.793 CET,,,70282,,65a29b5f.1128a,9,,2024-01-13 15:17:03 CET,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"","postmaster",,0 2024-02-01 06:55:57.145 CET,,,26624,"100.104.20.59:37478",65bb326d.6800,1,"",2024-02-01 06:55:57
Occurrence 2:
2024-02-09 19:15:41.353 CET,,,88714,,65a29b5e.15a8a,6,,2024-01-13 15:17:02 CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at A95F1/3C9D3D88 for Transaction/COMMIT: 2024-02-09 19:15:41.33755+01; dropped stats: 2/16405/3843195112 2/16405/3843195111 2/16405/3843195107; inval msgs: catcache 80 catcache 79 catcache 80 catcache 79 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 catcache 55 catcache 54 catcache 55 catcache 54 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 32 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 catcache 80 catcache 79 catcache 80 catcache 79 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 7 catcache 6 catcache 55 catcache 54 snapshot 2608 relcache 3843195107 snapshot 1214 relcache 3843195111 relcache 3843195112 relcache 3843195111 snapshot 2608 relcache 3843195111 relcache 3843195112 relcache 3843195107 snapshot 2608 relcache 3843195107 snapshot 2608 relcache 3843195107 snapshot 2608 relcache 3843195112 relcache 3843195111 snapshot 2608 relcache 3843195111 snapshot 2608 snapshot 2608 snapshot 2608 relcache 3843195107 snapshot 2608 snapshot 1214",,,,"","startup",,0
Googling the error only shows one conversation from last year: https://www.postgresql.org/message-id/flat/17947-b9554521ad963c9c%40postgresql.org
However, in that thread it is caused by logical replication slots, which this cluster does not have.
Does anyone have an idea what may be causing this?
Separately, it seems like Postgres did not restart by itself when this error occurred (normally if a backend crashes Postgres would kill all connections and initiate a restart), but in this case because it’s the recovery process that stops, the whole database just shuts down and needs to be manually restarted. Is that intended?
After manually starting the database again, recovery would continue as normal. The database would be in normal operation again.
-Floris
Floris Van Nee <florisvannee@Optiver.com> writes: > Hi, > On a database we have we've recently seen a fatal error occur twice. The error happened on two different physical replicas(of the same cluster) during a WAL redo action in the recovery process. They're running Postgres 15.5. > Occurrence 1: > 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WALredo at A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; ... Hmm. This must be coming from pgstat_drop_entry_internal. I suspect the correct fix is in pgstat_drop_entry, along the lines of - if (shent) + if (shent && !shent->dropped) but it's not clear to me how the already-dropped case ought to affect the function's bool result. Also, how are we getting into a concurrent-drop situation in recovery? regards, tom lane
> Floris Van Nee <florisvannee@Optiver.com> writes: > > Hi, > > On a database we have we've recently seen a fatal error occur twice. The > error happened on two different physical replicas (of the same cluster) > during a WAL redo action in the recovery process. They're running Postgres > 15.5. > > > Occurrence 1: > > 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 > CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at > A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; > ... > > Hmm. This must be coming from pgstat_drop_entry_internal. > I suspect the correct fix is in pgstat_drop_entry, along the lines of > > - if (shent) > + if (shent && !shent->dropped) > > but it's not clear to me how the already-dropped case ought to affect the > function's bool result. Also, how are we getting into a concurrent-drop > situation in recovery? Anyone has further thoughts on this? This still happens occasionally. -Floris
Floris Van Nee <florisvannee@Optiver.com> writes: >> Hmm. This must be coming from pgstat_drop_entry_internal. >> I suspect the correct fix is in pgstat_drop_entry, along the lines of >> >> - if (shent) >> + if (shent && !shent->dropped) >> >> but it's not clear to me how the already-dropped case ought to affect the >> function's bool result. Also, how are we getting into a concurrent-drop >> situation in recovery? > Anyone has further thoughts on this? This still happens occasionally. I was expecting Andres to comment on it eventually. I don't know that code well enough to want to modify it on my own authority, especially not just before a release. One thing that would be safe enough and perhaps useful is to modify the "can only drop stats once" message to provide a little more detail, like the hash key of the problematic entry. That might help us understand what's triggering this. The OIDs would be opaque perhaps, but the PgStat_Kind value seems useful to know. regards, tom lane
Hello Floris, 03.05.2024 21:10, Floris Van Nee wrote: >> Floris Van Nee <florisvannee@Optiver.com> writes: >>> Hi, >>> On a database we have we've recently seen a fatal error occur twice. The >> error happened on two different physical replicas (of the same cluster) >> during a WAL redo action in the recovery process. They're running Postgres >> 15.5. >> >>> Occurrence 1: >>> 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 >> CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at >> A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; >> ... >> >> Hmm. This must be coming from pgstat_drop_entry_internal. >> I suspect the correct fix is in pgstat_drop_entry, along the lines of >> >> - if (shent) >> + if (shent && !shent->dropped) >> >> but it's not clear to me how the already-dropped case ought to affect the >> function's bool result. Also, how are we getting into a concurrent-drop >> situation in recovery? > Anyone has further thoughts on this? This still happens occasionally. > Please take a look at bug #17947 (maybe you encounter the same?): https://www.postgresql.org/message-id/17947-b9554521ad963c9c%40postgresql.org Best regards, Alexander
Hi, On 2024-05-03 18:10:05 +0000, Floris Van Nee wrote: > > Floris Van Nee <florisvannee@Optiver.com> writes: > > > Hi, > > > On a database we have we've recently seen a fatal error occur twice. The > > error happened on two different physical replicas (of the same cluster) > > during a WAL redo action in the recovery process. They're running Postgres > > 15.5. > > > > > Occurrence 1: > > > 2024-02-01 06:55:54.476 CET,,,70290,,65a29b60.11292,6,,2024-01-13 15:17:04 > > CET,1/0,0,FATAL,XX000,"can only drop stats once",,,,,"WAL redo at > > A7BD1/D6F9B6C0 for Transaction/COMMIT: 2024-02-01 06:55:54.395851+01; > > ... > > > > Hmm. This must be coming from pgstat_drop_entry_internal. > > I suspect the correct fix is in pgstat_drop_entry, along the lines of > > > > - if (shent) > > + if (shent && !shent->dropped) > > > > but it's not clear to me how the already-dropped case ought to affect the > > function's bool result. I don't think that'd be quite right - just ignoring that we're confused about tracking "stats object" liveliness seems likely to hide bugs. Elsewhere in this thread you suggested adding more details about the error - let's do that. Something like the attached might already be an improvement? > Also, how are we getting into a concurrent-drop situation in recovery? I'd like to know how we get into the situation too. It's perhaps worth noting that stats can be generated on a standby, albeit not by the replay process. But locking should prevent active use of the stats entry when it's being dropped... > Anyone has further thoughts on this? This still happens occasionally. Do you have any more details about the workload leading to this issue? Is the standby used for queries? Given the "high value" your oids/relfilenodes have, I assume there are a lot of created/dropped/truncated relations? Greetings, Andres Freund
Вложения
Hi, On 2024-05-05 09:09:15 -0700, Andres Freund wrote: > Do you have any more details about the workload leading to this issue? Is the > standby used for queries? Given the "high value" your oids/relfilenodes have, > I assume there are a lot of created/dropped/truncated relations? I suspect that the "high value" oids are related to the issue. I bet you're running into OID wraparound for objects at a much higher frequency than most users. I suspect the reason this causes problems is the problem I just found and described here: https://www.postgresql.org/message-id/20240505183741.5tefii3razzefvtc%40awork3.anarazel.de Could you confirm that you have a) a lot of oid assignments b) your startup process was running for a long time by the time of the crash? The fact that we're not triggering "stats object garbage collection" for most drops also explains why you're much more likely to see this on a standby than on a primary. An a primary it's going to be rare to have a single backend live long enough to observe an oid wraparound leading to one backend accessing stats for the same object type with the same oid after that object previously having been dropped. But most stats accesses on a standby are going to be by the same process, the startup process. You'd still need some other accesses to prevent the object from being dropped "immediately", but that could be due to shorter lived processes. Greetings, Andres Freund
> > Could you confirm that you have a) a lot of oid assignments b) your startup > process was running for a long time by the time of the crash? > Thanks Andres. Both higher than average I guess, although it depends on what is considered 'a lot' and 'a long time'. Thestartup process was running for a few months. There are ~500k entries in pg_class, of which most are (Timescale) partitions.However, even with this number of items in pg_class I would't expect wraparound to happen frequently? These arenot dropped/recreated. I've monitored "select count(*) from pg_class" for a while to see if it changes often, and whilethere are changes during the day (likely temporary tables being created), it also doesn't nearly happen at a frequencythat will get us to a wraparound quickly. Oids aren't just consumed by pg_class though. And I do see system oid growing quickly (when doing CREATE TABLE twice witha minute in between, then checking the oid difference of the table). I don't know how to investigate the cause of this.What would be the best way to check what could be consuming these oids so quickly? -Floris