Re: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure
От | Andres Freund |
---|---|
Тема | Re: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure |
Дата | |
Msg-id | 20240505183741.5tefii3razzefvtc@awork3.anarazel.de обсуждение исходный текст |
Ответ на | Re: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Ответы |
RE: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure
|
Список | pgsql-bugs |
Hi, On 2023-06-01 08:42:45 +0900, Kyotaro Horiguchi wrote: > It looks like the function pgstat_get_entry_ref_cached returned a > faulty reference, which is directing us to a shared entry which is > already reinited for another replication slot. In the problem > scenario, the first backend successfully reuses the entry intended to > be dropped, which is pointed to by the cached entry, then the backend > re-drops it again. When the second backend obtains a cached entry for > another replication slot, the function returns an entry that points to > the same shared entry with the first backend. It's not really a different replication slot, right? At least if you consider the replication slot to be identified by its index / the stats entry to be identified by its key. I think the problem might actually be much simpler - when pgstat_drop_entry() can't free the entry, we don't call pgstat_request_entry_refs_gc(), in contrast to pgstat_drop_database_and_contents(), pgstat_drop_all_entries(). Because of that, entry refs to dropped stats entries can survive an effectively unbound amount of time. If pgstat_request_entry_refs_gc() had been called, the local entry would have been released, before we reused the slot's stats entry. There's no race around the entry being dropped concurrently with acquiring an entry references, as external locking needs to prevent an entry from being dropped while new references are being established. I can't explain why pgstat_drop_entry() doesn't call pgstat_request_entry_refs_gc(). It seems fairly obvious that it ought to? I think there would be a leftover edge-case, namely that pgstat_gc_entry_refs() doesn't gc entries with pending data. But we could just do that - and it should be correct, afaict. I wonder if the reason we didn't is that we introduced PgStat_KindInfo.delete_pending_cb later? We probably should introduce a crosscheck against acquiring an new entry ref for a dropped entries. A different approach could be to mirror the reinitialization logic from the shared case to the cached-local-ref case. But I'm somewhat doubtful that's the right direction. We can't guarantee that every other existing backend has gotten around to releasing their reference to a dropped shared entry when "creating" a new entry, but we *can* guarantee that our own backend doesn't have one. Greetings, Andres Freund
В списке pgsql-bugs по дате отправления: