Обсуждение: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure

Поиск
Список
Период
Сортировка

BUG #17947: Combination of replslots pgstat issues causes error/assertion failure

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17947
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 15.3
Operating system:   Ubuntu 22.04
Description:

The following script:
numclients=10
for ((c=1;c<=numclients;c++)); do
(
echo "
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding', false);
SELECT pg_sleep(0.05);
SELECT pg_drop_replication_slot('regression_slot');

SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot',
'test_decoding', false);
SELECT pg_sleep(0.05);
SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot2',
'test_decoding', true);
SELECT pg_drop_replication_slot('regression_slot');

CREATE TABLE tbl_$c(id int);
" | psql >psql-$c.log
) &
done
wait

(with
wal_level = logical
in postgresql.conf)

leads to an assertion failure (up to 10 runs required in my environment):
2023-05-26 13:08:09.243 MSK [2787598] ERROR:  can only drop stats once
TRAP:
FailedAssertion("!LWLockAnyHeldByMe(&(hash_table)->control->partitions[0].lock,
DSHASH_NUM_PARTITIONS, sizeof(dshash_partition))", File: "dshash.c", Line:
400, PID: 2787598)

Core was generated by `postgres: law regression [local] CREATE TABLE
                        '.
Program terminated with signal SIGABRT, Aborted.

#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=140618617759552) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=140618617759552) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140618617759552) at
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140618617759552, signo=signo@entry=6) at
./nptl/pthread_kill.c:89
#3  0x00007fe45210e476 in __GI_raise (sig=sig@entry=6) at
../sysdeps/posix/raise.c:26
#4  0x00007fe4520f47f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000555f13dad3e8 in ExceptionalCondition (
    conditionName=conditionName@entry=0x555f13f06090
"!LWLockAnyHeldByMe(&(hash_table)->control->partitions[0].lock,
DSHASH_NUM_PARTITIONS, sizeof(dshash_partition))",
errorType=errorType@entry=0x555f13e0b00b "FailedAssertion", 
    fileName=fileName@entry=0x555f13f05f6c "dshash.c",
lineNumber=lineNumber@entry=400) at assert.c:69
#6  0x0000555f13b25896 in dshash_find (hash_table=0x555f15b6b668,
key=key@entry=0x7fff8481e240, 
    exclusive=exclusive@entry=false) at dshash.c:400
#7  0x0000555f13c93ea6 in pgstat_get_entry_ref
(kind=kind@entry=PGSTAT_KIND_RELATION, dboid=dboid@entry=16384, 
    objoid=objoid@entry=3467, create=create@entry=true,
created_entry=created_entry@entry=0x0) at pgstat_shmem.c:439
#8  0x0000555f13c8edd9 in pgstat_prep_pending_entry
(kind=kind@entry=PGSTAT_KIND_RELATION, dboid=16384, 
    objoid=objoid@entry=3467, created_entry=created_entry@entry=0x0) at
pgstat.c:1094
#9  0x0000555f13c91599 in pgstat_prep_relation_pending (isshared=false,
rel_id=3467) at pgstat_relation.c:855
#10 pgstat_assoc_relation (rel=rel@entry=0x7fe446392740) at
pgstat_relation.c:138
#11 0x0000555f1398ff8e in _bt_first (scan=scan@entry=0x555f15c3b5e0,
dir=dir@entry=ForwardScanDirection)
    at nbtsearch.c:882
#12 0x0000555f13989e46 in btgettuple (scan=0x555f15c3b5e0,
dir=ForwardScanDirection) at nbtree.c:242
#13 0x0000555f1397c014 in index_getnext_tid (scan=0x555f15c3b5e0,
direction=<optimized out>) at indexam.c:533
#14 0x0000555f1397c28b in index_getnext_slot (scan=0x555f15c3b5e0,
direction=direction@entry=ForwardScanDirection, 
    slot=0x555f15c3b4c8) at indexam.c:625
#15 0x0000555f1397b426 in systable_getnext_ordered
(sysscan=sysscan@entry=0x555f15c3b470, 
    direction=direction@entry=ForwardScanDirection) at genam.c:715
#16 0x0000555f13d95e25 in BuildEventTriggerCache () at evtcache.c:151
#17 EventCacheLookup (event=<optimized out>, event@entry=EVT_SQLDrop) at
evtcache.c:69
#18 0x0000555f13a72e32 in trackDroppedObjectsNeeded () at
event_trigger.c:1147
#19 0x0000555f13a72ea0 in EventTriggerBeginCompleteQuery () at
event_trigger.c:1089
#20 0x0000555f13c7c510 in ProcessUtilitySlow (pstate=0x555f15b91540,
pstmt=0x555f15b70440, 
    queryString=0x555f15b6f410 "CREATE TABLE tbl_4(id int);",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
    queryEnv=0x0, qc=0x7fff8481fe10, dest=<optimized out>) at
utility.c:1114
#21 0x0000555f13c7bb53 in standard_ProcessUtility (pstmt=0x555f15b70440, 
    queryString=0x555f15b6f410 "CREATE TABLE tbl_4(id int);",
readOnlyTree=<optimized out>, 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x555f15b70530, qc=0x7fff8481fe10)
    at utility.c:1074
#22 0x0000555f13c7a201 in PortalRunUtility
(portal=portal@entry=0x555f15bdbbc0, pstmt=pstmt@entry=0x555f15b70440, 
    isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x555f15b70530,

    qc=qc@entry=0x7fff8481fe10) at pquery.c:1158
#23 0x0000555f13c7a33d in PortalRunMulti
(portal=portal@entry=0x555f15bdbbc0, isTopLevel=isTopLevel@entry=true, 
    setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x555f15b70530, 
    altdest=altdest@entry=0x555f15b70530, qc=qc@entry=0x7fff8481fe10) at
pquery.c:1315
#24 0x0000555f13c7a9e1 in PortalRun (portal=portal@entry=0x555f15bdbbc0,
count=count@entry=9223372036854775807, 
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x555f15b70530, 
    altdest=altdest@entry=0x555f15b70530, qc=0x7fff8481fe10) at
pquery.c:791
#25 0x0000555f13c766bd in exec_simple_query (query_string=0x555f15b6f410
"CREATE TABLE tbl_4(id int);")
    at postgres.c:1250
#26 0x0000555f13c78282 in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at postgres.c:4593
#27 0x0000555f13be2b42 in BackendRun (port=0x555f15b930a0,
port=0x555f15b930a0) at postmaster.c:4511
#28 BackendStartup (port=0x555f15b930a0) at postmaster.c:4239
#29 ServerLoop () at postmaster.c:1806
#30 0x0000555f13be3beb in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x555f15b69670) at postmaster.c:1478
#31 0x0000555f13908cff in main (argc=3, argv=0x555f15b69670) at main.c:202

(Initially, I'd got this exception when playing with a concurrent
installcheck for contrib/test_decoding [1].)

With some debug logging added, I see that this failure is a consequence of
the following combination of circumstances/issues:
First, when a replication slot created, it can just reuse an pre-existing
pgstat hash_entry, with increasing refcount.
ReplicationSlotCreate() -> pgstat_create_replslot() ->
pgstat_get_entry_ref_locked() -> pgstat_get_entry_ref() - >
pgstat_reinit_entry():
    /* mark as not dropped anymore */
    pg_atomic_fetch_add_u32(&shhashent->refcount, 1);
    shhashent->dropped = false;
Later, when that replication slot getting dropped, that hash_entry stays
alive (with refcount > 0), but has dropped = true;
(So, pg_create_logical_replication_slot/pg_drop_logical_replication_slot
can leave the pgstat hash_entry having dropped = true, but refcount > 1.)

Second, when another replication slot created, it can reuse the defective
pgstat_get_entry_ref from the local cache:
ReplicationSlotCreate() -> pgstat_create_replslot() ->
pgstat_get_entry_ref_locked() -> pgstat_get_entry_ref():
    /*
     * First check the lookup cache hashtable in local memory. If we find
a
     * match here we can avoid taking locks / causing contention.
     */
    if (pgstat_get_entry_ref_cached(key, &entry_ref))
        return entry_ref;
(Thus, a backend can get a new replication slot with an existing pgstat
hash_entry, that has dropped = true.)

Third, when the replication slot with pgstat hash_entry->dropped = true is
dropped again, the error "can only drop stats once" is raised in
pgstat_drop_entry_internal():
    /*
     * Signal that the entry is dropped - this will eventually cause other
     * backends to release their references.
     */
    if (shent->dropped)
        elog(ERROR, "can only drop stats once; hstat: %p", hstat);
but in this case dshash_release_lock() is not called, so that leads to the
assertion failure, when trying to access pgstat's hash on creating a
relation later.

[1]
https://www.postgresql.org/message-id/7e4d4a80-3e3c-231f-f886-6cada2aa582b%40gmail.com


Re: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure

От
Kyotaro Horiguchi
Дата:
At Fri, 26 May 2023 11:00:01 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> The following bug has been logged on the website:
> 
> Bug reference:      17947
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 15.3
> Operating system:   Ubuntu 22.04
> Description:        
> 
> The following script:

It is reproduced here.

It looks like the function pgstat_get_entry_ref_cached returned a
faulty reference, which is directing us to a shared entry which is
already reinited for another replication slot. In the problem
scenario, the first backend successfully reuses the entry intended to
be dropped, which is pointed to by the cached entry, then the backend
re-drops it again. When the second backend obtains a cached entry for
another replication slot, the function returns an entry that points to
the same shared entry with the first backend.  Consequently, the two
backends end up sharing the same shared stats entry, but for different
slots.

The attached ad-hoc patch appears to be working somehow for this
specific scenario. (It can contain any defects including possible
shared entry leaks.) We need to find a better approach to prevent the
reuse of an already-reinited entry.  I believe it can be fixed by
adding a reuse count to both the cached entry and shared entry, then
we could compare these numbers to verify the cached entry. However, I
can't think of a solution that wouldn't require additional struct
members for now.  Thus I'm not sure how to fix this for older versions
without them..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: BUG #17947: Combination of replslots pgstat issues causes error/assertion failure

От
Kyotaro Horiguchi
Дата:
At Thu, 01 Jun 2023 08:42:45 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> The attached ad-hoc patch appears to be working somehow for this
> specific scenario. (It can contain any defects including possible
> shared entry leaks.) We need to find a better approach to prevent the
> reuse of an already-reinited entry.  I believe it can be fixed by
> adding a reuse count to both the cached entry and shared entry, then
> we could compare these numbers to verify the cached entry. However, I
> can't think of a solution that wouldn't require additional struct
> members for now.  Thus I'm not sure how to fix this for older versions
> without them..

Here's the alpha version of that. This doesn't modify on-disk data
format and I don't think anyone outside the core relies on the
internal data structure. So it would be back-patcheable.

This problem brings another concern regarding how concurrent reusing
and refetching existing entries work safely.  At least, it is not
great to have to check two variables. It would be better combining
"dropped" with the reuse-count. In other words, checking if an entry
is in the dropped state by examining the reuse-count value.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения
Hi,

On 2023-06-01 08:42:45 +0900, Kyotaro Horiguchi wrote:
> It looks like the function pgstat_get_entry_ref_cached returned a
> faulty reference, which is directing us to a shared entry which is
> already reinited for another replication slot. In the problem
> scenario, the first backend successfully reuses the entry intended to
> be dropped, which is pointed to by the cached entry, then the backend
> re-drops it again. When the second backend obtains a cached entry for
> another replication slot, the function returns an entry that points to
> the same shared entry with the first backend.

It's not really a different replication slot, right? At least if you consider
the replication slot to be identified by its index / the stats entry to be
identified by its key.


I think the problem might actually be much simpler - when pgstat_drop_entry()
can't free the entry, we don't call pgstat_request_entry_refs_gc(), in
contrast to pgstat_drop_database_and_contents(), pgstat_drop_all_entries().
Because of that, entry refs to dropped stats entries can survive an
effectively unbound amount of time. If pgstat_request_entry_refs_gc() had been
called, the local entry would have been released, before we reused the slot's
stats entry.  There's no race around the entry being dropped concurrently with
acquiring an entry references, as external locking needs to prevent an entry
from being dropped while new references are being established.


I can't explain why pgstat_drop_entry() doesn't call
pgstat_request_entry_refs_gc(). It seems fairly obvious that it ought to?


I think there would be a leftover edge-case, namely that
pgstat_gc_entry_refs() doesn't gc entries with pending data. But we could just
do that - and it should be correct, afaict. I wonder if the reason we didn't
is that we introduced PgStat_KindInfo.delete_pending_cb later?


We probably should introduce a crosscheck against acquiring an new entry ref
for a dropped entries.


A different approach could be to mirror the reinitialization logic from the
shared case to the cached-local-ref case. But I'm somewhat doubtful that's the
right direction. We can't guarantee that every other existing backend has
gotten around to releasing their reference to a dropped shared entry when
"creating" a new entry, but we *can* guarantee that our own backend doesn't
have one.

Greetings,

Andres Freund



>
> I think the problem might actually be much simpler - when
> pgstat_drop_entry() can't free the entry, we don't call
> pgstat_request_entry_refs_gc(), in contrast to
> pgstat_drop_database_and_contents(), pgstat_drop_all_entries().
> Because of that, entry refs to dropped stats entries can survive an effectively
> unbound amount of time. If pgstat_request_entry_refs_gc() had been
> called, the local entry would have been released, before we reused the slot's
> stats entry.  There's no race around the entry being dropped concurrently
> with acquiring an entry references, as external locking needs to prevent an
> entry from being dropped while new references are being established.

This sounds like a reasonable root cause for this thread.

> I can't explain why pgstat_drop_entry() doesn't call
> pgstat_request_entry_refs_gc(). It seems fairly obvious that it ought to?

It looks like pgstat_drop_entry expects callers to take care of this. All callers in pgstat_xact.c
(AtEOXact_PgStat_DroppedStatsfor example) are dropping stats for several relations at once, and then at the end they
callpgstat_request_entry_refs_gc once if needed. I'm guessing this was an optimization. 

However, pgstat_drop_replslot didn't get that memo. Attached patch adds a call to gc in pgstat_drop_replslot to bring
itin line with usage in pgstat_xact.c. On my setup this fixes the issue that occurs on this thread. 

>
> I think there would be a leftover edge-case, namely that
> pgstat_gc_entry_refs() doesn't gc entries with pending data. But we could
> just do that - and it should be correct, afaict. I wonder if the reason we didn't
> is that we introduced PgStat_KindInfo.delete_pending_cb later?

Can you clarify why this would be safe to do? Looking at commit history, it looks like they were added in the same
commit5891c7a8e which introduced the shared memory stats. I can reason why it *should* be safe, but I can't find a
discussionaround why every single function except pgstat_drop_entry uses discard_pending=false if it would be safe to
passtrue everywhere. It seems strange that it'd have been chosen without reason. 

This does mean it's unfortunately unrelated to the issue I reported here
https://www.postgresql.org/message-id/flat/20240505223546.6yvjzgqifuoiii3e%40awork3.anarazel.de#1dfe3ecf6a75ace833444bdc0d268f4a
,because the issue in the current thread is fixed by changing a replication-slot call (which is not used in the
databasefor which I reported it). 

-Floris


Вложения
On Sun, Jun 02, 2024 at 09:31:56PM +0000, Floris Van Nee wrote:

Your recent reply has attracted me here.

> It looks like pgstat_drop_entry expects callers to take care of
> this. All callers in pgstat_xact.c (AtEOXact_PgStat_DroppedStats for
> example) are dropping stats for several relations at once, and then
> at the end they call pgstat_request_entry_refs_gc once if
> needed. I'm guessing this was an optimization.

Note an extra edge case in pgstat_init_function_usage(), which should
do the same, I guess?

> However, pgstat_drop_replslot didn't get that memo. Attached patch
> adds a call to gc in pgstat_drop_replslot to bring it in line with
> usage in pgstat_xact.c. On my setup this fixes the issue that occurs
> on this thread.

I find this omission for replication slots quite confusing, as well,
as this is incorrect with what's documented in pgstat_internal.h.  I
can get the point that it is important to document that on top of the
drop routine as well as that can be easily missed, though your
suggested comment could be simplified a bit more, perhaps.

>> I think there would be a leftover edge-case, namely that
>> pgstat_gc_entry_refs() doesn't gc entries with pending data. But we could
>> just do that - and it should be correct, afaict. I wonder if the reason we didn't
>> is that we introduced PgStat_KindInfo.delete_pending_cb later?
>
> Can you clarify why this would be safe to do? Looking at commit
> history, it looks like they were added in the same commit 5891c7a8e
> which introduced the shared memory stats. I can reason why it
> *should* be safe, but I can't find a discussion around why every
> single function except pgstat_drop_entry uses discard_pending=false
> if it would be safe to pass true everywhere. It seems strange that
> it'd have been chosen without reason.
>
> This does mean it's unfortunately unrelated to the issue I reported
> here
>
https://www.postgresql.org/message-id/flat/20240505223546.6yvjzgqifuoiii3e%40awork3.anarazel.de#1dfe3ecf6a75ace833444bdc0d268f4a
> , because the issue in the current thread is fixed by changing a
> replication-slot call (which is not used in the database for which I
> reported it).

Yeah, It's clear that we don't have enough information to proceed
here.  It seems to me that there is a good argument for addressing
this one first, separately of the other one.

PS: be careful with the indentation of your messages.  You may want to
limit the number of characters per line to 78-ish characters or so.
--
Michael

Вложения
On Tue, Jun 04, 2024 at 02:25:23PM +0900, Michael Paquier wrote:
>> This does mean it's unfortunately unrelated to the issue I reported
>> here
>>
https://www.postgresql.org/message-id/flat/20240505223546.6yvjzgqifuoiii3e%40awork3.anarazel.de#1dfe3ecf6a75ace833444bdc0d268f4a
>> , because the issue in the current thread is fixed by changing a
>> replication-slot call (which is not used in the database for which I
>> reported it).
>
> Yeah, It's clear that we don't have enough information to proceed
> here.  It seems to me that there is a good argument for addressing
> this one first, separately of the other one.

I have been playing with this patch and scenarios of replslot
create/drop similar to what's been sent at the top of this thread, and
the solution proposed is proving to be stable enough to not see a
problem in the pgstats machinery after letting the test run 30 minutes
or so (switched sleep intervals for one).

Attached is an updated patch with refined comments at the top of
pgstat_drop_entry(), to be applied to 15~.

I'm kind of feeling that this is not the end of it, even if the use of
the drop routine is inconsistent in this code path.  Perhaps that's
just my spidey sense being incorrect here..
--
Michael

Вложения
>
> Note an extra edge case in pgstat_init_function_usage(), which should do
> the same, I guess?
>

I noticed it but wasn't sure if it was necessary here due to the exceptional case here.
It raises ereport error right after and can only reach this if the entry was created
In the call to pgstat_prep_pending_entry before it, so it seems unlikely another
backend holds a ref to it. I might be wrong though.

> Attached is an updated patch with refined comments at the top of ?
> pgstat_drop_entry(), to be applied to 15~.

Thanks for testing the patch. Updated patch looks good to me.

-Floris




On Wed, Jun 05, 2024 at 10:24:43AM +0000, Floris Van Nee wrote:
> I noticed it but wasn't sure if it was necessary here due to the exceptional case here.
> It raises ereport error right after and can only reach this if the entry was created
> In the call to pgstat_prep_pending_entry before it, so it seems unlikely another
> backend holds a ref to it. I might be wrong though.

It does not prevent addressing the original issue, so pushed down to
15 after running concurrent tests for a few extra hours.

>> Attached is an updated patch with refined comments at the top of ?
>> pgstat_drop_entry(), to be applied to 15~.
>
> Thanks for testing the patch. Updated patch looks good to me.

I'll try to look at this one separately.
--
Michael

Вложения