RE: PG16 walsender hangs in ResourceArrayEnlarge using pgoutput
От | Zhijie Hou (Fujitsu) |
---|---|
Тема | RE: PG16 walsender hangs in ResourceArrayEnlarge using pgoutput |
Дата | |
Msg-id | OS0PR01MB5716794ED2BE87C1CF74DBB094D52@OS0PR01MB5716.jpnprd01.prod.outlook.com обсуждение исходный текст |
Ответ на | PG16 walsender hangs in ResourceArrayEnlarge using pgoutput (Bowen Shi <zxwsbg12138@gmail.com>) |
Ответы |
Re: PG16 walsender hangs in ResourceArrayEnlarge using pgoutput
Re: PG16 walsender hangs in ResourceArrayEnlarge using pgoutput |
Список | pgsql-bugs |
On Tuesday, June 25, 2024 3:53 PM Bowen Shi <zxwsbg12138@gmail.com> wrote: Hi, > 1. create a partition table; > 2. create a publication & subscription; > 3. alter partition table with 'add column col varchar(10) NOT NULL DEFAULT ''::character varying' ; > 4. insert/update 10000000 rows into partition table; > > Then we can find the walsender process hang in the loop with stack: > > #0 ResourceArrayAdd (resarr=0x2dfc048, value=140175392354560) > #1 ResourceArrayEnlarge (resarr=0x2dfc048) > #2 ResourceOwnerEnlargeTupleDescs (owner=0x2dfbf68) > #3 IncrTupleDescRefCount (tupdesc=0x7f7d2077b100) > #4 MakeTupleTableSlot (tupleDesc=0x7f7d2077b100, tts_ops=0xd237a0 <TTSOpsVirtual>) > #5 pgoutput_change (ctx=0x2eb1e68, txn=0x2edfec0, relation=0x7f7d2077e120, change=0x2f35138) > #6 change_cb_wrapper (cache=0x2eb3e78, txn=0x2edfec0, relation=0x7f7d2077e120, change=0x2f35138) > #7 ReorderBufferApplyChange (rb=0x2eb3e78, txn=0x2edfec0, relation=0x7f7d2077e120, change=0x2f35138, streaming=false) > #8 ReorderBufferProcessTXN (rb=0x2eb3e78, txn=0x2edfec0, commit_lsn=4675920936, snapshot_now=0x2ec82d0, command_id=0,streaming=false) > #9 ReorderBufferReplay (txn=0x2edfec0, rb=0x2eb3e78, xid=758, commit_lsn=4675920936, end_lsn=4675920984, commit_time=772615467516504,origin_id=0, origin_lsn=0) > #10 ReorderBufferCommit (rb=0x2eb3e78, xid=758, commit_lsn=4675920936, end_lsn=4675920984, commit_time=772615467516504,origin_id=0, origin_lsn=0) > #11 DecodeCommit (ctx=0x2eb1e68, buf=0x7ffe83c408f0, parsed=0x7ffe83c406f0, xid=758, two_phase=false) > #12 xact_decode (ctx=0x2eb1e68, buf=0x7ffe83c408f0) > #13 LogicalDecodingProcessRecord (ctx=0x2eb1e68, record=0x2eb2200) > #14 XLogSendLogical () > #15 WalSndLoop (send_data=0x95b183 <XLogSendLogical>) > #16 StartLogicalReplication (cmd=0x2e8fcd0) > #17 exec_replication_command (cmd_string=0x2dc3dc8 "START_REPLICATION SLOT \"sub_t\" LOGICAL 0/0 (proto_version '4', origin'any', publication_names '\"pub_t\"')") > #18 PostgresMain (dbname=0x2dfa7c8 "tdts", username=0x2dfa7a8 "postgres") > #19 BackendRun (port=0x2dedd60) > #20 BackendStartup (port=0x2dedd60) > #21 ServerLoop () > #22 PostmasterMain (argc=3, argv=0x2dbe920) > > In pgoutput.c , I found some reasons: > 1. In init_tuple_slot() function, we will build an attrmap when we publish a partition table (entry->publish_as_relid !=RelationGetRelid(relation)) and alter missing column into partition table with alter table statement; > 2. In pgoutput_change(), we would convert the tuple if needed using MakeTupleTableSlot() function; > 3. When we update 30000000 rows, MakeTupleTableSlot->PinTupleDesc->IncrTupleDescRefCount->ResourceOwnerEnlargeTupleDescs->ResourceArrayEnlarge,we would callResourceArrayAdd function too many times and cause the problem. Thanks for reporting and analyzing the issue ! You analysis looks good to me, I think I missed to drop the newly created slot when changing the logic to support row filter. Here is a small patch to drop the slots in each cycle, it can fix the issue on my machine. IIUC, the issue exists from PG15~HEAD. The current patch can apply on PG16~HEAD, If it looks ok, I will test and prepare the patch for the other branch. Best Regards, Hou zj
Вложения
В списке pgsql-bugs по дате отправления: