Обсуждение: BUG #16226: background worker "logical replication worker" (PID ) was terminated by signal 11: Segmentation

Поиск
Список
Период
Сортировка

BUG #16226: background worker "logical replication worker" (PID ) was terminated by signal 11: Segmentation

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16226
Logged by:          Vadim Yatsenko
Email address:      vadim@postgrespro.co.il
PostgreSQL version: 12.1
Operating system:   CentOS Linux 7 (Core) Kernel: Linux 3.10.0-1062.9.
Description:

Hi!
We have 2 PostgreSQL servers with logical replication between Postgres 11.6
(Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2
big tables from integer to text: 

    oid     | nspname  |                table                 |   rows    |
pages  |    size    | index_size | total_size

------------+----------+--------------------------------------+-----------+---------+------------+------------+----------------
  573803236 | public | acc_rep                         | 283359648 | 9276153
| 71 GB      | 50 GB      | 121 GB
  586884963 | public | acc_rep_actual             | 283280448 | 8855209 | 68
GB      | 27 GB      | 95 GB

, this of course led to a full rewrite both tables. We repated this
operation on both servers. And after that we started to get error like
"background worker "logical replication worker" (PID <pid>) was terminated
by signal 11: Segmentation fault" and server goes to recovery mode. This
error repated several times every 10-15 minutes. When we connected to
logical replication worker process using gdb, we recieved the following
message:

Program received signal SIGSEGV, Segmentation fault.
0x00007f04e5060426 in __memcpy_ssse3_back () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f04e5060426 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x000000000061e1cb in memcpy (__len=224588909, __src=<optimized out>,
__dest=0x101b870) at /usr/include/bits/string3.h:51
#2  tts_virtual_materialize (slot=0x101b0b0) at execTuples.c:235
#3  0x000000000061f594 in ExecFetchSlotHeapTuple (slot=0x101b0b0,
materialize=<optimized out>, shouldFree=0x7ffc331ae187) at
execTuples.c:1624
#4  0x00000000004c868c in heapam_tuple_update (relation=0x7f04e77d8470,
otid=0x101b240, slot=0x101b0b0, cid=19778, snapshot=<optimized out>,
crosscheck=0x0, wait=true, tmfd=0x7ffc331ae210, lockmode=0x7ffc331ae20c,
    update_indexes=0x7ffc331ae277) at heapam_handler.c:325
#5  0x00000000004f5524 in table_tuple_update (update_indexes=0x7ffc331ae277,
lockmode=0x7ffc331ae20c, tmfd=0x7ffc331ae210, wait=true, crosscheck=0x0,
snapshot=0x0, cid=<optimized out>, slot=0x101b0b0, otid=0x101b240,
rel=0x7f04e77d8470)
    at ../../../../src/include/access/tableam.h:1261
#6  simple_table_tuple_update (rel=0x7f04e77d8470,
otid=otid@entry=0x101b240, slot=slot@entry=0x101b0b0, snapshot=0x0,
update_indexes=update_indexes@entry=0x7ffc331ae277) at tableam.c:325
#7  0x000000000061bbc0 in ExecSimpleRelationUpdate
(estate=estate@entry=0x1031660, epqstate=epqstate@entry=0x7ffc331ae330,
searchslot=searchslot@entry=0x101b210, slot=slot@entry=0x101b0b0) at
execReplication.c:493
#8  0x000000000071100d in apply_handle_update (s=s@entry=0x7ffc331b5910) at
worker.c:750
#9  0x0000000000711ca7 in apply_dispatch (s=0x7ffc331b5910) at
worker.c:968
#10 LogicalRepApplyLoop (last_received=31760851165536) at worker.c:1176
#11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:1734
#12 0x00000000006e2d4a in StartBackgroundWorker () at bgworker.c:834
#13 0x00000000006ef1d3 in do_start_bgworker (rw=0xf71970) at
postmaster.c:5770
#14 maybe_start_bgworkers () at postmaster.c:5996
#15 0x00000000006efde5 in sigusr1_handler (postgres_signal_arg=<optimized
out>) at postmaster.c:5167
#16 <signal handler called>
#17 0x00007f04e4fff933 in __select_nocancel () from /lib64/libc.so.6
#18 0x00000000004833d4 in ServerLoop () at postmaster.c:1668
#19 0x00000000006f106f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0xf45300) at postmaster.c:1377
#20 0x0000000000484cd3 in main (argc=3, argv=0xf45300) at main.c:228
(gdb)
#0  0x00007f04e5060426 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x000000000061e1cb in memcpy (__len=224588909, __src=<optimized out>,
__dest=0x101b870) at /usr/include/bits/string3.h:51
#2  tts_virtual_materialize (slot=0x101b0b0) at execTuples.c:235
#3  0x000000000061f594 in ExecFetchSlotHeapTuple (slot=0x101b0b0,
materialize=<optimized out>, shouldFree=0x7ffc331ae187) at
execTuples.c:1624
#4  0x00000000004c868c in heapam_tuple_update (relation=0x7f04e77d8470,
otid=0x101b240, slot=0x101b0b0, cid=19778, snapshot=<optimized out>,
crosscheck=0x0, wait=true, tmfd=0x7ffc331ae210, lockmode=0x7ffc331ae20c,
    update_indexes=0x7ffc331ae277) at heapam_handler.c:325
#5  0x00000000004f5524 in table_tuple_update (update_indexes=0x7ffc331ae277,
lockmode=0x7ffc331ae20c, tmfd=0x7ffc331ae210, wait=true, crosscheck=0x0,
snapshot=0x0, cid=<optimized out>, slot=0x101b0b0, otid=0x101b240,
rel=0x7f04e77d8470)
    at ../../../../src/include/access/tableam.h:1261
#6  simple_table_tuple_update (rel=0x7f04e77d8470,
otid=otid@entry=0x101b240, slot=slot@entry=0x101b0b0, snapshot=0x0,
update_indexes=update_indexes@entry=0x7ffc331ae277) at tableam.c:325
#7  0x000000000061bbc0 in ExecSimpleRelationUpdate
(estate=estate@entry=0x1031660, epqstate=epqstate@entry=0x7ffc331ae330,
searchslot=searchslot@entry=0x101b210, slot=slot@entry=0x101b0b0) at
execReplication.c:493
#8  0x000000000071100d in apply_handle_update (s=s@entry=0x7ffc331b5910) at
worker.c:750
#9  0x0000000000711ca7 in apply_dispatch (s=0x7ffc331b5910) at
worker.c:968
#10 LogicalRepApplyLoop (last_received=31760851165536) at worker.c:1176
#11 ApplyWorkerMain (main_arg=<optimized out>) at worker.c:1734
#12 0x00000000006e2d4a in StartBackgroundWorker () at bgworker.c:834
#13 0x00000000006ef1d3 in do_start_bgworker (rw=0xf71970) at
postmaster.c:5770
#14 maybe_start_bgworkers () at postmaster.c:5996
#15 0x00000000006efde5 in sigusr1_handler (postgres_signal_arg=<optimized
out>) at postmaster.c:5167
#16 <signal handler called>
#17 0x00007f04e4fff933 in __select_nocancel () from /lib64/libc.so.6
#18 0x00000000004833d4 in ServerLoop () at postmaster.c:1668
#19 0x00000000006f106f in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0xf45300) at postmaster.c:1377
#20 0x0000000000484cd3 in main (argc=3, argv=0xf45300) at main.c:228

Error from /var/log/messages:

Jan 21 02:01:59 dbmr01 kernel: postmaster[50093]: segfault at e76f13d ip
00007fa8f701ffa0 sp 00007ffc8ae23498 error 4 in
libc-2.17.so[7fa8f6ed4000+1c2000]
Jan 21 02:01:59 dbmr01 abrt-hook-ccpp: Process 50093 (postgres) of user 26
killed by SIGSEGV - dumping core
Jan 21 02:02:15 dbmr01 abrt-server: Package 'postgresql12-server' isn't
signed with proper key
Jan 21 02:02:15 dbmr01 abrt-server: 'post-create' on
'/var/spool/abrt/ccpp-2020-01-21-02:01:59-50093' exited with 1
Jan 21 02:02:15 dbmr01 abrt-server: Deleting problem directory
'/var/spool/abrt/ccpp-2020-01-21-02:01:59-50093'

The issue was solved by removing these tables from publication on Primary.
Can you advise something? Is this a bug or didn't I take something into
account when working with logical replication?
Thank you!


On 2020-01-22 08:43, PG Bug reporting form wrote:
> We have 2 PostgreSQL servers with logical replication between Postgres 11.6
> (Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2
> big tables from integer to text:

> , this of course led to a full rewrite both tables. We repated this
> operation on both servers. And after that we started to get error like
> "background worker "logical replication worker" (PID <pid>) was terminated
> by signal 11: Segmentation fault" and server goes to recovery mode. This
> error repated several times every 10-15 minutes. When we connected to
> logical replication worker process using gdb, we recieved the following
> message:
> 
> Program received signal SIGSEGV, Segmentation fault.

Could you narrow it down to a reproducible test case?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Wed, Jan 22, 2020 at 10:22:55AM +0100, Peter Eisentraut wrote:
> On 2020-01-22 08:43, PG Bug reporting form wrote:
>> , this of course led to a full rewrite both tables. We repated this
>> operation on both servers. And after that we started to get error like
>> "background worker "logical replication worker" (PID <pid>) was terminated
>> by signal 11: Segmentation fault" and server goes to recovery mode. This
>> error repated several times every 10-15 minutes. When we connected to
>> logical replication worker process using gdb, we recieved the following
>> message:
>>
>> Program received signal SIGSEGV, Segmentation fault.
>
> Could you narrow it down to a reproducible test case?

I suspect that this is the same problem as this report:
https://www.postgresql.org/message-id/16129-a0c0f48e71741e5f@postgresql.org

And it has been fixed here, after 12.1, the fix being available with
12.2:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a2aa224e
--
Michael

Вложения
> We have 2 PostgreSQL servers with logical replication between Postgres 11.6
> (Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2
> big tables from integer to text: 
> ...
> , this of course led to a full rewrite both tables. We repated this
> operation on both servers. And after that we started to get error like
> "background worker "logical replication worker" (PID <pid>) was terminated
> by signal 11: Segmentation fault" and server goes to recovery mode.

Not sure, but this seems like it might be explained by this recent
bug fix:


Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master [4d9ceb001] 2019-11-22 11:31:19 -0500
Branch: REL_12_STABLE [a2aa224e0] 2019-11-22 11:31:19 -0500
Branch: REL_11_STABLE [b72a44c51] 2019-11-22 11:31:19 -0500
Branch: REL_10_STABLE [5d3fcb53a] 2019-11-22 11:31:19 -0500

    Fix bogus tuple-slot management in logical replication UPDATE handling.
    
    slot_modify_cstrings seriously abused the TupleTableSlot API by relying
    on a slot's underlying data to stay valid across ExecClearTuple.  Since
    this abuse was also quite undocumented, it's little surprise that the
    case got broken during the v12 slot rewrites.  As reported in bug #16129
    from Ondřej Jirman, this could lead to crashes or data corruption when
    a logical replication subscriber processes a row update.  Problems would
    only arise if the subscriber's table contained columns of pass-by-ref
    types that were not being copied from the publisher.
    
    Fix by explicitly copying the datum/isnull arrays from the source slot
    that the old row was in already.  This ends up being about the same
    thing that happened pre-v12, but hopefully in a less opaque and
    fragile way.
    
    We might've caught the problem sooner if there were any test cases
    dealing with updates involving non-replicated or dropped columns.
    Now there are.
    
    Back-patch to v10 where this code came in.  Even though the failure
    does not manifest before v12, IMO this code is too fragile to leave
    as-is.  In any case we certainly want the additional test coverage.
    
    Patch by me; thanks to Tomas Vondra for initial investigation.
    
    Discussion: https://postgr.es/m/16129-a0c0f48e71741e5f@postgresql.org

            regards, tom lane



Tom,

Thanks you!  We'll wait patch to update our servers.

Best Regards,
Vadim Yatsenko

ср, 22 янв. 2020 г., 18:28 Tom Lane <tgl@sss.pgh.pa.us>:
> We have 2 PostgreSQL servers with logical replication between Postgres 11.6
> (Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2
> big tables from integer to text:
> ...
> , this of course led to a full rewrite both tables. We repated this
> operation on both servers. And after that we started to get error like
> "background worker "logical replication worker" (PID <pid>) was terminated
> by signal 11: Segmentation fault" and server goes to recovery mode.

Not sure, but this seems like it might be explained by this recent
bug fix:


Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master [4d9ceb001] 2019-11-22 11:31:19 -0500
Branch: REL_12_STABLE [a2aa224e0] 2019-11-22 11:31:19 -0500
Branch: REL_11_STABLE [b72a44c51] 2019-11-22 11:31:19 -0500
Branch: REL_10_STABLE [5d3fcb53a] 2019-11-22 11:31:19 -0500

    Fix bogus tuple-slot management in logical replication UPDATE handling.

    slot_modify_cstrings seriously abused the TupleTableSlot API by relying
    on a slot's underlying data to stay valid across ExecClearTuple.  Since
    this abuse was also quite undocumented, it's little surprise that the
    case got broken during the v12 slot rewrites.  As reported in bug #16129
    from Ondřej Jirman, this could lead to crashes or data corruption when
    a logical replication subscriber processes a row update.  Problems would
    only arise if the subscriber's table contained columns of pass-by-ref
    types that were not being copied from the publisher.

    Fix by explicitly copying the datum/isnull arrays from the source slot
    that the old row was in already.  This ends up being about the same
    thing that happened pre-v12, but hopefully in a less opaque and
    fragile way.

    We might've caught the problem sooner if there were any test cases
    dealing with updates involving non-replicated or dropped columns.
    Now there are.

    Back-patch to v10 where this code came in.  Even though the failure
    does not manifest before v12, IMO this code is too fragile to leave
    as-is.  In any case we certainly want the additional test coverage.

    Patch by me; thanks to Tomas Vondra for initial investigation.

    Discussion: https://postgr.es/m/16129-a0c0f48e71741e5f@postgresql.org

                        regards, tom lane