RE: Parallel scan with SubTransGetTopmostTransaction assert coredump

Поиск
Список
Период
Сортировка
От Pengchengliu
Тема RE: Parallel scan with SubTransGetTopmostTransaction assert coredump
Дата
Msg-id 000601d74868$4cacca30$e6065e90$@tju.edu.cn
обсуждение исходный текст
Ответ на Re: Parallel scan with SubTransGetTopmostTransaction assert coredump  (Greg Nancarrow <gregn4422@gmail.com>)
Ответы Re: Parallel scan with SubTransGetTopmostTransaction assert coredump  (Greg Nancarrow <gregn4422@gmail.com>)
Список pgsql-hackers

Hi Greg,

  Thanks for you replay and test.

 

  When main process gets the transaction snapshot in InitializeParallelDSM->GetTransactionSnapshot,  the transaction snapshot xmin is very likely follows active snapshot xmin.

 

  Use gdb it is easy to verify it.

  Create env as blow:

 

  1, Use PG13.2(3fb4c75e857adee3da4386e947ba58a75f3e74b7), init a cluster database.

  2, Append the postgres.conf as below:

 

  max_connections = 2000

  parallel_setup_cost=0

  parallel_tuple_cost=0

  min_parallel_table_scan_size=0

  max_parallel_workers_per_gather=8

  max_parallel_workers = 128

  

  3, Start the cluster database. Use the init_test.sql script in attachment to create some test tables.

  4, Use the sub_120.sql script in attachment with pgbench to test it.

 

  pgbench  -d postgres -p 33550  -n -r -f sub_120.sql   -c 200 -j 200 -T 1800

 

 

 

  Then you can login the database, and use GDB to verify it.

  1, First use explain, make sure force Parallel is OK.

 

  postgres=# explain (verbose,analyze) select count(*) from contend1;

                                                                     QUERY PLAN                                                     

                

-------------------------------------------------------------------------------------------------------------------------------------

----------------

Finalize Aggregate  (cost=12006.11..12006.12 rows=1 width=8) (actual time=1075.214..1075.449 rows=1 loops=1)

   Output: count(*)

   ->  Gather  (cost=12006.08..12006.09 rows=8 width=8) (actual time=1075.198..1075.433 rows=1 loops=1)

         Output: (PARTIAL count(*))

         Workers Planned: 8

         Workers Launched: 0

         ->  Partial Aggregate  (cost=12006.08..12006.09 rows=1 width=8) (actual time=1074.674..1074.676 rows=1 loops=1)

               Output: PARTIAL count(*)

               ->  Parallel Seq Scan on public.contend1  (cost=0.00..11690.06 rows=126406 width=0) (actual time=0.008..587.454 rows=1

010200 loops=1)

                     Output: id, val, c2, c3, c4, c5, c6, c7, c8, c9, c10, crt_time

Planning Time: 0.123 ms

Execution Time: 1075.588 ms

 postgres=# select pg_backend_pid();

pg_backend_pid

----------------

        2182678

                               

 

  2, use gdb to debug our backend process. Add the breakpoint in parallel.c:219 and continue.

 

  gdb  -q -p 2182678

  ...

  (gdb) b parallel.c:219

Breakpoint 1 at 0x55d085: file /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/parallel.c, line 219.

  (gdb) c

Continuing.

 

  3, In the psql clinet, we can execute the explain command in step1 again.

     After we get the breakpoint in gdb, we wait a moment. Then we execute next.

     Use gdb check active_snapshot and transaction_snapshot, active_snapshot->xmin is 158987 and transaction_snapshot->xmin is 162160.

When I use gdb test it, sometimes active_snapshot is the same as transaction_snapshot. Then you can try it multiple times, and before execute next, try wait longer time.

  

   Breakpoint 1, InitializeParallelDSM (pcxt=0x2d53670)

    at /home/liupc/build/build_postgres2/../../devel/postgres2/src/backend/access/transam/parallel.c:219

219             Snapshot        transaction_snapshot = GetTransactionSnapshot();

(gdb) n

220             Snapshot        active_snapshot = GetActiveSnapshot();

(gdb)

223             oldcontext = MemoryContextSwitchTo(TopTransactionContext);

(gdb) p *transaction_snapshot

$1 = {snapshot_type = SNAPSHOT_MVCC, xmin = 162160, xmax = 183011, xip = 0x2d50d10, xcnt = 179, subxip = 0x148a9cddf010,

  subxcnt = 0, suboverflowed = true, takenDuringRecovery = false, copied = false, curcid = 0, speculativeToken = 0,

  active_count = 0, regd_count = 0, ph_node = {first_child = 0x0, next_sibling = 0x0, prev_or_parent = 0x0}, whenTaken = 0, lsn = 0}

(gdb) p *active_snapshot

$2 = {snapshot_type = SNAPSHOT_MVCC, xmin = 158987, xmax = 173138, xip = 0x2d53288, xcnt = 178, subxip = 0x0, subxcnt = 0,

  suboverflowed = true, takenDuringRecovery = false, copied = true, curcid = 0, speculativeToken = 0, active_count = 1,

  regd_count = 2, ph_node = {first_child = 0x0, next_sibling = 0x0, prev_or_parent = 0x2d52e48}, whenTaken = 0, lsn = 0}

(gdb) 

 

Thanks

Pengcheng

 

  

 

-----Original Message-----
From: Greg Nancarrow <gregn4422@gmail.com>
Sent: 2021513 22:15
To: Pengchengliu <pengchengliu@tju.edu.cn>
Cc: Andres Freund <andres@anarazel.de>; PostgreSQL-development <pgsql-hackers@postgresql.org>
Subject: Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

 

On Thu, May 13, 2021 at 11:25 AM Pengchengliu <pengchengliu@tju.edu.cn> wrote:

 

> Hi Andres,

>  Thanks for you replay.

 

Er, it's Greg who has replied so far (not Andres).

 

>   And If you still cannot reproduce it in 2 minitus. Could you run pgbench longer time, such as 30 or 60 minutes.

 

Actually, I did run it, multiple times, for more than 60 minutes, but no assert/crash/coredump occurred in my environment.

 

>   The parallel work process:

>   1, Get Snapshot and set TransactionXmin itself, in ParallelWorkerMain->BackgroundWorkerInitializeConnectionByOid->GetTransactionSnapshot->GetSnapshotData.

>   2, Acooding PARALLEL_KEY_TRANSACTION_SNAPSHOT(xmin 799425, xmax 82229) from main process, and set TransactionXmin 799425 in ParallelWorkerMain->RestoreTransactionSnapshot->SetTransactionSnapshot->ProcArrayInstallRestoredXmin.

>   3, ExecParallelInitializeWorker->ExecSeqScanInitializeWorker->table_beginscan_parallel get the active snapshot(xmin 799162, xmax  82206) from main process, and set this snapshot to scan->rs_base.rs_snapshot.

>   4, parallel scan begin, with active snapshot(xmin 799162, xmax  82206) and TransactionXmin(799425),when scan tuple(xmin 799225) SubTransGetTopmostTransaction assert got.

>      In  HeapTupleSatisfiesMVCC->XidInMVCCSnapshot->SubTransGetTopmostTransaction.

 

I added some logging at a couple of points in the code:

1) In the Worker process code - ParallelWorkerMain() - where it restores the serialized transaction and active snapshots (i.e. passed to the Worker from the main process).

2) In the HeapTupleSatisfiesMVCC() function, immediately before it calls XidInMVCCSnapshot()

 

After running it for an hour, examination of the log showed that in ALL cases, the transaction snapshot xmin,xmax was always THE SAME as the active snapshot xmin,xmax.

(Can you verify that this occurs on your system when things are working, prior to the coredump?)

 

This is different to what you are getting in your environment (at least, different to what you described when the problem occurs).

In your case, you say that the main process gets "the newer transaction snapshot" - where exactly is this happening in your case?

(or this is what you don't yet know?)

Perhaps very occasionally this somehow happens on your system and triggers the Assert (and coredump)?  I have not been able to reproduce that on my system.

 

Have you reproduced this issue on any other system, using the same steps as you provided?

I'm wondering if there might be something else in your environment that may be influencing this problem.

 

 

Regards,

Greg Nancarrow

Fujitsu Australia

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Julien Rouhaud
Дата:
Сообщение: Re: compute_query_id and pg_stat_statements
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: Teaching users how they can get the most out of HOT in Postgres 14