Re: Logical replication stops dropping used initial-sync replication slots

Поиск
Список
Период
Сортировка
От hubert depesz lubaczewski
Тема Re: Logical replication stops dropping used initial-sync replication slots
Дата
Msg-id 20220325083856.GA26455@depesz.com
обсуждение исходный текст
Ответ на Re: Logical replication stops dropping used initial-sync replication slots  (hubert depesz lubaczewski <depesz@depesz.com>)
Ответы Re: Logical replication stops dropping used initial-sync replication slots  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-bugs
On Tue, Mar 22, 2022 at 12:26:46PM +0100, hubert depesz lubaczewski wrote:
> On Tue, Mar 22, 2022 at 12:32:41PM +0530, vignesh C wrote:
> > > That wouldn't do me any good. The server I'm trying to upgrade now is
> > > test bench to make sure that I can upgrade using logical replication *at
> > > all*.
> > > Once I will have verified that I can, I have couple hundred servers to
> > > upgrade, that will have concurrent access all the time.
> > Is it possible to get  a reproducible test or a script for this problem?
> 
> Well, given that we're talking about 38GB database with 60k tables
> i find that unlikely.
> 
> But I'll try if I can figure out a way to replicate the problem with
> something that is shareable.

OK. I don't think it will be simple/possible to replicate.

Reason:

Did setup 10 separate environments, all with the same database on bionic
12.9.

Then I created publication with all tables in specific schemas, and
subscribed to it from separate focal pg 14.2 servers.

In total, used 20 servers to replicate the test 10 times.

Effect: in one case replication didn't finish initial sync.

Currently, it's 2022-03-25 08:32:14.10081+00, and on the one source that
had failed with subscription i see:
                  slot_name                  |  plugin  | slot_type | datoid | database | temporary | active |
active_pid| xmin | catalog_xmin | restart_lsn  | confirmed_flush_lsn 
 

---------------------------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+--------------+---------------------
 focal14                                     | pgoutput | logical   |  16606 | dbname   | f         | f      |
 |      |    161822728 | 8E9/3E70BB58 | 8E9/3E70BB58
 
 pg_3548429_sync_3343043_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32385|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3079712_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32378|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3145162_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32386|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_2993718_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32387|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3355696_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32379|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3099460_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32380|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3162651_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32381|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3157454_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32382|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3084198_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32383|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
 pg_3548429_sync_3142573_7077521990523695096 | pgoutput | logical   |  16606 | dbname   | f         | t      |
32384|      |    161824826 | 8F0/18327650 | 8F0/1ABF89C0
 
(11 rows)

pg_stat_replication shows 10 sync connections, and that's it. their sent_lsn,
write_lsn, flush_lsn, replay_lsn seems to be progressing nicely, but nothing is
changing.

pg_stat_activity contains:

=# select xact_start, query_start, backend_type, query from pg_stat_activity where state is distinct from 'idle';
          xact_start           │          query_start          │         backend_type         │
                                          query                                                                     
 

═══════════════════════════════╪═══════════════════════════════╪══════════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 [null]                        │ [null]                        │ autovacuum launcher          │ 
 [null]                        │ 2022-03-24 20:45:30.757582+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_2993718_7077521990523695096"LOGICAL 8E9/40FF0170 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ [null]                        │ pglogical supervisor         │ 
 [null]                        │ [null]                        │ logical replication launcher │ 
 [null]                        │ 2022-03-24 20:45:30.730818+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3343043_7077521990523695096"LOGICAL 8E9/40FF0100 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:31.65015+00  │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3079712_7077521990523695096"LOGICAL 8E9/40FEFF60 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:30.733366+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3145162_7077521990523695096"LOGICAL 8E9/40FF0138 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:30.76285+00  │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3355696_7077521990523695096"LOGICAL 8E9/40FEFF98 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:31.663299+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3099460_7077521990523695096"LOGICAL 8E9/40FEFFD0 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:31.676412+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3162651_7077521990523695096"LOGICAL 8E9/40FF0020 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:31.691807+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3157454_7077521990523695096"LOGICAL 8E9/40FF0058 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:31.69548+00  │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3084198_7077521990523695096"LOGICAL 8E9/40FF0090 (proto_version '1', publication_names '"focal14"')
 
 [null]                        │ 2022-03-24 20:45:30.974833+00 │ walsender                    │ START_REPLICATION SLOT
"pg_3548429_sync_3142573_7077521990523695096"LOGICAL 8E9/40FF00C8 (proto_version '1', publication_names '"focal14"')
 
 2022-03-25 08:35:27.971161+00 │ 2022-03-25 08:35:27.971161+00 │ client backend               │ select xact_start,
query_start,backend_type, query from pg_stat_activity where state is distinct from 'idle';
 
 [null]                        │ [null]                        │ background writer            │ 
 [null]                        │ [null]                        │ checkpointer                 │ 
 [null]                        │ [null]                        │ walwriter                    │ 
(17 rows)

So, it looks to me that there is some random-ish condition that causes
the problem.

I tried to insulate pgs from the world as well as I could: no
monitoring, no cron, no external access. Just replication and my simple
load generator, running loop:

CREATE TABLE depesz_test as SELECT i, repeat('payload', 50) as payload FROM generate_series(1,100000) i;
CREATE INDEX depesz_test_idx on depesz_test(i);
DROP TABLE depesz_test;

every minute.

I'm out of ideas. The good-ish news is that when I tried replication in
smaller sets (30k tables, not 60k), and/or with just
2 max_sync_workers_per_subscription, and not 10, it never failed. so
I guess I'll go that way.

Best regards,

depesz





В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17448: In Windows 10, version 1703 and later, huge_pages doesn't work.
Следующее
От: "egashira.yusuke@fujitsu.com"
Дата:
Сообщение: RE: Reconnect a single connection used by multiple threads in embedded SQL in C application causes error.