Обсуждение: BUG #19041: Logical replication locks wal processing

Поиск
Список
Период
Сортировка

BUG #19041: Logical replication locks wal processing

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      19041
Logged by:          Sergey Belyashov
Email address:      sergey.belyashov@gmail.com
PostgreSQL version: 17.6
Operating system:   Debian bookworm x86_64
Description:

I have few Postgresql servers: A, B, C, D... Each servers has tables: users,
settings and t with partitions t1, t2, t3... Server A publish tables users
and settings, and other servers are subscribed on them using one
subscription (logical replication is used). Other servers publish tables t
and server A subscribed on it (it does not matter, I think). When I create
table s (without indexes) on server A and copy huge about of rows (30M+) all
workers which do replication users,settings tables loads cpu cores by 35-80%
for a long time (5+ hours, I do not wait more). WAL is raising at this time
(37GB+). When I drop subscriptions to these tables from servers B, C... then
WAL is processed very fast. When I keep only one subscription undropped then
WAL is processed (reduced to 2-3 GB) in a hour and only one process eats
35-50% of CPU core time until WAL is not processed. Tables users and
settings are not changed or using (there are not foreign keys in the table
s) during the issued case. So the impact on their publication/subscriptions
is unexpected.

server_a# create table users (user_id integer, name text);
server_a# create table settings (user_id integer, name text, value text);
server_a# create publication users_pub for table users WITH (publish =
'insert');
server_a# create publication settings_pub for table settings;
...
server_b# create table users (user_id integer, name text);
server_b# create table settings (user_id integer, name text, value text);
server_b# create subscription usr_stt_sub connection '...' publication
users_pub, settings_pub;
...
server_a# create table s (trx_id integer, amount numeric);
server_a# COPY s FROM ...;


Re: BUG #19041: Logical replication locks wal processing

От
Dilip Kumar
Дата:
On Wed, Sep 3, 2025 at 4:55 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      19041
> Logged by:          Sergey Belyashov
> Email address:      sergey.belyashov@gmail.com
> PostgreSQL version: 17.6
> Operating system:   Debian bookworm x86_64
> Description:
>
> I have few Postgresql servers: A, B, C, D... Each servers has tables: users,
> settings and t with partitions t1, t2, t3... Server A publish tables users
> and settings, and other servers are subscribed on them using one
> subscription (logical replication is used). Other servers publish tables t
> and server A subscribed on it (it does not matter, I think). When I create
> table s (without indexes) on server A and copy huge about of rows (30M+) all
> workers which do replication users,settings tables loads cpu cores by 35-80%
> for a long time (5+ hours, I do not wait more). WAL is raising at this time
> (37GB+). When I drop subscriptions to these tables from servers B, C... then
> WAL is processed very fast. When I keep only one subscription undropped then
> WAL is processed (reduced to 2-3 GB) in a hour and only one process eats
> 35-50% of CPU core time until WAL is not processed. Tables users and
> settings are not changed or using (there are not foreign keys in the table
> s) during the issued case. So the impact on their publication/subscriptions
> is unexpected.

Just to give some context, a separate walsender process will be
created for each subscription and every walsender will decode(process)
all the WALs, so that's the reason that when you have multiple
subscriber there would be multiple walsender process will be their to
decode the WAL and each will consume CPU for decoding OTOH if you if
you only keep one subscription then there will only be one walsender
for decoding the WAL.  And WAL can not be removed until all the data
are streamed to all the subscribers and if you drop all the
subscribers WAL can be removed by the checkpoint if it is not required
by any other purpose like hot standby or wal archiver.

--
Regards,
Dilip Kumar
Google



Re: BUG #19041: Logical replication locks wal processing

От
Sergey Belyashov
Дата:
Thank you for your explanation. It is what I'm about.
Is it possible to optimize something to prevent decoding the whole WAL
if there are no affected tables there? For example, just skip a lot of
WAL blocks with unpublished tables. And/or combine WAL decoding in one
separate process for each publication which works for all active
affected subscriptions, if subscription is not active (server down or
too busy) then it is switched to legacy selfdecoding.

Best regards,
Sergey Belyashov

ср, 3 сент. 2025 г. в 14:56, Dilip Kumar <dilipbalaut@gmail.com>:
>
> On Wed, Sep 3, 2025 at 4:55 PM PG Bug reporting form
> <noreply@postgresql.org> wrote:
> >
> > The following bug has been logged on the website:
> >
> > Bug reference:      19041
> > Logged by:          Sergey Belyashov
> > Email address:      sergey.belyashov@gmail.com
> > PostgreSQL version: 17.6
> > Operating system:   Debian bookworm x86_64
> > Description:
> >
> > I have few Postgresql servers: A, B, C, D... Each servers has tables: users,
> > settings and t with partitions t1, t2, t3... Server A publish tables users
> > and settings, and other servers are subscribed on them using one
> > subscription (logical replication is used). Other servers publish tables t
> > and server A subscribed on it (it does not matter, I think). When I create
> > table s (without indexes) on server A and copy huge about of rows (30M+) all
> > workers which do replication users,settings tables loads cpu cores by 35-80%
> > for a long time (5+ hours, I do not wait more). WAL is raising at this time
> > (37GB+). When I drop subscriptions to these tables from servers B, C... then
> > WAL is processed very fast. When I keep only one subscription undropped then
> > WAL is processed (reduced to 2-3 GB) in a hour and only one process eats
> > 35-50% of CPU core time until WAL is not processed. Tables users and
> > settings are not changed or using (there are not foreign keys in the table
> > s) during the issued case. So the impact on their publication/subscriptions
> > is unexpected.
>
> Just to give some context, a separate walsender process will be
> created for each subscription and every walsender will decode(process)
> all the WALs, so that's the reason that when you have multiple
> subscriber there would be multiple walsender process will be their to
> decode the WAL and each will consume CPU for decoding OTOH if you if
> you only keep one subscription then there will only be one walsender
> for decoding the WAL.  And WAL can not be removed until all the data
> are streamed to all the subscribers and if you drop all the
> subscribers WAL can be removed by the checkpoint if it is not required
> by any other purpose like hot standby or wal archiver.
>
> --
> Regards,
> Dilip Kumar
> Google



Re: BUG #19041: Logical replication locks wal processing

От
Dilip Kumar
Дата:
On Wed, Sep 3, 2025 at 5:40 PM Sergey Belyashov
<sergey.belyashov@gmail.com> wrote:
>
> Thank you for your explanation. It is what I'm about.
> Is it possible to optimize something to prevent decoding the whole WAL
> if there are no affected tables there? For example, just skip a lot of
> WAL blocks with unpublished tables. And/or combine WAL decoding in one
> separate process for each publication which works for all active
> affected subscriptions, if subscription is not active (server down or
> too busy) then it is switched to legacy selfdecoding.

IMHO this is a valid optimization to have a single decoding worker and
all the walsender can get the required WALs decoded by a single
worker.  I think it's been discussed in the past as well and we might
do it sometime in the future but I don't see anything is in progress
for this as of now.

--
Regards,
Dilip Kumar
Google



RE: BUG #19041: Logical replication locks wal processing

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Sergey,

> Thank you for your explanation. It is what I'm about.
> Is it possible to optimize something to prevent decoding the whole WAL
> if there are no affected tables there?

I recalled there was a proposal which skips decoding changes for unpublished
tables, but it was withdrawn [1]. One of reasons was that a new transaction must
be started to check whether the table must be published or not, thus the
performance can be degraded in some cases.

I have also considered Dilip's idea [2] once, but it needs some architectural
changes. I cannot find on-going thread neither.

[1]: https://commitfest.postgresql.org/patch/5585/
[2]: https://www.postgresql.org/message-id/CAFiTN-uwgQ%3DHVFRdz2%3DJUVgQHCPubatKrK14xD-0Ov71HRpbTQ%40mail.gmail.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED