Обсуждение: memory leak in logical WAL sender with pgoutput's cachectx

Поиск

Список

Период

Сортировка

memory leak in logical WAL sender with pgoutput's cachectx

От

"赵宇鹏(宇彭)"

Дата:

14 августа, 09:43:34

Hi all,

We recently ran into a memory leak in a production logical-replication WAL-sender
process. A simplified reproduction script is attached.

If you run the script and then call MemoryContextStats(TopMemoryContext). you
will see something like:
"logical replication cache context: 562044928 total in 77 blocks;"
meaning “cachectx” has grown to ~500 MB, and it keeps growing as the number of
tables increases.

The workload can be summarised as follows:
1. CREATE PUBLICATION FOR ALL TABLES
2. CREATE SUBSCRIPTION
3. Repeatedly CREATE TABLE and DROP TABLE

cachectx is used mainly for entry->old_slot, entry->new_slot and entry->attrmap
allocations. When a DROP TABLE causes an invalidation we only set
entry->replicate_valid = false; we do not free those allocations immediately.
They are freed only if the same entry is used again.  In some workloads an entry
may never be reused, or it may be reused briefly and then become unreachable
forever (The WAL sender may still need to decode WAL records for tables that
have already been dropped while it is processing the invalidation.)

Given the current design I don’t see a simple fix.  Perhaps RelationSyncCache
needs some kind of eviction/cleanup policy to prevent this memory growth in
such scenarios.

Does anyone have ideas or suggestions?

Вложения

100_cachectx_oom.pl

RE: memory leak in logical WAL sender with pgoutput's cachectx

От

"Hayato Kuroda (Fujitsu)"

Дата:

14 августа, 13:39:36

Dear Zhao,

Thanks for raising the issue.

> If you run the script and then call MemoryContextStats(TopMemoryContext). you
> will see something like:
> "logical replication cache context: 562044928 total in 77 blocks;"
> meaning “cachectx” has grown to ~500 MB, and it keeps growing as the number

I also ran your script with count=1000, and confirmed that cachectx was grown
to around 50MB:

```
logical replication cache context: 58728448 total in 17 blocks; 3239568 free (62 chunks); 55488880 used
```

> cachectx is used mainly for entry->old_slot, entry->new_slot and entry->attrmap
> allocations. When a DROP TABLE causes an invalidation we only set
> entry->replicate_valid = false; we do not free those allocations immediately.
> They are freed only if the same entry is used again.  In some workloads an entry
> may never be reused, or it may be reused briefly and then become unreachable
> forever (The WAL sender may still need to decode WAL records for tables that
> have already been dropped while it is processing the invalidation.)

So, your suggestion is that we should sometimes free allocated memory for them,
right? Valid point.

> Given the current design I don’t see a simple fix.  Perhaps RelationSyncCache
> needs some kind of eviction/cleanup policy to prevent this memory growth in
> such scenarios.
> 
> Does anyone have ideas or suggestions?

Naively considered, relsync cahe can be cleaned up if entries were invalidated
many times. Attached patch implemented idea. It could reduce the used memory on
my env:

```
      logical replication cache context: 1056768 total in 8 blocks; 556856 free (51 chunks); 499912 used
```

Can you verify that?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Вложения

0001-pgoutput-allow-cleaning-up-Relsync-cache-entries.patch

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

Xuneng Zhou

Дата:

14 августа, 15:06:52

Hi,

Thanks for the patch.

On Thu, Aug 14, 2025 at 6:39 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Zhao,
>
> Thanks for raising the issue.
>
> > If you run the script and then call MemoryContextStats(TopMemoryContext). you
> > will see something like:
> > "logical replication cache context: 562044928 total in 77 blocks;"
> > meaning “cachectx” has grown to ~500 MB, and it keeps growing as the number
>
> I also ran your script with count=1000, and confirmed that cachectx was grown
> to around 50MB:
>
> ```
> logical replication cache context: 58728448 total in 17 blocks; 3239568 free (62 chunks); 55488880 used
> ```
>
> > cachectx is used mainly for entry->old_slot, entry->new_slot and entry->attrmap
> > allocations. When a DROP TABLE causes an invalidation we only set
> > entry->replicate_valid = false; we do not free those allocations immediately.
> > They are freed only if the same entry is used again.  In some workloads an entry
> > may never be reused, or it may be reused briefly and then become unreachable
> > forever (The WAL sender may still need to decode WAL records for tables that
> > have already been dropped while it is processing the invalidation.)
>
> So, your suggestion is that we should sometimes free allocated memory for them,
> right? Valid point.
>
> > Given the current design I don’t see a simple fix.  Perhaps RelationSyncCache
> > needs some kind of eviction/cleanup policy to prevent this memory growth in
> > such scenarios.
> >
> > Does anyone have ideas or suggestions?
>
> Naively considered, relsync cahe can be cleaned up if entries were invalidated
> many times. Attached patch implemented idea. It could reduce the used memory on
> my env:
>
> ```
>       logical replication cache context: 1056768 total in 8 blocks; 556856 free (51 chunks); 499912 used
> ```
>
> Can you verify that?
>

I had a quick look at it, and have some questions.
Is it safe to free the substructure from within rel_sync_cache_relation_cb()?
I’ also interested in the reasoning behind setting
NINVALIDATION_THRESHOLD to 100.

Best,
Xuneng

RE: memory leak in logical WAL sender with pgoutput's cachectx

От

"Hayato Kuroda (Fujitsu)"

Дата:

14 августа, 15:49:23

Dear Xuneng,

> Is it safe to free the substructure from within rel_sync_cache_relation_cb()?

You referred the comment in rel_sync_cache_relation_cb() right? I understood like
that we must not access to any *system caches*, from the comment. Here we do not
re-build caches so that we do not access to the syscaches - it is permitted.
I'm happy if you also confirm the point.

> I’ also interested in the reasoning behind setting
> NINVALIDATION_THRESHOLD to 100.

This is the debatable point of this implementation. I set to 100 because it was
sufficient with the provided workload, but this may cause the replication lag.
We may have to consider a benchmark workload, measure data, and consider the
appropriate value. 100 is just an initial point.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: memory leak in logical WAL sender with pgoutput's cachectx

От

"Zhijie Hou (Fujitsu)"

Дата:

15 августа, 05:10:21

On Thursday, August 14, 2025 8:49 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
> 
> Dear Xuneng,
> 
> > Is it safe to free the substructure from within rel_sync_cache_relation_cb()?
> 
> You referred the comment in rel_sync_cache_relation_cb() right? I understood
> like that we must not access to any *system caches*, from the comment. Here
> we do not re-build caches so that we do not access to the syscaches - it is
> permitted.

We should also avoid freeing the cache within the callback function, as
entries might still be accessed after the callback is invoked. This callback can
be triggered during the execution of pgoutput_change, which poses a risk of
concurrent access issues. For reference, see commit 7f481b8, which addresses a
similar issue. Ideally, cache release should occur within the normal code
execution path, such as within pgoutput_change() or get_rel_sync_entry().

> 
> > I’ also interested in the reasoning behind setting
> > NINVALIDATION_THRESHOLD to 100.
> 
> This is the debatable point of this implementation. I set to 100 because it was
> sufficient with the provided workload, but this may cause the replication lag.
> We may have to consider a benchmark workload, measure data, and consider
> the appropriate value. 100 is just an initial point.

I think it's worthwhile to test the performance impact of this approach. If the
table is altered but not dropped, removing the entries could introduce
additional overhead. Although this overhead might be negligible if within an
acceptable threshold, it still warrants consideration. Additionally, if the
number of invalidations is significant, the cleanup process may take a
considerable amount of time, potentially resulting in performance degradation.
In such cases, it might be beneficial to consider destroying the hash table and
creating a new one. Therefore, analyzing these scenarios is essential.

Best Regards,
Hou zj

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

Xuneng Zhou

Дата:

15 августа, 05:59:29

Hi Zhijie and Hayato-san,

Thanks for your clarification!

On Fri, Aug 15, 2025 at 10:10 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Thursday, August 14, 2025 8:49 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Xuneng,
> >
> > > Is it safe to free the substructure from within rel_sync_cache_relation_cb()?
> >
> > You referred the comment in rel_sync_cache_relation_cb() right? I understood
> > like that we must not access to any *system caches*, from the comment. Here
> > we do not re-build caches so that we do not access to the syscaches - it is
> > permitted.
>
> We should also avoid freeing the cache within the callback function, as
> entries might still be accessed after the callback is invoked. This callback can
> be triggered during the execution of pgoutput_change, which poses a risk of
> concurrent access issues. For reference, see commit 7f481b8, which addresses a
> similar issue. Ideally, cache release should occur within the normal code
> execution path, such as within pgoutput_change() or get_rel_sync_entry().

I am still getting familiar with this module, so I cannot make a sound
confirmation for this yet. I'll take a look at the commit mentioned by
Zhijie and trace the callback paths.

> >
> > > I’ also interested in the reasoning behind setting
> > > NINVALIDATION_THRESHOLD to 100.
> >
> > This is the debatable point of this implementation. I set to 100 because it was
> > sufficient with the provided workload, but this may cause the replication lag.
> > We may have to consider a benchmark workload, measure data, and consider
> > the appropriate value. 100 is just an initial point.
>
> I think it's worthwhile to test the performance impact of this approach. If the
> table is altered but not dropped, removing the entries could introduce
> additional overhead. Although this overhead might be negligible if within an
> acceptable threshold, it still warrants consideration. Additionally, if the
> number of invalidations is significant, the cleanup process may take a
> considerable amount of time, potentially resulting in performance degradation.
> In such cases, it might be beneficial to consider destroying the hash table and
> creating a new one. Therefore, analyzing these scenarios is essential.

Intuitively, a heuristic threshold is straightforward but also can be
hard to get it "right" since the workload varies.
Designing and writing tests to reflect the real-world use cases well
may not be that easy as well. I'm wondering whether it's beneficial to
brainstorm other potential alternatives before settling on the current
approach and start benchmarking and performance tuning.

Sequential cleaning could be costful in certain cases. I don't know
much about RelationSyncCache's life cycle, but it seems that we should
be cautious on shortening it without causing subtle issues.

Best,
Xuneng

RE: memory leak in logical WAL sender with pgoutput's cachectx

От

"Zhijie Hou (Fujitsu)"

Дата:

15 августа, 08:25:50

On Friday, August 15, 2025 10:59 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> Thanks for your clarification!
> 
> On Fri, Aug 15, 2025 at 10:10 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Thursday, August 14, 2025 8:49 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> > >
> > > Dear Xuneng,
> > >
> > > > Is it safe to free the substructure from within
> rel_sync_cache_relation_cb()?
> > >
> > > You referred the comment in rel_sync_cache_relation_cb() right? I
> > > understood like that we must not access to any *system caches*, from
> > > the comment. Here we do not re-build caches so that we do not access
> > > to the syscaches - it is permitted.
> >
> > We should also avoid freeing the cache within the callback function,
> > as entries might still be accessed after the callback is invoked. This
> > callback can be triggered during the execution of pgoutput_change,
> > which poses a risk of concurrent access issues. For reference, see
> > commit 7f481b8, which addresses a similar issue. Ideally, cache
> > release should occur within the normal code execution path, such as within
> pgoutput_change() or get_rel_sync_entry().
> 
> I am still getting familiar with this module, so I cannot make a sound
> confirmation for this yet. I'll take a look at the commit mentioned by Zhijie and
> trace the callback paths.
> 
> > >
> > > > I’ also interested in the reasoning behind setting
> > > > NINVALIDATION_THRESHOLD to 100.
> > >
> > > This is the debatable point of this implementation. I set to 100
> > > because it was sufficient with the provided workload, but this may cause
> the replication lag.
> > > We may have to consider a benchmark workload, measure data, and
> > > consider the appropriate value. 100 is just an initial point.
> >
> > I think it's worthwhile to test the performance impact of this
> > approach. If the table is altered but not dropped, removing the
> > entries could introduce additional overhead. Although this overhead
> > might be negligible if within an acceptable threshold, it still
> > warrants consideration. Additionally, if the number of invalidations
> > is significant, the cleanup process may take a considerable amount of time,
> potentially resulting in performance degradation.
> > In such cases, it might be beneficial to consider destroying the hash
> > table and creating a new one. Therefore, analyzing these scenarios is
> essential.
> 
> Intuitively, a heuristic threshold is straightforward but also can be hard to get it
> "right" since the workload varies.
> Designing and writing tests to reflect the real-world use cases well may not be
> that easy as well. I'm wondering whether it's beneficial to brainstorm other
> potential alternatives before settling on the current approach and start
> benchmarking and performance tuning.
> 
> Sequential cleaning could be costful in certain cases. I don't know much about
> RelationSyncCache's life cycle, but it seems that we should be cautious on
> shortening it without causing subtle issues.

I don't have a great idea that can avoid introducing overhead in all cases.

An alternative approach I considered before is deleting the hash entry upon
invalidation if the entry is not in use. This would require adding an in_use
flag in RelationSyncEntry, setting it to true when get_rel_sync_entry is called,
and marking it as false once the entry is no longer in use. This approach is
inspired from the relcache invalidation. I am slightly not sure if it's worth
the effort, but just share this idea with a POC patch for reference, which has
been modified based on Kuroda-San's version.

Best Regards,
Hou zj

Вложения

vApproach2-0001-delete-the-unused-hash-entry-on-invalidat.patch

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

Masahiko Sawada

Дата:

15 августа, 11:09:50

On Thu, Aug 14, 2025 at 10:26 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Friday, August 15, 2025 10:59 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > Thanks for your clarification!
> >
> > On Fri, Aug 15, 2025 at 10:10 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Thursday, August 14, 2025 8:49 PM Hayato Kuroda (Fujitsu)
> > <kuroda.hayato@fujitsu.com> wrote:
> > > >
> > > > Dear Xuneng,
> > > >
> > > > > Is it safe to free the substructure from within
> > rel_sync_cache_relation_cb()?
> > > >
> > > > You referred the comment in rel_sync_cache_relation_cb() right? I
> > > > understood like that we must not access to any *system caches*, from
> > > > the comment. Here we do not re-build caches so that we do not access
> > > > to the syscaches - it is permitted.
> > >
> > > We should also avoid freeing the cache within the callback function,
> > > as entries might still be accessed after the callback is invoked. This
> > > callback can be triggered during the execution of pgoutput_change,
> > > which poses a risk of concurrent access issues. For reference, see
> > > commit 7f481b8, which addresses a similar issue. Ideally, cache
> > > release should occur within the normal code execution path, such as within
> > pgoutput_change() or get_rel_sync_entry().
> >
> > I am still getting familiar with this module, so I cannot make a sound
> > confirmation for this yet. I'll take a look at the commit mentioned by Zhijie and
> > trace the callback paths.
> >
> > > >
> > > > > I’ also interested in the reasoning behind setting
> > > > > NINVALIDATION_THRESHOLD to 100.
> > > >
> > > > This is the debatable point of this implementation. I set to 100
> > > > because it was sufficient with the provided workload, but this may cause
> > the replication lag.
> > > > We may have to consider a benchmark workload, measure data, and
> > > > consider the appropriate value. 100 is just an initial point.
> > >
> > > I think it's worthwhile to test the performance impact of this
> > > approach. If the table is altered but not dropped, removing the
> > > entries could introduce additional overhead. Although this overhead
> > > might be negligible if within an acceptable threshold, it still
> > > warrants consideration. Additionally, if the number of invalidations
> > > is significant, the cleanup process may take a considerable amount of time,
> > potentially resulting in performance degradation.
> > > In such cases, it might be beneficial to consider destroying the hash
> > > table and creating a new one. Therefore, analyzing these scenarios is
> > essential.
> >
> > Intuitively, a heuristic threshold is straightforward but also can be hard to get it
> > "right" since the workload varies.
> > Designing and writing tests to reflect the real-world use cases well may not be
> > that easy as well. I'm wondering whether it's beneficial to brainstorm other
> > potential alternatives before settling on the current approach and start
> > benchmarking and performance tuning.
> >
> > Sequential cleaning could be costful in certain cases. I don't know much about
> > RelationSyncCache's life cycle, but it seems that we should be cautious on
> > shortening it without causing subtle issues.
>
> I don't have a great idea that can avoid introducing overhead in all cases.
>
> An alternative approach I considered before is deleting the hash entry upon
> invalidation if the entry is not in use. This would require adding an in_use
> flag in RelationSyncEntry, setting it to true when get_rel_sync_entry is called,
> and marking it as false once the entry is no longer in use. This approach is
> inspired from the relcache invalidation. I am slightly not sure if it's worth
> the effort, but just share this idea with a POC patch for reference, which has
> been modified based on Kuroda-San's version.

IIUC the patch adds logic to close entries and sets in_use to false
after processing each change. For example, in pgoutput_change(), we
close the entries even after we send its change:

@@ -1621,6 +1635,8 @@ cleanup:
            ExecDropSingleTupleTableSlot(new_slot);
    }

+   close_rel_sync_entry(relentry);
+
    MemoryContextSwitchTo(old);
    MemoryContextReset(data->context);

Given that cache invalidation is executed upon replaying
REORDER_BUFFER_CHANGE_INVALIDATION and the end of a transaction
replay, in which case do we keep the relcache (i.e. just setting
replicate_valid=false) because of in_use=true?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: memory leak in logical WAL sender with pgoutput's cachectx

От

"Hayato Kuroda (Fujitsu)"

Дата:

15 августа, 15:07:30

Dear Sawada-san,

> Given that cache invalidation is executed upon replaying
> REORDER_BUFFER_CHANGE_INVALIDATION and the end of a transaction
> replay, in which case do we keep the relcache (i.e. just setting
> replicate_valid=false) because of in_use=true?

Per old discussion [1], logicalrep_write_tuple ->SearchSysCache1 can cause the
cache invalidation (I did not verify though).
I felt in this case in_use can be true and validation can be skipped. Thought?

[1]:
https://www.postgresql.org/message-id/OS0PR01MB57168F454E33D53551B8D20294259%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

Masahiko Sawada

Дата:

15 августа, 21:22:06

On Fri, Aug 15, 2025 at 5:07 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > Given that cache invalidation is executed upon replaying
> > REORDER_BUFFER_CHANGE_INVALIDATION and the end of a transaction
> > replay, in which case do we keep the relcache (i.e. just setting
> > replicate_valid=false) because of in_use=true?
>
> Per old discussion [1], logicalrep_write_tuple ->SearchSysCache1 can cause the
> cache invalidation (I did not verify though).
> I felt in this case in_use can be true and validation can be skipped. Thought?

I've not verified, but even if that's true, IIUC only one relation's
cache entry can set in_use to true at a time. If my understanding is
correct, when the walsender accepts invalidation messages in
logicalrep_write_tuple() as you mentioned, it doesn't release the
cache of the relation whose change is being sent but does for other
relations. It seems to me too aggressive to release caches compared to
the current behavior.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

Masahiko Sawada

Дата:

15 августа, 21:29:24

On Wed, Aug 13, 2025 at 11:43 PM 赵宇鹏(宇彭) <zhaoyupeng.zyp@alibaba-inc.com> wrote:
>
> Hi all,
>
> We recently ran into a memory leak in a production logical-replication WAL-sender
> process. A simplified reproduction script is attached.
>
> If you run the script and then call MemoryContextStats(TopMemoryContext). you
> will see something like:
> "logical replication cache context: 562044928 total in 77 blocks;"
> meaning “cachectx” has grown to ~500 MB, and it keeps growing as the number of
> tables increases.
>
> The workload can be summarised as follows:
> 1. CREATE PUBLICATION FOR ALL TABLES
> 2. CREATE SUBSCRIPTION
> 3. Repeatedly CREATE TABLE and DROP TABLE

I'd like to know more about your use cases so that we can better
understand this problem.

In the workload you summarized, many tables are created and dropped.
Do the changes of these tables need to be replicated to the
subscriber? It seems to me that if tables are present in short periods
of time, we can use temp tables instead unless these changes need to
be replicated to the subscribers.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

RE: memory leak in logical WAL sender with pgoutput's cachectx

От

"Hayato Kuroda (Fujitsu)"

Дата:

18 августа, 09:30:18

Dear Sawada-san,

> I've not verified, but even if that's true, IIUC only one relation's
> cache entry can set in_use to true at a time.

I also think so.

> If my understanding is
> correct, when the walsender accepts invalidation messages in
> logicalrep_write_tuple() as you mentioned, it doesn't release the
> cache of the relation whose change is being sent but does for other
> relations. It seems to me too aggressive to release caches compared to
> the current behavior.

Valid point. In some cases, publications can be altered then same relsync entries
would be added again.

So... first approach may be better from the perspective. Attached patch counts
the number of invalidated entries. There is a chance to cleanup them only at
COMMIT/ABORT/PREPARE time - hence in_use flag is not needed. Thought?

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Вложения

v2-0001-pgoutput-allow-cleaning-up-Relsync-cache-entries.patch

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

Masahiko Sawada

Дата:

20 августа, 03:43:41

On Sun, Aug 17, 2025 at 11:30 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > I've not verified, but even if that's true, IIUC only one relation's
> > cache entry can set in_use to true at a time.
>
> I also think so.
>
> > If my understanding is
> > correct, when the walsender accepts invalidation messages in
> > logicalrep_write_tuple() as you mentioned, it doesn't release the
> > cache of the relation whose change is being sent but does for other
> > relations. It seems to me too aggressive to release caches compared to
> > the current behavior.
>
> Valid point. In some cases, publications can be altered then same relsync entries
> would be added again.
>
> So... first approach may be better from the perspective. Attached patch counts
> the number of invalidated entries. There is a chance to cleanup them only at
> COMMIT/ABORT/PREPARE time - hence in_use flag is not needed. Thought?
>

In the proposed patch, we have the following change:

@@ -2060,6 +2081,11 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
        entry->columns = NULL;
        entry->attrmap = NULL;
    }
+   else
+   {
+       /* The entry would be re-used, decrement the counter */
+       RelationSyncCacheInvalidateCounter--;
+   }

It decrements the counter whenever we successfully find the entry from
the cache but I'm not sure this is the right approach. What if no
cache invalidation happens at all but we retrieve entries from the
cache many times?

I have concerns about the performance implications of iterating
through all entries in the caches within
maybe_cleanup_rel_sync_cache(). If the cache contains numerous
entries, this iteration could potentially cause the walsender to
stall. If we use a larger number NINVALIDATION_THRESHOLD, we can
reduce the number of times we need sequential scans on the hash table
but it would in turn need to free more entries (probably we can have a
cap of the number of entries we can free in one cycle?).

An alternative approach would be to implement a dedicated list (such
as dclist) specifically for tracking invalidated entries. Entries
would be removed from this list when they are reused. We could then
implement a threshold-based cleanup mechanism where invalidated
entries are freed once the list exceeds a predetermined size. While
this approach would minimize the overhead of freeing invalidated
entries, it would incur some additional cost for maintaining the list.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

Xuneng Zhou

Дата:

20 августа, 04:52:10

Hi,

On Wed, Aug 20, 2025 at 8:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Aug 17, 2025 at 11:30 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > Dear Sawada-san,
> >
> > > I've not verified, but even if that's true, IIUC only one relation's
> > > cache entry can set in_use to true at a time.
> >
> > I also think so.
> >
> > > If my understanding is
> > > correct, when the walsender accepts invalidation messages in
> > > logicalrep_write_tuple() as you mentioned, it doesn't release the
> > > cache of the relation whose change is being sent but does for other
> > > relations. It seems to me too aggressive to release caches compared to
> > > the current behavior.
> >
> > Valid point. In some cases, publications can be altered then same relsync entries
> > would be added again.
> >
> > So... first approach may be better from the perspective. Attached patch counts
> > the number of invalidated entries. There is a chance to cleanup them only at
> > COMMIT/ABORT/PREPARE time - hence in_use flag is not needed. Thought?
> >
>
> In the proposed patch, we have the following change:
>
> @@ -2060,6 +2081,11 @@ get_rel_sync_entry(PGOutputData *data, Relation relation)
>         entry->columns = NULL;
>         entry->attrmap = NULL;
>     }
> +   else
> +   {
> +       /* The entry would be re-used, decrement the counter */
> +       RelationSyncCacheInvalidateCounter--;
> +   }
>
> It decrements the counter whenever we successfully find the entry from
> the cache but I'm not sure this is the right approach. What if no
> cache invalidation happens at all but we retrieve entries from the
> cache many times?
>

This may not be ideal.  It decrements on every lookup of an existing
entry, not just when consuming an invalidation, which could make the
counter go
negative. Do we need decrementing logic? Not perfect 1:1 tracking
seems ok in here; though it might make the clean-up a bit more
aggressive.

Best,
Xuneng

RE: memory leak in logical WAL sender with pgoutput's cachectx

От

"Hayato Kuroda (Fujitsu)"

Дата:

21 августа, 08:23:07

Dear Sawada-san,

> It decrements the counter whenever we successfully find the entry from
> the cache but I'm not sure this is the right approach. What if no
> cache invalidation happens at all but we retrieve entries from the
> cache many times?

Oh, right. I tried to handle the case that invalidated entries are re-validated
again, the approach was not correct.

> I have concerns about the performance implications of iterating
> through all entries in the caches within
> maybe_cleanup_rel_sync_cache(). If the cache contains numerous
> entries, this iteration could potentially cause the walsender to
> stall. If we use a larger number NINVALIDATION_THRESHOLD, we can
> reduce the number of times we need sequential scans on the hash table
> but it would in turn need to free more entries (probably we can have a
> cap of the number of entries we can free in one cycle?).

Exactly.

> An alternative approach would be to implement a dedicated list (such
> as dclist) specifically for tracking invalidated entries. Entries
> would be removed from this list when they are reused. We could then
> implement a threshold-based cleanup mechanism where invalidated
> entries are freed once the list exceeds a predetermined size. While
> this approach would minimize the overhead of freeing invalidated
> entries, it would incur some additional cost for maintaining the list.

Firstly I also considered but did not choose because of the code complexity.
After considering more, it is not so difficult, PSA new file.
It introduces a global dclist invalidated_caches and a its node in RelationSyncEntry.
RelSyncEntry can be pushed to the list when it is invalidated, and the length is
checked at the end of txn. Listed entries can be released if the length exceeds
the threshold.
A flag "already_pushed" is also needed to avoid queuing the same entry twice.
There is a possibility that same entry can be invalidated twice before cleaning
up it. I spent much time to recognize it :-(.

I'm planning to do a benchmark with this patch.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Вложения

v3-0001-pgoutput-allow-cleaning-up-Relsync-cache-entries.patch

RE: memory leak in logical WAL sender with pgoutput's cachectx

От

"Hayato Kuroda (Fujitsu)"

Дата:

21 августа, 08:27:03

Dear Xuneng,

> This may not be ideal.  It decrements on every lookup of an existing
> entry, not just when consuming an invalidation, which could make the
> counter go
> negative. Do we need decrementing logic? Not perfect 1:1 tracking
> seems ok in here; though it might make the clean-up a bit more
> aggressive.

Yeah it was not needed. I posted new approach [1] and even this has a possibility
that some of listed entries are revived. At that time they won't be removed
immediately, they can be skipped while cleaning up.

[1]:
https://www.postgresql.org/message-id/OSCPR01MB14966E0225BC4AA1BC956E1E3F532A%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED

RE: memory leak in logical WAL sender with pgoutput's cachectx

От

"Hayato Kuroda (Fujitsu)"

Дата:

21 августа, 08:41:09

> Firstly I also considered but did not choose because of the code complexity.
> After considering more, it is not so difficult, PSA new file.

v3 contained 100_cachectm_oom.pl, which won't succeed. Here is a patch
which removed the test file.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Вложения

v4-0001-pgoutput-allow-cleaning-up-Relsync-cache-entries.patch

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

"赵宇鹏(宇彭)"

Дата:

21 августа, 11:32:48

Hi,

From what we see in our users’ production environments, the situation is exactly
as previously described. Creating a “publication for all tables” is very common,
because manually choosing individual tables to publish can be cumbersome.
Regular CREATE/DROP TABLE activity is also normal, and the tables are not
necessarily short-lived. Since walsender is intended to be a long-lived process,
its memory footprint keeps accumulating over time.

Even if we ignore DROP TABLE entirely and only consider a large number of tables
that must be published, RelationSyncEntry alone can consume substantial memory.
Many users run multiple walsenders on the same instance, which further increases
memory pressure.

In normal backend processes, many cache structures are never evicted. That
already causes issues, but it is at least somewhat tolerable because a backend
is considered short-lived and a periodic reconnect can release the memory.
A walsender, however, is expected to stay alive much longer, nobody wants
replication sessions to be dropped regularly, so I am genuinely curious why
structures like RelationSyncEntry were not given an LRU-style eviction mechanism
from the start.

Adding an LRU mechanism to RelationSyncEntry has another benefit: it puts an
upper bound on the workload of callbacks such as invalidation_cb, preventing
walsender from stalling when there are a large number of tables. I have
therefore implemented a prototype of this idea (borrowing some code from
Hayato Kuroda). It should keep memory usage under control in more scenarios
while introducing only minimal overhead in theory. I will run additional
performance tests to confirm this.

What do you think of this approach?

Best regards,

Вложения

v1-0001-add-lru-for-rel_sync_cache.patch

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

Amit Kapila

Дата:

21 августа, 12:55:29

On Thu, Aug 21, 2025 at 2:03 PM 赵宇鹏(宇彭) <zhaoyupeng.zyp@alibaba-inc.com> wrote:
>
> From what we see in our users’ production environments, the situation is exactly
> as previously described. Creating a “publication for all tables” is very common,
> because manually choosing individual tables to publish can be cumbersome.
>

I understand the difficulty in choosing individual tables, but OTOH
all tables may lead to replicating tables that are not even required.
This consumes CPU, network, and disk space without the real need. This
sounds more harmful than once carefully describing publications. We
have another way to combine tables, which is to use TABLES IN SCHEMA.
I don't know if that can help, but anyway, we can't do much if the
user wants to replicate all tables.

> Regular CREATE/DROP TABLE activity is also normal, and the tables are not
> necessarily short-lived. Since walsender is intended to be a long-lived process,
> its memory footprint keeps accumulating over time.
>
> Even if we ignore DROP TABLE entirely and only consider a large number of tables
> that must be published, RelationSyncEntry alone can consume substantial memory.
> Many users run multiple walsenders on the same instance, which further increases
> memory pressure.
>

Agreed, there is a possibility of consuming a large amount of memory,
but still, we should do some experiments to see the real results. We
haven't seen complaints of walsenders consuming large amounts of
memory due to RelationSyncEntry's, so it is also possible that this is
just a theoretical case especially after we have some handling for
dropped tables.

> In normal backend processes, many cache structures are never evicted. That
> already causes issues, but it is at least somewhat tolerable because a backend
> is considered short-lived and a periodic reconnect can release the memory.
> A walsender, however, is expected to stay alive much longer, nobody wants
> replication sessions to be dropped regularly, so I am genuinely curious why
> structures like RelationSyncEntry were not given an LRU-style eviction mechanism
> from the start.
>
> Adding an LRU mechanism to RelationSyncEntry has another benefit: it puts an
> upper bound on the workload of callbacks such as invalidation_cb, preventing
> walsender from stalling when there are a large number of tables. I have
> therefore implemented a prototype of this idea (borrowing some code from
> Hayato Kuroda). It should keep memory usage under control in more scenarios
> while introducing only minimal overhead in theory.

Yeah, such an idea is worth considering after we have some tests to
show the high memory usage. BTW, I think it would be better to run
some algorithm to evict entries when we reach the threshold limit as
we do for shared buffers, see StrategyGetBuffer. Otherwise, we may be
paying the cost to maintain such a list when in practice it may be
required only very few times.

--
With Regards,
Amit Kapila.

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

Amit Kapila

Дата:

21 августа, 13:11:17

On Thu, Aug 21, 2025 at 10:53 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> > I have concerns about the performance implications of iterating
> > through all entries in the caches within
> > maybe_cleanup_rel_sync_cache(). If the cache contains numerous
> > entries, this iteration could potentially cause the walsender to
> > stall. If we use a larger number NINVALIDATION_THRESHOLD, we can
> > reduce the number of times we need sequential scans on the hash table
> > but it would in turn need to free more entries (probably we can have a
> > cap of the number of entries we can free in one cycle?).
>
> Exactly.
>

So, at least we can try some tests before completely giving up on this idea.

> > An alternative approach would be to implement a dedicated list (such
> > as dclist) specifically for tracking invalidated entries. Entries
> > would be removed from this list when they are reused. We could then
> > implement a threshold-based cleanup mechanism where invalidated
> > entries are freed once the list exceeds a predetermined size. While
> > this approach would minimize the overhead of freeing invalidated
> > entries, it would incur some additional cost for maintaining the list.
>
> Firstly I also considered but did not choose because of the code complexity.
> After considering more, it is not so difficult, PSA new file.
>

The other idea I was thinking of is if somehow we can decode the DROP
TABLE WAL record, say delete of relid from pg_class then we can use
that to remove the corresponding entry from RelationSyncCache.

--
With Regards,
Amit Kapila.

Re: memory leak in logical WAL sender with pgoutput's cachectx

От

Masahiko Sawada

Дата:

22 августа, 02:55:19

On Thu, Aug 21, 2025 at 2:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 21, 2025 at 2:03 PM 赵宇鹏(宇彭) <zhaoyupeng.zyp@alibaba-inc.com> wrote:
> >
> > From what we see in our users’ production environments, the situation is exactly
> > as previously described. Creating a “publication for all tables” is very common,
> > because manually choosing individual tables to publish can be cumbersome.
> >
>
> I understand the difficulty in choosing individual tables, but OTOH
> all tables may lead to replicating tables that are not even required.
> This consumes CPU, network, and disk space without the real need. This
> sounds more harmful than once carefully describing publications. We
> have another way to combine tables, which is to use TABLES IN SCHEMA.
> I don't know if that can help, but anyway, we can't do much if the
> user wants to replicate all tables.
>
> > Regular CREATE/DROP TABLE activity is also normal, and the tables are not
> > necessarily short-lived. Since walsender is intended to be a long-lived process,
> > its memory footprint keeps accumulating over time.
> >
> > Even if we ignore DROP TABLE entirely and only consider a large number of tables
> > that must be published, RelationSyncEntry alone can consume substantial memory.
> > Many users run multiple walsenders on the same instance, which further increases
> > memory pressure.
> >
>
> Agreed, there is a possibility of consuming a large amount of memory,
> but still, we should do some experiments to see the real results. We
> haven't seen complaints of walsenders consuming large amounts of
> memory due to RelationSyncEntry's, so it is also possible that this is
> just a theoretical case especially after we have some handling for
> dropped tables.
>
> > In normal backend processes, many cache structures are never evicted. That
> > already causes issues, but it is at least somewhat tolerable because a backend
> > is considered short-lived and a periodic reconnect can release the memory.
> > A walsender, however, is expected to stay alive much longer, nobody wants
> > replication sessions to be dropped regularly, so I am genuinely curious why
> > structures like RelationSyncEntry were not given an LRU-style eviction mechanism
> > from the start.

FYI as far as I know we've discussed that a similar problem exists
also in normal backend processes especially for users who are using a
connection pool[1] (it was about syscache bloat due to negative
caches).

With your proposed patch, it maintains cache entries with a hard limit
of 1024 entries using LRU eviction. This could become a performance
bottleneck when replicating more than 1,024 tables. Even for databases
having less than 1024 tables it would lead to a negative performance
impact due to the cost of maintaining LRU order. Also, while having a
maximum capacity with an eviction mechanism makes sense, we should
make this limit configurable (probably by size?). If we have the limit
in size it might be worth noting that cache entries consume variable
amounts of memory, and their underlying memory blocks might persist
even after entry eviction.

> >
> > Adding an LRU mechanism to RelationSyncEntry has another benefit: it puts an
> > upper bound on the workload of callbacks such as invalidation_cb, preventing
> > walsender from stalling when there are a large number of tables. I have
> > therefore implemented a prototype of this idea (borrowing some code from
> > Hayato Kuroda). It should keep memory usage under control in more scenarios
> > while introducing only minimal overhead in theory.
>
> Yeah, such an idea is worth considering after we have some tests to
> show the high memory usage. BTW, I think it would be better to run
> some algorithm to evict entries when we reach the threshold limit as
> we do for shared buffers, see StrategyGetBuffer. Otherwise, we may be
> paying the cost to maintain such a list when in practice it may be
> required only very few times.

Yeah, something like clock-sweep could work too. I think we need to
carefully select the cache management algorithm. While it's a well
known fact that maintaining LRU ordering is expensive, I'm really not
sure the clock-sweep is a suitable algorithm for logical replication's
cache use cases. Also, we might want to avoid iterating all caches to
find caches to evict.

Regards,

[1] https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyotaro%40lab.ntt.co.jp

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: memory leak in logical WAL sender with pgoutput's cachectx

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения