Обсуждение: memory leak in logical WAL sender with pgoutput's cachectx
Hi all, We recently ran into a memory leak in a production logical-replication WAL-sender process. A simplified reproduction script is attached. If you run the script and then call MemoryContextStats(TopMemoryContext). you will see something like: "logical replication cache context: 562044928 total in 77 blocks;" meaning “cachectx” has grown to ~500 MB, and it keeps growing as the number of tables increases. The workload can be summarised as follows: 1. CREATE PUBLICATION FOR ALL TABLES 2. CREATE SUBSCRIPTION 3. Repeatedly CREATE TABLE and DROP TABLE cachectx is used mainly for entry->old_slot, entry->new_slot and entry->attrmap allocations. When a DROP TABLE causes an invalidation we only set entry->replicate_valid = false; we do not free those allocations immediately. They are freed only if the same entry is used again. In some workloads an entry may never be reused, or it may be reused briefly and then become unreachable forever (The WAL sender may still need to decode WAL records for tables that have already been dropped while it is processing the invalidation.) Given the current design I don’t see a simple fix. Perhaps RelationSyncCache needs some kind of eviction/cleanup policy to prevent this memory growth in such scenarios. Does anyone have ideas or suggestions?
Вложения
Dear Zhao, Thanks for raising the issue. > If you run the script and then call MemoryContextStats(TopMemoryContext). you > will see something like: > "logical replication cache context: 562044928 total in 77 blocks;" > meaning “cachectx” has grown to ~500 MB, and it keeps growing as the number I also ran your script with count=1000, and confirmed that cachectx was grown to around 50MB: ``` logical replication cache context: 58728448 total in 17 blocks; 3239568 free (62 chunks); 55488880 used ``` > cachectx is used mainly for entry->old_slot, entry->new_slot and entry->attrmap > allocations. When a DROP TABLE causes an invalidation we only set > entry->replicate_valid = false; we do not free those allocations immediately. > They are freed only if the same entry is used again. In some workloads an entry > may never be reused, or it may be reused briefly and then become unreachable > forever (The WAL sender may still need to decode WAL records for tables that > have already been dropped while it is processing the invalidation.) So, your suggestion is that we should sometimes free allocated memory for them, right? Valid point. > Given the current design I don’t see a simple fix. Perhaps RelationSyncCache > needs some kind of eviction/cleanup policy to prevent this memory growth in > such scenarios. > > Does anyone have ideas or suggestions? Naively considered, relsync cahe can be cleaned up if entries were invalidated many times. Attached patch implemented idea. It could reduce the used memory on my env: ``` logical replication cache context: 1056768 total in 8 blocks; 556856 free (51 chunks); 499912 used ``` Can you verify that? Best regards, Hayato Kuroda FUJITSU LIMITED
Вложения
Hi, Thanks for the patch. On Thu, Aug 14, 2025 at 6:39 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Zhao, > > Thanks for raising the issue. > > > If you run the script and then call MemoryContextStats(TopMemoryContext). you > > will see something like: > > "logical replication cache context: 562044928 total in 77 blocks;" > > meaning “cachectx” has grown to ~500 MB, and it keeps growing as the number > > I also ran your script with count=1000, and confirmed that cachectx was grown > to around 50MB: > > ``` > logical replication cache context: 58728448 total in 17 blocks; 3239568 free (62 chunks); 55488880 used > ``` > > > cachectx is used mainly for entry->old_slot, entry->new_slot and entry->attrmap > > allocations. When a DROP TABLE causes an invalidation we only set > > entry->replicate_valid = false; we do not free those allocations immediately. > > They are freed only if the same entry is used again. In some workloads an entry > > may never be reused, or it may be reused briefly and then become unreachable > > forever (The WAL sender may still need to decode WAL records for tables that > > have already been dropped while it is processing the invalidation.) > > So, your suggestion is that we should sometimes free allocated memory for them, > right? Valid point. > > > Given the current design I don’t see a simple fix. Perhaps RelationSyncCache > > needs some kind of eviction/cleanup policy to prevent this memory growth in > > such scenarios. > > > > Does anyone have ideas or suggestions? > > Naively considered, relsync cahe can be cleaned up if entries were invalidated > many times. Attached patch implemented idea. It could reduce the used memory on > my env: > > ``` > logical replication cache context: 1056768 total in 8 blocks; 556856 free (51 chunks); 499912 used > ``` > > Can you verify that? > I had a quick look at it, and have some questions. Is it safe to free the substructure from within rel_sync_cache_relation_cb()? I’ also interested in the reasoning behind setting NINVALIDATION_THRESHOLD to 100. Best, Xuneng
Dear Xuneng, > Is it safe to free the substructure from within rel_sync_cache_relation_cb()? You referred the comment in rel_sync_cache_relation_cb() right? I understood like that we must not access to any *system caches*, from the comment. Here we do not re-build caches so that we do not access to the syscaches - it is permitted. I'm happy if you also confirm the point. > I’ also interested in the reasoning behind setting > NINVALIDATION_THRESHOLD to 100. This is the debatable point of this implementation. I set to 100 because it was sufficient with the provided workload, but this may cause the replication lag. We may have to consider a benchmark workload, measure data, and consider the appropriate value. 100 is just an initial point. Best regards, Hayato Kuroda FUJITSU LIMITED
On Thursday, August 14, 2025 8:49 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Xuneng, > > > Is it safe to free the substructure from within rel_sync_cache_relation_cb()? > > You referred the comment in rel_sync_cache_relation_cb() right? I understood > like that we must not access to any *system caches*, from the comment. Here > we do not re-build caches so that we do not access to the syscaches - it is > permitted. We should also avoid freeing the cache within the callback function, as entries might still be accessed after the callback is invoked. This callback can be triggered during the execution of pgoutput_change, which poses a risk of concurrent access issues. For reference, see commit 7f481b8, which addresses a similar issue. Ideally, cache release should occur within the normal code execution path, such as within pgoutput_change() or get_rel_sync_entry(). > > > I’ also interested in the reasoning behind setting > > NINVALIDATION_THRESHOLD to 100. > > This is the debatable point of this implementation. I set to 100 because it was > sufficient with the provided workload, but this may cause the replication lag. > We may have to consider a benchmark workload, measure data, and consider > the appropriate value. 100 is just an initial point. I think it's worthwhile to test the performance impact of this approach. If the table is altered but not dropped, removing the entries could introduce additional overhead. Although this overhead might be negligible if within an acceptable threshold, it still warrants consideration. Additionally, if the number of invalidations is significant, the cleanup process may take a considerable amount of time, potentially resulting in performance degradation. In such cases, it might be beneficial to consider destroying the hash table and creating a new one. Therefore, analyzing these scenarios is essential. Best Regards, Hou zj
Hi Zhijie and Hayato-san, Thanks for your clarification! On Fri, Aug 15, 2025 at 10:10 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, August 14, 2025 8:49 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Xuneng, > > > > > Is it safe to free the substructure from within rel_sync_cache_relation_cb()? > > > > You referred the comment in rel_sync_cache_relation_cb() right? I understood > > like that we must not access to any *system caches*, from the comment. Here > > we do not re-build caches so that we do not access to the syscaches - it is > > permitted. > > We should also avoid freeing the cache within the callback function, as > entries might still be accessed after the callback is invoked. This callback can > be triggered during the execution of pgoutput_change, which poses a risk of > concurrent access issues. For reference, see commit 7f481b8, which addresses a > similar issue. Ideally, cache release should occur within the normal code > execution path, such as within pgoutput_change() or get_rel_sync_entry(). I am still getting familiar with this module, so I cannot make a sound confirmation for this yet. I'll take a look at the commit mentioned by Zhijie and trace the callback paths. > > > > > I’ also interested in the reasoning behind setting > > > NINVALIDATION_THRESHOLD to 100. > > > > This is the debatable point of this implementation. I set to 100 because it was > > sufficient with the provided workload, but this may cause the replication lag. > > We may have to consider a benchmark workload, measure data, and consider > > the appropriate value. 100 is just an initial point. > > I think it's worthwhile to test the performance impact of this approach. If the > table is altered but not dropped, removing the entries could introduce > additional overhead. Although this overhead might be negligible if within an > acceptable threshold, it still warrants consideration. Additionally, if the > number of invalidations is significant, the cleanup process may take a > considerable amount of time, potentially resulting in performance degradation. > In such cases, it might be beneficial to consider destroying the hash table and > creating a new one. Therefore, analyzing these scenarios is essential. Intuitively, a heuristic threshold is straightforward but also can be hard to get it "right" since the workload varies. Designing and writing tests to reflect the real-world use cases well may not be that easy as well. I'm wondering whether it's beneficial to brainstorm other potential alternatives before settling on the current approach and start benchmarking and performance tuning. Sequential cleaning could be costful in certain cases. I don't know much about RelationSyncCache's life cycle, but it seems that we should be cautious on shortening it without causing subtle issues. Best, Xuneng
On Friday, August 15, 2025 10:59 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > Thanks for your clarification! > > On Fri, Aug 15, 2025 at 10:10 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Thursday, August 14, 2025 8:49 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Dear Xuneng, > > > > > > > Is it safe to free the substructure from within > rel_sync_cache_relation_cb()? > > > > > > You referred the comment in rel_sync_cache_relation_cb() right? I > > > understood like that we must not access to any *system caches*, from > > > the comment. Here we do not re-build caches so that we do not access > > > to the syscaches - it is permitted. > > > > We should also avoid freeing the cache within the callback function, > > as entries might still be accessed after the callback is invoked. This > > callback can be triggered during the execution of pgoutput_change, > > which poses a risk of concurrent access issues. For reference, see > > commit 7f481b8, which addresses a similar issue. Ideally, cache > > release should occur within the normal code execution path, such as within > pgoutput_change() or get_rel_sync_entry(). > > I am still getting familiar with this module, so I cannot make a sound > confirmation for this yet. I'll take a look at the commit mentioned by Zhijie and > trace the callback paths. > > > > > > > > I’ also interested in the reasoning behind setting > > > > NINVALIDATION_THRESHOLD to 100. > > > > > > This is the debatable point of this implementation. I set to 100 > > > because it was sufficient with the provided workload, but this may cause > the replication lag. > > > We may have to consider a benchmark workload, measure data, and > > > consider the appropriate value. 100 is just an initial point. > > > > I think it's worthwhile to test the performance impact of this > > approach. If the table is altered but not dropped, removing the > > entries could introduce additional overhead. Although this overhead > > might be negligible if within an acceptable threshold, it still > > warrants consideration. Additionally, if the number of invalidations > > is significant, the cleanup process may take a considerable amount of time, > potentially resulting in performance degradation. > > In such cases, it might be beneficial to consider destroying the hash > > table and creating a new one. Therefore, analyzing these scenarios is > essential. > > Intuitively, a heuristic threshold is straightforward but also can be hard to get it > "right" since the workload varies. > Designing and writing tests to reflect the real-world use cases well may not be > that easy as well. I'm wondering whether it's beneficial to brainstorm other > potential alternatives before settling on the current approach and start > benchmarking and performance tuning. > > Sequential cleaning could be costful in certain cases. I don't know much about > RelationSyncCache's life cycle, but it seems that we should be cautious on > shortening it without causing subtle issues. I don't have a great idea that can avoid introducing overhead in all cases. An alternative approach I considered before is deleting the hash entry upon invalidation if the entry is not in use. This would require adding an in_use flag in RelationSyncEntry, setting it to true when get_rel_sync_entry is called, and marking it as false once the entry is no longer in use. This approach is inspired from the relcache invalidation. I am slightly not sure if it's worth the effort, but just share this idea with a POC patch for reference, which has been modified based on Kuroda-San's version. Best Regards, Hou zj
Вложения
On Thu, Aug 14, 2025 at 10:26 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, August 15, 2025 10:59 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Thanks for your clarification! > > > > On Fri, Aug 15, 2025 at 10:10 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > > wrote: > > > > > > On Thursday, August 14, 2025 8:49 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > > > Dear Xuneng, > > > > > > > > > Is it safe to free the substructure from within > > rel_sync_cache_relation_cb()? > > > > > > > > You referred the comment in rel_sync_cache_relation_cb() right? I > > > > understood like that we must not access to any *system caches*, from > > > > the comment. Here we do not re-build caches so that we do not access > > > > to the syscaches - it is permitted. > > > > > > We should also avoid freeing the cache within the callback function, > > > as entries might still be accessed after the callback is invoked. This > > > callback can be triggered during the execution of pgoutput_change, > > > which poses a risk of concurrent access issues. For reference, see > > > commit 7f481b8, which addresses a similar issue. Ideally, cache > > > release should occur within the normal code execution path, such as within > > pgoutput_change() or get_rel_sync_entry(). > > > > I am still getting familiar with this module, so I cannot make a sound > > confirmation for this yet. I'll take a look at the commit mentioned by Zhijie and > > trace the callback paths. > > > > > > > > > > > I’ also interested in the reasoning behind setting > > > > > NINVALIDATION_THRESHOLD to 100. > > > > > > > > This is the debatable point of this implementation. I set to 100 > > > > because it was sufficient with the provided workload, but this may cause > > the replication lag. > > > > We may have to consider a benchmark workload, measure data, and > > > > consider the appropriate value. 100 is just an initial point. > > > > > > I think it's worthwhile to test the performance impact of this > > > approach. If the table is altered but not dropped, removing the > > > entries could introduce additional overhead. Although this overhead > > > might be negligible if within an acceptable threshold, it still > > > warrants consideration. Additionally, if the number of invalidations > > > is significant, the cleanup process may take a considerable amount of time, > > potentially resulting in performance degradation. > > > In such cases, it might be beneficial to consider destroying the hash > > > table and creating a new one. Therefore, analyzing these scenarios is > > essential. > > > > Intuitively, a heuristic threshold is straightforward but also can be hard to get it > > "right" since the workload varies. > > Designing and writing tests to reflect the real-world use cases well may not be > > that easy as well. I'm wondering whether it's beneficial to brainstorm other > > potential alternatives before settling on the current approach and start > > benchmarking and performance tuning. > > > > Sequential cleaning could be costful in certain cases. I don't know much about > > RelationSyncCache's life cycle, but it seems that we should be cautious on > > shortening it without causing subtle issues. > > I don't have a great idea that can avoid introducing overhead in all cases. > > An alternative approach I considered before is deleting the hash entry upon > invalidation if the entry is not in use. This would require adding an in_use > flag in RelationSyncEntry, setting it to true when get_rel_sync_entry is called, > and marking it as false once the entry is no longer in use. This approach is > inspired from the relcache invalidation. I am slightly not sure if it's worth > the effort, but just share this idea with a POC patch for reference, which has > been modified based on Kuroda-San's version. IIUC the patch adds logic to close entries and sets in_use to false after processing each change. For example, in pgoutput_change(), we close the entries even after we send its change: @@ -1621,6 +1635,8 @@ cleanup: ExecDropSingleTupleTableSlot(new_slot); } + close_rel_sync_entry(relentry); + MemoryContextSwitchTo(old); MemoryContextReset(data->context); Given that cache invalidation is executed upon replaying REORDER_BUFFER_CHANGE_INVALIDATION and the end of a transaction replay, in which case do we keep the relcache (i.e. just setting replicate_valid=false) because of in_use=true? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Sawada-san, > Given that cache invalidation is executed upon replaying > REORDER_BUFFER_CHANGE_INVALIDATION and the end of a transaction > replay, in which case do we keep the relcache (i.e. just setting > replicate_valid=false) because of in_use=true? Per old discussion [1], logicalrep_write_tuple ->SearchSysCache1 can cause the cache invalidation (I did not verify though). I felt in this case in_use can be true and validation can be skipped. Thought? [1]: https://www.postgresql.org/message-id/OS0PR01MB57168F454E33D53551B8D20294259%40OS0PR01MB5716.jpnprd01.prod.outlook.com Best regards, Hayato Kuroda FUJITSU LIMITED
On Fri, Aug 15, 2025 at 5:07 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > Given that cache invalidation is executed upon replaying > > REORDER_BUFFER_CHANGE_INVALIDATION and the end of a transaction > > replay, in which case do we keep the relcache (i.e. just setting > > replicate_valid=false) because of in_use=true? > > Per old discussion [1], logicalrep_write_tuple ->SearchSysCache1 can cause the > cache invalidation (I did not verify though). > I felt in this case in_use can be true and validation can be skipped. Thought? I've not verified, but even if that's true, IIUC only one relation's cache entry can set in_use to true at a time. If my understanding is correct, when the walsender accepts invalidation messages in logicalrep_write_tuple() as you mentioned, it doesn't release the cache of the relation whose change is being sent but does for other relations. It seems to me too aggressive to release caches compared to the current behavior. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Aug 13, 2025 at 11:43 PM 赵宇鹏(宇彭) <zhaoyupeng.zyp@alibaba-inc.com> wrote: > > Hi all, > > We recently ran into a memory leak in a production logical-replication WAL-sender > process. A simplified reproduction script is attached. > > If you run the script and then call MemoryContextStats(TopMemoryContext). you > will see something like: > "logical replication cache context: 562044928 total in 77 blocks;" > meaning “cachectx” has grown to ~500 MB, and it keeps growing as the number of > tables increases. > > The workload can be summarised as follows: > 1. CREATE PUBLICATION FOR ALL TABLES > 2. CREATE SUBSCRIPTION > 3. Repeatedly CREATE TABLE and DROP TABLE I'd like to know more about your use cases so that we can better understand this problem. In the workload you summarized, many tables are created and dropped. Do the changes of these tables need to be replicated to the subscriber? It seems to me that if tables are present in short periods of time, we can use temp tables instead unless these changes need to be replicated to the subscribers. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Dear Sawada-san, > I've not verified, but even if that's true, IIUC only one relation's > cache entry can set in_use to true at a time. I also think so. > If my understanding is > correct, when the walsender accepts invalidation messages in > logicalrep_write_tuple() as you mentioned, it doesn't release the > cache of the relation whose change is being sent but does for other > relations. It seems to me too aggressive to release caches compared to > the current behavior. Valid point. In some cases, publications can be altered then same relsync entries would be added again. So... first approach may be better from the perspective. Attached patch counts the number of invalidated entries. There is a chance to cleanup them only at COMMIT/ABORT/PREPARE time - hence in_use flag is not needed. Thought? Best regards, Hayato Kuroda FUJITSU LIMITED
Вложения
On Sun, Aug 17, 2025 at 11:30 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > > I've not verified, but even if that's true, IIUC only one relation's > > cache entry can set in_use to true at a time. > > I also think so. > > > If my understanding is > > correct, when the walsender accepts invalidation messages in > > logicalrep_write_tuple() as you mentioned, it doesn't release the > > cache of the relation whose change is being sent but does for other > > relations. It seems to me too aggressive to release caches compared to > > the current behavior. > > Valid point. In some cases, publications can be altered then same relsync entries > would be added again. > > So... first approach may be better from the perspective. Attached patch counts > the number of invalidated entries. There is a chance to cleanup them only at > COMMIT/ABORT/PREPARE time - hence in_use flag is not needed. Thought? > In the proposed patch, we have the following change: @@ -2060,6 +2081,11 @@ get_rel_sync_entry(PGOutputData *data, Relation relation) entry->columns = NULL; entry->attrmap = NULL; } + else + { + /* The entry would be re-used, decrement the counter */ + RelationSyncCacheInvalidateCounter--; + } It decrements the counter whenever we successfully find the entry from the cache but I'm not sure this is the right approach. What if no cache invalidation happens at all but we retrieve entries from the cache many times? I have concerns about the performance implications of iterating through all entries in the caches within maybe_cleanup_rel_sync_cache(). If the cache contains numerous entries, this iteration could potentially cause the walsender to stall. If we use a larger number NINVALIDATION_THRESHOLD, we can reduce the number of times we need sequential scans on the hash table but it would in turn need to free more entries (probably we can have a cap of the number of entries we can free in one cycle?). An alternative approach would be to implement a dedicated list (such as dclist) specifically for tracking invalidated entries. Entries would be removed from this list when they are reused. We could then implement a threshold-based cleanup mechanism where invalidated entries are freed once the list exceeds a predetermined size. While this approach would minimize the overhead of freeing invalidated entries, it would incur some additional cost for maintaining the list. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi, On Wed, Aug 20, 2025 at 8:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sun, Aug 17, 2025 at 11:30 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Sawada-san, > > > > > I've not verified, but even if that's true, IIUC only one relation's > > > cache entry can set in_use to true at a time. > > > > I also think so. > > > > > If my understanding is > > > correct, when the walsender accepts invalidation messages in > > > logicalrep_write_tuple() as you mentioned, it doesn't release the > > > cache of the relation whose change is being sent but does for other > > > relations. It seems to me too aggressive to release caches compared to > > > the current behavior. > > > > Valid point. In some cases, publications can be altered then same relsync entries > > would be added again. > > > > So... first approach may be better from the perspective. Attached patch counts > > the number of invalidated entries. There is a chance to cleanup them only at > > COMMIT/ABORT/PREPARE time - hence in_use flag is not needed. Thought? > > > > In the proposed patch, we have the following change: > > @@ -2060,6 +2081,11 @@ get_rel_sync_entry(PGOutputData *data, Relation relation) > entry->columns = NULL; > entry->attrmap = NULL; > } > + else > + { > + /* The entry would be re-used, decrement the counter */ > + RelationSyncCacheInvalidateCounter--; > + } > > It decrements the counter whenever we successfully find the entry from > the cache but I'm not sure this is the right approach. What if no > cache invalidation happens at all but we retrieve entries from the > cache many times? > This may not be ideal. It decrements on every lookup of an existing entry, not just when consuming an invalidation, which could make the counter go negative. Do we need decrementing logic? Not perfect 1:1 tracking seems ok in here; though it might make the clean-up a bit more aggressive. Best, Xuneng
Dear Sawada-san, > It decrements the counter whenever we successfully find the entry from > the cache but I'm not sure this is the right approach. What if no > cache invalidation happens at all but we retrieve entries from the > cache many times? Oh, right. I tried to handle the case that invalidated entries are re-validated again, the approach was not correct. > I have concerns about the performance implications of iterating > through all entries in the caches within > maybe_cleanup_rel_sync_cache(). If the cache contains numerous > entries, this iteration could potentially cause the walsender to > stall. If we use a larger number NINVALIDATION_THRESHOLD, we can > reduce the number of times we need sequential scans on the hash table > but it would in turn need to free more entries (probably we can have a > cap of the number of entries we can free in one cycle?). Exactly. > An alternative approach would be to implement a dedicated list (such > as dclist) specifically for tracking invalidated entries. Entries > would be removed from this list when they are reused. We could then > implement a threshold-based cleanup mechanism where invalidated > entries are freed once the list exceeds a predetermined size. While > this approach would minimize the overhead of freeing invalidated > entries, it would incur some additional cost for maintaining the list. Firstly I also considered but did not choose because of the code complexity. After considering more, it is not so difficult, PSA new file. It introduces a global dclist invalidated_caches and a its node in RelationSyncEntry. RelSyncEntry can be pushed to the list when it is invalidated, and the length is checked at the end of txn. Listed entries can be released if the length exceeds the threshold. A flag "already_pushed" is also needed to avoid queuing the same entry twice. There is a possibility that same entry can be invalidated twice before cleaning up it. I spent much time to recognize it :-(. I'm planning to do a benchmark with this patch. Best regards, Hayato Kuroda FUJITSU LIMITED
Вложения
Dear Xuneng, > This may not be ideal. It decrements on every lookup of an existing > entry, not just when consuming an invalidation, which could make the > counter go > negative. Do we need decrementing logic? Not perfect 1:1 tracking > seems ok in here; though it might make the clean-up a bit more > aggressive. Yeah it was not needed. I posted new approach [1] and even this has a possibility that some of listed entries are revived. At that time they won't be removed immediately, they can be skipped while cleaning up. [1]: https://www.postgresql.org/message-id/OSCPR01MB14966E0225BC4AA1BC956E1E3F532A%40OSCPR01MB14966.jpnprd01.prod.outlook.com Best regards, Hayato Kuroda FUJITSU LIMITED
> Firstly I also considered but did not choose because of the code complexity. > After considering more, it is not so difficult, PSA new file. v3 contained 100_cachectm_oom.pl, which won't succeed. Here is a patch which removed the test file. Best regards, Hayato Kuroda FUJITSU LIMITED
Вложения
Hi, From what we see in our users’ production environments, the situation is exactly as previously described. Creating a “publication for all tables” is very common, because manually choosing individual tables to publish can be cumbersome. Regular CREATE/DROP TABLE activity is also normal, and the tables are not necessarily short-lived. Since walsender is intended to be a long-lived process, its memory footprint keeps accumulating over time. Even if we ignore DROP TABLE entirely and only consider a large number of tables that must be published, RelationSyncEntry alone can consume substantial memory. Many users run multiple walsenders on the same instance, which further increases memory pressure. In normal backend processes, many cache structures are never evicted. That already causes issues, but it is at least somewhat tolerable because a backend is considered short-lived and a periodic reconnect can release the memory. A walsender, however, is expected to stay alive much longer, nobody wants replication sessions to be dropped regularly, so I am genuinely curious why structures like RelationSyncEntry were not given an LRU-style eviction mechanism from the start. Adding an LRU mechanism to RelationSyncEntry has another benefit: it puts an upper bound on the workload of callbacks such as invalidation_cb, preventing walsender from stalling when there are a large number of tables. I have therefore implemented a prototype of this idea (borrowing some code from Hayato Kuroda). It should keep memory usage under control in more scenarios while introducing only minimal overhead in theory. I will run additional performance tests to confirm this. What do you think of this approach? Best regards,
Вложения
On Thu, Aug 21, 2025 at 2:03 PM 赵宇鹏(宇彭) <zhaoyupeng.zyp@alibaba-inc.com> wrote: > > From what we see in our users’ production environments, the situation is exactly > as previously described. Creating a “publication for all tables” is very common, > because manually choosing individual tables to publish can be cumbersome. > I understand the difficulty in choosing individual tables, but OTOH all tables may lead to replicating tables that are not even required. This consumes CPU, network, and disk space without the real need. This sounds more harmful than once carefully describing publications. We have another way to combine tables, which is to use TABLES IN SCHEMA. I don't know if that can help, but anyway, we can't do much if the user wants to replicate all tables. > Regular CREATE/DROP TABLE activity is also normal, and the tables are not > necessarily short-lived. Since walsender is intended to be a long-lived process, > its memory footprint keeps accumulating over time. > > Even if we ignore DROP TABLE entirely and only consider a large number of tables > that must be published, RelationSyncEntry alone can consume substantial memory. > Many users run multiple walsenders on the same instance, which further increases > memory pressure. > Agreed, there is a possibility of consuming a large amount of memory, but still, we should do some experiments to see the real results. We haven't seen complaints of walsenders consuming large amounts of memory due to RelationSyncEntry's, so it is also possible that this is just a theoretical case especially after we have some handling for dropped tables. > In normal backend processes, many cache structures are never evicted. That > already causes issues, but it is at least somewhat tolerable because a backend > is considered short-lived and a periodic reconnect can release the memory. > A walsender, however, is expected to stay alive much longer, nobody wants > replication sessions to be dropped regularly, so I am genuinely curious why > structures like RelationSyncEntry were not given an LRU-style eviction mechanism > from the start. > > Adding an LRU mechanism to RelationSyncEntry has another benefit: it puts an > upper bound on the workload of callbacks such as invalidation_cb, preventing > walsender from stalling when there are a large number of tables. I have > therefore implemented a prototype of this idea (borrowing some code from > Hayato Kuroda). It should keep memory usage under control in more scenarios > while introducing only minimal overhead in theory. Yeah, such an idea is worth considering after we have some tests to show the high memory usage. BTW, I think it would be better to run some algorithm to evict entries when we reach the threshold limit as we do for shared buffers, see StrategyGetBuffer. Otherwise, we may be paying the cost to maintain such a list when in practice it may be required only very few times. -- With Regards, Amit Kapila.
On Thu, Aug 21, 2025 at 10:53 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > I have concerns about the performance implications of iterating > > through all entries in the caches within > > maybe_cleanup_rel_sync_cache(). If the cache contains numerous > > entries, this iteration could potentially cause the walsender to > > stall. If we use a larger number NINVALIDATION_THRESHOLD, we can > > reduce the number of times we need sequential scans on the hash table > > but it would in turn need to free more entries (probably we can have a > > cap of the number of entries we can free in one cycle?). > > Exactly. > So, at least we can try some tests before completely giving up on this idea. > > An alternative approach would be to implement a dedicated list (such > > as dclist) specifically for tracking invalidated entries. Entries > > would be removed from this list when they are reused. We could then > > implement a threshold-based cleanup mechanism where invalidated > > entries are freed once the list exceeds a predetermined size. While > > this approach would minimize the overhead of freeing invalidated > > entries, it would incur some additional cost for maintaining the list. > > Firstly I also considered but did not choose because of the code complexity. > After considering more, it is not so difficult, PSA new file. > The other idea I was thinking of is if somehow we can decode the DROP TABLE WAL record, say delete of relid from pg_class then we can use that to remove the corresponding entry from RelationSyncCache. -- With Regards, Amit Kapila.
On Thu, Aug 21, 2025 at 2:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Aug 21, 2025 at 2:03 PM 赵宇鹏(宇彭) <zhaoyupeng.zyp@alibaba-inc.com> wrote: > > > > From what we see in our users’ production environments, the situation is exactly > > as previously described. Creating a “publication for all tables” is very common, > > because manually choosing individual tables to publish can be cumbersome. > > > > I understand the difficulty in choosing individual tables, but OTOH > all tables may lead to replicating tables that are not even required. > This consumes CPU, network, and disk space without the real need. This > sounds more harmful than once carefully describing publications. We > have another way to combine tables, which is to use TABLES IN SCHEMA. > I don't know if that can help, but anyway, we can't do much if the > user wants to replicate all tables. > > > Regular CREATE/DROP TABLE activity is also normal, and the tables are not > > necessarily short-lived. Since walsender is intended to be a long-lived process, > > its memory footprint keeps accumulating over time. > > > > Even if we ignore DROP TABLE entirely and only consider a large number of tables > > that must be published, RelationSyncEntry alone can consume substantial memory. > > Many users run multiple walsenders on the same instance, which further increases > > memory pressure. > > > > Agreed, there is a possibility of consuming a large amount of memory, > but still, we should do some experiments to see the real results. We > haven't seen complaints of walsenders consuming large amounts of > memory due to RelationSyncEntry's, so it is also possible that this is > just a theoretical case especially after we have some handling for > dropped tables. > > > In normal backend processes, many cache structures are never evicted. That > > already causes issues, but it is at least somewhat tolerable because a backend > > is considered short-lived and a periodic reconnect can release the memory. > > A walsender, however, is expected to stay alive much longer, nobody wants > > replication sessions to be dropped regularly, so I am genuinely curious why > > structures like RelationSyncEntry were not given an LRU-style eviction mechanism > > from the start. FYI as far as I know we've discussed that a similar problem exists also in normal backend processes especially for users who are using a connection pool[1] (it was about syscache bloat due to negative caches). With your proposed patch, it maintains cache entries with a hard limit of 1024 entries using LRU eviction. This could become a performance bottleneck when replicating more than 1,024 tables. Even for databases having less than 1024 tables it would lead to a negative performance impact due to the cost of maintaining LRU order. Also, while having a maximum capacity with an eviction mechanism makes sense, we should make this limit configurable (probably by size?). If we have the limit in size it might be worth noting that cache entries consume variable amounts of memory, and their underlying memory blocks might persist even after entry eviction. > > > > Adding an LRU mechanism to RelationSyncEntry has another benefit: it puts an > > upper bound on the workload of callbacks such as invalidation_cb, preventing > > walsender from stalling when there are a large number of tables. I have > > therefore implemented a prototype of this idea (borrowing some code from > > Hayato Kuroda). It should keep memory usage under control in more scenarios > > while introducing only minimal overhead in theory. > > Yeah, such an idea is worth considering after we have some tests to > show the high memory usage. BTW, I think it would be better to run > some algorithm to evict entries when we reach the threshold limit as > we do for shared buffers, see StrategyGetBuffer. Otherwise, we may be > paying the cost to maintain such a list when in practice it may be > required only very few times. Yeah, something like clock-sweep could work too. I think we need to carefully select the cache management algorithm. While it's a well known fact that maintaining LRU ordering is expensive, I'm really not sure the clock-sweep is a suitable algorithm for logical replication's cache use cases. Also, we might want to avoid iterating all caches to find caches to evict. Regards, [1] https://www.postgresql.org/message-id/20161219.201505.11562604.horiguchi.kyotaro%40lab.ntt.co.jp -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com