Обсуждение: RE: Invalid pointer access in logical decoding after error
On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote: > > Hi, > > I encountered an invalid pointer access issue. Below are the steps to > reproduce the issue: ... > The error occurs because entry->columns is allocated in the entry > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This > context is a child of the PortalContext, which is cleared after an > error via: AbortTransaction > -> AtAbort_Portals -> > MemoryContextDeleteChildren -> MemoryContextDelete -> > MemoryContextDeleteOnly > As a result, the memory backing entry->columns is freed, but the > RelationSyncCache which resides in CacheMemoryContext and thus > survives the error still holds a dangling pointer to this freed > memory, causing it to pfree an invalid pointer. > In the normal (positive) execution flow, pgoutput_shutdown() is called > to clean up the RelationSyncCache. This happens via: > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But > this is not called in case of an error case. To handle this case > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately. > Attached patch has the changes for the same. > Thoughts? Thank you for reporting the issue and providing a fix. I recall that we identified this general issue with the hash table in pgoutput in other threads as well [1]. The basic consensus [2] is that calling FreeDecodingContext() within PG_CATCH is not ideal, as this function includes user code, increasing the risk of encountering another error within PG_CATCH. This scenario could prevent execution of subsequent code to invalidate syscache entries, which is problematic. I think a better fix could be to introduce a memory context reset callback(on data->cachectx) and perform the actions of pgoutput_shutdown() within it. [1] https://www.postgresql.org/message-id/OS0PR01MB57167C62D7DA4A8EBBC92B0A941BA@OS0PR01MB5716.jpnprd01.prod.outlook.com [2] https://www.postgresql.org/message-id/20230820210545.plmlk3rnxxgkokkj%40awork3.anarazel.de Best Regards, Hou zj
On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote: > > > > > > > > Hi, > > > > > > I encountered an invalid pointer access issue. Below are the steps to > > > reproduce the issue: > > ... > > > The error occurs because entry->columns is allocated in the entry > > > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This > > > context is a child of the PortalContext, which is cleared after an > > > error via: AbortTransaction > > > -> AtAbort_Portals -> > > > MemoryContextDeleteChildren -> MemoryContextDelete -> > > > MemoryContextDeleteOnly > > > As a result, the memory backing entry->columns is freed, but the > > > RelationSyncCache which resides in CacheMemoryContext and thus > > > survives the error still holds a dangling pointer to this freed > > > memory, causing it to pfree an invalid pointer. > > > In the normal (positive) execution flow, pgoutput_shutdown() is called > > > to clean up the RelationSyncCache. This happens via: > > > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But > > > this is not called in case of an error case. To handle this case > > > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to > > > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately. > > > Attached patch has the changes for the same. > > > Thoughts? > > > > Thank you for reporting the issue and providing a fix. > > > > I recall that we identified this general issue with the hash table in pgoutput > > in other threads as well [1]. The basic consensus [2] is that calling > > FreeDecodingContext() within PG_CATCH is not ideal, as this function includes > > user code, increasing the risk of encountering another error within PG_CATCH. > > This scenario could prevent execution of subsequent code to invalidate syscache > > entries, which is problematic. > > Yes, let's avoid this. > > > I think a better fix could be to introduce a memory context reset callback(on > > data->cachectx) and perform the actions of pgoutput_shutdown() within it. > > The attached v2 version patch has the changes for the same. We've addressed several memory-related issues in pgoutput. While most of these issues didn't affect logical replication, they did impact logical decoding called via SQL API. I find that these problems stem from RelationSyncCache being defined as a file-scope static variable and being allocated in CacheMemoryContext. I'm wondering if we could move it to PGOutputData and create it under the logical decoding context. This would ensure it's automatically cleaned up along with the logical decoding context. I also noticed another concerning issue: the entry->streamed_txns list is maintained in CacheMemoryContext (see set_schema_sent_in_streamed_txn()). This could lead to memory leaks when logical decoding called via the SQL API encounters an error. This issue isn't addressed in the current patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, 26 Sept 2025 at 05:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote: > > > > > > > > > > > Hi, > > > > > > > > I encountered an invalid pointer access issue. Below are the steps to > > > > reproduce the issue: > > > ... > > > > The error occurs because entry->columns is allocated in the entry > > > > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This > > > > context is a child of the PortalContext, which is cleared after an > > > > error via: AbortTransaction > > > > -> AtAbort_Portals -> > > > > MemoryContextDeleteChildren -> MemoryContextDelete -> > > > > MemoryContextDeleteOnly > > > > As a result, the memory backing entry->columns is freed, but the > > > > RelationSyncCache which resides in CacheMemoryContext and thus > > > > survives the error still holds a dangling pointer to this freed > > > > memory, causing it to pfree an invalid pointer. > > > > In the normal (positive) execution flow, pgoutput_shutdown() is called > > > > to clean up the RelationSyncCache. This happens via: > > > > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But > > > > this is not called in case of an error case. To handle this case > > > > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to > > > > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately. > > > > Attached patch has the changes for the same. > > > > Thoughts? > > > > > > Thank you for reporting the issue and providing a fix. > > > > > > I recall that we identified this general issue with the hash table in pgoutput > > > in other threads as well [1]. The basic consensus [2] is that calling > > > FreeDecodingContext() within PG_CATCH is not ideal, as this function includes > > > user code, increasing the risk of encountering another error within PG_CATCH. > > > This scenario could prevent execution of subsequent code to invalidate syscache > > > entries, which is problematic. > > > > Yes, let's avoid this. > > > > > I think a better fix could be to introduce a memory context reset callback(on > > > data->cachectx) and perform the actions of pgoutput_shutdown() within it. > > > > The attached v2 version patch has the changes for the same. > > We've addressed several memory-related issues in pgoutput. While most > of these issues didn't affect logical replication, they did impact > logical decoding called via SQL API. I find that these problems stem > from RelationSyncCache being defined as a file-scope static variable > and being allocated in CacheMemoryContext. I'm wondering if we could > move it to PGOutputData and create it under the logical decoding > context. This would ensure it's automatically cleaned up along with > the logical decoding context. I see one issue with keeping RelationSyncCache inside PGOutputData. Both rel_sync_cache_publication_cb and rel_sync_cache_relation_cb rely on this cache: they check its validity and update it in response to invalidation events. The problem is that these callbacks cannot be unregistered once they are registered. After pgoutput_shutdown() runs, the hash table is destroyed and later FreeDecodingContext deletes the logical decoding context. At that point, the registered callbacks may still fire, but they’ll be holding onto a freed memory. This is also noted directly in the callbacks: /* * We can get here if the plugin was used in SQL interface as the * RelationSyncCache is destroyed when the decoding finishes, but there is * no way to unregister the relcache invalidation callback. */ if (RelationSyncCache == NULL) return; So the core issue is that the callbacks may outlive the cache they reference, and we don’t have a mechanism to unregister them cleanly. Thoughts? Regards, Vignesh