Обсуждение: RE: Invalid pointer access in logical decoding after error

Поиск
Список
Период
Сортировка

RE: Invalid pointer access in logical decoding after error

От
"Zhijie Hou (Fujitsu)"
Дата:
On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:

> 
> Hi,
> 
> I encountered an invalid pointer access issue. Below are the steps to 
> reproduce the issue:
...
> The error occurs because entry->columns is allocated in the entry 
> private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This 
> context is a child of the PortalContext, which is cleared after an 
> error via: AbortTransaction
> -> AtAbort_Portals  ->
> MemoryContextDeleteChildren  -> MemoryContextDelete   ->
> MemoryContextDeleteOnly
> As a result, the memory backing entry->columns is freed, but the 
> RelationSyncCache which resides in CacheMemoryContext and thus 
> survives the error still holds a dangling pointer to this freed 
> memory, causing it to pfree an invalid pointer.
> In the normal (positive) execution flow, pgoutput_shutdown() is called 
> to clean up the RelationSyncCache. This happens via:
> FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But 
> this is not called in case of an error case. To handle this case 
> safely, I suggest calling FreeDecodingContext in the PG_CATCH block to 
> ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
> Attached patch has the changes for the same.
> Thoughts?

Thank you for reporting the issue and providing a fix.

I recall that we identified this general issue with the hash table in pgoutput
in other threads as well [1]. The basic consensus [2] is that calling
FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
user code, increasing the risk of encountering another error within PG_CATCH.
This scenario could prevent execution of subsequent code to invalidate syscache
entries, which is problematic.

I think a better fix could be to introduce a memory context reset callback(on
data->cachectx) and perform the actions of pgoutput_shutdown() within it.

[1]
https://www.postgresql.org/message-id/OS0PR01MB57167C62D7DA4A8EBBC92B0A941BA@OS0PR01MB5716.jpnprd01.prod.outlook.com
[2] https://www.postgresql.org/message-id/20230820210545.plmlk3rnxxgkokkj%40awork3.anarazel.de

Best Regards,
Hou zj

Re: Invalid pointer access in logical decoding after error

От
Masahiko Sawada
Дата:
On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:
> >
> > >
> > > Hi,
> > >
> > > I encountered an invalid pointer access issue. Below are the steps to
> > > reproduce the issue:
> > ...
> > > The error occurs because entry->columns is allocated in the entry
> > > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
> > > context is a child of the PortalContext, which is cleared after an
> > > error via: AbortTransaction
> > > -> AtAbort_Portals  ->
> > > MemoryContextDeleteChildren  -> MemoryContextDelete   ->
> > > MemoryContextDeleteOnly
> > > As a result, the memory backing entry->columns is freed, but the
> > > RelationSyncCache which resides in CacheMemoryContext and thus
> > > survives the error still holds a dangling pointer to this freed
> > > memory, causing it to pfree an invalid pointer.
> > > In the normal (positive) execution flow, pgoutput_shutdown() is called
> > > to clean up the RelationSyncCache. This happens via:
> > > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
> > > this is not called in case of an error case. To handle this case
> > > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
> > > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
> > > Attached patch has the changes for the same.
> > > Thoughts?
> >
> > Thank you for reporting the issue and providing a fix.
> >
> > I recall that we identified this general issue with the hash table in pgoutput
> > in other threads as well [1]. The basic consensus [2] is that calling
> > FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
> > user code, increasing the risk of encountering another error within PG_CATCH.
> > This scenario could prevent execution of subsequent code to invalidate syscache
> > entries, which is problematic.
>
> Yes, let's avoid this.
>
> > I think a better fix could be to introduce a memory context reset callback(on
> > data->cachectx) and perform the actions of pgoutput_shutdown() within it.
>
> The attached v2 version patch has the changes for the same.

We've addressed several memory-related issues in pgoutput. While most
of these issues didn't affect logical replication, they did impact
logical decoding called via SQL API. I find that these problems stem
from RelationSyncCache being defined as a file-scope static variable
and being allocated in CacheMemoryContext. I'm wondering if we could
move it to PGOutputData and create it under the logical decoding
context. This would ensure it's automatically cleaned up along with
the logical decoding context.

I also noticed another concerning issue: the entry->streamed_txns list
is maintained in CacheMemoryContext (see
set_schema_sent_in_streamed_txn()). This could lead to memory leaks
when logical decoding called via the SQL API encounters an error. This
issue isn't addressed in the current patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Invalid pointer access in logical decoding after error

От
vignesh C
Дата:
On Fri, 26 Sept 2025 at 05:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:
> > >
> > > >
> > > > Hi,
> > > >
> > > > I encountered an invalid pointer access issue. Below are the steps to
> > > > reproduce the issue:
> > > ...
> > > > The error occurs because entry->columns is allocated in the entry
> > > > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
> > > > context is a child of the PortalContext, which is cleared after an
> > > > error via: AbortTransaction
> > > > -> AtAbort_Portals  ->
> > > > MemoryContextDeleteChildren  -> MemoryContextDelete   ->
> > > > MemoryContextDeleteOnly
> > > > As a result, the memory backing entry->columns is freed, but the
> > > > RelationSyncCache which resides in CacheMemoryContext and thus
> > > > survives the error still holds a dangling pointer to this freed
> > > > memory, causing it to pfree an invalid pointer.
> > > > In the normal (positive) execution flow, pgoutput_shutdown() is called
> > > > to clean up the RelationSyncCache. This happens via:
> > > > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
> > > > this is not called in case of an error case. To handle this case
> > > > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
> > > > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
> > > > Attached patch has the changes for the same.
> > > > Thoughts?
> > >
> > > Thank you for reporting the issue and providing a fix.
> > >
> > > I recall that we identified this general issue with the hash table in pgoutput
> > > in other threads as well [1]. The basic consensus [2] is that calling
> > > FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
> > > user code, increasing the risk of encountering another error within PG_CATCH.
> > > This scenario could prevent execution of subsequent code to invalidate syscache
> > > entries, which is problematic.
> >
> > Yes, let's avoid this.
> >
> > > I think a better fix could be to introduce a memory context reset callback(on
> > > data->cachectx) and perform the actions of pgoutput_shutdown() within it.
> >
> > The attached v2 version patch has the changes for the same.
>
> We've addressed several memory-related issues in pgoutput. While most
> of these issues didn't affect logical replication, they did impact
> logical decoding called via SQL API. I find that these problems stem
> from RelationSyncCache being defined as a file-scope static variable
> and being allocated in CacheMemoryContext. I'm wondering if we could
> move it to PGOutputData and create it under the logical decoding
> context. This would ensure it's automatically cleaned up along with
> the logical decoding context.

I see one issue with keeping RelationSyncCache inside PGOutputData.
Both rel_sync_cache_publication_cb and rel_sync_cache_relation_cb rely
on this cache: they check its validity and update it in response to
invalidation events. The problem is that these callbacks cannot be
unregistered once they are registered. After pgoutput_shutdown() runs,
the hash table is destroyed and later FreeDecodingContext deletes the
logical decoding context. At that point, the registered callbacks may
still fire, but they’ll be holding onto a freed memory.

This is also noted directly in the callbacks:
/*
 * We can get here if the plugin was used in SQL interface as the
 * RelationSyncCache is destroyed when the decoding finishes, but there is
 * no way to unregister the relcache invalidation callback.
 */
if (RelationSyncCache == NULL)
    return;

So the core issue is that the callbacks may outlive the cache they
reference, and we don’t have a mechanism to unregister them cleanly.

Thoughts?

Regards,
Vignesh