Обсуждение: RE: Invalid pointer access in logical decoding after error

Поиск

Список

Период

Сортировка

RE: Invalid pointer access in logical decoding after error

От

"Zhijie Hou (Fujitsu)"

Дата:

02 июля 2025 г., 10:51:03

On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:

> 
> Hi,
> 
> I encountered an invalid pointer access issue. Below are the steps to 
> reproduce the issue:
...
> The error occurs because entry->columns is allocated in the entry 
> private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This 
> context is a child of the PortalContext, which is cleared after an 
> error via: AbortTransaction
> -> AtAbort_Portals  ->
> MemoryContextDeleteChildren  -> MemoryContextDelete   ->
> MemoryContextDeleteOnly
> As a result, the memory backing entry->columns is freed, but the 
> RelationSyncCache which resides in CacheMemoryContext and thus 
> survives the error still holds a dangling pointer to this freed 
> memory, causing it to pfree an invalid pointer.
> In the normal (positive) execution flow, pgoutput_shutdown() is called 
> to clean up the RelationSyncCache. This happens via:
> FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But 
> this is not called in case of an error case. To handle this case 
> safely, I suggest calling FreeDecodingContext in the PG_CATCH block to 
> ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
> Attached patch has the changes for the same.
> Thoughts?

Thank you for reporting the issue and providing a fix.

I recall that we identified this general issue with the hash table in pgoutput
in other threads as well [1]. The basic consensus [2] is that calling
FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
user code, increasing the risk of encountering another error within PG_CATCH.
This scenario could prevent execution of subsequent code to invalidate syscache
entries, which is problematic.

I think a better fix could be to introduce a memory context reset callback(on
data->cachectx) and perform the actions of pgoutput_shutdown() within it.

[1]
https://www.postgresql.org/message-id/OS0PR01MB57167C62D7DA4A8EBBC92B0A941BA@OS0PR01MB5716.jpnprd01.prod.outlook.com
[2] https://www.postgresql.org/message-id/20230820210545.plmlk3rnxxgkokkj%40awork3.anarazel.de

Best Regards,
Hou zj

Re: Invalid pointer access in logical decoding after error

От

Masahiko Sawada

Дата:

26 сентября 2025 г., 02:58:46

On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:
> >
> > >
> > > Hi,
> > >
> > > I encountered an invalid pointer access issue. Below are the steps to
> > > reproduce the issue:
> > ...
> > > The error occurs because entry->columns is allocated in the entry
> > > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
> > > context is a child of the PortalContext, which is cleared after an
> > > error via: AbortTransaction
> > > -> AtAbort_Portals  ->
> > > MemoryContextDeleteChildren  -> MemoryContextDelete   ->
> > > MemoryContextDeleteOnly
> > > As a result, the memory backing entry->columns is freed, but the
> > > RelationSyncCache which resides in CacheMemoryContext and thus
> > > survives the error still holds a dangling pointer to this freed
> > > memory, causing it to pfree an invalid pointer.
> > > In the normal (positive) execution flow, pgoutput_shutdown() is called
> > > to clean up the RelationSyncCache. This happens via:
> > > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
> > > this is not called in case of an error case. To handle this case
> > > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
> > > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
> > > Attached patch has the changes for the same.
> > > Thoughts?
> >
> > Thank you for reporting the issue and providing a fix.
> >
> > I recall that we identified this general issue with the hash table in pgoutput
> > in other threads as well [1]. The basic consensus [2] is that calling
> > FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
> > user code, increasing the risk of encountering another error within PG_CATCH.
> > This scenario could prevent execution of subsequent code to invalidate syscache
> > entries, which is problematic.
>
> Yes, let's avoid this.
>
> > I think a better fix could be to introduce a memory context reset callback(on
> > data->cachectx) and perform the actions of pgoutput_shutdown() within it.
>
> The attached v2 version patch has the changes for the same.

We've addressed several memory-related issues in pgoutput. While most
of these issues didn't affect logical replication, they did impact
logical decoding called via SQL API. I find that these problems stem
from RelationSyncCache being defined as a file-scope static variable
and being allocated in CacheMemoryContext. I'm wondering if we could
move it to PGOutputData and create it under the logical decoding
context. This would ensure it's automatically cleaned up along with
the logical decoding context.

I also noticed another concerning issue: the entry->streamed_txns list
is maintained in CacheMemoryContext (see
set_schema_sent_in_streamed_txn()). This could lead to memory leaks
when logical decoding called via the SQL API encounters an error. This
issue isn't addressed in the current patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Invalid pointer access in logical decoding after error

От

vignesh C

Дата:

29 сентября 2025 г., 11:52:29

On Fri, 26 Sept 2025 at 05:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:
> > >
> > > >
> > > > Hi,
> > > >
> > > > I encountered an invalid pointer access issue. Below are the steps to
> > > > reproduce the issue:
> > > ...
> > > > The error occurs because entry->columns is allocated in the entry
> > > > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
> > > > context is a child of the PortalContext, which is cleared after an
> > > > error via: AbortTransaction
> > > > -> AtAbort_Portals  ->
> > > > MemoryContextDeleteChildren  -> MemoryContextDelete   ->
> > > > MemoryContextDeleteOnly
> > > > As a result, the memory backing entry->columns is freed, but the
> > > > RelationSyncCache which resides in CacheMemoryContext and thus
> > > > survives the error still holds a dangling pointer to this freed
> > > > memory, causing it to pfree an invalid pointer.
> > > > In the normal (positive) execution flow, pgoutput_shutdown() is called
> > > > to clean up the RelationSyncCache. This happens via:
> > > > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
> > > > this is not called in case of an error case. To handle this case
> > > > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
> > > > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
> > > > Attached patch has the changes for the same.
> > > > Thoughts?
> > >
> > > Thank you for reporting the issue and providing a fix.
> > >
> > > I recall that we identified this general issue with the hash table in pgoutput
> > > in other threads as well [1]. The basic consensus [2] is that calling
> > > FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
> > > user code, increasing the risk of encountering another error within PG_CATCH.
> > > This scenario could prevent execution of subsequent code to invalidate syscache
> > > entries, which is problematic.
> >
> > Yes, let's avoid this.
> >
> > > I think a better fix could be to introduce a memory context reset callback(on
> > > data->cachectx) and perform the actions of pgoutput_shutdown() within it.
> >
> > The attached v2 version patch has the changes for the same.
>
> We've addressed several memory-related issues in pgoutput. While most
> of these issues didn't affect logical replication, they did impact
> logical decoding called via SQL API. I find that these problems stem
> from RelationSyncCache being defined as a file-scope static variable
> and being allocated in CacheMemoryContext. I'm wondering if we could
> move it to PGOutputData and create it under the logical decoding
> context. This would ensure it's automatically cleaned up along with
> the logical decoding context.

I see one issue with keeping RelationSyncCache inside PGOutputData.
Both rel_sync_cache_publication_cb and rel_sync_cache_relation_cb rely
on this cache: they check its validity and update it in response to
invalidation events. The problem is that these callbacks cannot be
unregistered once they are registered. After pgoutput_shutdown() runs,
the hash table is destroyed and later FreeDecodingContext deletes the
logical decoding context. At that point, the registered callbacks may
still fire, but they’ll be holding onto a freed memory.

This is also noted directly in the callbacks:
/*
 * We can get here if the plugin was used in SQL interface as the
 * RelationSyncCache is destroyed when the decoding finishes, but there is
 * no way to unregister the relcache invalidation callback.
 */
if (RelationSyncCache == NULL)
    return;

So the core issue is that the callbacks may outlive the cache they
reference, and we don’t have a mechanism to unregister them cleanly.

Thoughts?

Regards,
Vignesh

Re: Invalid pointer access in logical decoding after error

От

Masahiko Sawada

Дата:

06 октября 2025 г., 23:00:10

On Mon, Sep 29, 2025 at 1:52 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Fri, 26 Sept 2025 at 05:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:
> > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I encountered an invalid pointer access issue. Below are the steps to
> > > > > reproduce the issue:
> > > > ...
> > > > > The error occurs because entry->columns is allocated in the entry
> > > > > private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
> > > > > context is a child of the PortalContext, which is cleared after an
> > > > > error via: AbortTransaction
> > > > > -> AtAbort_Portals  ->
> > > > > MemoryContextDeleteChildren  -> MemoryContextDelete   ->
> > > > > MemoryContextDeleteOnly
> > > > > As a result, the memory backing entry->columns is freed, but the
> > > > > RelationSyncCache which resides in CacheMemoryContext and thus
> > > > > survives the error still holds a dangling pointer to this freed
> > > > > memory, causing it to pfree an invalid pointer.
> > > > > In the normal (positive) execution flow, pgoutput_shutdown() is called
> > > > > to clean up the RelationSyncCache. This happens via:
> > > > > FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
> > > > > this is not called in case of an error case. To handle this case
> > > > > safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
> > > > > ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
> > > > > Attached patch has the changes for the same.
> > > > > Thoughts?
> > > >
> > > > Thank you for reporting the issue and providing a fix.
> > > >
> > > > I recall that we identified this general issue with the hash table in pgoutput
> > > > in other threads as well [1]. The basic consensus [2] is that calling
> > > > FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
> > > > user code, increasing the risk of encountering another error within PG_CATCH.
> > > > This scenario could prevent execution of subsequent code to invalidate syscache
> > > > entries, which is problematic.
> > >
> > > Yes, let's avoid this.
> > >
> > > > I think a better fix could be to introduce a memory context reset callback(on
> > > > data->cachectx) and perform the actions of pgoutput_shutdown() within it.
> > >
> > > The attached v2 version patch has the changes for the same.
> >
> > We've addressed several memory-related issues in pgoutput. While most
> > of these issues didn't affect logical replication, they did impact
> > logical decoding called via SQL API. I find that these problems stem
> > from RelationSyncCache being defined as a file-scope static variable
> > and being allocated in CacheMemoryContext. I'm wondering if we could
> > move it to PGOutputData and create it under the logical decoding
> > context. This would ensure it's automatically cleaned up along with
> > the logical decoding context.
>
> I see one issue with keeping RelationSyncCache inside PGOutputData.
> Both rel_sync_cache_publication_cb and rel_sync_cache_relation_cb rely
> on this cache: they check its validity and update it in response to
> invalidation events. The problem is that these callbacks cannot be
> unregistered once they are registered. After pgoutput_shutdown() runs,
> the hash table is destroyed and later FreeDecodingContext deletes the
> logical decoding context. At that point, the registered callbacks may
> still fire, but they’ll be holding onto a freed memory.
>
> This is also noted directly in the callbacks:
> /*
>  * We can get here if the plugin was used in SQL interface as the
>  * RelationSyncCache is destroyed when the decoding finishes, but there is
>  * no way to unregister the relcache invalidation callback.
>  */
> if (RelationSyncCache == NULL)
>     return;
>
> So the core issue is that the callbacks may outlive the cache they
> reference, and we don’t have a mechanism to unregister them cleanly.
>
> Thoughts?

I agree with your analysis. It seems there is no convenient way to
move RelationSyncCache inside PGOutputData. So let's proceed with the
proposed approach.

I've done some minor changes to your v2 patch and updated the commit
message. IIUC this patch needs to be backpatched to v15. Please review
the attached patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

v3-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patch

Re: Invalid pointer access in logical decoding after error

От

"Euler Taveira"

Дата:

07 октября 2025 г., 04:54:33

On Mon, Oct 6, 2025, at 5:00 PM, Masahiko Sawada wrote:
> I agree with your analysis. It seems there is no convenient way to
> move RelationSyncCache inside PGOutputData. So let's proceed with the
> proposed approach.
>

+1. Shouldn't we add a comment mentioning that it is a possibility?

> I've done some minor changes to your v2 patch and updated the commit
> message. IIUC this patch needs to be backpatched to v15. Please review
> the attached patch.
>

I verified that the bug exists since v15 as reported. Despite of the test case
provided by Vignesh (which I attached a modified version to be used in v15 or
later), I also added another test case that has a similar problem with
generated columns. This 2nd test case only works for v18 (where the feature was
introduced). This patch also fixes this case.

I'm curious about other cases related to RelationSyncCache. Is there any other
cases that this patch doesn't fix?

This patch looks good to me. Do we really need a new function with the same
content as pgoutput_shutdown? I don't like mcallback. It seems 'm' stands for
'memory' but if we want to use crypt names I would suggest 'mcb' (_m_emory
_c_all_b_ack) -- that is also a crypt name but a shorter one. Same pattern is
already used in another place with MemoryContextCallback.

-- 
Euler Taveira
EDB   https://www.enterprisedb.com/

Вложения

Re: Invalid pointer access in logical decoding after error

От

Masahiko Sawada

Дата:

07 октября 2025 г., 21:09:34

On Mon, Oct 6, 2025 at 6:55 PM Euler Taveira <euler@eulerto.com> wrote:
>
> On Mon, Oct 6, 2025, at 5:00 PM, Masahiko Sawada wrote:
> > I agree with your analysis. It seems there is no convenient way to
> > move RelationSyncCache inside PGOutputData. So let's proceed with the
> > proposed approach.
> >
>
> +1. Shouldn't we add a comment mentioning that it is a possibility?

I think the primary reason why we cannot simply move RelationSyncCache
to PGOutputData is there is no way to unregister the invalidation
callback, which is already mentioned in some places:

      /*
       * We can get here if the plugin was used in SQL interface as the
       * RelationSyncCache is destroyed when the decoding finishes, but there is
       * no way to unregister the relcache invalidation callback.
       */
      if (RelationSyncCache == NULL)

>
> > I've done some minor changes to your v2 patch and updated the commit
> > message. IIUC this patch needs to be backpatched to v15. Please review
> > the attached patch.
> >
>
> I verified that the bug exists since v15 as reported. Despite of the test case
> provided by Vignesh (which I attached a modified version to be used in v15 or
> later), I also added another test case that has a similar problem with
> generated columns. This 2nd test case only works for v18 (where the feature was
> introduced). This patch also fixes this case.

Thank you for the tests!

> I'm curious about other cases related to RelationSyncCache. Is there any other
> cases that this patch doesn't fix?

I think that with the patch we can ensure to clean up
RelationSyncCache alongside with other memory contexts used in
pgoutput, fixing problems stemming from RelationSyncCache referencing
already-freed-memory in the pgoutput memory contexts.

IIUC there is another memory leak issue in pgoutput as I reported on
this thread[1]. It should be fixed in a separate patch.

> This patch looks good to me. Do we really need a new function with the same
> content as pgoutput_shutdown?

Probably we can call pgoutput_shutdown() in rel_sync_cache_reset_cb().

> I don't like mcallback. It seems 'm' stands for
> 'memory' but if we want to use crypt names I would suggest 'mcb' (_m_emory
> _c_all_b_ack) -- that is also a crypt name but a shorter one. Same pattern is
> already used in another place with MemoryContextCallback.

I think it's better to use 'mcallback' since back branches are already
using it (see commit bbe68c13a for example). While considering
backpatching this patch, I noticed that the memory context reset
callback function should be registered to ctx->context but not
ctx->cachectx.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoCgqZ0BUpXjVY6tD1jSLtVSdWpG%2BLZyZimq4Uu3TymTAA%40mail.gmail.com


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Invalid pointer access in logical decoding after error

От

vignesh C

Дата:

08 октября 2025 г., 21:39:38

On Tue, 7 Oct 2025 at 23:40, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Oct 6, 2025 at 6:55 PM Euler Taveira <euler@eulerto.com> wrote:
> >
> > On Mon, Oct 6, 2025, at 5:00 PM, Masahiko Sawada wrote:
> > > I agree with your analysis. It seems there is no convenient way to
> > > move RelationSyncCache inside PGOutputData. So let's proceed with the
> > > proposed approach.
> > >
> >
> > +1. Shouldn't we add a comment mentioning that it is a possibility?
>
> I think the primary reason why we cannot simply move RelationSyncCache
> to PGOutputData is there is no way to unregister the invalidation
> callback, which is already mentioned in some places:
>
>       /*
>        * We can get here if the plugin was used in SQL interface as the
>        * RelationSyncCache is destroyed when the decoding finishes, but there is
>        * no way to unregister the relcache invalidation callback.
>        */
>       if (RelationSyncCache == NULL)
>
> >
> > > I've done some minor changes to your v2 patch and updated the commit
> > > message. IIUC this patch needs to be backpatched to v15. Please review
> > > the attached patch.
> > >
> >
> > I verified that the bug exists since v15 as reported. Despite of the test case
> > provided by Vignesh (which I attached a modified version to be used in v15 or
> > later), I also added another test case that has a similar problem with
> > generated columns. This 2nd test case only works for v18 (where the feature was
> > introduced). This patch also fixes this case.
>
> Thank you for the tests!

I also have verified that the test works till the PG15 branch.

> > I'm curious about other cases related to RelationSyncCache. Is there any other
> > cases that this patch doesn't fix?
>
> I think that with the patch we can ensure to clean up
> RelationSyncCache alongside with other memory contexts used in
> pgoutput, fixing problems stemming from RelationSyncCache referencing
> already-freed-memory in the pgoutput memory contexts.
>
> IIUC there is another memory leak issue in pgoutput as I reported on
> this thread[1]. It should be fixed in a separate patch.

+1 to fix it as a separate patch.

> > This patch looks good to me. Do we really need a new function with the same
> > content as pgoutput_shutdown?
>
> Probably we can call pgoutput_shutdown() in rel_sync_cache_reset_cb().

Changed it to pgoutput_shutdown as it has the same functionality

> > I don't like mcallback. It seems 'm' stands for
> > 'memory' but if we want to use crypt names I would suggest 'mcb' (_m_emory
> > _c_all_b_ack) -- that is also a crypt name but a shorter one. Same pattern is
> > already used in another place with MemoryContextCallback.
>
> I think it's better to use 'mcallback' since back branches are already
> using it (see commit bbe68c13a for example).

Yes, that is better

> While considering
> backpatching this patch, I noticed that the memory context reset
> callback function should be registered to ctx->context but not
> ctx->cachectx.

Modified

The attached v4 version patch has the changes for the same.

Regards,
Vignesh

Вложения

v4-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patch

Re: Invalid pointer access in logical decoding after error

От

Masahiko Sawada

Дата:

08 октября 2025 г., 21:45:40

On Wed, Oct 8, 2025 at 11:39 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, 7 Oct 2025 at 23:40, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Oct 6, 2025 at 6:55 PM Euler Taveira <euler@eulerto.com> wrote:
> > >
> > > On Mon, Oct 6, 2025, at 5:00 PM, Masahiko Sawada wrote:
> > > > I agree with your analysis. It seems there is no convenient way to
> > > > move RelationSyncCache inside PGOutputData. So let's proceed with the
> > > > proposed approach.
> > > >
> > >
> > > +1. Shouldn't we add a comment mentioning that it is a possibility?
> >
> > I think the primary reason why we cannot simply move RelationSyncCache
> > to PGOutputData is there is no way to unregister the invalidation
> > callback, which is already mentioned in some places:
> >
> >       /*
> >        * We can get here if the plugin was used in SQL interface as the
> >        * RelationSyncCache is destroyed when the decoding finishes, but there is
> >        * no way to unregister the relcache invalidation callback.
> >        */
> >       if (RelationSyncCache == NULL)
> >
> > >
> > > > I've done some minor changes to your v2 patch and updated the commit
> > > > message. IIUC this patch needs to be backpatched to v15. Please review
> > > > the attached patch.
> > > >
> > >
> > > I verified that the bug exists since v15 as reported. Despite of the test case
> > > provided by Vignesh (which I attached a modified version to be used in v15 or
> > > later), I also added another test case that has a similar problem with
> > > generated columns. This 2nd test case only works for v18 (where the feature was
> > > introduced). This patch also fixes this case.
> >
> > Thank you for the tests!
>
> I also have verified that the test works till the PG15 branch.
>
> > > I'm curious about other cases related to RelationSyncCache. Is there any other
> > > cases that this patch doesn't fix?
> >
> > I think that with the patch we can ensure to clean up
> > RelationSyncCache alongside with other memory contexts used in
> > pgoutput, fixing problems stemming from RelationSyncCache referencing
> > already-freed-memory in the pgoutput memory contexts.
> >
> > IIUC there is another memory leak issue in pgoutput as I reported on
> > this thread[1]. It should be fixed in a separate patch.
>
> +1 to fix it as a separate patch.
>
> > > This patch looks good to me. Do we really need a new function with the same
> > > content as pgoutput_shutdown?
> >
> > Probably we can call pgoutput_shutdown() in rel_sync_cache_reset_cb().
>
> Changed it to pgoutput_shutdown as it has the same functionality
>
> > > I don't like mcallback. It seems 'm' stands for
> > > 'memory' but if we want to use crypt names I would suggest 'mcb' (_m_emory
> > > _c_all_b_ack) -- that is also a crypt name but a shorter one. Same pattern is
> > > already used in another place with MemoryContextCallback.
> >
> > I think it's better to use 'mcallback' since back branches are already
> > using it (see commit bbe68c13a for example).
>
> Yes, that is better
>
> > While considering
> > backpatching this patch, I noticed that the memory context reset
> > callback function should be registered to ctx->context but not
> > ctx->cachectx.
>
> Modified
>
> The attached v4 version patch has the changes for the same.

Thank you for updating the patch! I was working on creating patches
for backbranches too, so let me share these patches.

I've made some cosmetic changes to align back branch codes, and
attached the patches for master to v15.

One thing we might want to consider is for v14 and v13. They don't
have this bug since the entry_ctx was introduced in v15, but it's
still true for them that RelationSyncCache is not cleaned up in error
cases if pgoutput is used via SQL API. For example, if
RelationSyncCache hash table gets corrupted for some reason, logical
decoding could repeat an error until logical decoding completes
successfully and its shutdown callback is called. Also, it might be a
good idea in general to ensure cleaning up the hash table after use.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Invalid pointer access in logical decoding after error

От

vignesh C

Дата:

09 октября 2025 г., 16:40:32

On Thu, 9 Oct 2025 at 00:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Oct 8, 2025 at 11:39 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, 7 Oct 2025 at 23:40, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Oct 6, 2025 at 6:55 PM Euler Taveira <euler@eulerto.com> wrote:
> > > >
> > > > On Mon, Oct 6, 2025, at 5:00 PM, Masahiko Sawada wrote:
> > > > > I agree with your analysis. It seems there is no convenient way to
> > > > > move RelationSyncCache inside PGOutputData. So let's proceed with the
> > > > > proposed approach.
> > > > >
> > > >
> > > > +1. Shouldn't we add a comment mentioning that it is a possibility?
> > >
> > > I think the primary reason why we cannot simply move RelationSyncCache
> > > to PGOutputData is there is no way to unregister the invalidation
> > > callback, which is already mentioned in some places:
> > >
> > >       /*
> > >        * We can get here if the plugin was used in SQL interface as the
> > >        * RelationSyncCache is destroyed when the decoding finishes, but there is
> > >        * no way to unregister the relcache invalidation callback.
> > >        */
> > >       if (RelationSyncCache == NULL)
> > >
> > > >
> > > > > I've done some minor changes to your v2 patch and updated the commit
> > > > > message. IIUC this patch needs to be backpatched to v15. Please review
> > > > > the attached patch.
> > > > >
> > > >
> > > > I verified that the bug exists since v15 as reported. Despite of the test case
> > > > provided by Vignesh (which I attached a modified version to be used in v15 or
> > > > later), I also added another test case that has a similar problem with
> > > > generated columns. This 2nd test case only works for v18 (where the feature was
> > > > introduced). This patch also fixes this case.
> > >
> > > Thank you for the tests!
> >
> > I also have verified that the test works till the PG15 branch.
> >
> > > > I'm curious about other cases related to RelationSyncCache. Is there any other
> > > > cases that this patch doesn't fix?
> > >
> > > I think that with the patch we can ensure to clean up
> > > RelationSyncCache alongside with other memory contexts used in
> > > pgoutput, fixing problems stemming from RelationSyncCache referencing
> > > already-freed-memory in the pgoutput memory contexts.
> > >
> > > IIUC there is another memory leak issue in pgoutput as I reported on
> > > this thread[1]. It should be fixed in a separate patch.
> >
> > +1 to fix it as a separate patch.
> >
> > > > This patch looks good to me. Do we really need a new function with the same
> > > > content as pgoutput_shutdown?
> > >
> > > Probably we can call pgoutput_shutdown() in rel_sync_cache_reset_cb().
> >
> > Changed it to pgoutput_shutdown as it has the same functionality
> >
> > > > I don't like mcallback. It seems 'm' stands for
> > > > 'memory' but if we want to use crypt names I would suggest 'mcb' (_m_emory
> > > > _c_all_b_ack) -- that is also a crypt name but a shorter one. Same pattern is
> > > > already used in another place with MemoryContextCallback.
> > >
> > > I think it's better to use 'mcallback' since back branches are already
> > > using it (see commit bbe68c13a for example).
> >
> > Yes, that is better
> >
> > > While considering
> > > backpatching this patch, I noticed that the memory context reset
> > > callback function should be registered to ctx->context but not
> > > ctx->cachectx.
> >
> > Modified
> >
> > The attached v4 version patch has the changes for the same.
>
> Thank you for updating the patch! I was working on creating patches
> for backbranches too, so let me share these patches.
>
> I've made some cosmetic changes to align back branch codes, and
> attached the patches for master to v15.
>
> One thing we might want to consider is for v14 and v13. They don't
> have this bug since the entry_ctx was introduced in v15, but it's
> still true for them that RelationSyncCache is not cleaned up in error
> cases if pgoutput is used via SQL API. For example, if
> RelationSyncCache hash table gets corrupted for some reason, logical
> decoding could repeat an error until logical decoding completes
> successfully and its shutdown callback is called. Also, it might be a
> good idea in general to ensure cleaning up the hash table after use.

Agreed, let's backpatch to PG13. Should we also add a test case in the
master branch, given that this issue has been around for a while?

Regards,
Vignesh

Re: Invalid pointer access in logical decoding after error

От

"Euler Taveira"

Дата:

09 октября 2025 г., 18:22:59

On Thu, Oct 9, 2025, at 10:40 AM, vignesh C wrote:
> On Thu, 9 Oct 2025 at 00:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> One thing we might want to consider is for v14 and v13. They don't
>> have this bug since the entry_ctx was introduced in v15, but it's
>> still true for them that RelationSyncCache is not cleaned up in error
>> cases if pgoutput is used via SQL API. For example, if
>> RelationSyncCache hash table gets corrupted for some reason, logical
>> decoding could repeat an error until logical decoding completes
>> successfully and its shutdown callback is called. Also, it might be a
>> good idea in general to ensure cleaning up the hash table after use.
>
> Agreed, let's backpatch to PG13. Should we also add a test case in the
> master branch, given that this issue has been around for a while?
>

I'm wondering if it is a good idea because the bug doesn't manifest in v13 and
v14. At least the v13 has its final minor version in less than a month and EOL.
I would have caution when applying fixes to the latest minor version of a
stable branch; there won't be a chance to fix the fix in the next minor
release. Furthermore, in these back branches, the patch doesn't fix a known
issue.  I wouldn't bother with these back branches. For v14, if, in a couple of
months, we have some reports that justify the backpatch, fine.

-- 
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Invalid pointer access in logical decoding after error

От

Masahiko Sawada

Дата:

09 октября 2025 г., 20:07:22

On Thu, Oct 9, 2025 at 8:23 AM Euler Taveira <euler@eulerto.com> wrote:
>
> On Thu, Oct 9, 2025, at 10:40 AM, vignesh C wrote:
> > On Thu, 9 Oct 2025 at 00:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >>
> >> One thing we might want to consider is for v14 and v13. They don't
> >> have this bug since the entry_ctx was introduced in v15, but it's
> >> still true for them that RelationSyncCache is not cleaned up in error
> >> cases if pgoutput is used via SQL API. For example, if
> >> RelationSyncCache hash table gets corrupted for some reason, logical
> >> decoding could repeat an error until logical decoding completes
> >> successfully and its shutdown callback is called. Also, it might be a
> >> good idea in general to ensure cleaning up the hash table after use.
> >
> > Agreed, let's backpatch to PG13. Should we also add a test case in the
> > master branch, given that this issue has been around for a while?
> >
>
> I'm wondering if it is a good idea because the bug doesn't manifest in v13 and
> v14. At least the v13 has its final minor version in less than a month and EOL.
> I would have caution when applying fixes to the latest minor version of a
> stable branch; there won't be a chance to fix the fix in the next minor
> release. Furthermore, in these back branches, the patch doesn't fix a known
> issue.  I wouldn't bother with these back branches. For v14, if, in a couple of
> months, we have some reports that justify the backpatch, fine.

Agreed. I'm hesitant with patching to v13 and v14. We've never got
such a bug report yet and the next minior version of v13 would be the
final release. I'll add some comments in the commit message.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Invalid pointer access in logical decoding after error

От

Masahiko Sawada

Дата:

09 октября 2025 г., 21:02:43

On Thu, Oct 9, 2025 at 10:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Oct 9, 2025 at 8:23 AM Euler Taveira <euler@eulerto.com> wrote:
> >
> > On Thu, Oct 9, 2025, at 10:40 AM, vignesh C wrote:
> > > On Thu, 9 Oct 2025 at 00:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >>
> > >> One thing we might want to consider is for v14 and v13. They don't
> > >> have this bug since the entry_ctx was introduced in v15, but it's
> > >> still true for them that RelationSyncCache is not cleaned up in error
> > >> cases if pgoutput is used via SQL API. For example, if
> > >> RelationSyncCache hash table gets corrupted for some reason, logical
> > >> decoding could repeat an error until logical decoding completes
> > >> successfully and its shutdown callback is called. Also, it might be a
> > >> good idea in general to ensure cleaning up the hash table after use.
> > >
> > > Agreed, let's backpatch to PG13. Should we also add a test case in the
> > > master branch, given that this issue has been around for a while?
> > >
> >
> > I'm wondering if it is a good idea because the bug doesn't manifest in v13 and
> > v14. At least the v13 has its final minor version in less than a month and EOL.
> > I would have caution when applying fixes to the latest minor version of a
> > stable branch; there won't be a chance to fix the fix in the next minor
> > release. Furthermore, in these back branches, the patch doesn't fix a known
> > issue.  I wouldn't bother with these back branches. For v14, if, in a couple of
> > months, we have some reports that justify the backpatch, fine.
>
> Agreed. I'm hesitant with patching to v13 and v14. We've never got
> such a bug report yet and the next minior version of v13 would be the
> final release. I'll add some comments in the commit message.
>

And now pushed (from master to v15).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Invalid pointer access in logical decoding after error

От

vignesh C

Дата:

11 октября 2025 г., 17:19:48

On Thu, 9 Oct 2025 at 23:33, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Oct 9, 2025 at 10:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Thu, Oct 9, 2025 at 8:23 AM Euler Taveira <euler@eulerto.com> wrote:
> > >
> > > On Thu, Oct 9, 2025, at 10:40 AM, vignesh C wrote:
> > > > On Thu, 9 Oct 2025 at 00:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >>
> > > >> One thing we might want to consider is for v14 and v13. They don't
> > > >> have this bug since the entry_ctx was introduced in v15, but it's
> > > >> still true for them that RelationSyncCache is not cleaned up in error
> > > >> cases if pgoutput is used via SQL API. For example, if
> > > >> RelationSyncCache hash table gets corrupted for some reason, logical
> > > >> decoding could repeat an error until logical decoding completes
> > > >> successfully and its shutdown callback is called. Also, it might be a
> > > >> good idea in general to ensure cleaning up the hash table after use.
> > > >
> > > > Agreed, let's backpatch to PG13. Should we also add a test case in the
> > > > master branch, given that this issue has been around for a while?
> > > >
> > >
> > > I'm wondering if it is a good idea because the bug doesn't manifest in v13 and
> > > v14. At least the v13 has its final minor version in less than a month and EOL.
> > > I would have caution when applying fixes to the latest minor version of a
> > > stable branch; there won't be a chance to fix the fix in the next minor
> > > release. Furthermore, in these back branches, the patch doesn't fix a known
> > > issue.  I wouldn't bother with these back branches. For v14, if, in a couple of
> > > months, we have some reports that justify the backpatch, fine.
> >
> > Agreed. I'm hesitant with patching to v13 and v14. We've never got
> > such a bug report yet and the next minior version of v13 would be the
> > final release. I'll add some comments in the commit message.
> >
>
> And now pushed (from master to v15).

Thanks for committing the changes. I monitored the build farm, and the
runs completed successfully. I've also closed the Cf entry at [1].
[1] - https://commitfest.postgresql.org/patch/5903/

Regards,
Vignesh

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: RE: Invalid pointer access in logical decoding after error

Вложения

Вложения

Вложения

Вложения