Обсуждение: "unexpected duplicate for tablespace" problem in logical replication

Поиск

Список

Период

Сортировка

"unexpected duplicate for tablespace" problem in logical replication

От

"wangsh.fnst@fujitsu.com"

Дата:

06 апреля 2022 г., 02:14:26

Hi,


I met a problem while using logical replication in PG11 and I think all 
the PG version have this problem.


The log looks like:
> ERROR: unexpected duplicate for tablespace 0, relfilenode xxxxxxx
Someone also reported this problem in [1], but no one has responded to it.



I did some investigation, and found a way to reproduce this problem. The 
steps are:


1. create a table (call it tableX) and truncate it.


2. cycle through 2^32 OIDs.


3. restart the database to clear all the cache.


4. create a temp table which make the temp table's OID equals to the 
tableX's relfilenode and insert any data into tableX.


The attachment(run.sh) can reproduce this problem in PG10 and PG11with 
the help of option 'WITH OIDS'. I don't find any way to cycle the OIDs 
quickly in branch master, but I use the gdb to reproduce this problem too.



Now, function GetNewRelFileNode() only checks:


1. duplicated OIDs in pg_class.


2. relpath(rnode) is exists in disk.


However, the result of relpath(temp table) and relpath(non-temp table) 
are different, temp table's relpath() has a prefix "t%d". That means, if 
there is a table that value of relfilenode is 20000(but the value of oid 
isn't 20000), it's possible to create a temp table that value of 
relfilenode is also 20000.


I think function GetNewRelFileNode() should always check the duplicated 
relfilenode, see the patch(a simple to way to fix this problem is master 
branch).


Any comment?


Regards,

Shenhao Wang



[1] 
https://www.postgresql.org/message-id/flat/CAM5YvKTPxmMT%3DS7iPcu5SgmaOv4S4nhE1HZRO_sdFX9cXeXXOQ%40mail.gmail.com

Вложения

RE: "unexpected duplicate for tablespace" problem in logical replication

От

"osumi.takamichi@fujitsu.com"

Дата:

08 апреля 2022 г., 09:44:07

On Wednesday, April 6, 2022 11:14 AM wangsh.fnst@fujitsu.com <wangsh.fnst@fujitsu.com> wrote:
> I met a problem while using logical replication in PG11 and I think all the PG
> version have this problem.
> 
> 
> The log looks like:
> > ERROR: unexpected duplicate for tablespace 0, relfilenode xxxxxxx
> Someone also reported this problem in [1], but no one has responded to it.
> 
> 
> 
> I did some investigation, and found a way to reproduce this problem. The
> steps are:
> 
> 
> 1. create a table (call it tableX) and truncate it.
> 
> 
> 2. cycle through 2^32 OIDs.
> 
> 
> 3. restart the database to clear all the cache.
> 
> 
> 4. create a temp table which make the temp table's OID equals to the
> tableX's relfilenode and insert any data into tableX.
> 
> 
> The attachment(run.sh) can reproduce this problem in PG10 and PG11with
> the help of option 'WITH OIDS'. I don't find any way to cycle the OIDs
> quickly in branch master, but I use the gdb to reproduce this problem too.
> 
> 
> 
> Now, function GetNewRelFileNode() only checks:
> 
> 
> 1. duplicated OIDs in pg_class.
> 
> 
> 2. relpath(rnode) is exists in disk.
> 
> 
> However, the result of relpath(temp table) and relpath(non-temp table)
> are different, temp table's relpath() has a prefix "t%d". That means, if
> there is a table that value of relfilenode is 20000(but the value of oid
> isn't 20000), it's possible to create a temp table that value of
> relfilenode is also 20000.
> 
> 
> I think function GetNewRelFileNode() should always check the duplicated
> relfilenode, see the patch(a simple to way to fix this problem is master
> branch).
> 
> 
> Any comment?
Hi, thank you for your report.

It seems correct that there's room that wraparounded oid can be used for temp table,
and we get duplicate result when we retrieve it and face the error.

I reproduced your issue with HEAD and gdb, by replacing rnode.node.relNode with
an existing relfilenode in GetNewRelFileNode(), immediately before the call of relpath().

At the same time, I also confirmed that with your patch, I had another iteration
call of GetNewOidWithIndex() and got different oid, even if I inserted an existing
relfilenode in the same way. I didn't get the same error with your patch.

Below are my several review comments on your patch.

(1) Name of isRelNodeCollision

How about changing it to "IsCollidedRelNode" ?

When I see around of this code, there are functions
named as "IsReservedName", "IsSharedRelation" or "IsPinnedObject".
So, my suggestion would be more aligned.

(2) Remove curly brackets for one sentence

+       if (pg_class != NULL)
+       {
+               usedAsOID = true;
+       }
+       else
+       {
+               pg_class = table_open(RelationRelationId, AccessShareLock);
+               usedAsOID = false;
+       }
+

Please remove curly brackets for one sentence.

Also, I suggest you add a comment something like
"Ensure we have opened pg_class in this function"
at the top of this check for clarification.

(3) variables of isRelNodeCollision

+       SysScanDesc scandesc;
+       static ScanKeyData skey[2];

We can move those declaration into the inside of
the conditional branch for scankey.

(4) Having Assert in isRelNodeCollision

How about add an Assert to check the argument
relation is not NULL ?

(5) argument name of isRelNodeCollision

I suggest you change it to pg_class.

Best Regards,
    Takamichi Osumi

RE: "unexpected duplicate for tablespace" problem in logical replication

От

"osumi.takamichi@fujitsu.com"

Дата:

11 апреля 2022 г., 07:40:11

On Friday, April 8, 2022 6:44 PM I wrote:
> On Wednesday, April 6, 2022 11:14 AM wangsh.fnst@fujitsu.com
> <wangsh.fnst@fujitsu.com> wrote:
> > I met a problem while using logical replication in PG11 and I think
> > all the PG version have this problem.
> >
> >
> > The log looks like:
> > > ERROR: unexpected duplicate for tablespace 0, relfilenode xxxxxxx
> > Someone also reported this problem in [1], but no one has responded to it.
> >
> >
> >
> > I did some investigation, and found a way to reproduce this problem.
> > The steps are:
> >
> >
> > 1. create a table (call it tableX) and truncate it.
> >
> >
> > 2. cycle through 2^32 OIDs.
> >
> >
> > 3. restart the database to clear all the cache.
> >
> >
> > 4. create a temp table which make the temp table's OID equals to the
> > tableX's relfilenode and insert any data into tableX.
> >
> >
> > The attachment(run.sh) can reproduce this problem in PG10 and PG11with
> > the help of option 'WITH OIDS'. I don't find any way to cycle the OIDs
> > quickly in branch master, but I use the gdb to reproduce this problem too.
> >
> >
> >
> > Now, function GetNewRelFileNode() only checks:
> >
> >
> > 1. duplicated OIDs in pg_class.
> >
> >
> > 2. relpath(rnode) is exists in disk.
> >
> >
> > However, the result of relpath(temp table) and relpath(non-temp table)
> > are different, temp table's relpath() has a prefix "t%d". That means,
> > if there is a table that value of relfilenode is 20000(but the value
> > of oid isn't 20000), it's possible to create a temp table that value
> > of relfilenode is also 20000.
> >
> >
> > I think function GetNewRelFileNode() should always check the
> > duplicated relfilenode, see the patch(a simple to way to fix this
> > problem is master branch).
> >
> >
> > Any comment?
> Hi, thank you for your report.
> 
> 
> It seems correct that there's room that wraparounded oid can be used for temp
> table, and we get duplicate result when we retrieve it and face the error.
> 
> I reproduced your issue with HEAD and gdb, by replacing rnode.node.relNode
> with an existing relfilenode in GetNewRelFileNode(), immediately before the
> call of relpath().
One thing I forgot to note is that this bug is not unique to the logical replication.
There is other path to hit it for example, pg_filenode_relation
in the same procedures with gdb.

In the below output, I created tempa table with the same filenode with gdb
without having a pair of logical replication and got the same error you reported.

postgres=# select oid, relname, relfilenode, reltablespace from pg_class where relname in ('c', 'tempa');
  oid  | relname | relfilenode | reltablespace
-------+---------+-------------+---------------
 16387 | c       |       16390 |             0
 16390 | tempa   |       16390 |             0
(2 rows)

postgres=# select pg_filenode_relation(0, 16390);
ERROR:  unexpected duplicate for tablespace 0, relfilenode 16390


Best Regards,
    Takamichi Osumi

Re: "unexpected duplicate for tablespace" problem in logical replication

От

"wangsh.fnst@fujitsu.com"

Дата:

11 апреля 2022 г., 10:24:26

Hi, Osumi-san, Thank you for your comment.

On 4/11/22 15:40, Osumi, Takamichi/大墨 昂道 wrote:
> (1) Name of isRelNodeCollision
changed.
> (2) Remove curly brackets for one sentence
changed.
> (3) variables of isRelNodeCollision
changed.
> (4) Having Assert in isRelNodeCollision
added.
> (5) argument name of isRelNodeCollision
changed.

BTW, I also change the parameter from null to SnapshotAny

         scandesc = systable_beginscan(pg_class,
                                       ClassTblspcRelfilenodeIndexId,
-                                      true, NULL, 2, skey);
+                                      true, SnapshotAny, 2, skey);

I'm not sure, but I think SnapshotAny is more suitable after reading the 
source(comment) in function GetNewOidWithIndex().

Please see the attachment

Regards,
Shenhao Wang

Вложения

v2-0001-Check-the-duplicated-relfilenode-when-create-tabl.patch

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Kyotaro Horiguchi

Дата:

12 апреля 2022 г., 02:45:39

At Mon, 11 Apr 2022 10:24:26 +0000, "wangsh.fnst@fujitsu.com" <wangsh.fnst@fujitsu.com> wrote in 
> Please see the attachment

Sorry for being late, but this seems to be in the wrong direction.

In the problem case, as you know, GetNewRelFileNode does *not* check
oid uniqueness in pg_class.  The on-catalog duplicate check is done
only when the result is to be used as the *table oid* at creation.  In
other cases it chooses a relfilenode that does not duplicate in
storage file names irrelevantly to the relation oid.  As the result
Non-temporary and temporary tables have separate relfilenode oid
spaces, either intentional or not..

Thus, I think we don't want GetNewRelFileNode to check for duplicate
oid on pg_class if pg_class is not passed since that significantly
narrow the oid space for relfilenode.  If we did something like that,
we might do check the existence of file names of the both non-temp and
temp in GetNewRelFileNode(), but that also narrows the relfilenode oid
space.

RelidByRelfilenode is called by autoprewarm, logical replication, and
pg_filenode_relation. In the autoprewarm and logical replication
cases, it is called only for non-temprary relations so letting the
function ignore (oid duplicating) temp tables works.
pg_relfilienode_relation is somewhat dubious.  It is affected by this
bug.  In the attached PoC, to avoid API change (in case for
back-patching), RelidByRelfilenode ignores duplcates of differernt
persistence.  Also the PoC prioritize on persistent tables to
temporary ones but I'm not sure it's good decision, but otherwise we
need to change the signature of pg_filenode_relation.

The attached is an PoC of that.  The first one is a reproduction-aid
code. With it applied, the following steps create duplicate
relfilenode relations.

(apply the first diff)
$ rm /tmp/hoge
$ (start postgres)
# touch /tmp/hoge
$ psql
=# create table xp (a int);
=# truncate xp;
=# create temp table xt (a int);
=# truncate xt;
=# select oid, relname, relfilenode from pg_class where relname like 'x%';
  oid  | relname | relfilenode 
-------+---------+-------------
 16384 | xp      |       55555
 55558 | xt      |       55555
(2 rows)
=# select pg_filenode_relation(0, 55555);
ERROR:  unexpected duplicate for tablespace 0, relfilenode 55555


(apply the second diff)
=# create temp table xt (a int);
=# truncate xt;
=# select pg_filenode_relation(0, 55555);
 pg_filenode_relation 
----------------------
 xp
(1 row)


What do you think about this direction?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: "unexpected duplicate for tablespace" problem in logical replication

От

"wangsh.fnst@fujitsu.com"

Дата:

12 апреля 2022 г., 03:36:43

On 4/12/22 10:45, Kyotaro Horiguchi wrote:
> What do you think about this direction?

I think this direction is better then mine.  This POC looks good to me.


Regards,
Shenhao Wang

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Masahiko Sawada

Дата:

19 апреля 2022 г., 06:23:37

On Tue, Apr 12, 2022 at 11:45 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Mon, 11 Apr 2022 10:24:26 +0000, "wangsh.fnst@fujitsu.com" <wangsh.fnst@fujitsu.com> wrote in
> > Please see the attachment
>
> Sorry for being late, but this seems to be in the wrong direction.
>
> In the problem case, as you know, GetNewRelFileNode does *not* check
> oid uniqueness in pg_class.  The on-catalog duplicate check is done
> only when the result is to be used as the *table oid* at creation.  In
> other cases it chooses a relfilenode that does not duplicate in
> storage file names irrelevantly to the relation oid.  As the result
> Non-temporary and temporary tables have separate relfilenode oid
> spaces, either intentional or not..
>
> Thus, I think we don't want GetNewRelFileNode to check for duplicate
> oid on pg_class if pg_class is not passed since that significantly
> narrow the oid space for relfilenode.  If we did something like that,
> we might do check the existence of file names of the both non-temp and
> temp in GetNewRelFileNode(), but that also narrows the relfilenode oid
> space.

Agreed.

>
> RelidByRelfilenode is called by autoprewarm, logical replication, and
> pg_filenode_relation. In the autoprewarm and logical replication
> cases, it is called only for non-temprary relations so letting the
> function ignore (oid duplicating) temp tables works.
> pg_relfilienode_relation is somewhat dubious.  It is affected by this
> bug.  In the attached PoC, to avoid API change (in case for
> back-patching), RelidByRelfilenode ignores duplcates of differernt
> persistence.  Also the PoC prioritize on persistent tables to
> temporary ones but I'm not sure it's good decision, but otherwise we
> need to change the signature of pg_filenode_relation.
>
> The attached is an PoC of that.  The first one is a reproduction-aid
> code. With it applied, the following steps create duplicate
> relfilenode relations.
>
> What do you think about this direction?

Sounds good to me. If we go in this direction, probably we also need
to change the cache checks in RelidByRelfilenode() so that we
prioritize non-temporary relations. Otherwise, we will end up
returning the OID of temporary relation if it's already cached.

Another idea I came up with is that we have RelidByRelfilenode() check
only non-temporary relations by adding relpersistence !=
RELPERSISTENCE_TEMP to the scan key. If we want to support temporary
relations too, I think that, in v16, we can add relpersistence to
RelidByRelfilenode() as a function argument (and a flag to
pg_filenode_relation() accordingly) so that the user can get the name
of the temporary relation by like pg_filenode_relation(0, 16386,
true). On the other hand, in back branches, if an application is using
pg_filenode_relation() to get the name of temporary relations, it
would break it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: "unexpected duplicate for tablespace" problem in logical replication

От

vignesh C

Дата:

08 января 2024 г., 05:02:03

On Tue, 19 Apr 2022 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Apr 12, 2022 at 11:45 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Mon, 11 Apr 2022 10:24:26 +0000, "wangsh.fnst@fujitsu.com" <wangsh.fnst@fujitsu.com> wrote in
> > > Please see the attachment
> >
> > Sorry for being late, but this seems to be in the wrong direction.
> >
> > In the problem case, as you know, GetNewRelFileNode does *not* check
> > oid uniqueness in pg_class.  The on-catalog duplicate check is done
> > only when the result is to be used as the *table oid* at creation.  In
> > other cases it chooses a relfilenode that does not duplicate in
> > storage file names irrelevantly to the relation oid.  As the result
> > Non-temporary and temporary tables have separate relfilenode oid
> > spaces, either intentional or not..
> >
> > Thus, I think we don't want GetNewRelFileNode to check for duplicate
> > oid on pg_class if pg_class is not passed since that significantly
> > narrow the oid space for relfilenode.  If we did something like that,
> > we might do check the existence of file names of the both non-temp and
> > temp in GetNewRelFileNode(), but that also narrows the relfilenode oid
> > space.
>
> Agreed.
>
> >
> > RelidByRelfilenode is called by autoprewarm, logical replication, and
> > pg_filenode_relation. In the autoprewarm and logical replication
> > cases, it is called only for non-temprary relations so letting the
> > function ignore (oid duplicating) temp tables works.
> > pg_relfilienode_relation is somewhat dubious.  It is affected by this
> > bug.  In the attached PoC, to avoid API change (in case for
> > back-patching), RelidByRelfilenode ignores duplcates of differernt
> > persistence.  Also the PoC prioritize on persistent tables to
> > temporary ones but I'm not sure it's good decision, but otherwise we
> > need to change the signature of pg_filenode_relation.
> >
> > The attached is an PoC of that.  The first one is a reproduction-aid
> > code. With it applied, the following steps create duplicate
> > relfilenode relations.
> >
> > What do you think about this direction?
>
> Sounds good to me. If we go in this direction, probably we also need
> to change the cache checks in RelidByRelfilenode() so that we
> prioritize non-temporary relations. Otherwise, we will end up
> returning the OID of temporary relation if it's already cached.
>
> Another idea I came up with is that we have RelidByRelfilenode() check
> only non-temporary relations by adding relpersistence !=
> RELPERSISTENCE_TEMP to the scan key. If we want to support temporary
> relations too, I think that, in v16, we can add relpersistence to
> RelidByRelfilenode() as a function argument (and a flag to
> pg_filenode_relation() accordingly) so that the user can get the name
> of the temporary relation by like pg_filenode_relation(0, 16386,
> true). On the other hand, in back branches, if an application is using
> pg_filenode_relation() to get the name of temporary relations, it
> would break it.

I have changed the status to "Waiting on Author" as the points raised
by Masahiko Sawada-san is pending.

Regards,
Vignesh

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Kyotaro Horiguchi

Дата:

12 января 2024 г., 08:37:46

At Mon, 8 Jan 2024 10:32:03 +0530, vignesh C <vignesh21@gmail.com> wrote in 
> I have changed the status to "Waiting on Author" as the points raised
> by Masahiko Sawada-san is pending.

Oops! This slipped off from my mind. Thank you for the reminder.

> On Tue, 19 Apr 2022 at 11:54, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > What do you think about this direction?
> >
> > Sounds good to me. If we go in this direction, probably we also need
> > to change the cache checks in RelidByRelfilenode() so that we
> > prioritize non-temporary relations. Otherwise, we will end up
> > returning the OID of temporary relation if it's already cached.

Maybe. It might be better for the cache not to register temprary
relations at all.

> > Another idea I came up with is that we have RelidByRelfilenode() check
> > only non-temporary relations by adding relpersistence !=
> > RELPERSISTENCE_TEMP to the scan key.

I'm not sure it is good considering the case pg_relationumber_relation
for a temporary relation.

> > If we want to support temporary
> > relations too, I think that, in v16, we can add relpersistence to
> > RelidByRelfilenode() as a function argument (and a flag to
> > pg_filenode_relation() accordingly) so that the user can get the name
> > of the temporary relation by like pg_filenode_relation(0, 16386,
> > true). On the other hand, in back branches, if an application is using
> > pg_filenode_relation() to get the name of temporary relations, it
> > would break it.

Agreed. So, the attachd files are the patches for the master and older
versions respectively.

If there is a concern with the patch for the master, the parameter
handling of pg_filenode_relation might be not good. If the feature of
searching for relations regardless of their persistence is
unnecessary, it could be simplified. (Anyway it lacks documentation at
all).

Also, while the patch for version 11 is attached, I am unable to build
this version on my system (although I haven't investigated this
deeply, but I'm on Rocky 9 now).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Robert Haas

Дата:

15 января 2024 г., 21:32:07

On Fri, Jan 12, 2024 at 3:38 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Maybe. It might be better for the cache not to register temprary
> relations at all.

This point seems worthy of serious consideration to me. Is there any
reason why we need RelidByRelfilenumber() to work with temporary
relations at all? I understand that the current behavior is exposed
via the SQL-callable function, but maybe that's not really
intentional. If there's no other use of RelidByRelfilenumber() that
needs to care about permanent relations intrinsically, I think we
shouldn't hesitate to just cut them out of the mechanism entirely.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Kyotaro Horiguchi

Дата:

16 января 2024 г., 01:58:04

At Mon, 15 Jan 2024 16:32:07 -0500, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Fri, Jan 12, 2024 at 3:38 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > Maybe. It might be better for the cache not to register temprary
> > relations at all.
> 
> This point seems worthy of serious consideration to me. Is there any
> reason why we need RelidByRelfilenumber() to work with temporary
> relations at all? I understand that the current behavior is exposed
> via the SQL-callable function, but maybe that's not really
> intentional. If there's no other use of RelidByRelfilenumber() that
> needs to care about permanent relations intrinsically, I think we
> shouldn't hesitate to just cut them out of the mechanism entirely.

Do you mean that the current behavior of the SQL-callable function is
being treated as a bug and should it be corrected?

Simply doing so will result in the functions pg_relation_filenode()
and pg_filenode_relation() behaving asymmetrically. While there is no
need to purposely change the behavior of the former, it is necessary
to document the behavior of the latter regardless.

The attached patch does the above for the master head. If we apply
this approach to older versions, I can adapt and create similar
patches for them.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

v2_RelidByRelfilenumber_prefer_perm_rels_master.patch

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Robert Haas

Дата:

02 февраля 2024 г., 15:49:17

On Mon, Jan 15, 2024 at 8:58 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Do you mean that the current behavior of the SQL-callable function is
> being treated as a bug and should it be corrected?
>
> Simply doing so will result in the functions pg_relation_filenode()
> and pg_filenode_relation() behaving asymmetrically. While there is no
> need to purposely change the behavior of the former, it is necessary
> to document the behavior of the latter regardless.
>
> The attached patch does the above for the master head. If we apply
> this approach to older versions, I can adapt and create similar
> patches for them.

Yes, this patch shows the approach I was asking about.

Andres, what do you think about this idea? I wonder if you just
momentarily forgot about temporary relations when coding
RelidByRelfilenumber -- because for that function to give well-defined
answers with temporary relations included, it would need the backend
ID as an additional argument.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Michael Paquier

Дата:

18 марта 2024 г., 06:35:38

On Fri, Feb 02, 2024 at 10:49:17AM -0500, Robert Haas wrote:
> Andres, what do you think about this idea? I wonder if you just
> momentarily forgot about temporary relations when coding
> RelidByRelfilenumber -- because for that function to give well-defined
> answers with temporary relations included, it would need the backend
> ID as an additional argument.

No idea what Andres thinks, but seeing that pg_filenode_relation()
uses in input a tablespace OID and a filenode OID while ignoring the
prefix that would be used for a temp relation path (with a 't' and the
backend number), it is clear that the current function is not suited
to make the difference between temporary and persistent relations as
we'd need to have a priority order to choose one over the other.  And
that may not lead to the correct choice.

Ignoring temporary relations entirely makes sense: one cannot get a
regclass from only a tablespace and a relfilenode, the persistence, as
well as a backend ID would also be required.  I've not checked the
patch in details, but it's to say that the idea to cut temporary
relations sounds rather right here.
--
Michael

Вложения

signature.asc

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Masahiko Sawada

Дата:

24 июля 2024 г., 21:51:20

On Sun, Mar 17, 2024 at 11:35 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Feb 02, 2024 at 10:49:17AM -0500, Robert Haas wrote:
> > Andres, what do you think about this idea? I wonder if you just
> > momentarily forgot about temporary relations when coding
> > RelidByRelfilenumber -- because for that function to give well-defined
> > answers with temporary relations included, it would need the backend
> > ID as an additional argument.
>
>
> Ignoring temporary relations entirely makes sense: one cannot get a
> regclass from only a tablespace and a relfilenode, the persistence, as
> well as a backend ID would also be required.  I've not checked the
> patch in details, but it's to say that the idea to cut temporary
> relations sounds rather right here.

That makes sense to me too.

Regarding the patch, filtering by the relpersistence in
systable_getnext() loop seems to be good to me. Alternatively we can
add "relpersistence == RELPERSISTENCE_TEMP" to the scan key. The patch
would need regression tests too.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: "unexpected duplicate for tablespace" problem in logical replication

От

vignesh C

Дата:

25 марта 2025 г., 13:01:01

On Thu, 25 Jul 2024 at 03:21, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Sun, Mar 17, 2024 at 11:35 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Feb 02, 2024 at 10:49:17AM -0500, Robert Haas wrote:
> > > Andres, what do you think about this idea? I wonder if you just
> > > momentarily forgot about temporary relations when coding
> > > RelidByRelfilenumber -- because for that function to give well-defined
> > > answers with temporary relations included, it would need the backend
> > > ID as an additional argument.
> >
> >
> > Ignoring temporary relations entirely makes sense: one cannot get a
> > regclass from only a tablespace and a relfilenode, the persistence, as
> > well as a backend ID would also be required.  I've not checked the
> > patch in details, but it's to say that the idea to cut temporary
> > relations sounds rather right here.
>
> That makes sense to me too.
>
> Regarding the patch, filtering by the relpersistence in
> systable_getnext() loop seems to be good to me. Alternatively we can
> add "relpersistence == RELPERSISTENCE_TEMP" to the scan key. The patch
> would need regression tests too.

The attached patch adds a new test and resolves an existing test
failure. However, a downside is that we can no longer verify the
mapping of the temporary tables.

Regards,
Vignesh

Вложения

v3_RelidByRelfilenumber_prefer_perm_rels_master.patch

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Ashutosh Bapat

Дата:

19 июня 2025 г., 11:47:40

On Tue, Jun 17, 2025 at 6:29 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Thu, 25 Jul 2024 at 03:21, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Sun, Mar 17, 2024 at 11:35 PM Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Fri, Feb 02, 2024 at 10:49:17AM -0500, Robert Haas wrote:
> > > > Andres, what do you think about this idea? I wonder if you just
> > > > momentarily forgot about temporary relations when coding
> > > > RelidByRelfilenumber -- because for that function to give well-defined
> > > > answers with temporary relations included, it would need the backend
> > > > ID as an additional argument.
> > >
> > >
> > > Ignoring temporary relations entirely makes sense: one cannot get a
> > > regclass from only a tablespace and a relfilenode, the persistence, as
> > > well as a backend ID would also be required.  I've not checked the
> > > patch in details, but it's to say that the idea to cut temporary
> > > relations sounds rather right here.
> >
> > That makes sense to me too.
> >
> > Regarding the patch, filtering by the relpersistence in
> > systable_getnext() loop seems to be good to me. Alternatively we can
> > add "relpersistence == RELPERSISTENCE_TEMP" to the scan key.

Do you mean relpersistence != RELPERSISTENCE_TEMP? I tried that. It's
not possible to add this check to the scankey since indexes do not
work with != conditions. We will need to use a sequential scan to do
so, but that will affect the performance. I think what Vignesh has
done in his patch is good enough.

>
> The attached patch adds a new test and resolves an existing test
> failure. However, a downside is that we can no longer verify the
> mapping of the temporary tables.

Yes, but I think the current way of verification wasn't correct anyway
because it couldn't match the temporary table's relation exactly. We
will need to devise another way to do that, maybe creating a version
of pg_filenode_relation() for temporary tables.

Some more comments, some of which I have applied in the attached
patches. Please review those. Let me know what you think.

I feel that we should mention permanent tables in the prologue of
pg_filenode_relation() for somebody who just looks at
pg_filenode_relation(). Also in its pg_proc.dat description for one
who looks at \df+ output. Attached patch does that.

- * Returns InvalidOid if no relation matching the criteria could be found.
+ * Returns InvalidOid if no permanent relation matching the criteria could be
+ * found.

The relpersistence enum has values
#define   RELPERSISTENCE_PERMANENT 'p' /* regular table */
#define   RELPERSISTENCE_UNLOGGED 'u' /* unlogged permanent table */
#define   RELPERSISTENCE_TEMP 't' /* temporary table */

The description of UNLOGGED mentions "permanent" so it looks like your
use of "permanent" covers both regular and unlogged tables. However
looking purely at the RELPERSISTENCE_ names, it' s possible that one
will associate the word "permanent" with RELPERSISTENCE_PERMANENT
only, and not with RELPERSISTENCE_UNLOGGED. Should we use
"non-temporary" instead to avoid confusion?

+ /*
+ * Temporary relations should be excluded. This exclusion is
+ * necessary because they can lead to the wrong result; the
+ * relfilenumber space is shared with persistent
+ * relations. Additionally, excluding them is possible since they
+ * are not the focus in this context.
+ */
+ if (classform->relpersistence == RELPERSISTENCE_TEMP)
+ continue;
+

I slightly rephrased the comment and moved it to the function prologue
since it defines the function's scope.

WHERE relkind IN ('r', 'i', 'S', 't', 'm') AND mapped_oid IS DISTINCT FROM oid;
+WHERE relkind IN ('r', 'i', 'S', 'm') AND mapped_oid IS DISTINCT FROM oid;

Why do we need to remove permanent toast tables from the query?
Instead adjustment below

 SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid
-WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL;
+WHERE (c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL) AND
c.relpersistence != 't';

 seems enough. Actually, we shouldn't pass temporary tables to
pg_filenode_relation(), since it doesn't map temporary relations now.
Adjusted that way in the attached patch.

While reviewing the patch, I found something else. The
RelidByRelfilenumber() enters negative cache entries.
RelfilenumberMapInvalidateCallback() treats the negative entries
specifically which indicates that it's intentional. But if somebody
calls pg_filenode_relation() repeatedly with invalid relfilenodes,
that would bloat the cache with "kinda useless entries". It's a way to
make a backend consume a lot of memory quickly and thus perform a DOS.
For example, on my laptop, I could make a backend consume almost 3GB
of memory in 7 minutes.

#SELECT pg_size_pretty(total_bytes) total_bytes,
pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
where name = 'RelfilenumberMap cache';
 total_bytes | used_bytes
-------------+------------
(0 rows)

#\timing
Timing is on.
#select r is null, count(*) from (select pg_filenode_relation(i, j) r
from generate_series(1, 1000) i, generate_series(1, 1000) j) q group
by r is null;
 ?column? |  count
----------+---------
 t        | 1000000
(1 row)

Time: 4705.483 ms (00:04.705)
#SELECT pg_size_pretty(total_bytes) total_bytes,
pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
where name = 'RelfilenumberMap cache';
 total_bytes | used_bytes
-------------+------------
 40 MB       | 39 MB
(1 row)

Time: 2.351 ms
#select r is null, count(*) from (select pg_filenode_relation(i, j) r
from generate_series(1001, 10000) i, generate_series(1001, 10000) j) q
group by r is null;
 ?column? |  count
----------+----------
 f        |   153215
 t        | 80846785
(2 rows)

Time: 421998.039 ms (07:01.998)
#SELECT pg_size_pretty(total_bytes) total_bytes,
pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
where name = 'RelfilenumberMap cache';
 total_bytes | used_bytes
-------------+------------
 3180 MB     | 3176 MB
(1 row)

Time: 132.187 ms

Logical replication and autoprewarm may not cause such a large bloat
but there is no limit to passing invalid combinations of reltablespace
and relfilenumber to pg_filenode_relation(). Do we want to prohibit
that case by passing a flag from logical pg_filenode_relation() to not
cache invalid entries?

I have moved the CF entry to the July commitfest.

--
Best Wishes,
Ashutosh Bapat

Вложения

0001-Ignore-temporary-relations-in-RelidByRelfil-20250619.patch

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Ashutosh Bapat

Дата:

13 августа 2025 г., 12:48:31

On Thu, Jun 19, 2025 at 2:17 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Jun 17, 2025 at 6:29 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, 25 Jul 2024 at 03:21, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Sun, Mar 17, 2024 at 11:35 PM Michael Paquier <michael@paquier.xyz> wrote:
> > > >
> > > > On Fri, Feb 02, 2024 at 10:49:17AM -0500, Robert Haas wrote:
> > > > > Andres, what do you think about this idea? I wonder if you just
> > > > > momentarily forgot about temporary relations when coding
> > > > > RelidByRelfilenumber -- because for that function to give well-defined
> > > > > answers with temporary relations included, it would need the backend
> > > > > ID as an additional argument.
> > > >
> > > >
> > > > Ignoring temporary relations entirely makes sense: one cannot get a
> > > > regclass from only a tablespace and a relfilenode, the persistence, as
> > > > well as a backend ID would also be required.  I've not checked the
> > > > patch in details, but it's to say that the idea to cut temporary
> > > > relations sounds rather right here.
> > >
> > > That makes sense to me too.
> > >
> > > Regarding the patch, filtering by the relpersistence in
> > > systable_getnext() loop seems to be good to me. Alternatively we can
> > > add "relpersistence == RELPERSISTENCE_TEMP" to the scan key.
>
> Do you mean relpersistence != RELPERSISTENCE_TEMP? I tried that. It's
> not possible to add this check to the scankey since indexes do not
> work with != conditions. We will need to use a sequential scan to do
> so, but that will affect the performance. I think what Vignesh has
> done in his patch is good enough.
>
> >
> > The attached patch adds a new test and resolves an existing test
> > failure. However, a downside is that we can no longer verify the
> > mapping of the temporary tables.
>
> Yes, but I think the current way of verification wasn't correct anyway
> because it couldn't match the temporary table's relation exactly. We
> will need to devise another way to do that, maybe creating a version
> of pg_filenode_relation() for temporary tables.
>
> Some more comments, some of which I have applied in the attached
> patches. Please review those. Let me know what you think.
>
> I feel that we should mention permanent tables in the prologue of
> pg_filenode_relation() for somebody who just looks at
> pg_filenode_relation(). Also in its pg_proc.dat description for one
> who looks at \df+ output. Attached patch does that.
>
> - * Returns InvalidOid if no relation matching the criteria could be found.
> + * Returns InvalidOid if no permanent relation matching the criteria could be
> + * found.
>
> The relpersistence enum has values
> #define   RELPERSISTENCE_PERMANENT 'p' /* regular table */
> #define   RELPERSISTENCE_UNLOGGED 'u' /* unlogged permanent table */
> #define   RELPERSISTENCE_TEMP 't' /* temporary table */
>
> The description of UNLOGGED mentions "permanent" so it looks like your
> use of "permanent" covers both regular and unlogged tables. However
> looking purely at the RELPERSISTENCE_ names, it' s possible that one
> will associate the word "permanent" with RELPERSISTENCE_PERMANENT
> only, and not with RELPERSISTENCE_UNLOGGED. Should we use
> "non-temporary" instead to avoid confusion?
>
> + /*
> + * Temporary relations should be excluded. This exclusion is
> + * necessary because they can lead to the wrong result; the
> + * relfilenumber space is shared with persistent
> + * relations. Additionally, excluding them is possible since they
> + * are not the focus in this context.
> + */
> + if (classform->relpersistence == RELPERSISTENCE_TEMP)
> + continue;
> +
>
> I slightly rephrased the comment and moved it to the function prologue
> since it defines the function's scope.
>
> WHERE relkind IN ('r', 'i', 'S', 't', 'm') AND mapped_oid IS DISTINCT FROM oid;
> +WHERE relkind IN ('r', 'i', 'S', 'm') AND mapped_oid IS DISTINCT FROM oid;
>
> Why do we need to remove permanent toast tables from the query?
> Instead adjustment below
>
>  SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid
> -WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL;
> +WHERE (c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL) AND
> c.relpersistence != 't';
>
>  seems enough. Actually, we shouldn't pass temporary tables to
> pg_filenode_relation(), since it doesn't map temporary relations now.
> Adjusted that way in the attached patch.
>
> While reviewing the patch, I found something else. The
> RelidByRelfilenumber() enters negative cache entries.
> RelfilenumberMapInvalidateCallback() treats the negative entries
> specifically which indicates that it's intentional. But if somebody
> calls pg_filenode_relation() repeatedly with invalid relfilenodes,
> that would bloat the cache with "kinda useless entries". It's a way to
> make a backend consume a lot of memory quickly and thus perform a DOS.
> For example, on my laptop, I could make a backend consume almost 3GB
> of memory in 7 minutes.
>
> #SELECT pg_size_pretty(total_bytes) total_bytes,
> pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
> where name = 'RelfilenumberMap cache';
>  total_bytes | used_bytes
> -------------+------------
> (0 rows)
>
> #\timing
> Timing is on.
> #select r is null, count(*) from (select pg_filenode_relation(i, j) r
> from generate_series(1, 1000) i, generate_series(1, 1000) j) q group
> by r is null;
>  ?column? |  count
> ----------+---------
>  t        | 1000000
> (1 row)
>
> Time: 4705.483 ms (00:04.705)
> #SELECT pg_size_pretty(total_bytes) total_bytes,
> pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
> where name = 'RelfilenumberMap cache';
>  total_bytes | used_bytes
> -------------+------------
>  40 MB       | 39 MB
> (1 row)
>
> Time: 2.351 ms
> #select r is null, count(*) from (select pg_filenode_relation(i, j) r
> from generate_series(1001, 10000) i, generate_series(1001, 10000) j) q
> group by r is null;
>  ?column? |  count
> ----------+----------
>  f        |   153215
>  t        | 80846785
> (2 rows)
>
> Time: 421998.039 ms (07:01.998)
> #SELECT pg_size_pretty(total_bytes) total_bytes,
> pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
> where name = 'RelfilenumberMap cache';
>  total_bytes | used_bytes
> -------------+------------
>  3180 MB     | 3176 MB
> (1 row)
>
> Time: 132.187 ms
>
> Logical replication and autoprewarm may not cause such a large bloat
> but there is no limit to passing invalid combinations of reltablespace
> and relfilenumber to pg_filenode_relation(). Do we want to prohibit
> that case by passing a flag from logical pg_filenode_relation() to not
> cache invalid entries?
>
> I have moved the CF entry to the July commitfest.

The patch needed a rebase because of func.sgml refactoring. Here's a
rebased patchset.

--
Best Wishes,
Ashutosh Bapat

Вложения

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Ashutosh Bapat

Дата:

13 августа 2025 г., 12:59:50

On Wed, Aug 13, 2025 at 3:18 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Thu, Jun 19, 2025 at 2:17 PM Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> >
> > On Tue, Jun 17, 2025 at 6:29 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Thu, 25 Jul 2024 at 03:21, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Sun, Mar 17, 2024 at 11:35 PM Michael Paquier <michael@paquier.xyz> wrote:
> > > > >
> > > > > On Fri, Feb 02, 2024 at 10:49:17AM -0500, Robert Haas wrote:
> > > > > > Andres, what do you think about this idea? I wonder if you just
> > > > > > momentarily forgot about temporary relations when coding
> > > > > > RelidByRelfilenumber -- because for that function to give well-defined
> > > > > > answers with temporary relations included, it would need the backend
> > > > > > ID as an additional argument.
> > > > >
> > > > >
> > > > > Ignoring temporary relations entirely makes sense: one cannot get a
> > > > > regclass from only a tablespace and a relfilenode, the persistence, as
> > > > > well as a backend ID would also be required.  I've not checked the
> > > > > patch in details, but it's to say that the idea to cut temporary
> > > > > relations sounds rather right here.
> > > >
> > > > That makes sense to me too.
> > > >
> > > > Regarding the patch, filtering by the relpersistence in
> > > > systable_getnext() loop seems to be good to me. Alternatively we can
> > > > add "relpersistence == RELPERSISTENCE_TEMP" to the scan key.
> >
> > Do you mean relpersistence != RELPERSISTENCE_TEMP? I tried that. It's
> > not possible to add this check to the scankey since indexes do not
> > work with != conditions. We will need to use a sequential scan to do
> > so, but that will affect the performance. I think what Vignesh has
> > done in his patch is good enough.
> >
> > >
> > > The attached patch adds a new test and resolves an existing test
> > > failure. However, a downside is that we can no longer verify the
> > > mapping of the temporary tables.
> >
> > Yes, but I think the current way of verification wasn't correct anyway
> > because it couldn't match the temporary table's relation exactly. We
> > will need to devise another way to do that, maybe creating a version
> > of pg_filenode_relation() for temporary tables.
> >
> > Some more comments, some of which I have applied in the attached
> > patches. Please review those. Let me know what you think.
> >
> > I feel that we should mention permanent tables in the prologue of
> > pg_filenode_relation() for somebody who just looks at
> > pg_filenode_relation(). Also in its pg_proc.dat description for one
> > who looks at \df+ output. Attached patch does that.
> >
> > - * Returns InvalidOid if no relation matching the criteria could be found.
> > + * Returns InvalidOid if no permanent relation matching the criteria could be
> > + * found.
> >
> > The relpersistence enum has values
> > #define   RELPERSISTENCE_PERMANENT 'p' /* regular table */
> > #define   RELPERSISTENCE_UNLOGGED 'u' /* unlogged permanent table */
> > #define   RELPERSISTENCE_TEMP 't' /* temporary table */
> >
> > The description of UNLOGGED mentions "permanent" so it looks like your
> > use of "permanent" covers both regular and unlogged tables. However
> > looking purely at the RELPERSISTENCE_ names, it' s possible that one
> > will associate the word "permanent" with RELPERSISTENCE_PERMANENT
> > only, and not with RELPERSISTENCE_UNLOGGED. Should we use
> > "non-temporary" instead to avoid confusion?
> >
> > + /*
> > + * Temporary relations should be excluded. This exclusion is
> > + * necessary because they can lead to the wrong result; the
> > + * relfilenumber space is shared with persistent
> > + * relations. Additionally, excluding them is possible since they
> > + * are not the focus in this context.
> > + */
> > + if (classform->relpersistence == RELPERSISTENCE_TEMP)
> > + continue;
> > +
> >
> > I slightly rephrased the comment and moved it to the function prologue
> > since it defines the function's scope.
> >
> > WHERE relkind IN ('r', 'i', 'S', 't', 'm') AND mapped_oid IS DISTINCT FROM oid;
> > +WHERE relkind IN ('r', 'i', 'S', 'm') AND mapped_oid IS DISTINCT FROM oid;
> >
> > Why do we need to remove permanent toast tables from the query?
> > Instead adjustment below
> >
> >  SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid
> > -WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL;
> > +WHERE (c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL) AND
> > c.relpersistence != 't';
> >
> >  seems enough. Actually, we shouldn't pass temporary tables to
> > pg_filenode_relation(), since it doesn't map temporary relations now.
> > Adjusted that way in the attached patch.
> >
> > While reviewing the patch, I found something else. The
> > RelidByRelfilenumber() enters negative cache entries.
> > RelfilenumberMapInvalidateCallback() treats the negative entries
> > specifically which indicates that it's intentional. But if somebody
> > calls pg_filenode_relation() repeatedly with invalid relfilenodes,
> > that would bloat the cache with "kinda useless entries". It's a way to
> > make a backend consume a lot of memory quickly and thus perform a DOS.
> > For example, on my laptop, I could make a backend consume almost 3GB
> > of memory in 7 minutes.
> >
> > #SELECT pg_size_pretty(total_bytes) total_bytes,
> > pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
> > where name = 'RelfilenumberMap cache';
> >  total_bytes | used_bytes
> > -------------+------------
> > (0 rows)
> >
> > #\timing
> > Timing is on.
> > #select r is null, count(*) from (select pg_filenode_relation(i, j) r
> > from generate_series(1, 1000) i, generate_series(1, 1000) j) q group
> > by r is null;
> >  ?column? |  count
> > ----------+---------
> >  t        | 1000000
> > (1 row)
> >
> > Time: 4705.483 ms (00:04.705)
> > #SELECT pg_size_pretty(total_bytes) total_bytes,
> > pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
> > where name = 'RelfilenumberMap cache';
> >  total_bytes | used_bytes
> > -------------+------------
> >  40 MB       | 39 MB
> > (1 row)
> >
> > Time: 2.351 ms
> > #select r is null, count(*) from (select pg_filenode_relation(i, j) r
> > from generate_series(1001, 10000) i, generate_series(1001, 10000) j) q
> > group by r is null;
> >  ?column? |  count
> > ----------+----------
> >  f        |   153215
> >  t        | 80846785
> > (2 rows)
> >
> > Time: 421998.039 ms (07:01.998)
> > #SELECT pg_size_pretty(total_bytes) total_bytes,
> > pg_size_pretty(used_bytes) used_bytes FROM pg_backend_memory_contexts
> > where name = 'RelfilenumberMap cache';
> >  total_bytes | used_bytes
> > -------------+------------
> >  3180 MB     | 3176 MB
> > (1 row)
> >
> > Time: 132.187 ms
> >
> > Logical replication and autoprewarm may not cause such a large bloat
> > but there is no limit to passing invalid combinations of reltablespace
> > and relfilenumber to pg_filenode_relation(). Do we want to prohibit
> > that case by passing a flag from logical pg_filenode_relation() to not
> > cache invalid entries?
> >
> > I have moved the CF entry to the July commitfest.
>
> The patch needed a rebase because of func.sgml refactoring. Here's a
> rebased patchset.

Please ignore the 0002 patch in the earlier patchset. It's an
experimental change related to negative entries bloat in
RelfilenumberMap cache.

--
Best Wishes,
Ashutosh Bapat

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Michael Paquier

Дата:

21 августа 2025 г., 09:12:18

On Wed, Aug 13, 2025 at 03:29:50PM +0530, Ashutosh Bapat wrote:

Sorry for the delay.  This has been on my TODO list for some time.
Back to it now.

> On Wed, Aug 13, 2025 at 3:18 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>> On Thu, Jun 19, 2025 at 2:17 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>>> The description of UNLOGGED mentions "permanent" so it looks like your
>>> use of "permanent" covers both regular and unlogged tables. However
>>> looking purely at the RELPERSISTENCE_ names, it' s possible that one
>>> will associate the word "permanent" with RELPERSISTENCE_PERMANENT
>>> only, and not with RELPERSISTENCE_UNLOGGED. Should we use
>>> "non-temporary" instead to avoid confusion?

FWIW, I see a pretty good point in backpatching what we have here, as
active users of logical replication can be impacted by this issue
randomly.  Even if this requires a wraparound, for some users it may
be a window that can open in a couple of days, mostly for workloads
with a high temp table exhaustion, I guess.  In short, let's not touch
pg_proc.dat, and I'm planning for a backpatch all the way down due to
the possible random intrusions with the logirep paths.

The changes at the top of RelidByRelfilenumber() describing the
problem with temporary tables feels a bit bloated.  I see no need to
mention directly autoprewarm and logical replication, two of the three
callers of this routine, with the explanation coming down to the fact
that this is only a temp relation problem because we cannot pinpoint
to an OID without knowing a proc number.  So let's cut a few lines
here at the top of RelidByRelfilenumber(), let's add a simpler line in
func-admin.sgml to say directly and only that "temporary relations are
not supported", passing on the "permanent" vs "no permanent" business.
I don't see a direct need to mention that at the top of
pg_relation_filenode(), as long as RelidByRelfilenumber(), but OK by
me to add a simpler "temporary relations are not supported", it looks
like most prefer than based on the contents of the thread.

>>> Logical replication and autoprewarm may not cause such a large bloat
>>> but there is no limit to passing invalid combinations of reltablespace
>>> and relfilenumber to pg_filenode_relation(). Do we want to prohibit
>>> that case by passing a flag from logical pg_filenode_relation() to not
>>> cache invalid entries?

Fun.  Yes, I agree that we could do better here.  It does not strike
me as an issue as invasive as the original report, direct calls of
pg_filenode_relation() are much rarer than the code paths of
autoprewarm and logical replication touched by RelidByRelfilenumber().
That's a potential HEAD improvement to me.

> Please ignore the 0002 patch in the earlier patchset. It's an
> experimental change related to negative entries bloat in
> RelfilenumberMap cache.

And I was wondering what was going on in the latest patch set..  I'm
ignoring this part.
--
Michael

Вложения

signature.asc

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Ashutosh Bapat

Дата:

21 августа 2025 г., 17:24:55

On Thu, Aug 21, 2025 at 11:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Aug 13, 2025 at 03:29:50PM +0530, Ashutosh Bapat wrote:
>
> Sorry for the delay.  This has been on my TODO list for some time.
> Back to it now.
>
> > On Wed, Aug 13, 2025 at 3:18 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
> >> On Thu, Jun 19, 2025 at 2:17 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
> >>> The description of UNLOGGED mentions "permanent" so it looks like your
> >>> use of "permanent" covers both regular and unlogged tables. However
> >>> looking purely at the RELPERSISTENCE_ names, it' s possible that one
> >>> will associate the word "permanent" with RELPERSISTENCE_PERMANENT
> >>> only, and not with RELPERSISTENCE_UNLOGGED. Should we use
> >>> "non-temporary" instead to avoid confusion?
>
> FWIW, I see a pretty good point in backpatching what we have here, as
> active users of logical replication can be impacted by this issue
> randomly.  Even if this requires a wraparound, for some users it may
> be a window that can open in a couple of days, mostly for workloads
> with a high temp table exhaustion, I guess.  In short, let's not touch
> pg_proc.dat, and I'm planning for a backpatch all the way down due to
> the possible random intrusions with the logirep paths.

The only change in pg_proc.dat is change in the function description.
We can revert it for the benefit of backpatching. Its usefulness is
arguable. +1 for not touching pg_proc.dat.

>
> The changes at the top of RelidByRelfilenumber() describing the
> problem with temporary tables feels a bit bloated.  I see no need to
> mention directly autoprewarm and logical replication, two of the three
> callers of this routine, with the explanation coming down to the fact
> that this is only a temp relation problem because we cannot pinpoint
> to an OID without knowing a proc number.

How about the following:

---
Map a relation's (tablespace, relfilenumber) to a relation's oid and
cache the result.

A temporary relation may have the same relfilenumber as a permanent
relation, or as other temporary relations. In order to uniquely
identify a temporary relation we also need the
proc number of the backend which created it. The proc number is not
available to this function. Hence ignore temporary relations.

Returns InvalidOid if no permanent relation matching the criteria
could be found.
---

> So let's cut a few lines
> here at the top of RelidByRelfilenumber(), let's add a simpler line in
> func-admin.sgml to say directly and only that "temporary relations are
> not supported", passing on the "permanent" vs "no permanent" business.
> I don't see a direct need to mention that at the top of
> pg_relation_filenode(), as long as RelidByRelfilenumber(), but OK by
> me to add a simpler "temporary relations are not supported", it looks
> like most prefer than based on the contents of the thread.

If we just say that temporary relations are not supported, the user
may wonder as to what will happen if they pass tablespace and
relfilenode of a temporary relation. Instead it's better to clarify
that behaviour. How about adding following line to the current
description of the function in func-admin.sgml

"Also, returns <literal>NULL</literal> if a temporary relation in the
current database is associated with the given values.".

Said that if you insist on adding just "Temporary relations are not
supported", we can go ahead with it.

>
> >>> Logical replication and autoprewarm may not cause such a large bloat
> >>> but there is no limit to passing invalid combinations of reltablespace
> >>> and relfilenumber to pg_filenode_relation(). Do we want to prohibit
> >>> that case by passing a flag from logical pg_filenode_relation() to not
> >>> cache invalid entries?
>
> Fun.  Yes, I agree that we could do better here.  It does not strike
> me as an issue as invasive as the original report, direct calls of
> pg_filenode_relation() are much rarer than the code paths of
> autoprewarm and logical replication touched by RelidByRelfilenumber().
> That's a potential HEAD improvement to me.

Ok. I will make a patch to ignore negative entries when
RelidByRelfilenumber() is invoked from pg_filenode_relation(). I am ok
with head-only improvement. If we find reports from the field that
this behaviour creates problems in the previous releases, we will be
able to back patch it. I will try to come up with a back-patchable
patch.

Vignesh, since you are the original author of the patch, do you want
to take care of addressing the comments? Otherwise I can.

--
Best Wishes,
Ashutosh Bapat

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Michael Paquier

Дата:

22 августа 2025 г., 00:36:56

On Thu, Aug 21, 2025 at 07:54:55PM +0530, Ashutosh Bapat wrote:
> How about the following:
>
> ---
> Map a relation's (tablespace, relfilenumber) to a relation's oid and
> cache the result.
>
> A temporary relation may have the same relfilenumber as a permanent
> relation, or as other temporary relations. In order to uniquely
> identify a temporary relation we also need the
> proc number of the backend which created it. The proc number is not
> available to this function. Hence ignore temporary relations.
>
> Returns InvalidOid if no permanent relation matching the criteria
> could be found.
> ---

That's simpler.

> Said that if you insist on adding just "Temporary relations are not
> supported", we can go ahead with it.

That would be simpler as well, providing the information.

> Vignesh, since you are the original author of the patch, do you want
> to take care of addressing the comments? Otherwise I can.

It's fine.  I am going to take care of that.
--
Michael

Вложения

signature.asc

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Ashutosh Bapat

Дата:

18 сентября 2025 г., 15:07:10

On Thu, Aug 21, 2025 at 11:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Aug 13, 2025 at 03:29:50PM +0530, Ashutosh Bapat wrote:
>
> >>> Logical replication and autoprewarm may not cause such a large bloat
> >>> but there is no limit to passing invalid combinations of reltablespace
> >>> and relfilenumber to pg_filenode_relation(). Do we want to prohibit
> >>> that case by passing a flag from logical pg_filenode_relation() to not
> >>> cache invalid entries?
>
> Fun.  Yes, I agree that we could do better here.  It does not strike
> me as an issue as invasive as the original report, direct calls of
> pg_filenode_relation() are much rarer than the code paths of
> autoprewarm and logical replication touched by RelidByRelfilenumber().
> That's a potential HEAD improvement to me.

Sorry for taking time. Here's a patch to address this. Added to the commitfest.

--
Best Wishes,
Ashutosh Bapat

Вложения

0001-Negative-RelfilenumberMap-cache-entries-fro-20250918.patch

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Andres Freund

Дата:

18 сентября 2025 г., 15:17:49

Hi,

On 2025-09-18 17:37:10 +0530, Ashutosh Bapat wrote:
> From 6a3562b4ac8917c8b577797e5468416a90cc04f5 Mon Sep 17 00:00:00 2001
> From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> Date: Thu, 18 Sep 2025 17:24:09 +0530
> Subject: [PATCH] Negative RelfilenumberMap cache entries from
>  pg_filenode_relation()
> 
> RelidByRelfilenumber() adds negative entries to the cache. It has three
> users, logical replication, autoprewarm and pg_filenode_relation(). The
> first two need negative entries in the cache in case they happen to
> lookup non-existent mapping again and again.  However such mappings will
> be smaller in number and usually come from some database object e.g. WAL
> or autoprewarm metadata.
> 
> But pg_filenode_relation(), which is SQL callable, may be invoked many
> times with invalid tablespace and relfilenode pairs, causing the cache
> to be bloated with negative cache entries. This can be used as a denial
> of service attack since any user can execute it. This commit avoids such
> a bloat.

I don't really understand why this is worth fixing for the relfilenode stuff
specifically - isn't this true for just about *all* of our caches? Many, if
not most, can be reached via SQL?

Greetings,

Andres Freund

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Andres Freund

Дата:

18 сентября 2025 г., 15:22:58

Hi,

On 2025-09-18 08:17:49 -0400, Andres Freund wrote:
> On 2025-09-18 17:37:10 +0530, Ashutosh Bapat wrote:
> > From 6a3562b4ac8917c8b577797e5468416a90cc04f5 Mon Sep 17 00:00:00 2001
> > From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> > Date: Thu, 18 Sep 2025 17:24:09 +0530
> > Subject: [PATCH] Negative RelfilenumberMap cache entries from
> >  pg_filenode_relation()
> >
> > RelidByRelfilenumber() adds negative entries to the cache. It has three
> > users, logical replication, autoprewarm and pg_filenode_relation(). The
> > first two need negative entries in the cache in case they happen to
> > lookup non-existent mapping again and again.  However such mappings will
> > be smaller in number and usually come from some database object e.g. WAL
> > or autoprewarm metadata.
> >
> > But pg_filenode_relation(), which is SQL callable, may be invoked many
> > times with invalid tablespace and relfilenode pairs, causing the cache
> > to be bloated with negative cache entries. This can be used as a denial
> > of service attack since any user can execute it. This commit avoids such
> > a bloat.
>
> I don't really understand why this is worth fixing for the relfilenode stuff
> specifically - isn't this true for just about *all* of our caches? Many, if
> not most, can be reached via SQL?

Example:

postgres[315631][1]=# SELECT count(*), sum(total_bytes) total_bytes, sum(total_nblocks) total_nblocks, sum(free_bytes)
free_bytes,sum(free_chunks) free_chunks, sum(used_bytes) used_bytes FROM pg_backend_memory_contexts WHERE path @>
(SELECTpath FROM pg_backend_memory_contexts WHERE name = 'CacheMemoryContext');
 
┌───────┬─────────────┬───────────────┬────────────┬─────────────┬────────────┐
│ count │ total_bytes │ total_nblocks │ free_bytes │ free_chunks │ used_bytes │
├───────┼─────────────┼───────────────┼────────────┼─────────────┼────────────┤
│    89 │      747200 │           187 │     130336 │         216 │     616864 │
└───────┴─────────────┴───────────────┴────────────┴─────────────┴────────────┘
(1 row)

Time: 1.540 ms
postgres[315631][1]=# SELECT to_regclass(g.i::text||'.'||g.i::text) is NULL, count(*) FROM generate_series(1, 10000000)
g(i)GROUP BY 1;
 
┌──────────┬──────────┐
│ ?column? │  count   │
├──────────┼──────────┤
│ t        │ 10000000 │
└──────────┴──────────┘
(1 row)

Time: 23238.662 ms (00:23.239)
postgres[315631][1]=# SELECT count(*), sum(total_bytes) total_bytes, sum(total_nblocks) total_nblocks, sum(free_bytes)
free_bytes,sum(free_chunks) free_chunks, sum(used_bytes) used_bytes FROM pg_backend_memory_contexts WHERE path @>
(SELECTpath FROM pg_backend_memory_contexts WHERE name = 'CacheMemoryContext');
 
┌───────┬─────────────┬───────────────┬────────────┬─────────────┬────────────┐
│ count │ total_bytes │ total_nblocks │ free_bytes │ free_chunks │ used_bytes │
├───────┼─────────────┼───────────────┼────────────┼─────────────┼────────────┤
│    89 │  2223205104 │           441 │    8331448 │         472 │ 2214873656 │
└───────┴─────────────┴───────────────┴────────────┴─────────────┴────────────┘
(1 row)


There are so so so many ways to cause arbitrarily large memory usage,
e.g. just by definining a lot of prepared statements, that I just don't see it
worth adding any complexity to the relfilenode case specifically.

Greetings,

Andres Freund

Re: "unexpected duplicate for tablespace" problem in logical replication

От

Ashutosh Bapat

Дата:

18 сентября 2025 г., 18:30:34

On Thu, Sep 18, 2025 at 5:53 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-09-18 08:17:49 -0400, Andres Freund wrote:
> > On 2025-09-18 17:37:10 +0530, Ashutosh Bapat wrote:
> > > From 6a3562b4ac8917c8b577797e5468416a90cc04f5 Mon Sep 17 00:00:00 2001
> > > From: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
> > > Date: Thu, 18 Sep 2025 17:24:09 +0530
> > > Subject: [PATCH] Negative RelfilenumberMap cache entries from
> > >  pg_filenode_relation()
> > >
> > > RelidByRelfilenumber() adds negative entries to the cache. It has three
> > > users, logical replication, autoprewarm and pg_filenode_relation(). The
> > > first two need negative entries in the cache in case they happen to
> > > lookup non-existent mapping again and again.  However such mappings will
> > > be smaller in number and usually come from some database object e.g. WAL
> > > or autoprewarm metadata.
> > >
> > > But pg_filenode_relation(), which is SQL callable, may be invoked many
> > > times with invalid tablespace and relfilenode pairs, causing the cache
> > > to be bloated with negative cache entries. This can be used as a denial
> > > of service attack since any user can execute it. This commit avoids such
> > > a bloat.
> >
> > I don't really understand why this is worth fixing for the relfilenode stuff
> > specifically - isn't this true for just about *all* of our caches? Many, if
> > not most, can be reached via SQL?
>
> Example:
>
> postgres[315631][1]=# SELECT count(*), sum(total_bytes) total_bytes, sum(total_nblocks) total_nblocks,
sum(free_bytes)free_bytes, sum(free_chunks) free_chunks, sum(used_bytes) used_bytes FROM pg_backend_memory_contexts
WHEREpath @> (SELECT path FROM pg_backend_memory_contexts WHERE name = 'CacheMemoryContext');
 
> ┌───────┬─────────────┬───────────────┬────────────┬─────────────┬────────────┐
> │ count │ total_bytes │ total_nblocks │ free_bytes │ free_chunks │ used_bytes │
> ├───────┼─────────────┼───────────────┼────────────┼─────────────┼────────────┤
> │    89 │      747200 │           187 │     130336 │         216 │     616864 │
> └───────┴─────────────┴───────────────┴────────────┴─────────────┴────────────┘
> (1 row)
>
> Time: 1.540 ms
> postgres[315631][1]=# SELECT to_regclass(g.i::text||'.'||g.i::text) is NULL, count(*) FROM generate_series(1,
10000000)g(i) GROUP BY 1;
 
> ┌──────────┬──────────┐
> │ ?column? │  count   │
> ├──────────┼──────────┤
> │ t        │ 10000000 │
> └──────────┴──────────┘
> (1 row)
>

Interesting! I didn't know that cat cache could have negative entries
in it. But SearchCatCacheMiss says so explicitly
/*
* Tuple was not found in cache, so we have to try to retrieve it directly
* from the relation. If found, we will add it to the cache; if not
* found, we will add a negative cache entry instead.
*/

That settles it. Use of negative entries spread wider than I thought
and in places where they may not even be useful. Thanks for the
example. I don't see any reason to pursue this patch specifically.

-- 
Best Wishes,
Ashutosh Bapat

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: "unexpected duplicate for tablespace" problem in logical replication

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения