Обсуждение: pg_upgrade: transfer pg_largeobject_metadata's files when possible

Поиск
Список
Период
Сортировка

pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
(new thread)

On Fri, Jul 18, 2025 at 11:05:04AM -0500, Nathan Bossart wrote:
> I'm cautiously optimistic that we can find some better gains for upgrades
> from v16 and newer.  That would involve dumping lo_create() commands for
> all LOs with comments/seclabels, dumping the relevant pg_shdepend rows, and
> then copying/linking the pg_largeobject_metadata files like we did prior to
> v12.

Here is a patch.  For background, the reason this is limited to upgrades
from v16 and newer is because the aclitem data type (needed by
pg_largeobject_metadata.lomacl) changed its storage format in v16 (see
commit 7b378237aa).  Note that the patch is essentially a revert of commit
12a53c732c, but there are enough differences that it should be considered a
fresh effort.

Something I hadn't anticipated is that we need to take special care to
transfer the relfilenode of pg_largeobject_metadata and its index, as was
done for pg_largeobject in commits d498e052b4 and bbe08b8869.  In fact, the
majority of the patch is dedicated to that.

My testing showed some decent, but not earth-shattering performance
improvements from this patch.  For upgrades with many large objects with
NULL lomacl/lomowner columns, pg_upgrade was 50% faster.  With non-NULL
lomacl/lomowner, that dropped to 25%.  When each large object had a
comment, there was no change.  I'm assuming that its rare to have lots of
large objects with comments or security labels, so I don't see any need to
expend energy trying to optimize that use-case.

I am a bit concerned that we'll forget to add checks for new types of
dependencies similar to comments and security labels.  If we do, pg_upgrade
should just fail to restore the schema, and fixing the code should be easy
enough.  Also, we'll need to remember to revisit this code if there's
another storage format change for one of pg_largeobject_metadata's columns,
but that seems unlikely to happen anytime soon.  On the whole, I'm not too
worried about either of these points.

-- 
nathan

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Hannu Krosing
Дата:
Have you considered re-creating pg_shdepend from
pg_largeobject_metadata directly instead of having special cases for
dumping it ?

It would also be useful in cases of old (pg_upgraded since before pg
12) databases which might be missing it anyway.


On Thu, Aug 14, 2025 at 5:22 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> (new thread)
>
> On Fri, Jul 18, 2025 at 11:05:04AM -0500, Nathan Bossart wrote:
> > I'm cautiously optimistic that we can find some better gains for upgrades
> > from v16 and newer.  That would involve dumping lo_create() commands for
> > all LOs with comments/seclabels, dumping the relevant pg_shdepend rows, and
> > then copying/linking the pg_largeobject_metadata files like we did prior to
> > v12.
>
> Here is a patch.  For background, the reason this is limited to upgrades
> from v16 and newer is because the aclitem data type (needed by
> pg_largeobject_metadata.lomacl) changed its storage format in v16 (see
> commit 7b378237aa).  Note that the patch is essentially a revert of commit
> 12a53c732c, but there are enough differences that it should be considered a
> fresh effort.
>
> Something I hadn't anticipated is that we need to take special care to
> transfer the relfilenode of pg_largeobject_metadata and its index, as was
> done for pg_largeobject in commits d498e052b4 and bbe08b8869.  In fact, the
> majority of the patch is dedicated to that.
>
> My testing showed some decent, but not earth-shattering performance
> improvements from this patch.  For upgrades with many large objects with
> NULL lomacl/lomowner columns, pg_upgrade was 50% faster.  With non-NULL
> lomacl/lomowner, that dropped to 25%.  When each large object had a
> comment, there was no change.  I'm assuming that its rare to have lots of
> large objects with comments or security labels, so I don't see any need to
> expend energy trying to optimize that use-case.
>
> I am a bit concerned that we'll forget to add checks for new types of
> dependencies similar to comments and security labels.  If we do, pg_upgrade
> should just fail to restore the schema, and fixing the code should be easy
> enough.  Also, we'll need to remember to revisit this code if there's
> another storage format change for one of pg_largeobject_metadata's columns,
> but that seems unlikely to happen anytime soon.  On the whole, I'm not too
> worried about either of these points.
>
> --
> nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Tue, Aug 19, 2025 at 09:49:26AM +0200, Hannu Krosing wrote:
> Have you considered re-creating pg_shdepend from
> pg_largeobject_metadata directly instead of having special cases for
> dumping it ?

I considered it when you last brought up the idea [0], but my testing
indicated that it's measurably slower.

> It would also be useful in cases of old (pg_upgraded since before pg
> 12) databases which might be missing it anyway.

We only use COPY for upgrades from v12 and newer, and the patch at hand
only applies to v16 and newer.  There should be no need to repair
pg_shdepend for any such upgrades because we haven't copied/linked the
files since before v12.

[0] https://postgr.es/m/CAMT0RQTXiqH7zdQEVSVd2L7_Cw4wQ1eHOD8hfZ%2B0vecMXJWc-w%40mail.gmail.com

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Michael Paquier
Дата:
On Thu, Aug 14, 2025 at 10:22:02AM -0500, Nathan Bossart wrote:
> Here is a patch.  For background, the reason this is limited to upgrades
> from v16 and newer is because the aclitem data type (needed by
> pg_largeobject_metadata.lomacl) changed its storage format in v16 (see
> commit 7b378237aa).  Note that the patch is essentially a revert of commit
> 12a53c732c, but there are enough differences that it should be considered a
> fresh effort.

Noted.

> Something I hadn't anticipated is that we need to take special care to
> transfer the relfilenode of pg_largeobject_metadata and its index, as was
> done for pg_largeobject in commits d498e052b4 and bbe08b8869.  In fact, the
> majority of the patch is dedicated to that.
>
> My testing showed some decent, but not earth-shattering performance
> improvements from this patch.  For upgrades with many large objects with
> NULL lomacl/lomowner columns, pg_upgrade was 50% faster.  With non-NULL
> lomacl/lomowner, that dropped to 25%.  When each large object had a
> comment, there was no change.  I'm assuming that its rare to have lots of
> large objects with comments or security labels, so I don't see any need to
> expend energy trying to optimize that use-case.

I highly doubt that there are a lot of comments assigned to LOs, so
these numbers are pretty cool IMO.  Security labels are a pain to test
in the upgrade path, or test_dummy_label could be extended with a new
TAP test and a pg_upgrade command..  There is some coverage with
comments on LOs in src/bin/pg_dump's 002, so that would be enough for
the comment part, at least.

> I am a bit concerned that we'll forget to add checks for new types of
> dependencies similar to comments and security labels.  If we do, pg_upgrade
> should just fail to restore the schema, and fixing the code should be easy
> enough.  Also, we'll need to remember to revisit this code if there's
> another storage format change for one of pg_largeobject_metadata's columns,
> but that seems unlikely to happen anytime soon.  On the whole, I'm not too
> worried about either of these points.

This part does not worry me much, TBH.  This stuff would require
dump/restore support and the pg_dump test suite would catch that for
commands with --binary-upgrade.  So it should be hard to miss.

-        /*
-         * pg_largeobject
-         */
         if (fout->remoteVersion >= 90300)
             appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, relminmxid, relfilenode, oid\n"
                               "FROM pg_catalog.pg_class\n"
-                              "WHERE oid IN (%u, %u);\n",
-                              LargeObjectRelationId, LargeObjectLOidPNIndexId);
+                              "WHERE oid IN (%u, %u, %u, %u);\n",
+                              LargeObjectRelationId, LargeObjectLOidPNIndexId,
+                              LargeObjectMetadataRelationId, LargeObjectMetadataOidIndexId);
[...]
         appendPQExpBufferStr(loHorizonQry, "\n-- For binary upgrade, set pg_largeobject relfrozenxid and
relminmxid\n");
+        appendPQExpBufferStr(lomHorizonQry, "\n-- For binary upgrade, set pg_largeobject_metadata relfrozenxid and
relminmxid\n");
         appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, preserve pg_largeobject and index relfilenodes\n");
+        appendPQExpBufferStr(lomOutQry, "\n-- For binary upgrade, preserve pg_largeobject_metadata and index
relfilenodes\n");

Is all that really required when upgrading from a cluster in the
9.3~15 range?
--
Michael

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Mon, Sep 01, 2025 at 03:19:46PM +0900, Michael Paquier wrote:
> I highly doubt that there are a lot of comments assigned to LOs, so
> these numbers are pretty cool IMO.  Security labels are a pain to test
> in the upgrade path, or test_dummy_label could be extended with a new
> TAP test and a pg_upgrade command..  There is some coverage with
> comments on LOs in src/bin/pg_dump's 002, so that would be enough for
> the comment part, at least.

Do you think a new pg_upgrade test for security labels is worth the
trouble?  It seems doable, but it'd be an awfully expensive test for this.
On the other hand, I'm not sure there's any coverage for pg_upgrade with
security labels, so perhaps this is a good time to establish some tests.

> -        /*
> -         * pg_largeobject
> -         */
>          if (fout->remoteVersion >= 90300)
>              appendPQExpBuffer(loFrozenQry, "SELECT relfrozenxid, relminmxid, relfilenode, oid\n"
>                                "FROM pg_catalog.pg_class\n"
> -                              "WHERE oid IN (%u, %u);\n",
> -                              LargeObjectRelationId, LargeObjectLOidPNIndexId);
> +                              "WHERE oid IN (%u, %u, %u, %u);\n",
> +                              LargeObjectRelationId, LargeObjectLOidPNIndexId,
> +                              LargeObjectMetadataRelationId, LargeObjectMetadataOidIndexId);
> [...]
>          appendPQExpBufferStr(loHorizonQry, "\n-- For binary upgrade, set pg_largeobject relfrozenxid and
relminmxid\n");
> +        appendPQExpBufferStr(lomHorizonQry, "\n-- For binary upgrade, set pg_largeobject_metadata relfrozenxid and
relminmxid\n");
>          appendPQExpBufferStr(loOutQry, "\n-- For binary upgrade, preserve pg_largeobject and index
relfilenodes\n");
> +        appendPQExpBufferStr(lomOutQry, "\n-- For binary upgrade, preserve pg_largeobject_metadata and index
relfilenodes\n");
> 
> Is all that really required when upgrading from a cluster in the
> 9.3~15 range?

No, that stuff is discarded for upgrades from those versions.  My intent
was to maintain readability by avoiding lots of version checks.  FWIW I
originally put all of the pg_large_object_metadata stuff in a separate
block, but that resulted in a lot of copy/pasted code.  I'm happy to adjust
it as you see fit.

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Michael Paquier
Дата:
On Tue, Sep 02, 2025 at 09:43:40AM -0500, Nathan Bossart wrote:
> Do you think a new pg_upgrade test for security labels is worth the
> trouble?  It seems doable, but it'd be an awfully expensive test for this.
> On the other hand, I'm not sure there's any coverage for pg_upgrade with
> security labels, so perhaps this is a good time to establish some tests.

I would argue in favor of these additions.  Security labels are not
the most popular thing ever, AFAIK, but your patch makes the need more
relevant to have.  The cheapest approach would be to add a LO creation
pattern in src/bin/pg_dump/t/002_pg_dump.pl, with an EXTRA_INSTALL
pointing at src/test/modules/dummy_seclabel/ to be able to create the
security label (we already do that in pg_upgrade and pg_basebackup so
the trick works).  That should be enough to make sure that the binary
upgrade dumps have the seclabel data included.  It's a bit funky, I
agree.  So if you think that this is not worth the test cycles, I
won't push hard on this point, either.

> No, that stuff is discarded for upgrades from those versions.  My intent
> was to maintain readability by avoiding lots of version checks.  FWIW I
> originally put all of the pg_large_object_metadata stuff in a separate
> block, but that resulted in a lot of copy/pasted code.  I'm happy to adjust
> it as you see fit.

+        if (fout->remoteVersion >= 160000)
+            ArchiveEntry(fout, nilCatalogId, createDumpId(),
+                         ARCHIVE_OPTS(.tag = "pg_largeobject_metadata",
+                                      .description = "pg_largeobject_metadata",
+                                      .section = SECTION_PRE_DATA,
+                                      .createStmt = lomOutQry->data));

So it's filtered out depending on the old cluster version here.  I
have managed to miss this part.  Some tests with LOs and across
multiple backend versions (pg_dump -C --binary-upgrade) are showing me
that I am wrong and that you are right, with diffs showing up properly
in the binary upgrade dumps for pg_largeobject_metadata.  This logic
looks OK to me, even if it's a waste to fill lomOutQry when upgrading
from an instance in the 9.3~15 range, discarding ArchiveEntry() at the
end.  It's not a big deal.
--
Michael

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Thu, Sep 04, 2025 at 01:59:36PM +0900, Michael Paquier wrote:
> On Tue, Sep 02, 2025 at 09:43:40AM -0500, Nathan Bossart wrote:
>> Do you think a new pg_upgrade test for security labels is worth the
>> trouble?  It seems doable, but it'd be an awfully expensive test for this.
>> On the other hand, I'm not sure there's any coverage for pg_upgrade with
>> security labels, so perhaps this is a good time to establish some tests.
> 
> I would argue in favor of these additions.  Security labels are not
> the most popular thing ever, AFAIK, but your patch makes the need more
> relevant to have.  The cheapest approach would be to add a LO creation
> pattern in src/bin/pg_dump/t/002_pg_dump.pl, with an EXTRA_INSTALL
> pointing at src/test/modules/dummy_seclabel/ to be able to create the
> security label (we already do that in pg_upgrade and pg_basebackup so
> the trick works).  That should be enough to make sure that the binary
> upgrade dumps have the seclabel data included.  It's a bit funky, I
> agree.  So if you think that this is not worth the test cycles, I
> won't push hard on this point, either.

Ah, I'd forgotten about EXTRA_INSTALL.  That simplifies things.  There's
enough special handling for large objects in pg_upgrade that I think we
ought to test it end-to-end, so I sneaked it into 006_tranfer_modes.pl.
WDYT?

-- 
nathan

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Michael Paquier
Дата:
On Thu, Sep 04, 2025 at 08:23:58PM -0500, Nathan Bossart wrote:
> Ah, I'd forgotten about EXTRA_INSTALL.  That simplifies things.  There's
> enough special handling for large objects in pg_upgrade that I think we
> ought to test it end-to-end, so I sneaked it into 006_tranfer_modes.pl.
> WDYT?

Neat.  That works for me.

+        $old->safe_psql('postgres', q|
+            CREATE EXTENSION dummy_seclabel;

Still I think that this bit is going to fail with installcheck,
because src/test/modules/ is not installed by default :)

You can make that conditional with check_extension(), like the tricks
for injection_points in other TAP tests.  The security label creation
and test also need to be made conditional.  The work is already done
with the check on oldinstall, it just needs an extra tweak, so I would
store the condition in a separate variable at the top of test_mode()
in 006_transfer_modes.pl, or something equivalent.
--
Michael

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Fri, Sep 05, 2025 at 03:35:21PM +0900, Michael Paquier wrote:
> +        $old->safe_psql('postgres', q|
> +            CREATE EXTENSION dummy_seclabel;
> 
> Still I think that this bit is going to fail with installcheck,
> because src/test/modules/ is not installed by default :)
> 
> You can make that conditional with check_extension(), like the tricks
> for injection_points in other TAP tests.  The security label creation
> and test also need to be made conditional.  The work is already done
> with the check on oldinstall, it just needs an extra tweak, so I would
> store the condition in a separate variable at the top of test_mode()
> in 006_transfer_modes.pl, or something equivalent.

How does this look?

-- 
nathan

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Michael Paquier
Дата:
On Fri, Sep 05, 2025 at 01:12:49PM -0500, Nathan Bossart wrote:
> How does this look?

+    # We can only test security labels if both the old and new installations
+    # have dummy_seclabel.
+    my $test_seclabel = 1;
+    $old->start;
+    if (!$old->check_extension('dummy_seclabel'))
+    {
+        $test_seclabel = 0;
+    }
+    $old->stop;
+    $new->start;
+    if (!$new->check_extension('dummy_seclabel'))
+    {
+        $test_seclabel = 0;
+    }
+    $new->stop;

Yep.  This plan is safe to rely on.

A tiny comment I may have is that the LO numbers are hardcoded and
duplicated.  I would have used a variable to store these numbers.
Please feel free to ignore my picky-ism.
--
Michael

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Sat, Sep 06, 2025 at 10:12:11AM +0900, Michael Paquier wrote:
> Yep.  This plan is safe to rely on.

Committed, thanks for reviewing!

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Michael Paquier
Дата:
On Mon, Sep 08, 2025 at 02:20:33PM -0500, Nathan Bossart wrote:
> Committed, thanks for reviewing!

Thanks!
--
Michael

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Andres Freund
Дата:
Hi,

On 2025-09-08 14:20:33 -0500, Nathan Bossart wrote:
> On Sat, Sep 06, 2025 at 10:12:11AM +0900, Michael Paquier wrote:
> > Yep.  This plan is safe to rely on.
>
> Committed, thanks for reviewing!

I just had occasion to look at this code and I found the way this works quite
confusing:

When upgrading from a new enough server, we migrate
pg_largeobject_metadata. So far so good. But we *also* do a COPY FROM
COPY "pg_catalog"."pg_largeobject_metadata" ("oid", "lomowner", "lomacl") FROM stdin;
...

for all the large objects that have a description or a security label.


For a while I was somewhat baffled, because that sure looks like it ought to
lead to uniqueness violations.  But it doesn't.

The reason, I think, is that the COPY is happening into a relfilenode that
will be overwritten later, it doesn't yet contain the contents of the old
cluster.

Presumably we do this because we need the temporary pg_largeobject_metadata to
make COMMENT ON and security label commands not fail.


If this is the reasoning / how it works, shouldn't there be a comment in the
code or the commit message explaining that? Because it sure seems non-obvious
to me.

It's also not entirely obvious to me that this is safe - after all
(bbe08b8869bd, revised in 0e758ae89) appeared to have taken some pains to
ensure that the file gets unlinked immediately during the "binary upgrade
mode" TRUNCATE. But now we are actually filling that file again, after the
relation had been truncated?



Separately, if we don't handle large objects that don't have comments or
labels via pg_dump, why do we do dependency tracking for all LOs in getLOs(),
rather than just the ones that have a comment / label? Given how much memory
both the query results and the dependency tracking take...


Greetings,

Andres Freund



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Andres Freund
Дата:
Hi,

On 2026-02-03 18:16:17 -0500, Andres Freund wrote:
> On 2025-09-08 14:20:33 -0500, Nathan Bossart wrote:
> > On Sat, Sep 06, 2025 at 10:12:11AM +0900, Michael Paquier wrote:
> > > Yep.  This plan is safe to rely on.
> >
> > Committed, thanks for reviewing!
> 
> I just had occasion to look at this code and I found the way this works quite
> confusing:
> 
> When upgrading from a new enough server, we migrate
> pg_largeobject_metadata. So far so good. But we *also* do a COPY FROM
> COPY "pg_catalog"."pg_largeobject_metadata" ("oid", "lomowner", "lomacl") FROM stdin;
> ...
> 
> for all the large objects that have a description or a security label.
> 
> 
> For a while I was somewhat baffled, because that sure looks like it ought to
> lead to uniqueness violations.  But it doesn't.
> 
> The reason, I think, is that the COPY is happening into a relfilenode that
> will be overwritten later, it doesn't yet contain the contents of the old
> cluster.
> 
> Presumably we do this because we need the temporary pg_largeobject_metadata to
> make COMMENT ON and security label commands not fail.
> 
> 
> If this is the reasoning / how it works, shouldn't there be a comment in the
> code or the commit message explaining that? Because it sure seems non-obvious
> to me.
> 
> It's also not entirely obvious to me that this is safe - after all
> (bbe08b8869bd, revised in 0e758ae89) appeared to have taken some pains to
> ensure that the file gets unlinked immediately during the "binary upgrade
> mode" TRUNCATE. But now we are actually filling that file again, after the
> relation had been truncated?

An example of what could go wrong:

Imagine that we add support for freezing pages on-access (as we are
discussing). If pg_largeobject_metadata was *not* frozen / vacuumed on the old
server, there might not be a _vm on the old server, but due to the new
on-access freezing, there might be one created for the data in the ephemeral
pg_largeobject_metadata during the accesses from COMMENT ON.

Because there's no VM on the old server, transfer_relfile() would just return
false:

        /* Is it an extent, fsm, or vm file? */
        if (type_suffix[0] != '\0' || segno != 0)
        {
            /* Did file open fail? */
            if (stat(old_file, &statbuf) != 0)
            {
                /* File does not exist?  That's OK, just return */
                if (errno == ENOENT)
                    return;

therefore not reaching the:
        unlink(new_file);

Leaving a completely wrong visibilitymap in place.


There are other scary possibilities. Imagine that the ephemeral
pg_largeobject_metadata ends up bigger than on the old server, e.g. because we
are a bit more aggressive about bulk relation extension in the new version
(which we should be!). If that size difference ends up with a different
segment count between the old server and the ephemeral relation in the new
server, we're in trouble: transfer_relfile() wouldn't unlink the additional
segments on the target system.

Greetings,

Andres Freund



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Tue, Feb 03, 2026 at 06:46:25PM -0500, Andres Freund wrote:
>> The reason, I think, is that the COPY is happening into a relfilenode that
>> will be overwritten later, it doesn't yet contain the contents of the old
>> cluster.
>> 
>> Presumably we do this because we need the temporary pg_largeobject_metadata to
>> make COMMENT ON and security label commands not fail.
>> 
>> If this is the reasoning / how it works, shouldn't there be a comment in the
>> code or the commit message explaining that? Because it sure seems non-obvious
>> to me.

Right, the COPY for LOs with comments and security labels is solely meant
to avoid failure when restoring the comments and security labels, since we
won't have transferred the relation files yet.  This was the case before
commit 12a53c732c, where we had this comment in getBlobs():

        * We *do* dump out the definition of the blob because we need that to
        * make the restoration of the comments, and anything else, work since
        * pg_upgrade copies the files behind pg_largeobject and
        * pg_largeobject_metadata after the dump is restored.

Commit 3bcfcd815e (mine) added this one to pg_dump.c:

        * If upgrading from v16 or newer, only dump large objects with
        * comments/seclabels.  For these upgrades, pg_upgrade can copy/link
        * pg_largeobject_metadata's files (which is usually faster) but we
        * still need to dump LOs with comments/seclabels here so that the
        * subsequent COMMENT and SECURITY LABEL commands work.  pg_upgrade
        * can't copy/link the files from older versions because aclitem
        * (needed by pg_largeobject_metadata.lomacl) changed its storage
        * format in v16.

IIUC your critique is that this doesn't explain the overwriting behavior
like the older comment does.  I'll work on adding that.

>> It's also not entirely obvious to me that this is safe - after all
>> (bbe08b8869bd, revised in 0e758ae89) appeared to have taken some pains to
>> ensure that the file gets unlinked immediately during the "binary upgrade
>> mode" TRUNCATE. But now we are actually filling that file again, after the
>> relation had been truncated?
> 
> An example of what could go wrong:
>
> [... examples of what could go wrong ...] 

I'm considering a couple of options here, but it seems like the easiest
thing to do is to move the TRUNCATE commands to the end of the dump file.
At least, that seems to be sufficient for our existing tests.  If that
seems okay to you, I can work on putting together a patch.

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Wed, Feb 04, 2026 at 10:06:29AM -0600, Nathan Bossart wrote:
> IIUC your critique is that this doesn't explain the overwriting behavior
> like the older comment does.  I'll work on adding that.
> 
> [...]
> 
> I'm considering a couple of options here, but it seems like the easiest
> thing to do is to move the TRUNCATE commands to the end of the dump file.
> At least, that seems to be sufficient for our existing tests.  If that
> seems okay to you, I can work on putting together a patch.

Here is a rough first draft of a patch that does this.

-- 
nathan

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Andres Freund
Дата:
Hi,

On 2026-02-04 10:06:29 -0600, Nathan Bossart wrote:
> IIUC your critique is that this doesn't explain the overwriting behavior
> like the older comment does.  I'll work on adding that.

I think even the old comment was woefully under-documenting that the commands
are all just make work that's going to be thrown out almost immediately after.

Thanks for addressing that!


On 2026-02-04 14:08:47 -0600, Nathan Bossart wrote:
> On Wed, Feb 04, 2026 at 10:06:29AM -0600, Nathan Bossart wrote:
> > IIUC your critique is that this doesn't explain the overwriting behavior
> > like the older comment does.  I'll work on adding that.
> > 
> > [...]
> > 
> > I'm considering a couple of options here, but it seems like the easiest
> > thing to do is to move the TRUNCATE commands to the end of the dump file.
> > At least, that seems to be sufficient for our existing tests.  If that
> > seems okay to you, I can work on putting together a patch.
> 
> Here is a rough first draft of a patch that does this.

It certainly seems better than what we do now.  Still feels pretty grotty and
error prone to me that we fill the catalog table and then throw the contents
out.


> @@ -1157,7 +1158,10 @@ main(int argc, char **argv)
>           * subsequent COMMENT and SECURITY LABEL commands work.  pg_upgrade
>           * can't copy/link the files from older versions because aclitem
>           * (needed by pg_largeobject_metadata.lomacl) changed its storage
> -         * format in v16.
> +         * format in v16.  At the end of the dump, we'll generate a TRUNCATE
> +         * command for pg_largeobject_metadata so that it's contents are
> +         * cleared in preparation for the subsequent file transfer by
> +         * pg_upgrade.
>           */

I'd move that comment to earlier in the paragraph, it sounds a bit like it
applies to the v16 specific bits.


>          if (fout->remoteVersion >= 160000)
>              lo_metadata->dataObj->filtercond = "WHERE oid IN "
> @@ -1243,6 +1247,13 @@ main(int argc, char **argv)
>      for (i = 0; i < numObjs; i++)
>          dumpDumpableObject(fout, dobjs[i]);
>  
> +    /*
> +     * For binary upgrades, set relfrozenxids, relminmxids, and relfilenodes
> +     * of pg_largeobject and maybe pg_largeobject_metadata, and remove all
> +     * their files.  We will transfer them from the old cluster as needed.
> +     */
> +    dumpLOTruncation(fout);
> +
>      /*
>       * Set up options info to ensure we dump what we want.
>       */

Seems good to move this to a dedicated function, regardless of anything else.


Do I see correctly that we just rely on the ordering in the file, rather than
dependencies? That's not a complaint, I just don't know that code very well.

Greetings,

Andres Freund



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Thu, Feb 05, 2026 at 11:19:46AM -0500, Andres Freund wrote:
> It certainly seems better than what we do now.  Still feels pretty grotty and
> error prone to me that we fill the catalog table and then throw the contents
> out.

Before I go any further with this approach, I thought of something else we
could do that I believe is worth considering...

As of commit 3bcfcd815e, the only reason we are dumping any of
pg_largeobject_metadata at all is to avoid an ERROR during COMMENT ON or
SECURITY LABEL ON because the call to LargeObjectExists() in
get_object_address() returns false.  If we bypass that check in
binary-upgrade mode, we can skip dumping pg_largeobject_metadata entirely.

The attached patch passes our existing tests, and it seems to create the
expected binary-upgrade-mode dump files, too.  I haven't updated any of the
comments yet.

-- 
nathan

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Thu, Feb 05, 2026 at 11:36:00AM -0600, Nathan Bossart wrote:
> @@ -1046,7 +1046,7 @@ get_object_address(ObjectType objtype, Node *object,
>                  address.classId = LargeObjectRelationId;
>                  address.objectId = oidparse(object);
>                  address.objectSubId = 0;
> -                if (!LargeObjectExists(address.objectId))
> +                if (!LargeObjectExists(address.objectId) && !IsBinaryUpgrade)
>                  {
>                      if (!missing_ok)
>                          ereport(ERROR,

(Probably better to set missing_ok for only the specific commands we want
to bypass this check, but you get the idea...)

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Andres Freund
Дата:
Hi,

On 2026-02-05 11:36:00 -0600, Nathan Bossart wrote:
> On Thu, Feb 05, 2026 at 11:19:46AM -0500, Andres Freund wrote:
> > It certainly seems better than what we do now.  Still feels pretty grotty and
> > error prone to me that we fill the catalog table and then throw the contents
> > out.
> 
> Before I go any further with this approach, I thought of something else we
> could do that I believe is worth considering...
> 
> As of commit 3bcfcd815e, the only reason we are dumping any of
> pg_largeobject_metadata at all is to avoid an ERROR during COMMENT ON or
> SECURITY LABEL ON because the call to LargeObjectExists() in
> get_object_address() returns false.  If we bypass that check in
> binary-upgrade mode, we can skip dumping pg_largeobject_metadata entirely.

Yea, I think that's worth considering.  As you say downthread, the check for
binary upgrade should probably be moved, but that's details.

Upthread I also wondering why we do all the work in getLOs() if we don't
actually need most of it (only if there are comments or labels).  Right now
that's a very slow and very memory intensive part of doing an upgrade of a
system with a lot of binary upgrades.  Do we need *any* of that if we go the
path you suggest?

Greetings,

Andres Freund



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Thu, Feb 05, 2026 at 01:02:17PM -0500, Andres Freund wrote:
> Upthread I also wondering why we do all the work in getLOs() if we don't
> actually need most of it (only if there are comments or labels).  Right now
> that's a very slow and very memory intensive part of doing an upgrade of a
> system with a lot of binary upgrades.  Do we need *any* of that if we go the
> path you suggest?

AFAICT we only need it for the comments and security labels later on.
Commit a45c78e3 did batch 1000 large objects into each ArchiveEntry, but of
course there can still be a ton of entries.  In theory, we could update the
pg_largeobject_metadata query to only retrieve LOs with comments and
security labels.  I'm not sure it's worth trying to optimize further than
that; we've long operated under the assumption that comments/seclabels on
LOs are pretty rare.

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Andres Freund
Дата:
Hi,

On 2026-02-05 13:31:23 -0600, Nathan Bossart wrote:
> On Thu, Feb 05, 2026 at 01:02:17PM -0500, Andres Freund wrote:
> > Upthread I also wondering why we do all the work in getLOs() if we don't
> > actually need most of it (only if there are comments or labels).  Right now
> > that's a very slow and very memory intensive part of doing an upgrade of a
> > system with a lot of binary upgrades.  Do we need *any* of that if we go the
> > path you suggest?
>
> AFAICT we only need it for the comments and security labels later on.

And in binary upgrade mode not even for that, if we do the thing we talked
about re not checking references in binary upgrade mode? Right?


> Commit a45c78e3 did batch 1000 large objects into each ArchiveEntry, but of
> course there can still be a ton of entries.

Entries and then also the PGresult (almost as large as all the entries). The
peak memory usage is pretty bad. We could address the PGresult memory usage
with PQsetChunkedRowsMode() or explicit cursor use, but it doesn't seem
entirely trivial to combine with the batching.


> In theory, we could update the pg_largeobject_metadata query to only
> retrieve LOs with comments and security labels.  I'm not sure it's worth
> trying to optimize further than that; we've long operated under the
> assumption that comments/seclabels on LOs are pretty rare.

I think that'd be a huge improvement. Right now it's not hard to get into a
situation where you have too many LOs to not have enough memory to do a
pg_upgrade.

Memory usage aside, it's also slow and expensive from the query execution and
data transfer side. Because of the ORDER BY that the batching requires, the
server needs to sort all of pg_largeobject_metadata before any rows can be
returned.

For 5M LOs:

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                  QUERY PLAN
                       │
 

├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Sort  (cost=715978.34..728478.39 rows=5000020 width=72) (actual time=10292.252..10652.950 rows=5000020.00 loops=1)
                       │
 
│   Sort Key: pg_largeobject_metadata.lomowner, ((pg_largeobject_metadata.lomacl)::text), pg_largeobject_metadata.oid
                       │
 
│   Sort Method: quicksort  Memory: 509110kB
                       │
 
│   ->  Seq Scan on pg_largeobject_metadata  (cost=0.00..159638.55 rows=5000020 width=72) (actual time=0.034..2284.442
rows=5000020.00loops=1) │
 
│         SubPlan expr_1
                       │
 
│           ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.000..0.000 rows=1.00 loops=5000020)
                       │
 
│ Planning Time: 0.117 ms
                       │
 
│ Serialization: time=3961.343 ms  output=218686kB  format=text
                       │
 
│ Execution Time: 14930.747 ms
                       │
 

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

/usr/bin/time -v pg_dump --binary-upgrade --no-data --quote-all-identifiers --no-statistics --format=custom -f
/tmp/dumplo_5m
 

    Command being timed: "pg_dump --binary-upgrade --no-data --quote-all-identifiers --no-statistics --format=custom -f
/tmp/dumplo_5m"
 
    User time (seconds): 1.85
    System time (seconds): 0.54
    Percent of CPU this job got: 16%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.55
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 935084
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 133379
    Voluntary context switches: 30486
    Involuntary context switches: 6
    Swaps: 0
    File system inputs: 0
    File system outputs: 16
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

Peak memory usage ~1GB.


And unfortunately 5M LOs is not a whole lot.


Filtering the LOs in the query to only return ones that have a comment / label
would typically make the query much faster and pg_dump consume a lot less
memory.


Greetings,

Andres Freund



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Thu, Feb 05, 2026 at 03:23:38PM -0500, Andres Freund wrote:
> On 2026-02-05 13:31:23 -0600, Nathan Bossart wrote:
>> On Thu, Feb 05, 2026 at 01:02:17PM -0500, Andres Freund wrote:
>> > Upthread I also wondering why we do all the work in getLOs() if we don't
>> > actually need most of it (only if there are comments or labels).  Right now
>> > that's a very slow and very memory intensive part of doing an upgrade of a
>> > system with a lot of binary upgrades.  Do we need *any* of that if we go the
>> > path you suggest?
>>
>> AFAICT we only need it for the comments and security labels later on.
> 
> And in binary upgrade mode not even for that, if we do the thing we talked
> about re not checking references in binary upgrade mode? Right?

Well, we need _something_ to do what dumpLO() does today, i.e., call
dumpComment() and dumpSecLabel() for each large object that has one or the
other.  I'm sure we could decouple that from the getLOs() machinery if we
tried hard enough, but I'm not seeing any great advantage in doing so if we
teach getLOs() to only get what it needs.

>> In theory, we could update the pg_largeobject_metadata query to only
>> retrieve LOs with comments and security labels.  I'm not sure it's worth
>> trying to optimize further than that; we've long operated under the
>> assumption that comments/seclabels on LOs are pretty rare.
> 
> I think that'd be a huge improvement. Right now it's not hard to get into a
> situation where you have too many LOs to not have enough memory to do a
> pg_upgrade.

It seems to work.  I'm working on a nicer patch for all this stuff.

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Andres Freund
Дата:
Hi,

On 2026-02-05 15:23:38 -0500, Andres Freund wrote:
> Memory usage aside, it's also slow and expensive from the query execution and
> data transfer side. Because of the ORDER BY that the batching requires, the
> server needs to sort all of pg_largeobject_metadata before any rows can be
> returned.
> 
> For 5M LOs:
>
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │                                                                  QUERY PLAN
                         │
 
>
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Sort  (cost=715978.34..728478.39 rows=5000020 width=72) (actual time=10292.252..10652.950 rows=5000020.00 loops=1)
                         │
 
> │   Sort Key: pg_largeobject_metadata.lomowner, ((pg_largeobject_metadata.lomacl)::text), pg_largeobject_metadata.oid
                         │
 
> │   Sort Method: quicksort  Memory: 509110kB
                         │
 
> │   ->  Seq Scan on pg_largeobject_metadata  (cost=0.00..159638.55 rows=5000020 width=72) (actual
time=0.034..2284.442rows=5000020.00 loops=1) │
 
> │         SubPlan expr_1
                         │
 
> │           ->  Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.000..0.000 rows=1.00 loops=5000020)
                         │
 
> │ Planning Time: 0.117 ms
                         │
 
> │ Serialization: time=3961.343 ms  output=218686kB  format=text
                         │
 
> │ Execution Time: 14930.747 ms
                         │
 
>
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> (9 rows)

This isn't quite the right query, sorry for that. I just tried moving the
acldefault() in a subselect, because I was wondering whether that'd prevent it
from being computed below the sort. Unfortunately no.  It's a bit faster
without that, but not much:

EXPLAIN (ANALYZE, SERIALIZE, BUFFERS OFF, VERBOSE) SELECT oid, lomowner, lomacl::pg_catalog.text, acldefault('L',
lomowner)AS acldefault FROM pg_largeobject_metadata ORDER BY lomowner, lomacl::pg_catalog.text, oid;
 

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                       QUERY PLAN
                                 │
 

├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Sort  (cost=665978.14..678478.19 rows=5000020 width=72) (actual time=8887.007..9243.088 rows=5000020.00 loops=1)
                                 │
 
│   Output: oid, lomowner, ((lomacl)::text), (acldefault('L'::"char", lomowner))
                                 │
 
│   Sort Key: pg_largeobject_metadata.lomowner, ((pg_largeobject_metadata.lomacl)::text), pg_largeobject_metadata.oid
                                 │
 
│   Sort Method: quicksort  Memory: 509110kB
                                 │
 
│   ->  Seq Scan on pg_catalog.pg_largeobject_metadata  (cost=0.00..109638.35 rows=5000020 width=72) (actual
time=0.029..823.895rows=5000020.00 loops=1) │
 
│         Output: oid, lomowner, (lomacl)::text, acldefault('L'::"char", lomowner)
                                 │
 
│ Planning Time: 0.087 ms
                                 │
 
│ Serialization: time=3965.649 ms  output=218686kB  format=text
                                 │
 
│ Execution Time: 13516.925 ms
                                 │
 

└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

I might start a separate thread about this misoptimization...

Greetings,

Andres Freund



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Thu, Feb 05, 2026 at 02:30:28PM -0600, Nathan Bossart wrote:
> On Thu, Feb 05, 2026 at 03:23:38PM -0500, Andres Freund wrote:
>> On 2026-02-05 13:31:23 -0600, Nathan Bossart wrote:
>>> In theory, we could update the pg_largeobject_metadata query to only
>>> retrieve LOs with comments and security labels.  I'm not sure it's worth
>>> trying to optimize further than that; we've long operated under the
>>> assumption that comments/seclabels on LOs are pretty rare.
>> 
>> I think that'd be a huge improvement. Right now it's not hard to get into a
>> situation where you have too many LOs to not have enough memory to do a
>> pg_upgrade.
> 
> It seems to work.  I'm working on a nicer patch for all this stuff.

Here is what I have so far.

-- 
nathan

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Thu, Feb 05, 2026 at 03:29:55PM -0600, Nathan Bossart wrote:
> Here is what I have so far.

BTW I ran 006_transfer_modes.pl (which tests LOs with comments and security
labels) for upgrades from v10, v13, v16, v17, and v18, and all tests
succeeded.

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Hannu Krosing
Дата:
Parts of this discussion are very much related to
https://commitfest.postgresql.org/patch/5774/ - "Extending skipping FK
checks on replicas to include ADD FK and TRUNCATE"

I have seen cases where adding a foreign key after copying data takes
several days for no benefit at all.

We should be allowing more of "skip unuseful work" options for all the
cases we can be pretty sure the dat ais ok - upgrades, replica setups
and restoring backup.

We probably should also have an option to tolerate slight corruption,
like duplicate PK entries if the source is an existing database.

99 times out of 100 the user would rather have a slight corruption
than no database at all, for example in case they had to do an
emergency restore and if fails after running 3 hours because ADD
PRIMARY KEY hits a duplicate row.


On Fri, Feb 6, 2026 at 8:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Thu, Feb 05, 2026 at 03:29:55PM -0600, Nathan Bossart wrote:
> > Here is what I have so far.
>
> BTW I ran 006_transfer_modes.pl (which tests LOs with comments and security
> labels) for upgrades from v10, v13, v16, v17, and v18, and all tests
> succeeded.
>
> --
> nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Andres Freund
Дата:
Hi,

On 2026-02-05 15:29:55 -0600, Nathan Bossart wrote:
> Here is what I have so far.

Thanks!


> --- a/src/bin/pg_dump/pg_dump.c
> +++ b/src/bin/pg_dump/pg_dump.c
> @@ -214,12 +214,6 @@ static int    nbinaryUpgradeClassOids = 0;
>  static SequenceItem *sequences = NULL;
>  static int    nsequences = 0;
>  
> -/*
> - * For binary upgrade, the dump ID of pg_largeobject_metadata is saved for use
> - * as a dependency for pg_shdepend and any large object comments/seclabels.
> - */
> -static DumpId lo_metadata_dumpId;
> -
>  /* Maximum number of relations to fetch in a fetchAttributeStats() call. */
>  #define MAX_ATTR_STATS_RELS 64
>  
> @@ -1121,27 +1115,20 @@ main(int argc, char **argv)
>          getTableData(&dopt, tblinfo, numTables, RELKIND_SEQUENCE);
>  
>      /*
> -     * For binary upgrade mode, dump pg_largeobject_metadata and the
> -     * associated pg_shdepend rows. This is faster to restore than the
> -     * equivalent set of large object commands.  We can only do this for
> -     * upgrades from v12 and newer; in older versions, pg_largeobject_metadata
> -     * was created WITH OIDS, so the OID column is hidden and won't be dumped.
> +     * For binary upgrade mode, dump the pg_shdepend rows for large objects
> +     * and maybe even pg_largeobject_metadata (see comment below for details).
> +     * This is faster to restore than the equivalent set of large object
> +     * commands.  We can only do this for upgrades from v12 and newer; in
> +     * older versions, pg_largeobject_metadata was created WITH OIDS, so the
> +     * OID column is hidden and won't be dumped.
>       */

It's not really related to this change, but what is that WITH OIDS bit about?
Sure, they aren't shown by default, but all it takes to change that is to
explicitly add the output column? I'm not saying we have to do that, I just
don't understand the reasoning as written here.


> @@ -3979,7 +3964,25 @@ getLOs(Archive *fout)
>      appendPQExpBufferStr(loQry,
>                           "SELECT oid, lomowner, lomacl, "
>                           "acldefault('L', lomowner) AS acldefault "
> -                         "FROM pg_largeobject_metadata "
> +                         "FROM pg_largeobject_metadata ");
> +
> +    /*
> +     * For upgrades from v12 or newer, we transfer pg_largeobject_metadata via

*binary upgrades.


The improvements in pg_upgrade time and memory in a cluster with 50M LOs [1]
are really quite impressive, even if upgrading from < 16. It's rare to improve
memory usage by several orders of magnitude.

Greetings,

Andres Freund

[1] c=20;pgbench -n -c$c -j$c -P1 -t $((50000000 / $c)) -f <(echo -e "SELECT lo_create(0) AS loid\n") 50m_los



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Sun, Feb 08, 2026 at 04:26:41PM -0500, Andres Freund wrote:
> On 2026-02-05 15:29:55 -0600, Nathan Bossart wrote:
>> +     * commands.  We can only do this for upgrades from v12 and newer; in
>> +     * older versions, pg_largeobject_metadata was created WITH OIDS, so the
>> +     * OID column is hidden and won't be dumped.
>>       */
> 
> It's not really related to this change, but what is that WITH OIDS bit about?
> Sure, they aren't shown by default, but all it takes to change that is to
> explicitly add the output column? I'm not saying we have to do that, I just
> don't understand the reasoning as written here.

IIRC the issue is that getTableAttrs() won't pick up the OID column on
older versions.  It might be easy to fix that by adjusting its query for
binary upgrades from <v12.  That could be worth doing, if for no other
reason than to simplify some of the pg_dump code.  I'll make a note of it.

> The improvements in pg_upgrade time and memory in a cluster with 50M LOs [1]
> are really quite impressive, even if upgrading from < 16. It's rare to improve
> memory usage by several orders of magnitude.

Nice!

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
Here is what I have staged for commit, which I'm planning to do this
afternoon.

-- 
nathan

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Mon, Feb 09, 2026 at 11:05:50AM -0600, Nathan Bossart wrote:
> Here is what I have staged for commit, which I'm planning to do this
> afternoon.

Committed.

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Sun, Feb 08, 2026 at 04:00:40PM -0600, Nathan Bossart wrote:
> On Sun, Feb 08, 2026 at 04:26:41PM -0500, Andres Freund wrote:
>> On 2026-02-05 15:29:55 -0600, Nathan Bossart wrote:
>>> +     * commands.  We can only do this for upgrades from v12 and newer; in
>>> +     * older versions, pg_largeobject_metadata was created WITH OIDS, so the
>>> +     * OID column is hidden and won't be dumped.
>>>       */
>> 
>> It's not really related to this change, but what is that WITH OIDS bit about?
>> Sure, they aren't shown by default, but all it takes to change that is to
>> explicitly add the output column? I'm not saying we have to do that, I just
>> don't understand the reasoning as written here.
> 
> IIRC the issue is that getTableAttrs() won't pick up the OID column on
> older versions.  It might be easy to fix that by adjusting its query for
> binary upgrades from <v12.  That could be worth doing, if for no other
> reason than to simplify some of the pg_dump code.  I'll make a note of it.

This was a little more painful than I expected, but this seems to be what
is required to allow COPY-ing pg_largeobject_metadata during binary
upgrades from < v12.

-- 
nathan

Вложения

Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Wed, Feb 11, 2026 at 03:00:51PM -0600, Nathan Bossart wrote:
> This was a little more painful than I expected, but this seems to be what
> is required to allow COPY-ing pg_largeobject_metadata during binary
> upgrades from < v12.

I don't think there's anything terribly controversial here, so I'd like to
proceed with committing this in the next couple of days if there are no
objections.

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Andres Freund
Дата:
Hi,

On 2026-02-11 15:00:51 -0600, Nathan Bossart wrote:
> On Sun, Feb 08, 2026 at 04:00:40PM -0600, Nathan Bossart wrote:
> > IIRC the issue is that getTableAttrs() won't pick up the OID column on
> > older versions.  It might be easy to fix that by adjusting its query for
> > binary upgrades from <v12.  That could be worth doing, if for no other
> > reason than to simplify some of the pg_dump code.  I'll make a note of it.
> 
> This was a little more painful than I expected, but this seems to be what
> is required to allow COPY-ing pg_largeobject_metadata during binary
> upgrades from < v12.

Nice!


> @@ -2406,11 +2404,14 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
>      column_list = fmtCopyColumnList(tbinfo, clistBuf);
>  
>      /*
> -     * Use COPY (SELECT ...) TO when dumping a foreign table's data, and when
> -     * a filter condition was specified.  For other cases a simple COPY
> -     * suffices.
> +     * Use COPY (SELECT ...) TO when dumping a foreign table's data, when a
> +     * filter condition was specified, and when in binary upgrade mode and
> +     * dumping an old pg_largeobject_metadata defined WITH OIDS.  For other
> +     * cases a simple COPY suffices.
>       */
> -    if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
> +    if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
> +        (fout->dopt->binary_upgrade && fout->remoteVersion < 120000 &&
> +         tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId))
>      {
>          /* Temporary allows to access to foreign tables to dump data */
>          if (tbinfo->relkind == RELKIND_FOREIGN_TABLE)

Not really the fault of this patch, but it seems somewhat grotty that this has
binary upgrade specific code in this place.  I was certainly confused when
first trying to use pg_dump in binary upgrade mode with large objects, because
no data was dumped when using plain text mode, which is what I had been using
for simplicity...


> @@ -2406,11 +2404,14 @@ dumpTableData_copy(Archive *fout, const void *dcontext)
>      column_list = fmtCopyColumnList(tbinfo, clistBuf);
>  
>      /*
> -     * Use COPY (SELECT ...) TO when dumping a foreign table's data, and when
> -     * a filter condition was specified.  For other cases a simple COPY
> -     * suffices.
> +     * Use COPY (SELECT ...) TO when dumping a foreign table's data, when a
> +     * filter condition was specified, and when in binary upgrade mode and
> +     * dumping an old pg_largeobject_metadata defined WITH OIDS.  For other
> +     * cases a simple COPY suffices.
>       */
> -    if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
> +    if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE ||
> +        (fout->dopt->binary_upgrade && fout->remoteVersion < 120000 &&
> +         tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId))
>      {
>          /* Temporary allows to access to foreign tables to dump data */
>          if (tbinfo->relkind == RELKIND_FOREIGN_TABLE)

I guess you could instead generate a COPY using WITH OIDS. But it's probably
not worth having that path, given we already need to support COPY (SELECT ..).


OTOH, I think it'd perhaps avoid needing to deal with this:

> @@ -9442,7 +9428,18 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
>                               "(pt.classoid = co.tableoid AND pt.objoid = co.oid)\n");
>  
>      appendPQExpBufferStr(q,
> -                         "WHERE a.attnum > 0::pg_catalog.int2\n"
> +                         "WHERE a.attnum > 0::pg_catalog.int2\n");
> +
> +    /*
> +     * For binary upgrades from <v12, be sure to pick up
> +     * pg_largeobject_metadata's oid column.
> +     */
> +    if (fout->dopt->binary_upgrade && fout->remoteVersion < 120000)
> +        appendPQExpBufferStr(q,
> +                             "OR (a.attnum = -2::pg_catalog.int2 AND src.tbloid = "
> +                             CppAsString2(LargeObjectMetadataRelationId) ")\n");
> +

as we'd just include the oid column without needing to somehow include it in
the attribute list.


> @@ -9544,7 +9541,9 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
>  
>          for (int j = 0; j < numatts; j++, r++)
>          {
> -            if (j + 1 != atoi(PQgetvalue(res, r, i_attnum)))
> +            if (j + 1 != atoi(PQgetvalue(res, r, i_attnum)) &&
> +                !(fout->dopt->binary_upgrade && fout->remoteVersion < 120000 &&
> +                  tbinfo->dobj.catId.oid == LargeObjectMetadataRelationId))
>                  pg_fatal("invalid column numbering in table \"%s\"",
>                           tbinfo->dobj.name);
>              tbinfo->attnames[j] = pg_strdup(PQgetvalue(res, r, i_attname));
> -- 
> 2.50.1 (Apple Git-155)

I guess WITH OIDs also would avoid the need for this.

Greetings,

Andres Freund



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Thu, Feb 12, 2026 at 04:46:30PM -0500, Andres Freund wrote:
> On 2026-02-11 15:00:51 -0600, Nathan Bossart wrote:
>> This was a little more painful than I expected, but this seems to be what
>> is required to allow COPY-ing pg_largeobject_metadata during binary
>> upgrades from < v12.
> 
> Nice!

Thanks for looking.

> Not really the fault of this patch, but it seems somewhat grotty that this has
> binary upgrade specific code in this place.  I was certainly confused when
> first trying to use pg_dump in binary upgrade mode with large objects, because
> no data was dumped when using plain text mode, which is what I had been using
> for simplicity...

Yeah, this has been an annoyance while hacking in this area.  I haven't
looked into what it'd take to fix it, though.

> I guess you could instead generate a COPY using WITH OIDS. But it's probably
> not worth having that path, given we already need to support COPY (SELECT ..).

Ah, I forgot all about that COPY option...

> as we'd just include the oid column without needing to somehow include it in
> the attribute list.
> 
> [...]
> 
> I guess WITH OIDs also would avoid the need for this.

Will give it a try.

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Thu, Feb 12, 2026 at 03:59:24PM -0600, Nathan Bossart wrote:
> On Thu, Feb 12, 2026 at 04:46:30PM -0500, Andres Freund wrote:
>> I guess you could instead generate a COPY using WITH OIDS. But it's probably
>> not worth having that path, given we already need to support COPY (SELECT ..).
> 
> Ah, I forgot all about that COPY option...
> 
>> as we'd just include the oid column without needing to somehow include it in
>> the attribute list.
>> 
>> [...]
>> 
>> I guess WITH OIDs also would avoid the need for this.
> 
> Will give it a try.

So, in addition to hacking in the OIDS option to the COPY command executed
on the old server, we still need to hack the OID column into the column
list for the dumped COPY command (or omit the column list entirely for
pg_largeobject_metadata).  That seems to result in roughly the same level
of hackery as before.

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
Committed.

-- 
nathan



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Andres Freund
Дата:
On 2026-02-16 15:15:42 -0600, Nathan Bossart wrote:
> Committed.

Thanks for working on this!  I think the situation now is a lot better than it
was in 18...



Re: pg_upgrade: transfer pg_largeobject_metadata's files when possible

От
Nathan Bossart
Дата:
On Mon, Feb 16, 2026 at 04:19:10PM -0500, Andres Freund wrote:
> Thanks for working on this!  I think the situation now is a lot better than it
> was in 18...

I hope so...  I appreciate the pointers/reviews.

-- 
nathan