Обсуждение: In-order pg_dump (or in-order COPY TO)
Hello list, I am storing dumps of a database (pg_dump custom format) in a de-duplicating backup server. Each dump is many terabytes in size, so deduplication is very important. And de-duplication itself is based on rolling checksums which is pretty flexible, it can compensate for blocks moving by some offset. Unfortunately after I did pg_restore to a new server, I notice that the dumps from the new server are not being de-duplicated, all blocks are considered new. This means that the data has been significantly altered. The new dumps contain the same rows but probably in very different order. Could the row-order have changed when doing COPY FROM with pg_restore? No idea, but now that I think about it this can happen by many operations, like CLUSTER, VACUUM FULL etc so the question still applies. A *logical* dump of data shouldn't be affected by on-disk order. Internal representation shouldn't affect the output. This makes me wonder: Is there a way to COPY TO in primary-key order? If that is possible, then pg_dump could make use of it. Thanks in advance, Dimitris
On Tue, Aug 26, 2025 at 12:43 PM Dimitrios Apostolou <jimis@gmx.net> wrote:
Could the
row-order have changed when doing COPY FROM with pg_restore?
There is no reliable, meaningful, row ordering when it comes to the physical files. Sure, cluster does make an attempt, but it is quite limited in practice.
A *logical* dump of data shouldn't be affected by on-disk order.
Internal representation shouldn't affect the output.
The logical dump has no ordering - it will come out however it comes out. "COPY <table> TO ..." doesn't have an order by clause - there is no way to make or communicate to it that ordering is important. For adhoc work you can use "COPY <query> TO ..." and put and order by in the query.
David J.
On Tue, Aug 26, 2025 at 3:44 PM Dimitrios Apostolou <jimis@gmx.net> wrote:
Hello list,
I am storing dumps of a database (pg_dump custom format) in a
de-duplicating backup server. Each dump is many terabytes in size, so
deduplication is very important. And de-duplication itself is based on
rolling checksums which is pretty flexible, it can compensate for blocks
moving by some offset.
This might be a silly question, but why are you using -Fc to create multi-TB dumps instead of -Fd? If nothing else, the parallelization granted by -Fd will greatly speed up the dumps.
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
Dimitrios Apostolou <jimis@gmx.net> writes: > Unfortunately after I did pg_restore to a new server, I notice that the > dumps from the new server are not being de-duplicated, all blocks are > considered new. > This means that the data has been significantly altered. The new dumps > contain the same rows but probably in very different order. Could the > row-order have changed when doing COPY FROM with pg_restore? I'd expect pg_dump/pg_restore to preserve the physical row ordering, simply because it doesn't do anything that would change that. However, restoring into an empty table would result in a table with minimal free space, whereas the original table probably had a meaningful amount of free space thanks to updates and deletes. Thus for example TIDs would not be the same. If your "rolling checksum" methodology is at all sensitive to page boundaries, the table would look quite different to it. regards, tom lane
On Tue, Aug 26, 2025 at 4:31 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Tue, Aug 26, 2025 at 12:43 PM Dimitrios Apostolou <jimis@gmx.net> wrote:Could the
row-order have changed when doing COPY FROM with pg_restore?There is no reliable, meaningful, row ordering when it comes to the physical files. Sure, cluster does make an attempt, but it is quite limited in practice.
A *logical* dump of data shouldn't be affected by on-disk order.
Internal representation shouldn't affect the output.The logical dump has no ordering - it will come out however it comes out. "COPY <table> TO ..." doesn't have an order by clause - there is no way to make or communicate to it that ordering is important.
Doesn't COPY TO copy out records in the order they appeared in the physical files? That _seems_ to mean that the records laid down by COPY FROM should be in the same order as they were in the old dump files.
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
On Tue, Aug 26, 2025 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dimitrios Apostolou <jimis@gmx.net> writes:
> Unfortunately after I did pg_restore to a new server, I notice that the
> dumps from the new server are not being de-duplicated, all blocks are
> considered new.
> This means that the data has been significantly altered. The new dumps
> contain the same rows but probably in very different order. Could the
> row-order have changed when doing COPY FROM with pg_restore?
I'd expect pg_dump/pg_restore to preserve the physical row ordering,
simply because it doesn't do anything that would change that.
However, restoring into an empty table would result in a table with
minimal free space, whereas the original table probably had a
meaningful amount of free space thanks to updates and deletes. Thus
for example TIDs would not be the same. If your "rolling checksum"
methodology is at all sensitive to page boundaries, the table would
look quite different to it.
But the rolling checksums are against a pg_dump file, not a pg_basebackup file.
What probably changed are table OIDs. Would that change the ordering of COPY data in post-restore dump files?
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
Ron Johnson <ronljohnsonjr@gmail.com> writes: > On Tue, Aug 26, 2025 at 4:31 PM David G. Johnston < > david.g.johnston@gmail.com> wrote: >> The logical dump has no ordering - it will come out however it comes out. >> "COPY <table> TO ..." doesn't have an order by clause - there is no way to >> make or communicate to it that ordering is important. > Doesn't COPY TO copy out records in the order they appeared in the physical > files? It emits whatever a sequential-scan plan would emit. If you set synchronize_seqscans = off (which pg_dump does), that will match physical row order. At least with our standard table AM. If you're using Aurora or one of those other PG forks with proprietary storage layers, you'd have to ask them. I suspect the OP's problem is not row order per se, but differing TIDs or XIDs, which are things pg_dump does not endeavor to replicate. Or, given that he said something about blocks, maybe he's actually sensitive to where the free space is. regards, tom lane
Ron Johnson <ronljohnsonjr@gmail.com> writes: > On Tue, Aug 26, 2025 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'd expect pg_dump/pg_restore to preserve the physical row ordering, >> simply because it doesn't do anything that would change that. > But the rolling checksums are against a pg_dump file, not a pg_basebackup > file. Oh, that wasn't clear to me. > What probably changed are table OIDs. Would that change the ordering of > COPY data in post-restore dump files? It would not change the order of data within any one table. There are corner cases in which different OID assignments can cause pg_dump to emit database objects in a different order, see this recent thread: https://www.postgresql.org/message-id/flat/20250707192654.9e.nmisch%40google.com regards, tom lane
On 8/26/25 12:43, Dimitrios Apostolou wrote: > Hello list, > > I am storing dumps of a database (pg_dump custom format) in a de- > duplicating backup server. Each dump is many terabytes in size, so > deduplication is very important. And de-duplication itself is based on > rolling checksums which is pretty flexible, it can compensate for blocks > moving by some offset. > > Unfortunately after I did pg_restore to a new server, I notice that the > dumps from the new server are not being de-duplicated, all blocks are > considered new. What are the pg_dump/pg_restore commands? What are the Postgres versions involved? Are they community versions of Postgres or something else? What is the depduplication program? > Thanks in advance, > Dimitris > > -- Adrian Klaver adrian.klaver@aklaver.com
On Wednesday 2025-08-27 00:54, Adrian Klaver wrote: >Date: Wed, 27 Aug 2025 00:54:52 >From: Adrian Klaver <adrian.klaver@aklaver.com> >To: Dimitrios Apostolou <jimis@gmx.net>, pgsql-general@lists.postgresql.org >Subject: Re: In-order pg_dump (or in-order COPY TO) > > On 8/26/25 12:43, Dimitrios Apostolou wrote: >> Hello list, >> >> I am storing dumps of a database (pg_dump custom format) in a de- >> duplicating backup server. Each dump is many terabytes in size, so >> deduplication is very important. And de-duplication itself is based on >> rolling checksums which is pretty flexible, it can compensate for blocks >> moving by some offset. >> >> Unfortunately after I did pg_restore to a new server, I notice that the >> dumps from the new server are not being de-duplicated, all blocks are >> considered new. > > What are the pg_dump/pg_restore commands? > > What are the Postgres versions involved? > > Are they community versions of Postgres or something else? > > What is the depduplication program? > > Dump is from PostgreSQL 16, it's pg_dump writing to stdout: pg_dump -v --format=custom --compress=none --no-toast-compression --serializable-deferrable db_name | borg create ... As you can see the backup (and deduplicating) program is borgbackup. Restore is in PostgreSQL 17: I first create the empty tables by running the DDL commands in version control to setup the database. And then I do pg_restore --data-only: pg_restore -vvvv -j 8 -U db_owner -d db_name --schema=public --section=data dump_file Worth noting is that the above pg_restore goes through the WAL, i.e. all writes are done by walwriter, not the backend directly. Postgres is standard open source running on own server. It has a couple of custom patches that shouldn't matter in this codepath. >> Thanks in advance, >> Dimitris
On Wednesday 2025-08-27 00:08, Tom Lane wrote: > However, restoring into an empty table would result in a table with > minimal free space, whereas the original table probably had a > meaningful amount of free space thanks to updates and deletes. Thus > for example TIDs would not be the same. If your "rolling checksum" > methodology is at all sensitive to page boundaries, the table would > look quite different to it. Thanks Tom. I'm not following how the empty space in the tables could affect the custom format dumps. Where can I read more about these TIDs? Are they stored in the pg_dump custom format archive? The rolling checksum method should iron out shifting of data chunks that are around a couple MB in size. Shorter shifts will not be caught, and I assume that the "page boundaries" changes you mentioned would happen every 8KB. So that is definitely too fine grained for the deduplicated algorithm. FYI something that I forgot to mention is that pg_restore is --data-only and writes go through walwriter. The database with the tables has been created from scratch so every table is empty before the pg_restore. Not sure how this affects the above. > I'd expect pg_dump/pg_restore to preserve the physical row ordering, > simply because it doesn't do anything that would change that. Regardless of my specific case, it's scary to think that doing VACUUM FULL, CLUSTER, or who knows what other maintenance command, will modify the logical dumps. Some implicit ordering could be enforced by pg_dump if possible, for example when a primary key exists. Does it make sense? Is it even possible? Thanks, Dimitris
On Tuesday 2025-08-26 22:31, David G. Johnston wrote: >The logical dump has no ordering - it will come out however it comes out. "COPY <table> TO ..." doesn't have an order byclause - there is no way to make or communicate to it that ordering is important. For adhoc work you can use "COPY <query>TO ..." and put and order by in the query. Thank you, so it's not possible currently. How would "COPY <query> TO" behave for copying very large tables? Would it make sense to optionally have that in pg_dump? Or would it make sense as a new feature, to optionally order "COPY <table> TO ..." based on primary key where available, and use that in pg_dump option? Dimitris
On Wednesday 2025-08-27 00:00, Ron Johnson wrote: > >This might be a silly question, but why are you using -Fc to create multi-TB dumps instead of -Fd? If nothing else, theparallelization granted by -Fd will greatly speed up the dumps. Hi Ron, the primary reason is space. With -Fc I don't have to store the 10TB archive locally. And I don't have to wait for pg_dump to finish. I pipe it directly to the "borg create" command which does compression and deduplication and sends new chunks to a remote borgbackup server. Regards, Dimitris P.S. please use "Reply-all" when replying, I've missed quite a few replies that I found on the list archives. :-)
On Tue, Aug 26, 2025 at 3:44 PM Dimitrios Apostolou <jimis@gmx.net> wrote:
I am storing dumps of a database (pg_dump custom format) in a
de-duplicating backup server. Each dump is many terabytes in size, so
deduplication is very important. And de-duplication itself is based on
rolling checksums which is pretty flexible, it can compensate for blocks
moving by some offset.
I suggest looking into pgBackRest, and it's block incremental feature, which sounds similar to what you are doing. But it also does it with parallel processes, and can do things like grab backup files from your replicas, plus a lot of other features.
Cheers,
Greg
--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support
Dimitrios Apostolou <jimis@gmx.net> writes: > Dump is from PostgreSQL 16, it's pg_dump writing to stdout: > pg_dump -v --format=custom --compress=none --no-toast-compression --serializable-deferrable db_name | borg create ... Don't use --format=custom (and not -v either). That causes pg_dump to include the OIDs and pg_dump object IDs of all the tables and other objects, which will all be different in a dump from the new server. The actual data contents of the tables should be the same, but apparently the differences in the entry headers are enough to mislead borgbackup. You might be well advised to manually examine the data you are stuffing into borgbackup. Right now we seem to be operating on hypotheses, not facts, about what that looks like and how it's different between your old and new server. regards, tom lane
On Wed, Aug 27, 2025 at 10:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dimitrios Apostolou <jimis@gmx.net> writes:
> Dump is from PostgreSQL 16, it's pg_dump writing to stdout:
> pg_dump -v --format=custom --compress=none --no-toast-compression --serializable-deferrable db_name | borg create ...
Don't use --format=custom (and not -v either). That causes pg_dump to
include the OIDs and pg_dump object IDs of all the tables and other
objects,
That's interesting. Why? (Since isn't it supposed to be Bad to rely on OIDs?)
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
Ron Johnson <ronljohnsonjr@gmail.com> writes: > On Wed, Aug 27, 2025 at 10:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Don't use --format=custom (and not -v either). That causes pg_dump to >> include the OIDs and pg_dump object IDs of all the tables and other >> objects, > That's interesting. Why? (Since isn't it supposed to be Bad to rely on > OIDs?) -v in a text-format dump includes that data for debugging purposes: -- -- TOC entry 1401 (class 1255 OID 16499) -- Name: fipshash(text); Type: FUNCTION; Schema: public; Owner: postgres -- (The "TOC entry" comment line wouldn't be there without -v.) Then custom format has to store the same info so that pg_restore can produce this identical text output on demand. Yeah, this is all pretty historical, but nobody wants to change it at this point. regards, tom lane
On Wed, Aug 27, 2025 at 10:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ron Johnson <ronljohnsonjr@gmail.com> writes:
> On Wed, Aug 27, 2025 at 10:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Don't use --format=custom (and not -v either). That causes pg_dump to
>> include the OIDs and pg_dump object IDs of all the tables and other
>> objects,
> That's interesting. Why? (Since isn't it supposed to be Bad to rely on
> OIDs?)
-v in a text-format dump includes that data for debugging purposes:
--
-- TOC entry 1401 (class 1255 OID 16499)
-- Name: fipshash(text); Type: FUNCTION; Schema: public; Owner: postgres
--
(The "TOC entry" comment line wouldn't be there without -v.)
Then custom format has to store the same info so that pg_restore
can produce this identical text output on demand.
Ah, so the culprit is "-v". I like using -v, redirecting it to a log file (more info is almost always better), but then I rarely use pg_dump, and never pipe it to de-duplicators. (ExaGrid is supposed to deduplicate, but that's not going to stop me from using pgbackrest, compression and encryption; PCI auditors care about that, not deduplication.)
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!
On 8/27/25 05:09, Dimitrios Apostolou wrote: > On Wednesday 2025-08-27 00:54, Adrian Klaver wrote: > >> What are the pg_dump/pg_restore commands? >> >> What are the Postgres versions involved? >> >> Are they community versions of Postgres or something else? >> >> What is the depduplication program? >> >> > > Comments in line below. > Dump is from PostgreSQL 16, it's pg_dump writing to stdout: > > pg_dump -v --format=custom --compress=none --no-toast-compression -- > serializable-deferrable db_name | borg create ... > > > As you can see the backup (and deduplicating) program is borgbackup. Ok, I use BorgBackup and it is fairly forgiving of normal changes. FYI, if you ever want to use compression check out gzip --rsyncable, I have found it plays well with Borg. For more information see: https://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/ > > > Restore is in PostgreSQL 17: > > I first create the empty tables by running the DDL commands in version > control to setup the database. And then I do pg_restore --data-only: > > pg_restore -vvvv -j 8 -U db_owner -d db_name --schema=public -- > section=data dump_file If you are using only the --data section why not --data-only in the pg_dump? Or is the pg_dump output used for other purposes? > > > Worth noting is that the above pg_restore goes through the WAL, i.e. all > writes are done by walwriter, not the backend directly. Please explain the above further. The problem occurs when you do the pg_dump after this restore, correct? Is it the same pg_dump command as you show above? > > Postgres is standard open source running on own server. It has a couple > of custom patches that shouldn't matter in this codepath. For completeness and just in case they may affect the output what do the patches do? > > >>> Thanks in advance, >>> Dimitris -- Adrian Klaver adrian.klaver@aklaver.com
On Wednesday 2025-08-27 17:25, Adrian Klaver wrote: > Comments in line below. > >> Dump is from PostgreSQL 16, it's pg_dump writing to stdout: >> >> pg_dump -v --format=custom --compress=none --no-toast-compression -- >> serializable-deferrable db_name | borg create ... >> >> >> As you can see the backup (and deduplicating) program is borgbackup. > > Ok, I use BorgBackup and it is fairly forgiving of normal changes. > > FYI, if you ever want to use compression check out gzip --rsyncable, I have > found it plays well with Borg. For more information see: > > https://beeznest.wordpress.com/2005/02/03/rsyncable-gzip/ Yes, zstd has also --rsyncable. In this case I let borg do per-chunk compression after deduplication, it has worked well so far. >> Restore is in PostgreSQL 17: >> >> I first create the empty tables by running the DDL commands in version >> control to setup the database. And then I do pg_restore --data-only: >> >> pg_restore -vvvv -j 8 -U db_owner -d db_name --schema=public -- >> section=data dump_file > > If you are using only the --data section why not --data-only in the pg_dump? I want the dump to be as complete as possible. Didn't think it would create issues. > > Or is the pg_dump output used for other purposes? It has happened that I have selectively restored user schemas from that dump. >> Worth noting is that the above pg_restore goes through the WAL, i.e. all >> writes are done by walwriter, not the backend directly. > > Please explain the above further. The COPY FROM data is going through the WAL, as usual INSERTS do. The writes to disk happen by the walwriter process. OTOH, If you have configured the server with wal_level=minimal and BEGIN a transaction, CREATE or TRUNCATE a table, and then COPY FROM into that table, then the backend process writes directly to the table without logging to the WAL. This can be much faster, but most importantly it avoids situations of WAL overflow that are very difficult to predict and can mess your server up completely. [1] [1] https://www.postgresql.org/message-id/flat/076464ad-3d70-dd25-9e8f-e84f27decfba%40gmx.net My patches are for activating that codepath in pg_restore, but they were not used on purpose and I took notice that the writes went via WAL. > > The problem occurs when you do the pg_dump after this restore, correct? Correct. The first pg_dump from the restored pg17 is not deduplicated at all. Most of the tables have not changed (logically at least; apparently they have changed physically). > > Is it the same pg_dump command as you show above? Yes. > >> >> Postgres is standard open source running on own server. It has a couple of >> custom patches that shouldn't matter in this codepath. > > For completeness and just in case they may affect the output what do the > patches do? Two patches for speeding up scanning an archive without TOC, like the one I'm having (because it is piped into borg, instead of written to file). These were activated, but shouldn't matter. They only build the TOC in pg_restore's memory. https://commitfest.postgresql.org/patch/5809/ https://commitfest.postgresql.org/patch/5817/ And two patches for speeding up pg_restore like mentioned above, under specific arguments that I didn't provide. (one speedup needs --clean, and the other needs --freeze). https://commitfest.postgresql.org/patch/5821/ https://commitfest.postgresql.org/patch/5826/ IIRC I did not activate them (via --clean) because TRUNCATE fails when foreign keys exist. See the discussion threads. Dimitris
On 8/27/25 09:10, Dimitrios Apostolou wrote: > > On Wednesday 2025-08-27 17:25, Adrian Klaver wrote: >> >> For completeness and just in case they may affect the output what do >> the patches do? > > Two patches for speeding up scanning an archive without TOC, like the > one I'm having (because it is piped into borg, instead of written to > file). These were activated, but shouldn't matter. They only build the > TOC in pg_restore's memory. Are you sure about that? I just did: pg_dump -Fc --compress=none --no-toast-compression -d test -U postgres | borg create --stats --stdin-name pg_file --stdin-user aklaver --stdin-group aklaver borg_test/::PgTest - Then: borg mount borg_test/ mnt_tmp/ cd mnt_tmp/PgTest/ and then: pg_restore -l pg_file and I got a TOC. Or are you streaming the data out of the Borg archive? > > https://commitfest.postgresql.org/patch/5809/ > https://commitfest.postgresql.org/patch/5817/ > > And two patches for speeding up pg_restore like mentioned above, under > specific arguments that I didn't provide. (one speedup needs --clean, > and the other needs --freeze). > > https://commitfest.postgresql.org/patch/5821/ > https://commitfest.postgresql.org/patch/5826/ > > IIRC I did not activate them (via --clean) because TRUNCATE fails when > foreign keys exist. See the discussion threads. > > > Dimitris -- Adrian Klaver adrian.klaver@aklaver.com
Sorry I was not remembering the details. Probably there is a TOC in your dump file, but it does not contain any positionsfor the data. The pg_restore command has to scan the whole file in advance, and fill in the TOC offsets in memory. This scanning happens in a very inefficient way, with many seek calls and small block reads. Try strace to see them. Thisinitial phase can take hours in a huge dump file, before even starting any actual restoration. Thank you for testing. Dimitris On 30 August 2025 20:19:13 CEST, Adrian Klaver <adrian.klaver@aklaver.com> wrote: >On 8/27/25 09:10, Dimitrios Apostolou wrote: >> >> On Wednesday 2025-08-27 17:25, Adrian Klaver wrote: > > >>> >>> For completeness and just in case they may affect the output what do the patches do? >> >> Two patches for speeding up scanning an archive without TOC, like the one I'm having (because it is piped into borg, insteadof written to file). These were activated, but shouldn't matter. They only build the TOC in pg_restore's memory. > >Are you sure about that? > >I just did: > >pg_dump -Fc --compress=none --no-toast-compression -d test -U postgres | borg create --stats --stdin-name pg_file --stdin-useraklaver --stdin-group aklaver borg_test/::PgTest - > >Then: > >borg mount borg_test/ mnt_tmp/ >cd mnt_tmp/PgTest/ > >and then: > >pg_restore -l pg_file > >and I got a TOC. > >Or are you streaming the data out of the Borg archive? > >> >> https://commitfest.postgresql.org/patch/5809/ >> https://commitfest.postgresql.org/patch/5817/ >> >> And two patches for speeding up pg_restore like mentioned above, under specific arguments that I didn't provide. (onespeedup needs --clean, and the other needs --freeze). >> >> https://commitfest.postgresql.org/patch/5821/ >> https://commitfest.postgresql.org/patch/5826/ >> >> IIRC I did not activate them (via --clean) because TRUNCATE fails when foreign keys exist. See the discussion threads. >> >> >> Dimitris > >
On 8/30/25 18:21, Dimitrios Apostolou wrote: > Sorry I was not remembering the details. Probably there is a TOC in your dump file, but it does not contain any positionsfor the data. The pg_restore command has to scan the whole file in advance, and fill in the TOC offsets in memory. > > This scanning happens in a very inefficient way, with many seek calls and small block reads. Try strace to see them. Thisinitial phase can take hours in a huge dump file, before even starting any actual restoration. It may be that my coffee is not strong enough, but I don't understand what you are trying to say. Are you using, from previous post, the following?: "Two patches for speeding up scanning an archive without TOC, like the one I'm having (because it is piped into borg, instead of written to file). These were activated, but shouldn't matter. They only build the TOC in pg_restore's memory. " The part I don't see is how you get a dump file without a TOC? When I do the pg_dump and pipe it to Borg the resulting file has a TOC. Can you show the rest of the | borg create ... command? > > Thank you for testing. > Dimitris > > On 30 August 2025 20:19:13 CEST, Adrian Klaver <adrian.klaver@aklaver.com> wrote: -- Adrian Klaver adrian.klaver@aklaver.com
When I first said "dump file without TOC" I actually meant "without offsets in the TOC". The fact that you get a TOC printed does not prove that the dump file includes a TOC with offsets. All pg_dump -Fc commands that write to stdout, produce a file without offsets in the TOC. It has nothing to do with borg.ToC offsets must be filled in right before streaming each table, but this is impossible when the whole TOC has alreadybeen written to stdout in the beginning. Dimitris On 31 August 2025 17:41:34 CEST, Adrian Klaver <adrian.klaver@aklaver.com> wrote: >On 8/30/25 18:21, Dimitrios Apostolou wrote: >> Sorry I was not remembering the details. Probably there is a TOC in your dump file, but it does not contain any positionsfor the data. The pg_restore command has to scan the whole file in advance, and fill in the TOC offsets in memory. >> >> This scanning happens in a very inefficient way, with many seek calls and small block reads. Try strace to see them. Thisinitial phase can take hours in a huge dump file, before even starting any actual restoration. > >It may be that my coffee is not strong enough, but I don't understand what you are trying to say. > >Are you using, from previous post, the following?: > >"Two patches for speeding up scanning an archive without TOC, like the one I'm having (because it is piped into borg, insteadof written to file). These were activated, but shouldn't matter. They only build the TOC in pg_restore's memory. " > >The part I don't see is how you get a dump file without a TOC? > >When I do the pg_dump and pipe it to Borg the resulting file has a TOC. > >Can you show the rest of the | borg create ... command? > >> >> Thank you for testing. >> Dimitris >> >> On 30 August 2025 20:19:13 CEST, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > > >
On 8/31/25 10:52, Dimitrios Apostolou wrote: > When I first said "dump file without TOC" I actually meant "without offsets in the TOC". > > The fact that you get a TOC printed does not prove that the dump file includes a TOC with offsets. I did some digging in the code and see that the TOC is more then that, it stores a range of data. Still have not part where the offsets are ignored for writes to stdout, but will keep on digging. Getting back to your OP: "Unfortunately after I did pg_restore to a new server, I notice that the dumps from the new server are not being de-duplicated, all blocks are considered new." I ran a test on a much smaller database and I do not see the above. The commands and the Borg info for the archive are in attached file. > > All pg_dump -Fc commands that write to stdout, produce a file without offsets in the TOC. It has nothing to do with borg.ToC offsets must be filled in right before streaming each table, but this is impossible when the whole TOC has alreadybeen written to stdout in the beginning. > > Dimitris -- Adrian Klaver adrian.klaver@aklaver.com
Вложения
Adrian Klaver <adrian.klaver@aklaver.com> writes: > I did some digging in the code and see that the TOC is more then that, > it stores a range of data. Still have not part where the offsets are > ignored for writes to stdout, but will keep on digging. The TOC is initially written out with zeroes for the offsets. Then the per-table data parts are written out, tracking where each one begins. At the end, if the output file is seekable, pg_dump seeks back to the start and re-writes the whole TOC section, now with data offsets populated. But output to a pipe won't be seekable. regards, tom lane
On 9/1/25 10:54, Tom Lane wrote: > Adrian Klaver <adrian.klaver@aklaver.com> writes: >> I did some digging in the code and see that the TOC is more then that, >> it stores a range of data. Still have not part where the offsets are >> ignored for writes to stdout, but will keep on digging. > > The TOC is initially written out with zeroes for the offsets. > Then the per-table data parts are written out, tracking where > each one begins. At the end, if the output file is seekable, > pg_dump seeks back to the start and re-writes the whole TOC > section, now with data offsets populated. But output to a > pipe won't be seekable. Got it, thanks. > > regards, tom lane -- Adrian Klaver adrian.klaver@aklaver.com
On 2025-Aug-26, Dimitrios Apostolou wrote: > I am storing dumps of a database (pg_dump custom format) in a de-duplicating > backup server. Each dump is many terabytes in size, so deduplication is very > important. And de-duplication itself is based on rolling checksums which is > pretty flexible, it can compensate for blocks moving by some offset. Hello, It's generally considered nowadays that pg_dump is not the best option to create backups of very large databases. You may be better served by using a binary backup tool -- something like Barman. With current Postgres releases you can create incremental backups, which would probably be more effective at deduplicating than playing with pg_dump's TOC, because it's based on what actually happens to the data. Barman provides support for hook scripts, which perhaps can be used to transfer the backup files to Borg. (I haven't actually tried to do this, but the Barman developers talk about using them to transfer the backups to tape, so I imagine getting them to play with Borg it's a Simple Matter of Programming.) -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "On the other flipper, one wrong move and we're Fatal Exceptions" (T.U.X.: Term Unit X - http://www.thelinuxreview.com/TUX/)
On 9/4/25 05:02, Álvaro Herrera wrote: > On 2025-Aug-26, Dimitrios Apostolou wrote: > >> I am storing dumps of a database (pg_dump custom format) in a de-duplicating >> backup server. Each dump is many terabytes in size, so deduplication is very >> important. And de-duplication itself is based on rolling checksums which is >> pretty flexible, it can compensate for blocks moving by some offset. > > Hello, > > It's generally considered nowadays that pg_dump is not the best option > to create backups of very large databases. You may be better served by > using a binary backup tool -- something like Barman. With current > Postgres releases you can create incremental backups, which would > probably be more effective at deduplicating than playing with pg_dump's > TOC, because it's based on what actually happens to the data. Barman As I understand it the TOC issue was with pg_restore and it having to generate the offsets as they where not included in the backup file as it was streamed, not written to a file. The deduplication became an issue when changing Postgres versions per: https://www.postgresql.org/message-id/4ss66r31-558o-qq24-332q-no351p7n5osr%40tzk.arg " > The problem occurs when you do the pg_dump after this restore, correct? Correct. The first pg_dump from the restored pg17 is not deduplicated at all. Most of the tables have not changed (logically at least; apparently they have changed physically). " > provides support for hook scripts, which perhaps can be used to transfer > the backup files to Borg. (I haven't actually tried to do this, but the > Barman developers talk about using them to transfer the backups to tape, > so I imagine getting them to play with Borg it's a Simple Matter of > Programming.) > -- Adrian Klaver adrian.klaver@aklaver.com
Hi Álvaro and Greg, On Thursday 2025-09-04 14:02, Álvaro Herrera wrote: > It's generally considered nowadays that pg_dump is not the best option > to create backups of very large databases. You may be better served by > using a binary backup tool -- something like Barman. With current > Postgres releases you can create incremental backups, which would > probably be more effective at deduplicating than playing with pg_dump's > TOC, because it's based on what actually happens to the data. Barman > provides support for hook scripts, which perhaps can be used to transfer > the backup files to Borg. (I haven't actually tried to do this, but the > Barman developers talk about using them to transfer the backups to tape, > so I imagine getting them to play with Borg it's a Simple Matter of > Programming.) On Wed, 27 Aug 2025, Greg Sabino Mullane wrote: > I suggest looking into pgBackRest, and it's block incremental feature, > which sounds similar to what you are doing. But it also does it with > parallel processes, and can do things like grab backup files from your > replicas, plus a lot of other features. if I'm not mistaken, both Barman and pgBackRest are based on physical dumps of the database (pg_basebackup). At the start of this project I had evaluated pg_basebackup, but decided logical backup fitted my needs better. + pg_basebackup was slower, measuring speeds of around 10MB/s, because of issues with 8KB page size and compressed btrfs (see [1]; situation has been improved both on the postgres side and the kernel side; I'm not sure how pg_basebackup fares today). + pg_basebackup was much bigger, because of including indices etc. As a result of size and speed, pg_basebackup was also taking a longer time. + physical dumps would change a lot during maintenance (vacuum full, cluster etc) while the data would remain the same. This would reduce the effect of deduplication and increase size requirements even further. At that point in time I did not expect logical dumps to change too, when the data hasn't changed. + I use logical dumps as a tool, not only as a backup, to copy the database to other servers with different postgresql versions. + I also use it to verify the VCS-committed SQL schema: doing pg_restore --data-only on an already created database will fail if the SQL schema had been modified on the original server without committing the changes. + Finally I don't really need all the advanced features that physical replication offers, like HA, PITR, load balancing. It's a non-mission-critical service that can take a little time off in case of disaster recovery. [1] https://www.postgresql.org/message-id/flat/218fa2e0-bc58-e469-35dd-c5cb35906064%40gmx.net Regards, Dimitris
On 9/4/25 19:08, Dimitrios Apostolou wrote: > Hi Álvaro and Greg, > > On Thursday 2025-09-04 14:02, Álvaro Herrera wrote: > >> It's generally considered nowadays that pg_dump is not the best option >> to create backups of very large databases. You may be better served by >> using a binary backup tool -- something like Barman. With current >> Postgres releases you can create incremental backups, which would >> probably be more effective at deduplicating than playing with pg_dump's >> TOC, because it's based on what actually happens to the data. Barman >> provides support for hook scripts, which perhaps can be used to transfer >> the backup files to Borg. (I haven't actually tried to do this, but the >> Barman developers talk about using them to transfer the backups to tape, >> so I imagine getting them to play with Borg it's a Simple Matter of >> Programming.) > > On Wed, 27 Aug 2025, Greg Sabino Mullane wrote: > >> I suggest looking into pgBackRest, and it's block incremental >> feature, which sounds similar to what you are doing. But it also does >> it with parallel processes, and can do things like grab backup files >> from your replicas, plus a lot of other features. > > > if I'm not mistaken, both Barman and pgBackRest are based on physical > dumps of the database (pg_basebackup). At the start of this project I > had evaluated pg_basebackup, but decided logical backup fitted my > needs better. > > + pg_basebackup was slower, measuring speeds of around 10MB/s, because > of issues with 8KB page size and compressed btrfs (see [1]; situation > has been improved both on the postgres side and the kernel side; > I'm not sure how pg_basebackup fares today). > > + pg_basebackup was much bigger, because of including indices etc. As a > result of size and speed, pg_basebackup was also taking a longer time. > > + physical dumps would change a lot during maintenance (vacuum full, > cluster etc) while the data would remain the same. This would > reduce the effect of deduplication and increase size requirements even > further. At that point in time I did not expect logical dumps to > change too, when the data hasn't changed. > > + I use logical dumps as a tool, not only as a backup, to copy the > database to other servers with different postgresql versions. > > + I also use it to verify the VCS-committed SQL schema: doing pg_restore > --data-only on an already created database will fail if the SQL schema > had been modified on the original server without committing the > changes. > > + Finally I don't really need all the advanced features that physical > replication offers, like HA, PITR, load balancing. It's a > non-mission-critical service that can take a little time off in case > of disaster recovery. > > [1] > https://www.postgresql.org/message-id/flat/218fa2e0-bc58-e469-35dd-c5cb35906064%40gmx.net > if that helps , this is a writeup I had some some years ago comparing pgbackrest , barman and probackup : https://severalnines.com/blog/current-state-open-source-backup-management-postgresql/ > > Regards, > Dimitris