Обсуждение: Adding REPACK [concurrently]
Hello, Here's a patch to add REPACK and eventually the CONCURRENTLY flag to it. This is coming from [1]. The ultimate goal is to have an in-core tool to allow concurrent table rewrite to get rid of bloat; right now, VACUUM FULL does that, but it's not concurrent. Users have resorted to using the pg_repack third-party tool, which is ancient and uses a weird internal implementation, as well as pg_squeeze, which uses logical decoding to capture changes that occur during the table rewrite. The patch submitted here, largely by Antonin Houska with some changes by me, is based on the the pg_squeeze code which he authored, and first introduces a new command called REPACK to absorb both VACUUM FULL and CLUSTER, followed by addition of a CONCURRENTLY flag to allow some forms of REPACK to operate online using logical decoding. Essentially, this first patch just reshuffles the CLUSTER code to create the REPACK command. I made a few changes from Antonin's original at [2]. First, I modified the grammar to support "REPACK [tab] USING INDEX" without specifying the index name. With this change, all possibilities of the old commands are covered, which gives us the chance to flag them as obsolete. (This is good, because having VACUUM FULL do something completely different from regular VACUUM confuses users all the time; and on the other hand, having a command called CLUSTER which is at odds with what most people think of as a "database cluster" is also confusing.) Here's a list of existing commands, and how to write them in the current patch's proposal for REPACK: -- re-clusters all tables that have a clustered index set CLUSTER -> REPACK USING INDEX -- clusters the given table using the given index CLUSTER tab USING idx -> REPACK tab USING INDEX idx -- clusters this table using a clustered index; error if no index clustered CLUSTER tab -> REPACK tab USING INDEX -- vacuum-full all tables VACUUM FULL -> REPACK -- vacuum-full the specified table VACUUM FULL tab -> REPACK tab My other change to Antonin's patch is that I made REPACK USING INDEX set the 'indisclustered' flag to the index being used, so REPACK behaves identically to CLUSTER. We can discuss whether we really want this. For instance we could add an option so that by default REPACK omits persisting the clustered index, and instead it only does that when you give it some special option, say something like "REPACK (persist_clustered_index=true) tab USING INDEX idx" Overall I'm not sure this is terribly interesting, since clustered indexes are not very useful for most users anyway. I made a few other minor changes not worthy of individual mention, and there are a few others pending, such as updates to the pg_stat_progress_repack view infrastructure, as well as phasing out pg_stat_progress_cluster (maybe the latter would offer a subset of the former; not yet sure about this.) Also, I'd like to work on adding a `repackdb` command for completeness. On repackdb: I think is going to be very similar to vacuumdb, mostly in that it is going to need to be able to run tasks in parallel; but there are things it doesn't have to deal with, such as analyze-in-stages, which I think is a large burden. I estimate about 1k LOC there, extremely similar to vacuumdb. Maybe it makes sense to share the source code and make the new executable a symlink instead, with some additional code to support the two different modes. Again, I'm not sure about this -- I like the idea, but I'd have to see the implementation. I'll be rebasing the rest of Antonin's patch series afterwards, including the logical decoding changes necessary for CONCURRENTLY. In the meantime, if people want to review those, which would be very valuable, they can go back to branch master from around the time he submitted it and apply the old patches there. [1] https://postgr.es/m/76278.1724760050@antos [2] https://postgr.es/m/152010.1751307725@localhost -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Вложения
On Sat, Jul 26, 2025 at 5:56 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Hello, > > Here's a patch to add REPACK and eventually the CONCURRENTLY flag to it. > This is coming from [1]. The ultimate goal is to have an in-core tool > to allow concurrent table rewrite to get rid of bloat; right now, VACUUM > FULL does that, but it's not concurrent. Users have resorted to using > the pg_repack third-party tool, which is ancient and uses a weird > internal implementation, as well as pg_squeeze, which uses logical > decoding to capture changes that occur during the table rewrite. The > patch submitted here, largely by Antonin Houska with some changes by me, > is based on the the pg_squeeze code which he authored, and first > introduces a new command called REPACK to absorb both VACUUM FULL and > CLUSTER, followed by addition of a CONCURRENTLY flag to allow some forms > of REPACK to operate online using logical decoding. > > Essentially, this first patch just reshuffles the CLUSTER code to create > the REPACK command. > Thanks for keeping this ball rolling. > > My other change to Antonin's patch is that I made REPACK USING INDEX set > the 'indisclustered' flag to the index being used, so REPACK behaves > identically to CLUSTER. We can discuss whether we really want this. > For instance we could add an option so that by default REPACK omits > persisting the clustered index, and instead it only does that when you > give it some special option, say something like > "REPACK (persist_clustered_index=true) tab USING INDEX idx" > Overall I'm not sure this is terribly interesting, since clustered > indexes are not very useful for most users anyway. > I think I would lean towards having it work like CLUSTER (preserve the index), since that helps people making the transition, and it doesn't feel terribly useful to invent new syntax for a feature that I would agree isn't very useful for most people. > I made a few other minor changes not worthy of individual mention, and > there are a few others pending, such as updates to the > pg_stat_progress_repack view infrastructure, as well as phasing out > pg_stat_progress_cluster (maybe the latter would offer a subset of the > former; not yet sure about this.) Also, I'd like to work on adding a > `repackdb` command for completeness. > > On repackdb: I think is going to be very similar to vacuumdb, mostly in > that it is going to need to be able to run tasks in parallel; but there > are things it doesn't have to deal with, such as analyze-in-stages, > which I think is a large burden. I estimate about 1k LOC there, > extremely similar to vacuumdb. Maybe it makes sense to share the source > code and make the new executable a symlink instead, with some additional > code to support the two different modes. Again, I'm not sure about > this -- I like the idea, but I'd have to see the implementation. > > I'll be rebasing the rest of Antonin's patch series afterwards, > including the logical decoding changes necessary for CONCURRENTLY. In > the meantime, if people want to review those, which would be very > valuable, they can go back to branch master from around the time he > submitted it and apply the old patches there. > For clarity, are you intending to commit this patch before having the other parts ready? (If that sounds like an objection, it isn't) After a first pass, I think there's some confusing bits in the new docs that could use straightening out, but there likely going to overlap changes once concurrently is brought in, so it might make sense to hold off on those. Either way I definitely want to dive into this a bit deeper with some fresh eyes, there's a lot to digest... speaking of, for this bit in src/backend/commands/cluster.c + switch (cmd) + { + case REPACK_COMMAND_REPACK: + return "REPACK"; + case REPACK_COMMAND_VACUUMFULL: + return "VACUUM"; + case REPACK_COMMAND_CLUSTER: + return "VACUUM"; + } + return "???"; The last one should return "CLUSTER" no? Robert Treat https://xzilla.net
On Sun, Jul 27, 2025 at 6:56 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Hello, > > Here's a patch to add REPACK and eventually the CONCURRENTLY flag to it. > This is coming from [1]. The ultimate goal is to have an in-core tool > to allow concurrent table rewrite to get rid of bloat; +1 > right now, VACUUM > FULL does that, but it's not concurrent. Users have resorted to using > the pg_repack third-party tool, which is ancient and uses a weird > internal implementation, as well as pg_squeeze, which uses logical > decoding to capture changes that occur during the table rewrite. The > patch submitted here, largely by Antonin Houska with some changes by me, > is based on the the pg_squeeze code which he authored, and first > introduces a new command called REPACK to absorb both VACUUM FULL and > CLUSTER, followed by addition of a CONCURRENTLY flag to allow some forms > of REPACK to operate online using logical decoding. Does this mean REPACK CONCURRENTLY requires wal_level = logical, while plain REPACK (without CONCURRENTLY) works with any wal_level setting? If we eventually deprecate VACUUM FULL and CLUSTER, I think plain REPACK should still be allowed with wal_level = minimal or replica, so users with those settings can perform equivalent processing. + if (!cluster_is_permitted_for_relation(tableOid, userid, + CLUSTER_COMMAND_CLUSTER)) As for the patch you attached, it seems to be an early WIP and might not be ready for review yet?? BTW, I got the following compilation failure and probably CLUSTER_COMMAND_CLUSTER the above should be GetUserId(). ----------------- cluster.c:455:14: error: use of undeclared identifier 'CLUSTER_COMMAND_CLUSTER' 455 | CLUSTER_COMMAND_CLUSTER)) | ^ 1 error generated. ----------------- Regards, -- Fujii Masao
On 2025-Jul-26, Robert Treat wrote: > For clarity, are you intending to commit this patch before having the > other parts ready? (If that sounds like an objection, it isn't) After > a first pass, I think there's some confusing bits in the new docs that > could use straightening out, but there likely going to overlap changes > once concurrently is brought in, so it might make sense to hold off on > those. I'm aiming at getting 0001 committed during the September commitfest, and the CONCURRENTLY flag addition later in the pg19 cycle. But I'd rather have good-enough docs at every step of the way. They don't have to be *perfect* if we want to get everything in pg19, but I'd rather not leave anything openly confusing even transiently. That said, I did not review the docs this time around, so here's them the same as they were in the previous post. But if you want to suggest changes for the docs in 0001, please do. Just don't get too carried away. > speaking of, for this bit in src/backend/commands/cluster.c > > + switch (cmd) > + { > + case REPACK_COMMAND_REPACK: > + return "REPACK"; > + case REPACK_COMMAND_VACUUMFULL: > + return "VACUUM"; > + case REPACK_COMMAND_CLUSTER: > + return "VACUUM"; > + } > + return "???"; > > The last one should return "CLUSTER" no? Absolutely -- my blunder. On 2025-Jul-27, Fujii Masao wrote: > > The patch submitted here, largely by Antonin Houska with some > > changes by me, is based on the the pg_squeeze code which he > > authored, and first introduces a new command called REPACK to absorb > > both VACUUM FULL and CLUSTER, followed by addition of a CONCURRENTLY > > flag to allow some forms of REPACK to operate online using logical > > decoding. > > Does this mean REPACK CONCURRENTLY requires wal_level = logical, while > plain REPACK (without CONCURRENTLY) works with any wal_level setting? > If we eventually deprecate VACUUM FULL and CLUSTER, I think plain > REPACK should still be allowed with wal_level = minimal or replica, so > users with those settings can perform equivalent processing. Absolutely. One of the later patches in the series, which I have not included yet, intends to implement the idea of transiently enabling wal_level=logical for the table being repacked concurrently, so that you can still use the concurrent mode if you have a non-logical-wal_level instance. > + if (!cluster_is_permitted_for_relation(tableOid, userid, > + CLUSTER_COMMAND_CLUSTER)) > > As for the patch you attached, it seems to be an early WIP and > might not be ready for review yet?? BTW, I got the following > compilation failure and probably CLUSTER_COMMAND_CLUSTER > the above should be GetUserId(). This was a silly merge mistake, caused by my squashing Antonin's 0004 (trivial code restructuring) into 0001 at the last minute and failing to "git add" the compile fixes before doing git-format-patch. Here's v17. (I decided that calling my previous one "v1" after Antonin had gone all the way to v15 was stupid on my part.) The important part here is that I rebased Antonin 0004's, that is, the addition of the CONCURRENTLY flag, plus 0005 regression tests. The only interesting change here is that I decided to not mess with the grammar by allowing an unparenthesized CONCURRENTLY keyword; if you want concurrent, you have to say "REPACK (CONCURRENTLY)". This is at odds with the way we use the keyword in other commands, but ISTM we don't _need_ to support that legacy syntax. Anyway, this is easy to put back afterwards, if enough people find it not useless. I've not reviewed 0003 in depth yet, just rebased it. But it works to the point that CI is happy with it. I've not yet included Antonin's 0006 and 0007. TODO list for 0001: - addition of src/bin/scripts/repackdb - clean up the progress report infrastructure - doc review -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ Thou shalt check the array bounds of all strings (indeed, all arrays), for surely where thou typest "foo" someone someday shall type "supercalifragilisticexpialidocious" (5th Commandment for C programmers)
Вложения
On Fri, Aug 1, 2025 at 1:50 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > One of the later patches in the series, which I have not included yet, > intends to implement the idea of transiently enabling wal_level=logical > for the table being repacked concurrently, so that you can still use > the concurrent mode if you have a non-logical-wal_level instance. Sounds good to me! > Here's v17. I just tried REPACK command and observed a few things: When I repeatedly ran REPACK on the regression database while make installcheck was running, I got the following error: ERROR: StartTransactionCommand: unexpected state STARTED "REPACK (VERBOSE);" failed with the following error. ERROR: syntax error at or near ";" REPACK (CONCURRENTLY) USING INDEX failed with the following error, while the same command without CONCURRENTLY completed successfully: =# REPACK (CONCURRENTLY) parallel_vacuum_table using index regular_sized_index ; ERROR: cannot process relation "parallel_vacuum_table" HINT: Relation "parallel_vacuum_table" has no identity index. When I ran REPACK (CONCURRENTLY) on a table that's also a logical replication target, I saw the following log messages. Is this expected? =# REPACK (CONCURRENTLY) t; LOG: logical decoding found consistent point at 1/00021F20 DETAIL: There are no running transactions. STATEMENT: REPACK (CONCURRENTLY) t; Regards, -- Fujii Masao
Hello Álvaro, Should we skip non-ready indexes in build_new_indexes? Also, in the current implementation, concurrent mode is marked as non-MVCC safe. From my point of view, this is a significant limitation for practical use. Should we consider an option to exchange non-MVCC issues to short exclusive lock of ProcArrayLock + cancellation of some transactions with older xmin? Best regards, Mikhail
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > I made a few changes from Antonin's original at [2]. First, I modified > the grammar to support "REPACK [tab] USING INDEX" without specifying the > index name. With this change, all possibilities of the old commands are > covered, ... > Here's a list of existing commands, and how to write them in the current > patch's proposal for REPACK: > > -- re-clusters all tables that have a clustered index set > CLUSTER -> REPACK USING INDEX > > -- clusters the given table using the given index > CLUSTER tab USING idx -> REPACK tab USING INDEX idx > > -- clusters this table using a clustered index; error if no index clustered > CLUSTER tab -> REPACK tab USING INDEX > > -- vacuum-full all tables > VACUUM FULL -> REPACK > > -- vacuum-full the specified table > VACUUM FULL tab -> REPACK tab > Now that we want to cover the CLUSTER/VACUUM FULL completely, I've checked the options of VACUUM FULL. I found two items not supported by REPACK (but also not supported by by CLUSTER): ANALYZE and SKIP_DATABASE_STATS. Maybe just let's mention that in the user documentation of REPACK? (Besides that, VACUUM FULL accepts TRUNCATE and INDEX_CLEANUP options, but I think these have no effect.) -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hello! One more thing - I think build_new_indexes and index_concurrently_create_copy are very close in semantics, so it might be a good idea to refactor them a bit. I’m still concerned about MVCC-related issues. For multiple applications, this is a dealbreaker, because in some cases correctness is a higher priority than availability. Possible options: 1) Terminate connections with old snapshots. Add a flag to terminate all connections with snapshots during the ExclusiveLock period for the swap. From the application’s perspective, this is not a big deal - it's similar to a primary switch. We would also need to prevent new snapshots from being taken during the swap transaction, so a short exclusive lock on ProcArrayLock would also be required. 2) MVCC-safe two-phase approach (inspired by CREATE INDEX). - copy the data from T1 to the new table T2. - apply the log. - take a table-exclusive lock on T1 - apply the log again. - instead of swapping, mark the T2 as a kind of shadow table - any transaction applying changes to T1 must also apply them to T2, while reads still use T1 as the source of truth. - commit (and record the transaction ID as XID1). - at this point, all changes are applied to both tables with the same XIDs because of the "shadow table" mechanism. - wait until older snapshots no longer treat XID1 as uncommitted. - now the tables are identical from the MVCC perspective. - take an exclusive lock on both T1 and T2. - perform the swap and drop T1. - commit. This is more complex and would require implementing some sort of "shadow table" mechanism, so it might not be worth the effort. Option 1 feels more appealing to me. If others think this is a good idea, I might try implementing a proof of concept. Best regards, Mikhail
On 2025-Aug-09, Mihail Nikalayeu wrote: > Hello! > > One more thing - I think build_new_indexes and > index_concurrently_create_copy are very close in semantics, so it > might be a good idea to refactor them a bit. > > I’m still concerned about MVCC-related issues. For multiple > applications, this is a dealbreaker, because in some cases correctness > is a higher priority than availability. Please note that Antonin already implemented this. See his patches here: https://www.postgresql.org/message-id/77690.1725610115%40antos I proposed to leave this part out initially, which is why it hasn't been reposted. We can review and discuss after the initial patches are in. Because having an MVCC-safe mode has drawbacks, IMO we should make it optional. But you're welcome to review that part specifically if you're so inclined, and offer feedback on it. (I suggest to rewind back your checked-out tree to branch master at the time that patch was posted, for easy application. We can deal with a rebase later.) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ Officer Krupke, what are we to do? Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > One more thing - I think build_new_indexes and > index_concurrently_create_copy are very close in semantics, so it > might be a good idea to refactor them a bit. You're right. I think I even used the latter for reference when writing the first. 0002 in the attached series tries to fix that. build_new_indexes() (in 0004) is simpler now. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Вложения
Antonin Houska <ah@cybertec.at> wrote: > Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > > > One more thing - I think build_new_indexes and > > index_concurrently_create_copy are very close in semantics, so it > > might be a good idea to refactor them a bit. > > You're right. I think I even used the latter for reference when writing the > first. > > 0002 in the attached series tries to fix that. build_new_indexes() (in 0004) > is simpler now. This is v18 again. Parts 0001 through 0004 are unchanged, however 0005 is added. It implements a new client application pg_repackdb. (If I posted 0005 alone its regression tests would not work. I wonder if the cfbot handles the repeated occurence of the 'v18-' prefix correctly.) -- Antonin Houska Web: https://www.cybertec-postgresql.com
Вложения
On 2025-Aug-15, Antonin Houska wrote: > This is v18 again. Thanks for this! > Parts 0001 through 0004 are unchanged, however 0005 is added. It > implements a new client application pg_repackdb. (If I posted 0005 > alone its regression tests would not work. I wonder if the cfbot > handles the repeated occurence of the 'v18-' prefix correctly.) Yeah, the cfbot is just going to take the attachments from the latest email in the thread that has any, and assume they are the whole that make up the patch. It wouldn't work to post just v18-0005 and assume that the bot is going grab patches 0001 through 0004 from a previous email, if that's what you're thinking. In short, what you did is correct and necessary. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Tue, Aug 5, 2025 at 4:59 AM Antonin Houska <ah@cybertec.at> wrote: > > Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > I made a few changes from Antonin's original at [2]. First, I modified > > the grammar to support "REPACK [tab] USING INDEX" without specifying the > > index name. With this change, all possibilities of the old commands are > > covered, > > ... > > > Here's a list of existing commands, and how to write them in the current > > patch's proposal for REPACK: > > > > -- re-clusters all tables that have a clustered index set > > CLUSTER -> REPACK USING INDEX > > > > -- clusters the given table using the given index > > CLUSTER tab USING idx -> REPACK tab USING INDEX idx > > > > -- clusters this table using a clustered index; error if no index clustered > > CLUSTER tab -> REPACK tab USING INDEX > > In the v18 patch, the docs say that repack doesn't remember the index, but it seems we are still calling mark_index_clustered, so I think the above is true but we need to update the docs(?). > > -- vacuum-full all tables > > VACUUM FULL -> REPACK > > > > -- vacuum-full the specified table > > VACUUM FULL tab -> REPACK tab > > > > Now that we want to cover the CLUSTER/VACUUM FULL completely, I've checked the > options of VACUUM FULL. I found two items not supported by REPACK (but also > not supported by by CLUSTER): ANALYZE and SKIP_DATABASE_STATS. Maybe just > let's mention that in the user documentation of REPACK? > I would note that both pg_repack and pg_squeeze analyze by default, and running "vacuum full analyze" is the recommended behavior, so not having analyze included is a step backwards. > (Besides that, VACUUM FULL accepts TRUNCATE and INDEX_CLEANUP options, but I > think these have no effect.) > Yeah, these seem safe to ignore. Robert Treat https://xzilla.net
On 2025-Aug-16, Robert Treat wrote: > On Tue, Aug 5, 2025 at 4:59 AM Antonin Houska <ah@cybertec.at> wrote: > > Now that we want to cover the CLUSTER/VACUUM FULL completely, I've checked the > > options of VACUUM FULL. I found two items not supported by REPACK (but also > > not supported by by CLUSTER): ANALYZE and SKIP_DATABASE_STATS. Maybe just > > let's mention that in the user documentation of REPACK? > > I would note that both pg_repack and pg_squeeze analyze by default, > and running "vacuum full analyze" is the recommended behavior, so not > having analyze included is a step backwards. Make sense to add ANALYZE as an option to repack, yeah. So if I repack a single table with REPACK (ANALYZE) table USING INDEX; then do you expect that this would first cluster the table under AccessExclusiveLock, then release the lock to do the analyze step, or would the analyze be done under the same lock? This is significant for a query that starts while repack is running, because if we release the AEL then the query is planned when there are no stats for the table, which might be bad. I think the time to run the analyze step should be considerable shorter than the time to run the repacking step, so running both together under the same lock should be okay. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Computing is too important to be left to men." (Karen Spärck Jones)
On 2025-Aug-16, Robert Treat wrote: > In the v18 patch, the docs say that repack doesn't remember the index, > but it seems we are still calling mark_index_clustered, so I think the > above is true but we need to update the docs(?). Yes, the docs are obsolete on this point, I'm in the process of updating them. Thanks for pointing this out. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "La victoria es para quien se atreve a estar solo"
Hello, Here's a second cut of the initial REPACK work. Antonin added an implementation of pg_repackdb, and there's also a couple of bug fixes that were reported in the thread. I also added support for the ANALYZE option as noted by Robert Treat, though it only works if you specify a single non-partitioned table. Adding for the multi-table case is likely easy, but I didn't try. I purposefully do not include the CONCURRENTLY work yet -- I want to get this part commitable-clean first, then we can continue work on the logical decoding work on top of that. Note choice of shell command name: though all the other programs in src/bin/scripts do not use the "pg_" prefix, this one does; we thought it made no sense to follow the old programs as precedent because there seems to be a lament for the lack of pg_ prefix in those, and we only keep what they are because of their long history. This one has no history. Still on pg_repackdb, the implementation here is to install a symlink called pg_repackdb which points to vacuumdb, and make the program behave differently when called in this way. The amount of additional code for this is relatively small, so I think this is a worthy technique -- assuming it works. If it doesn't, Antonin proposed a separate binary that just calls some functions from vacuumdb. Or maybe we could have a common source file that both utilities call. I edited the docs a bit, limiting the exposure of CLUSTER and VACUUM FULL, and instead redirecting the user to the REPACK docs. In the REPACK docs I modified things for additional clarity. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Selbst das größte Genie würde nicht weit kommen, wenn es alles seinem eigenen Innern verdanken wollte." (Johann Wolfgang von Goethe) Ni aún el genio más grande llegaría muy lejos si quisiera sacarlo todo de su propio interior.
Вложения
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2025-Aug-16, Robert Treat wrote: > > > On Tue, Aug 5, 2025 at 4:59 AM Antonin Houska <ah@cybertec.at> wrote: > > > > Now that we want to cover the CLUSTER/VACUUM FULL completely, I've checked the > > > options of VACUUM FULL. I found two items not supported by REPACK (but also > > > not supported by by CLUSTER): ANALYZE and SKIP_DATABASE_STATS. Maybe just > > > let's mention that in the user documentation of REPACK? > > > > I would note that both pg_repack and pg_squeeze analyze by default, > > and running "vacuum full analyze" is the recommended behavior, so not > > having analyze included is a step backwards. > > Make sense to add ANALYZE as an option to repack, yeah. > > So if I repack a single table with > REPACK (ANALYZE) table USING INDEX; > > then do you expect that this would first cluster the table under > AccessExclusiveLock, then release the lock to do the analyze step, or > would the analyze be done under the same lock? This is significant for > a query that starts while repack is running, because if we release the > AEL then the query is planned when there are no stats for the table, > which might be bad. > > I think the time to run the analyze step should be considerable shorter > than the time to run the repacking step, so running both together under > the same lock should be okay. AFAICS, VACUUM FULL first releases the AEL, then it analyzes the table. If users did not complain so far, I'd assume that vacuum_rel() (effectively cluster_rel() in the FULL case) does not change the stats that much. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Álvaro Herrera <alvherre@kurilemu.de> wrote: > Still on pg_repackdb, the implementation here is to install a symlink > called pg_repackdb which points to vacuumdb, and make the program behave > differently when called in this way. The amount of additional code for > this is relatively small, so I think this is a worthy technique -- > assuming it works. If it doesn't, Antonin proposed a separate binary > that just calls some functions from vacuumdb. Or maybe we could have a > common source file that both utilities call. There's an issue with the symlink, maybe some meson expert can help. In particular, the CI on Windows ends up with the following error: ERROR: Tried to install symlink to missing file C:/cirrus/build/tmp_install/usr/local/pgsql/bin/vacuumdb (The reason it does not happen on other platforms might be that the build is slower on Windows, and thus it's more prone to some specific race conditions.) It appears that the 'point_to' argument of the 'install_symlink()' function [1] is only a string rather than a "real target" [2]. That's likely the reason the function does not wait for the creation of the 'vacuumdb' executable. I could not find another symlink of this kind in the tree. (AFAICS, the postmaster->postgres symlink had been removed before Meson has been introduced.) Does anyone happen to have a clue? Thanks. [1] https://mesonbuild.com/Reference-manual_functions.html#install_symlink [2] https://mesonbuild.com/Reference-manual_returned_tgt.html -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2025-Aug-20, Antonin Houska wrote: > There's an issue with the symlink, maybe some meson expert can help. In > particular, the CI on Windows ends up with the following error: > > ERROR: Tried to install symlink to missing file C:/cirrus/build/tmp_install/usr/local/pgsql/bin/vacuumdb Hmm, that's not the problem I see in the CI run from the commitfest app: https://cirrus-ci.com/task/5608274336153600 [19:11:00.642] FAILED: [code=2] src/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj [19:11:00.642] "cl" "-Isrc\bin\scripts\vacuumdb.exe.p" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include" "-I..\src\include\port\win32""-I..\src\include\port\win32_msvc" "-Isrc/interfaces/libpq" "-I..\src\interfaces\libpq" "/MDd""/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/Zc:preprocessor" "/DWIN32" "/DWINDOWS" "/D__WINDOWS__" "/D__WIN32__""/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101" "/wd4102""/wd4090" "/wd4267" "/Fdsrc\bin\scripts\vacuumdb.exe.p\vacuumdb.c.pdb" /Fosrc/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj"/c" ../src/bin/scripts/vacuumdb.c [19:11:00.642] ../src/bin/scripts/vacuumdb.c(186): error C2059: syntax error: '}' [19:11:00.642] ../src/bin/scripts/vacuumdb.c(197): warning C4034: sizeof returns 0 The real problem here seems to be the empty long_options_repack array. I removed it and started a new run to see what happens. Running now: https://cirrus-ci.com/build/4961902171783168 -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Álvaro Herrera <alvherre@kurilemu.de> wrote: > On 2025-Aug-20, Antonin Houska wrote: > > > There's an issue with the symlink, maybe some meson expert can help. In > > particular, the CI on Windows ends up with the following error: > > > > ERROR: Tried to install symlink to missing file C:/cirrus/build/tmp_install/usr/local/pgsql/bin/vacuumdb > > Hmm, that's not the problem I see in the CI run from the commitfest app: > > https://cirrus-ci.com/task/5608274336153600 I was referring to the other build that you shared off-list (probably independent from cfbot): https://cirrus-ci.com/build/4726227505774592 > [19:11:00.642] FAILED: [code=2] src/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj > [19:11:00.642] "cl" "-Isrc\bin\scripts\vacuumdb.exe.p" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include" "-I..\src\include\port\win32""-I..\src\include\port\win32_msvc" "-Isrc/interfaces/libpq" "-I..\src\interfaces\libpq" "/MDd""/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/Zc:preprocessor" "/DWIN32" "/DWINDOWS" "/D__WINDOWS__" "/D__WIN32__""/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101" "/wd4102""/wd4090" "/wd4267" "/Fdsrc\bin\scripts\vacuumdb.exe.p\vacuumdb.c.pdb" /Fosrc/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj"/c" ../src/bin/scripts/vacuumdb.c > [19:11:00.642] ../src/bin/scripts/vacuumdb.c(186): error C2059: syntax error: '}' > [19:11:00.642] ../src/bin/scripts/vacuumdb.c(197): warning C4034: sizeof returns 0 > > The real problem here seems to be the empty long_options_repack array. > I removed it and started a new run to see what happens. Running now: > https://cirrus-ci.com/build/4961902171783168 The symlink issue occurred at "Windows - Server 2019, MinGW64 - Meson", where the code compiled well. The compilation failure mentioned above comes from "Windows - Server 2019, VS 2019 - Meson & ninja". I think it's still possible that the symlink issue will occur there once the compilation is fixed. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hi, On 2025-08-20 16:22:41 +0200, Antonin Houska wrote: > Álvaro Herrera <alvherre@kurilemu.de> wrote: > > > On 2025-Aug-20, Antonin Houska wrote: > > > > > There's an issue with the symlink, maybe some meson expert can help. In > > > particular, the CI on Windows ends up with the following error: > > > > > > ERROR: Tried to install symlink to missing file C:/cirrus/build/tmp_install/usr/local/pgsql/bin/vacuumdb > > > > Hmm, that's not the problem I see in the CI run from the commitfest app: > > > > https://cirrus-ci.com/task/5608274336153600 > > I was referring to the other build that you shared off-list (probably > independent from cfbot): > > https://cirrus-ci.com/build/4726227505774592 > > > [19:11:00.642] FAILED: [code=2] src/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj > > [19:11:00.642] "cl" "-Isrc\bin\scripts\vacuumdb.exe.p" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include""-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc" "-Isrc/interfaces/libpq" "-I..\src\interfaces\libpq""/MDd" "/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/Zc:preprocessor" "/DWIN32" "/DWINDOWS""/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273""/wd4101" "/wd4102" "/wd4090" "/wd4267" "/Fdsrc\bin\scripts\vacuumdb.exe.p\vacuumdb.c.pdb" /Fosrc/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj"/c" ../src/bin/scripts/vacuumdb.c > > [19:11:00.642] ../src/bin/scripts/vacuumdb.c(186): error C2059: syntax error: '}' > > [19:11:00.642] ../src/bin/scripts/vacuumdb.c(197): warning C4034: sizeof returns 0 > > > > The real problem here seems to be the empty long_options_repack array. > > I removed it and started a new run to see what happens. Running now: > > https://cirrus-ci.com/build/4961902171783168 > > The symlink issue occurred at "Windows - Server 2019, MinGW64 - Meson", where > the code compiled well. The compilation failure mentioned above comes from > "Windows - Server 2019, VS 2019 - Meson & ninja". I think it's still possible > that the symlink issue will occur there once the compilation is fixed. FWIW, I don't think it's particularly wise to rely on symlinks on windows - IIRC they will often not be enabled outside of development environments. Greetings, Andres Freund
Hello everyone! Alvaro Herrera <alvherre@alvh.no-ip.org>: > Please note that Antonin already implemented this. See his patches > here: > https://www.postgresql.org/message-id/77690.1725610115%40antos > I proposed to leave this part out initially, which is why it hasn't been > reposted. We can review and discuss after the initial patches are in. I think it is worth pushing it at least in the same release cycle. > But you're welcome to review that part specifically if you're so > inclined, and offer feedback on it. (I suggest to rewind back your > checked-out tree to branch master at the time that patch was posted, for > easy application. We can deal with a rebase later.) I have rebased that on top of v18 (attached). Also, I think I found an issue (or lost something during rebase): we must preserve xmin,cmin during initial copy to make sure that data is going to be visible by snapshots of concurrent changes later: static void reform_and_rewrite_tuple(......) ..... /*It is also crucial to stamp the new record with the exact same xid and cid, * because the tuple must be visible to the snapshot of the applied concurrent * change later. */ CommandId cid = HeapTupleHeaderGetRawCommandId(tuple->t_data); TransactionId xid = HeapTupleHeaderGetXmin(tuple->t_data); heap_insert(NewHeap, copiedTuple, xid, cid, HEAP_INSERT_NO_LOGICAL, NULL); I'll try to polish that part a little bit. > Because having an MVCC-safe mode has drawbacks, IMO we should make it > optional. Do you mean some option for the command? Like REPACK (CONCURRENTLY, SAFE)? Best regards, Mikhail.
Вложения
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > Also, I think I found an issue (or lost something during rebase): we > must preserve xmin,cmin during initial copy > to make sure that data is going to be visible by snapshots of > concurrent changes later: > > static void > reform_and_rewrite_tuple(......) > ..... > /*It is also crucial to stamp the new record with the exact same > xid and cid, > * because the tuple must be visible to the snapshot of the > applied concurrent > * change later. > */ > CommandId cid = HeapTupleHeaderGetRawCommandId(tuple->t_data); > TransactionId xid = HeapTupleHeaderGetXmin(tuple->t_data); > > heap_insert(NewHeap, copiedTuple, xid, cid, HEAP_INSERT_NO_LOGICAL, NULL); When posting version 12 of the patch [1] I raised a concern that the the MVCC safety is too expensive when it comes to logical decoding. Therefore, I abandoned the concept for now, and v13 [2] uses plain heap_insert(). Once we implement the MVCC safety, we simply rewrite the tuple like v12 did - that's the simplest way to preserve fields like xmin, cmin, ... [1] https://www.postgresql.org/message-id/178741.1743514291%40localhost [2] https://www.postgresql.org/message-id/97795.1744363522%40localhost -- Antonin Houska Web: https://www.cybertec-postgresql.com
Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2025-08-20 16:22:41 +0200, Antonin Houska wrote: > > Álvaro Herrera <alvherre@kurilemu.de> wrote: > > > > > On 2025-Aug-20, Antonin Houska wrote: > > > > > > > There's an issue with the symlink, maybe some meson expert can help. In > > > > particular, the CI on Windows ends up with the following error: > > > > > > > > ERROR: Tried to install symlink to missing file C:/cirrus/build/tmp_install/usr/local/pgsql/bin/vacuumdb > > > > > > Hmm, that's not the problem I see in the CI run from the commitfest app: > > > > > > https://cirrus-ci.com/task/5608274336153600 > > > > I was referring to the other build that you shared off-list (probably > > independent from cfbot): > > > > https://cirrus-ci.com/build/4726227505774592 > > > > > [19:11:00.642] FAILED: [code=2] src/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj > > > [19:11:00.642] "cl" "-Isrc\bin\scripts\vacuumdb.exe.p" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include""-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc" "-Isrc/interfaces/libpq" "-I..\src\interfaces\libpq""/MDd" "/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/Zc:preprocessor" "/DWIN32" "/DWINDOWS""/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273""/wd4101" "/wd4102" "/wd4090" "/wd4267" "/Fdsrc\bin\scripts\vacuumdb.exe.p\vacuumdb.c.pdb" /Fosrc/bin/scripts/vacuumdb.exe.p/vacuumdb.c.obj"/c" ../src/bin/scripts/vacuumdb.c > > > [19:11:00.642] ../src/bin/scripts/vacuumdb.c(186): error C2059: syntax error: '}' > > > [19:11:00.642] ../src/bin/scripts/vacuumdb.c(197): warning C4034: sizeof returns 0 > > > > > > The real problem here seems to be the empty long_options_repack array. > > > I removed it and started a new run to see what happens. Running now: > > > https://cirrus-ci.com/build/4961902171783168 > > > > The symlink issue occurred at "Windows - Server 2019, MinGW64 - Meson", where > > the code compiled well. The compilation failure mentioned above comes from > > "Windows - Server 2019, VS 2019 - Meson & ninja". I think it's still possible > > that the symlink issue will occur there once the compilation is fixed. > > FWIW, I don't think it's particularly wise to rely on symlinks on windows - > IIRC they will often not be enabled outside of development environments. ok, installing a copy of the same executable with a different name seems more reliable. At least that's how the postmaster->postgres link used to be handled, if I read Makefile correctly. Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hi, On 2025-08-21 20:14:14 +0200, Antonin Houska wrote: > ok, installing a copy of the same executable with a different name seems more > reliable. At least that's how the postmaster->postgres link used to be > handled, if I read Makefile correctly. Thanks. I have not followed this thread, but I don't think the whole thing of having a single executable with multiple names is worth doing. Just make whatever an option, instead of having multiple "executables". Greetings, Andres
On Tue, Aug 19, 2025 at 2:53 PM Álvaro Herrera <alvherre@kurilemu.de> wrote: > Note choice of shell command name: though all the other programs in > src/bin/scripts do not use the "pg_" prefix, this one does; we thought > it made no sense to follow the old programs as precedent because there > seems to be a lament for the lack of pg_ prefix in those, and we only > keep what they are because of their long history. This one has no > history. > > Still on pg_repackdb, the implementation here is to install a symlink > called pg_repackdb which points to vacuumdb, and make the program behave > differently when called in this way. The amount of additional code for > this is relatively small, so I think this is a worthy technique -- > assuming it works. If it doesn't, Antonin proposed a separate binary > that just calls some functions from vacuumdb. Or maybe we could have a > common source file that both utilities call. > What's the plan for clusterdb? It seems like we'd ideally create a stand alone pg_repackdb which replaces clusterdb and also allows us to remove the FULL options from vacuumdb. Robert Treat https://xzilla.net
On 2025-Aug-21, Robert Treat wrote: > What's the plan for clusterdb? It seems like we'd ideally create a > stand alone pg_repackdb which replaces clusterdb and also allows us to > remove the FULL options from vacuumdb. I don't think we should remove clusterdb, to avoid breaking any scripts that work today. As you say, I would create the standalone pg_repackdb to do what we need it to do (namely: run the REPACK commands) and leave vacuumdb and clusterdb alone. Removing the obsolete commands and options can be done in a few years. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On Fri, Aug 22, 2025, at 6:40 AM, Álvaro Herrera wrote: > On 2025-Aug-21, Robert Treat wrote: > >> What's the plan for clusterdb? It seems like we'd ideally create a >> stand alone pg_repackdb which replaces clusterdb and also allows us to >> remove the FULL options from vacuumdb. > > I don't think we should remove clusterdb, to avoid breaking any scripts > that work today. As you say, I would create the standalone pg_repackdb > to do what we need it to do (namely: run the REPACK commands) and leave > vacuumdb and clusterdb alone. Removing the obsolete commands and > options can be done in a few years. > I would say that we need to plan the removal of these binaries (clusterdb and vacuumdb). We can start with a warning into clusterdb saying they should use pg_repackdb. In a few years, we can remove clusterdb. There were complaints about binary names without a pg_ prefix in the past [1]. I don't think we need to keep vacuumdb. Packagers can keep a symlink (vacuumdb) to pg_repackdb. We can add a similar warning message saying they should use pg_repackdb if the symlink is used. [1] https://www.postgresql.org/message-id/CAJgfmqXYYKXR%2BQUhEa3cq6pc8dV0Hu7QvOUccm7R0TkC%3DT-%2B%3DA%40mail.gmail.com -- Euler Taveira EDB https://www.enterprisedb.com/
Hi, On Fri, Aug 22, 2025 at 05:32:34PM -0300, Euler Taveira wrote: > On Fri, Aug 22, 2025, at 6:40 AM, Álvaro Herrera wrote: > > On 2025-Aug-21, Robert Treat wrote: > >> What's the plan for clusterdb? It seems like we'd ideally create a > >> stand alone pg_repackdb which replaces clusterdb and also allows us to > >> remove the FULL options from vacuumdb. > > > > I don't think we should remove clusterdb, to avoid breaking any scripts > > that work today. As you say, I would create the standalone pg_repackdb > > to do what we need it to do (namely: run the REPACK commands) and leave > > vacuumdb and clusterdb alone. Removing the obsolete commands and > > options can be done in a few years. > > I would say that we need to plan the removal of these binaries (clusterdb and > vacuumdb). We can start with a warning into clusterdb saying they should use > pg_repackdb. In a few years, we can remove clusterdb. There were complaints > about binary names without a pg_ prefix in the past [1]. Yeah. > I don't think we need to keep vacuumdb. Packagers can keep a symlink (vacuumdb) > to pg_repackdb. We can add a similar warning message saying they should use > pg_repackdb if the symlink is used. Unless pg_repack has the same (or a superset of) CLI and behaviour as vacuumdb (I haven't checked, but doubt it?), I think replacing vacuumdb with a symlink to pg_repack will lead to much more breakage in existing scripts/automation than clusterdb, which I guess is used orders of magnitude less frequently than vacumdb. Michael
On 2025-08-23, Michael Banck wrote: > On Fri, Aug 22, 2025 at 05:32:34PM -0300, Euler Taveira wrote: >> I don't think we need to keep vacuumdb. Packagers can keep a symlink (vacuumdb) >> to pg_repackdb. We can add a similar warning message saying they should use >> pg_repackdb if the symlink is used. > > Unless pg_repack has the same (or a superset of) CLI and behaviour as > vacuumdb (I haven't checked, but doubt it?), I think replacing vacuumdb > with a symlink to pg_repack will lead to much more breakage in existing > scripts/automation than clusterdb, which I guess is used orders of > magnitude less frequently than vacumdb. Yeah, I completely disagree with the idea of getting rid of vacuumdb. We can, maybe, in a distant future, get rid of the--full option to vacuumdb. But the rest of the vacuumdb behavior must stay, I think, because REPACK is not VACUUM — itis only VACUUM FULL. And we want to make that distinction very clear. We can also, in a few years, get rid of clusterdb. But I don't think we need to deprecate it just yet. -- Álvaro Herrera
Hello, Antonin! > When posting version 12 of the patch [1] I raised a concern that the the MVCC > safety is too expensive when it comes to logical decoding. Therefore, I > abandoned the concept for now, and v13 [2] uses plain heap_insert(). Once we > implement the MVCC safety, we simply rewrite the tuple like v12 did - that's > the simplest way to preserve fields like xmin, cmin, ... Thanks for the explanation. I was looking into catalog-related logical decoding features, and it seems like they are clearly overkill for the repack case. We don't need CID tracking or even a snapshot for each commit if we’re okay with passing xmin/xmax as arguments. What do you think about the following approach for replaying: * use the extracted XID as the value for xmin/xmax. * use SnapshotSelf to find the tuple for update/delete operations. SnapshotSelf seems like a good fit here: * it sees the last "existing" version. * any XID set as xmin/xmax in the repacked version is already committed - so each update/insert is effectively "committed" once written. * it works with multiple updates of the same tuple within a single transaction - SnapshotSelf sees the last version. * all updates are ordered and replayed sequentially - so the last version is always the one we want. If I'm not missing anything, this looks like something worth including in the patch set. If so, I can try implementing a test version. Best regards, Mikhail
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > I was looking into catalog-related logical decoding features, and it > seems like they are clearly overkill for the repack case. > We don't need CID tracking or even a snapshot for each commit if we’re > okay with passing xmin/xmax as arguments. I assume you are concerned with the patch part 0005 of the v12 patch ("Preserve visibility information of the concurrent data changes."), aren't you? > What do you think about the following approach for replaying: > * use the extracted XID as the value for xmin/xmax. > * use SnapshotSelf to find the tuple for update/delete operations. > > SnapshotSelf seems like a good fit here: > * it sees the last "existing" version. > * any XID set as xmin/xmax in the repacked version is already > committed - so each update/insert is effectively "committed" once > written. > * it works with multiple updates of the same tuple within a single > transaction - SnapshotSelf sees the last version. > * all updates are ordered and replayed sequentially - so the last > version is always the one we want. > > If I'm not missing anything, this looks like something worth including > in the patch set. > If so, I can try implementing a test version. Not sure I understand in all details, but I don't think SnapshotSelf is the correct snapshot. Note that HeapTupleSatisfiesSelf() does not use its 'snapshot' argument at all. Instead, it considers the set of running transactions as it is at the time the function is called. One particular problem I imagine is replaying an UPDATE to a row that some later transaction will eventually delete, but the transaction that ran the UPDATE obviously had to see it. When looking for the old version during the replay, HeapTupleSatisfiesMVCC() will find the old version as long as we pass the correct snapshot to it. However, at the time we're replaying the UPDATE in the new table, the tuple may have been already deleted from the old table, and the deleting transaction may already have committed. In such a case, HeapTupleSatisfiesSelf() will conclude the old version invisible and the we'll fail to replay the UPDATE. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hi, Antonin! > I assume you are concerned with the patch part 0005 of the v12 patch > ("Preserve visibility information of the concurrent data changes."), aren't > you? Yes, of course. I got an idea while trying to find a way to optimize it. > Not sure I understand in all details, but I don't think SnapshotSelf is the > correct snapshot. Note that HeapTupleSatisfiesSelf() does not use its > 'snapshot' argument at all. Instead, it considers the set of running > transactions as it is at the time the function is called. Yes, and it is almost the same behavior when a typical MVCC snapshot encounters a tuple created by its own transaction. So, how it works in the non MVCC-safe case (current patch behaviour): 1) we have a whole initial table snapshot with all the xmin = repack XID 2) appling transaction sees ALL the self-alive (no xmax) tuples in it because all tuples created\deleted by transaction itself 3) each update/delete during the replay selects the last existing tuple version, updates it xmax and inserts a new one 4) so, there is no any real MVCC involved - just find the latest version and create a new version 5) and it works correctly because all ordering issues were resolved by locking mechanisms on the original table or by reordering buffer How it maps to MVCC-safe case (SnapshotSelf): 1) we have a whole initial table snapshot with all xmin copied from the original table. All such xmin are committed. 2) appling transaction sees ALL the self-alive (no xmax) tuple in it because its xmin\xmax is committed and SnapshotSelf is happy with it 3) each update/delete during the replay selects the last existing tuple version, updates it xmax=original xid and inserts a new one keeping with xmin=orignal xid 4) --//-- 5) --//-- > However, at the time we're replaying the UPDATE in the new table, the tuple > may have been already deleted from the old table, and the deleting transaction > may already have committed. In such a case, HeapTupleSatisfiesSelf() will > conclude the old version invisible and the we'll fail to replay the UPDATE. No, it will see it - because its xmax will be empty in the repacked version of the table. From your question I feel you understood the concept - but feel free to ask for an additional explanation/scheme. Best regards, Mikhail.
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > > Not sure I understand in all details, but I don't think SnapshotSelf is the > > correct snapshot. Note that HeapTupleSatisfiesSelf() does not use its > > 'snapshot' argument at all. Instead, it considers the set of running > > transactions as it is at the time the function is called. > > Yes, and it is almost the same behavior when a typical MVCC snapshot > encounters a tuple created by its own transaction. > > So, how it works in the non MVCC-safe case (current patch behaviour): > > 1) we have a whole initial table snapshot with all the xmin = repack XID > 2) appling transaction sees ALL the self-alive (no xmax) tuples in it > because all tuples created\deleted by transaction itself > 3) each update/delete during the replay selects the last existing > tuple version, updates it xmax and inserts a new one > 4) so, there is no any real MVCC involved - just find the latest > version and create a new version > 5) and it works correctly because all ordering issues were resolved by > locking mechanisms on the original table or by reordering buffer ok > How it maps to MVCC-safe case (SnapshotSelf): > > 1) we have a whole initial table snapshot with all xmin copied from > the original table. All such xmin are committed. > 2) appling transaction sees ALL the self-alive (no xmax) tuple in it > because its xmin\xmax is committed and SnapshotSelf is happy with it How does HeapTupleSatisfiesSelf() recognize the status of any XID w/o using a snapshot? Do you mean by checking the commit log (TransactionIdDidCommit) ? > 3) each update/delete during the replay selects the last existing > tuple version, updates it xmax=original xid and inserts a new one > keeping with xmin=orignal xid > 4) --//-- > 5) --//-- > > However, at the time we're replaying the UPDATE in the new table, the tuple > > may have been already deleted from the old table, and the deleting transaction > > may already have committed. In such a case, HeapTupleSatisfiesSelf() will > > conclude the old version invisible and the we'll fail to replay the UPDATE. > > No, it will see it - because its xmax will be empty in the repacked > version of the table. You're right, it'll be empty in the new table. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Sat, Aug 23, 2025 at 10:23 AM Álvaro Herrera <alvherre@kurilemu.de> wrote: > On 2025-08-23, Michael Banck wrote: > > On Fri, Aug 22, 2025 at 05:32:34PM -0300, Euler Taveira wrote: > > >> I don't think we need to keep vacuumdb. Packagers can keep a symlink (vacuumdb) > >> to pg_repackdb. We can add a similar warning message saying they should use > >> pg_repackdb if the symlink is used. > > > > Unless pg_repack has the same (or a superset of) CLI and behaviour as > > vacuumdb (I haven't checked, but doubt it?), I think replacing vacuumdb > > with a symlink to pg_repack will lead to much more breakage in existing > > scripts/automation than clusterdb, which I guess is used orders of > > magnitude less frequently than vacumdb. > > Yeah, I completely disagree with the idea of getting rid of vacuumdb. We can, maybe, in a distant future, get rid of the--full option to vacuumdb. But the rest of the vacuumdb behavior must stay, I think, because REPACK is not VACUUM — itis only VACUUM FULL. And we want to make that distinction very clear. > Or to put it the other way, VACUUM FULL is not really VACUUM either, it is really a form of "repack". > We can also, in a few years, get rid of clusterdb. But I don't think we need to deprecate it just yet. > Yeah, ISTM the long term goal should be two binaries, one of which manages aspects of clustering/repacking type of activities, and one which manages vacuum type activities. I don't think that's different that what Alvaro is proposing, FWIW my original question was about confirming that was the end goal, but also trying to understand the coordination of when these changes would take place, because the changes to the code, changes to the SQL commands and their docs, and changes to the command line tools, seem to be working at different cadences. Which can be fine if it's on purpose, but maybe needs to be tightened up if not; for example, the current patchset doesn't make any changes to clusterdb, which one might expect to emit a warning about being deprecated in favor of pg_repackdb, if not just a complete punting to use pg_repackdb instead. Robert Treat https://xzilla.net
Hi, Antonin > How does HeapTupleSatisfiesSelf() recognize the status of any XID w/o using a > snapshot? Do you mean by checking the commit log (TransactionIdDidCommit) ? Yes, TransactionIdDidCommit. Another option is just invent a new snapshot type - SnapshotBelieveEverythingCommitted - for that particular case it should work - because all xmin/xmax written into the new table are committed by design.
On Mon, Aug 25, 2025 at 10:15 AM Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > 1) we have a whole initial table snapshot with all xmin copied from > the original table. All such xmin are committed. > 2) appling transaction sees ALL the self-alive (no xmax) tuple in it > because its xmin\xmax is committed and SnapshotSelf is happy with it > 3) each update/delete during the replay selects the last existing > tuple version, updates it xmax=original xid and inserts a new one > keeping with xmin=orignal xid > 4) --//-- > 5) --//-- > Advancing the tables min xid to at least repack XID is a pretty big feature, but the above scenario sounds like it would result in any non-modified pre-existing tuples ending up with their original xmin rather than repack XID, which seems like it could lead to weird side-effects. Maybe I am mis-thinking it though? Robert Treat https://xzilla.net
Robert Treat <rob@xzilla.net> wrote: > On Mon, Aug 25, 2025 at 10:15 AM Mihail Nikalayeu > <mihailnikalayeu@gmail.com> wrote: > > 1) we have a whole initial table snapshot with all xmin copied from > > the original table. All such xmin are committed. > > 2) appling transaction sees ALL the self-alive (no xmax) tuple in it > > because its xmin\xmax is committed and SnapshotSelf is happy with it > > 3) each update/delete during the replay selects the last existing > > tuple version, updates it xmax=original xid and inserts a new one > > keeping with xmin=orignal xid > > 4) --//-- > > 5) --//-- > > > > Advancing the tables min xid to at least repack XID is a pretty big > feature, but the above scenario sounds like it would result in any > non-modified pre-existing tuples ending up with their original xmin > rather than repack XID, which seems like it could lead to weird > side-effects. Maybe I am mis-thinking it though? What we discuss here is how to keep visibility information of tuples (xmin, xmax, ...) unchanged. Both CLUSTER and VACUUM FULL already do that. However it's not trivial to ensure that REPACK with the CONCURRENTLY option does as well. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > Hi, Antonin > > > How does HeapTupleSatisfiesSelf() recognize the status of any XID w/o using a > > snapshot? Do you mean by checking the commit log (TransactionIdDidCommit) ? > > Yes, TransactionIdDidCommit. I think the problem is that HeapTupleSatisfiesSelf() uses TransactionIdIsInProgress() instead of checking the snapshot: ... else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple))) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) ... When decoding (and replaying) data changes, you deal with the database state as it was (far) in the past. However TransactionIdIsInProgress() is not suitable for this purpose. And since CommitTransaction() updates the commit log before removing the transaction from ProcArray, I can even imagine race conditions: if a transaction is committed and decoded fast enough, TransactionIdIsInProgress() might still return true. In such a case, HeapTupleSatisfiesSelf() returns false instead of calling TransactionIdDidCommit(). > Another option is just invent a new > snapshot type - SnapshotBelieveEverythingCommitted - for that > particular case it should work - because all xmin/xmax written into > the new table are committed by design. I'd prefer optimization of the logical decoding for REPACK CONCURRENTLY, and using the MVCC snapshots. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at>: > I think the problem is that HeapTupleSatisfiesSelf() uses > TransactionIdIsInProgress() instead of checking the snapshot: Yes, some issues might be possible for SnapshotSelf. Possible solution is to override TransactionIdIsCurrentTransactionId to true (like you did with nParallelCurrentXids but just return true). IIUC, in that case all checks are going to behave the same way as in v5 version. > I'd prefer optimization of the logical decoding for REPACK CONCURRENTLY, and > using the MVCC snapshots. It is also possible, but it is much more complex and feels like overkill to me. We need just a way to find the latest version of row in the world of all-committed transactions without any concurrent writers - I am pretty sure it is possible to achieve in a more simple and effective way.
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > Antonin Houska <ah@cybertec.at>: > > I think the problem is that HeapTupleSatisfiesSelf() uses > > TransactionIdIsInProgress() instead of checking the snapshot: > > Yes, some issues might be possible for SnapshotSelf. > Possible solution is to override TransactionIdIsCurrentTransactionId > to true (like you did with nParallelCurrentXids but just return true). > IIUC, in that case all checks are going to behave the same way as in v5 version. I assume you mean v12-0005. Yes, that modifies TransactionIdIsCurrentTransactionId(), so that the the transaction being replayed recognizes if it (or its subtransaction) performed particular change itself. Although it could work, I think it'd be confusing to consider the transactions being replayed as "current" from the point of view of the backend that executes REPACK CONCURRENTLY. But the primary issue is that in v12-0005, TransactionIdIsCurrentTransactionId() gets the information on "current transactions" from snapshots - see the calls of SetRepackCurrentXids() before each scan. It's probably not what you want. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at>: > Although it could work, I think it'd be confusing to consider the transactions > being replayed as "current" from the point of view of the backend that > executes REPACK CONCURRENTLY. Just realized SnapshotDirty is the thing that fits into the role - it respects not-yet committed transactions, giving enough information to wait for them. It is already used in a similar pattern in check_exclusion_or_unique_constraint and RelationFindReplTupleByIndex. So, it is easy to detect the case of the race you described previously and retry + there is no sense to hack around TransactionIdIsCurrentTransactionId. BWT, btree + SnapshotDirty has issue [0], but it is a different story and happens only with concurrent updates which are not present in the current scope. [0]: https://www.postgresql.org/message-id/flat/CADzfLwXGhH_qD6RGqPyEeKdmHgr-HpA-tASYdi5onP%2BRyP5TCw%40mail.gmail.com#77f6426ef2d282198f2d930d5334e3fa
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > Antonin Houska <ah@cybertec.at>: > > > Although it could work, I think it'd be confusing to consider the transactions > > being replayed as "current" from the point of view of the backend that > > executes REPACK CONCURRENTLY. > > Just realized SnapshotDirty is the thing that fits into the role - it > respects not-yet committed transactions, giving enough information to > wait for them. > It is already used in a similar pattern in > check_exclusion_or_unique_constraint and RelationFindReplTupleByIndex. > > So, it is easy to detect the case of the race you described previously > and retry + there is no sense to hack around > TransactionIdIsCurrentTransactionId. Where exactly should HeapTupleSatisfiesDirty() conclude that the tuple is visible? TransactionIdIsCurrentTransactionId() will not do w/o the modifications that you proposed earlier [1] and TransactionIdIsInProgress() is not suitable as I explained in [2]. I understand your idea that there are no transaction aborts in the new table, which makes things simpler. I cannot judge if it's worth inventing a new kind of snapshot. Anyway, I think you'd then also need to hack HeapTupleSatisfiesUpdate(). Isn't that too invasive? [1] https://www.postgresql.org/message-id/CADzfLwUqyOmpkLmciecBy4aBN1sohQVZ2Hgc6m-tjSUqDRHwyQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/24483.1756142534%40localhost -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hello, Antonin! Antonin Houska <ah@cybertec.at>: > > Where exactly should HeapTupleSatisfiesDirty() conclude that the tuple is > visible? TransactionIdIsCurrentTransactionId() will not do w/o the > modifications that you proposed earlier [1] and TransactionIdIsInProgress() is > not suitable as I explained in [2]. HeapTupleSatisfiesDirty is designed to respect even not-yet-committed transactions and provides additional related information. else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple))) { /* * Return the speculative token to caller. Caller can worry about * xmax, since it requires a conclusively locked row version, and * a concurrent update to this tuple is a conflict of its * purposes. */ if (HeapTupleHeaderIsSpeculative(tuple)) { snapshot->speculativeToken = HeapTupleHeaderGetSpeculativeToken(tuple); Assert(snapshot->speculativeToken != 0); } snapshot->xmin = HeapTupleHeaderGetRawXmin(tuple); /* XXX shouldn't we fall through to look at xmax? */ return true; /* in insertion by other */ } So, it returns true when TransactionIdIsInProgress is true. However, that alone is not sufficient to trust the result in the common case. You may check check_exclusion_or_unique_constraint or RelationFindReplTupleByIndex for that pattern: if xmin is set in the snapshot (a special hack in SnapshotDirty to provide additional information from the check), we wait for the ongoing transaction (or one that is actually committed but not yet properly reflected in the proc array), and then retry the entire tuple search. So, the race condition you explained in [2] will be resolved by a retry, and the changes to TransactionIdIsInProgress described in [1] are not necessary. > I understand your idea that there are no transaction aborts in the new table, > which makes things simpler. I cannot judge if it's worth inventing a new kind > of snapshot. Anyway, I think you'd then also need to hack > HeapTupleSatisfiesUpdate(). Isn't that too invasive? It seems that HeapTupleSatisfiesUpdate is also fine as it currently exists (we'll see the committed version after retry).. The solution appears to be non-invasive: * uses the existing snapshot type * follows the existing usage pattern * leaves TransactionIdIsInProgress and HeapTupleSatisfiesUpdate unchanged The main change is that xmin/xmax values are forced from the arguments - but that seems unavoidable in any case. I'll try to make some kind of prototype this weekend + cover race condition you mentioned in specs. Maybe some corner cases will appear. By the way, there's one more optimization we could apply in both MVCC-safe and non-MVCC-safe cases: setting the HEAP_XMIN_COMMITTED / HEAP_XMAX_COMMITTED bit in the new table: * in the MVCC-safe approach, the transaction is already committed. * in the non-MVCC-safe case, it isn’t committed yet - but no one will examine that bit before it commits (though this approach does feel more fragile). This could help avoid potential storms of full-page writes caused by SetHintBit after the table switch. Best regards, Mikhail.
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > Hello, Antonin! > > Antonin Houska <ah@cybertec.at>: > > > > Where exactly should HeapTupleSatisfiesDirty() conclude that the tuple is > > visible? TransactionIdIsCurrentTransactionId() will not do w/o the > > modifications that you proposed earlier [1] and TransactionIdIsInProgress() is > > not suitable as I explained in [2]. > > HeapTupleSatisfiesDirty is designed to respect even not-yet-committed > transactions and provides additional related information. > > else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple))) > { > /* > * Return the speculative token to caller. Caller can worry about > * xmax, since it requires a conclusively locked row version, and > * a concurrent update to this tuple is a conflict of its > * purposes. > */ > if (HeapTupleHeaderIsSpeculative(tuple)) > { > snapshot->speculativeToken = > HeapTupleHeaderGetSpeculativeToken(tuple); > > Assert(snapshot->speculativeToken != 0); > } > > snapshot->xmin = HeapTupleHeaderGetRawXmin(tuple); > /* XXX shouldn't we fall through to look at xmax? */ > return true; /* in insertion by other */ > } > > So, it returns true when TransactionIdIsInProgress is true. > However, that alone is not sufficient to trust the result in the common case. > > You may check check_exclusion_or_unique_constraint or > RelationFindReplTupleByIndex for that pattern: > if xmin is set in the snapshot (a special hack in SnapshotDirty to > provide additional information from the check), we wait for the > ongoing transaction (or one that is actually committed but not yet > properly reflected in the proc array), and then retry the entire tuple > search. > > So, the race condition you explained in [2] will be resolved by a > retry, and the changes to TransactionIdIsInProgress described in [1] > are not necessary. I insist that this is a misuse of TransactionIdIsInProgress(). When dealing with logical decoding, only WAL should tell whether particular transaction is still running. AFAICS this is how reorderbuffer.c works. A new kind of snapshot seems like (much) cleaner solution at the moment. > I'll try to make some kind of prototype this weekend + cover race > condition you mentioned in specs. > Maybe some corner cases will appear. No rush. First, the MVCC safety is not likely to be included in v19 [1]. Second, I think it's good to let others propose their ideas before writing code. > By the way, there's one more optimization we could apply in both > MVCC-safe and non-MVCC-safe cases: setting the HEAP_XMIN_COMMITTED / > HEAP_XMAX_COMMITTED bit in the new table: > * in the MVCC-safe approach, the transaction is already committed. > * in the non-MVCC-safe case, it isn’t committed yet - but no one will > examine that bit before it commits (though this approach does feel > more fragile). > > This could help avoid potential storms of full-page writes caused by > SetHintBit after the table switch. Good idea, thanks. [1] https://www.postgresql.org/message-id/202504040733.ysuy5gad55md%40alvherre.pgsql -- Antonin Houska Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at>: > I insist that this is a misuse of TransactionIdIsInProgress(). When dealing > with logical decoding, only WAL should tell whether particular transaction is > still running. AFAICS this is how reorderbuffer.c works. Hm... Maybe, but at the same time we already have SnapshotDirty used in that way and it even deals with the same race.... But I agree - a special kind of snapshot is a more accurate solution. > A new kind of snapshot seems like (much) cleaner solution at the moment. Do you mean some kind of snapshot which only uses TransactionIdDidCommit/Abort ignoring TransactionIdIsCurrentTransactionId/TransactionIdIsInProgress? Actually it behaves like SnapshotBelieveEverythingCommitted in that particular case, but TransactionIdDidCommit/Abort may be used as some kind of assert/error source to be sure everything is going as designed. And, yes, for the new snapshot we need to have HeapTupleSatisfiesUpdate to be modified. Also, to deal with that particular race we may just use XactLockTableWait(xid, NULL, NULL, XLTW_None) before starting transaction replay. > No rush. First, the MVCC safety is not likely to be included in v19 [1]. That worries me - it is not the behaviour someone expects from a database by default. At least the warning should be much more visible and obvious. I think most of user will expect the same guarantees as [CREATE|RE] INDEX CONCURRENTLY provides.
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > > A new kind of snapshot seems like (much) cleaner solution at the moment. > > Do you mean some kind of snapshot which only uses > TransactionIdDidCommit/Abort ignoring > TransactionIdIsCurrentTransactionId/TransactionIdIsInProgress? > Actually it behaves like SnapshotBelieveEverythingCommitted in that > particular case, but TransactionIdDidCommit/Abort may be used as some > kind of assert/error source to be sure everything is going as > designed. Given that there should be no (sub)transaction aborts in the new table, I think you only need to check that the tuple has valid xmin and invalid xmax. I think the XID should be in the commit log at the time the transaction is being replayed, so it should be legal to use TransactionIdDidCommit/Abort in Assert() statements. (And as long as REPACK CONCURRENTLY will use ShareUpdateExclusiveLock, which conflicts with VACUUM, pg_class(relfrozenxid) for given table should not advance during the processing, and therefore the replayed XIDs should not be truncated from the commit log while REPACK CONCURRENTLY is running.) > And, yes, for the new snapshot we need to have > HeapTupleSatisfiesUpdate to be modified. > > Also, to deal with that particular race we may just use > XactLockTableWait(xid, NULL, NULL, XLTW_None) before starting > transaction replay. Do you mean the race related to TransactionIdIsInProgress()? Not sure I understand, as you suggested above that you no longer need the function. > > No rush. First, the MVCC safety is not likely to be included in v19 [1]. > > That worries me - it is not the behaviour someone expects from a > database by default. At least the warning should be much more visible > and obvious. > I think most of user will expect the same guarantees as [CREATE|RE] > INDEX CONCURRENTLY provides. It does not really worry me. The pg_squeeze extension is not MVCC-safe and I remember there were only 1 or 2 related complaints throughout its existence. (pg_repack isn't MVCC-safe as well, but I don't keep track of its issues.) Of course, user documentation should warn about the problem, in a way it does for other commands (typically ALTER TABLE). -- Antonin Houska Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at>: > Do you mean the race related to TransactionIdIsInProgress()? Not sure I > understand, as you suggested above that you no longer need the function. The "lightweight" approaches I see so far: * XactLockTableWait before replay + SnapshotSelf(GetLatestSnapshot?) * SnapshotDirty + retry logic * SnapshotBelieveEverythingCommitted + modification of HeapTupleSatisfiesUpdate (because it called by heap_update and looks into TransactionIdIsInProgress) > It does not really worry me. The pg_squeeze extension is not MVCC-safe and I > remember there were only 1 or 2 related complaints throughout its > existence. (pg_repack isn't MVCC-safe as well, but I don't keep track of its > issues.) But pg_squeeze and pg_repack are extensions. If we are moving that mechanics into core I'd expect some improvements over pg_squeeze. MVCC-safety of REINDEX CONCURRENTLY makes it possible to run it on a regular basis as some kind of background job. It would be nice to have something like this for the heap. I agree the initial approach is too invasive, complex and performance-heavy to push it forward now. But, any of "lightweight" feels like a good candidate to be shipped with the feature itself - relatively easy and non-invasive. Best regards, Mikhail.
On 2025-Aug-21, Mihail Nikalayeu wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org>: > > I proposed to leave this part out initially, which is why it hasn't been > > reposted. We can review and discuss after the initial patches are in. > > I think it is worth pushing it at least in the same release cycle. If others are motivated enough to certify it, maybe we can consider it. But I don't think I'm going to have time to get this part reviewed and committed in time for 19, so you might need another committer. > > Because having an MVCC-safe mode has drawbacks, IMO we should make it > > optional. > > Do you mean some option for the command? Like REPACK (CONCURRENTLY, SAFE)? Yes, exactly that. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "La grandeza es una experiencia transitoria. Nunca es consistente. Depende en gran parte de la imaginación humana creadora de mitos" (Irulan)
Hello, Álvaro! > If others are motivated enough to certify it, maybe we can consider it. > But I don't think I'm going to have time to get this part reviewed and > committed in time for 19, so you might need another committer. I don't think it is realistic to involve another committer - it is just a well-known curse of all non-committers :) > > > Because having an MVCC-safe mode has drawbacks, IMO we should make it > > > optional. As far as I can see, the proposed "lightweight" solutions don't introduce any drawbacks - unless something has been overlooked. > > Do you mean some option for the command? Like REPACK (CONCURRENTLY, SAFE)? > Yes, exactly that. To be honest that approach feels a little bit strange for me. I work in the database-consumer (not database-developer) industry and 90% of DevOps engineers (or similar roles who deal with database maintenance now) have no clue what MVCC is - and it is industry standard nowadays. From my perspective - it is better to have a less performant, but MVCC-safe approach by default, with some "more performance, less safety" flag for those who truly understand the trade-offs. All kinds of safety and correctness is probably the main reason to use classic databases instead of storing data in s3\elastis\mongo\etc these days. In case of some incident related to that (in a large well-known company) the typical takeaway for readers of tech blogs will simply be "some command in Postgres is broken". And maybe also "the database with a name starting with 'O' is not affected by that flaw". Yes, some ALTER table-rewriting commands are not MVCC-safe, but those typically block everything for hours - so they're avoided unless absolutely necessary, and usually performed during backend outages to prevent systems from getting stuck waiting on millions of locks. "CONCURRENTLY" is something that feels like a "working in background, don't worry, do not stop your 100k RPS" thing, especially in Postgres because of CIC. I also have personal experience debugging incidents caused by incorrect Postgres behavior - and it’s absolute hell. Sorry, I made a drama here.... I fully understand resource limitations... Maybe Postgres 19 could introduce something like REPACK (CONCURRENTLY, UNSAFE) first, and later add a safer REPACK (CONCURRENTLY)? Best regards, Mikhail.
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > In case of some incident related to that (in a large well-known > company) the typical takeaway for readers of tech blogs will simply be > "some command in Postgres is broken". For _responsible_ users, the message will rather be that "some tech bloggers do not bother to read user documentation". -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hello, Here's v19 of this patchset. This is mostly Antonin's v18. I added a preparatory v19-0001 commit, which splits vacuumdb.c to create a new file, vacuuming.c (and its header file vacuuming.h). If you look at it under 'git show --color-moved=zebra' you should notice that most of it is just code movement; there's hardly any code changes. v19-0002 has absorbed Antonin's v18-0005 (the pg_repackdb binary) together with the introduction of the REPACK command proper; but instead of using a symlink, I just created a separate pg_repackdb.c source file for it and we compile that small new source file with vacuuming.c to create a regular binary. BTW the meson.build changes look somewhat duplicative; maybe there's a less dumb way to go about this. (For instance, maybe just have libscripts.a include vacuuming.o, though it's not used by any of the other programs in that subdir.) I'm not wedded to the name "vacuuming.c"; happy to take suggestions. After 0002, the pg_repackdb utility should be ready to take clusterdb's place, and also vacuumdb --full, with one gotcha: if you try to use pg_repackdb with an older server version, it will fail, claiming that REPACK is not supported. This is not ideal. Instead, we should make it run VACUUM FULL (or CLUSTER); so if you have a fleet including older servers you can use the new utils there too. All the logic for vacuumdb to select tables to operate on has been moved to vacuuming.c verbatim. This means this logic applies to pg_repackdb as well. As long as you stick to repacking a single table this is okay (read: it won't be used at all), but if you want to use parallel mode (say to process multiple schemas), we might need to change it. For the same reason, I think we should add an option to it (--index[=indexname]) to select whether to use the USING INDEX clause or not, and optionally indicate which index to use; right now there's no way to select which logic (cluster's or vacuum full's) to use. Then v19-0003 through v19-0005 are Antonin's subsequent patches to add the CONCURRENTLY option; I have not reviewed these at all, so I'm including them here just for completion. I also included v18-0006 as posted by Mihail previously, though I have little faith that we're going to include it in this release. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Pensar que el espectro que vemos es ilusorio no lo despoja de espanto, sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)
Вложения
- v19-0001-Split-vacuumdb-to-create-vacuuming.c-h.patch
- v19-0002-Add-REPACK-command.patch
- v19-0003-Refactor-index_concurrently_create_copy-for-use-.patch
- v19-0004-Move-conversion-of-a-historic-to-MVCC-snapshot-t.patch
- v19-0005-Add-CONCURRENTLY-option-to-REPACK-command.patch
- v19-0006-Preserve-visibility-information-of-the-concurren.patch
Apparently I mismerged src/bin/scripts/meson.build. This v20 is identical to v19, where that mistake has been corrected. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ Al principio era UNIX, y UNIX habló y dijo: "Hello world\n". No dijo "Hello New Jersey\n", ni "Hello USA\n".
Вложения
- v20-0001-Split-vacuumdb-to-create-vacuuming.c-h.patch
- v20-0002-Add-REPACK-command.patch
- v20-0003-Refactor-index_concurrently_create_copy-for-use-.patch
- v20-0004-Move-conversion-of-a-historic-to-MVCC-snapshot-t.patch
- v20-0005-Add-CONCURRENTLY-option-to-REPACK-command.patch
- v20-0006-Preserve-visibility-information-of-the-concurren.patch
Hello! I started an attempt to make a "lightweight" MVCC-safe prototype and stuck into the "it is not working" issue. After some debugging I realized Antonin's variant (catalog-mode based) seems to be broken also... And after a few more hours I realized non-MVCC is broken as well :) This is a patch with a test to reproduce the issue related to repack + concurrent modifications. Seems like some updates may be lost. I hope the patch logic is clear - but feel free to ask if not. Best regards, Mikhail.
Вложения
On 2025-Aug-31, Mihail Nikalayeu wrote: > I started an attempt to make a "lightweight" MVCC-safe prototype and > stuck into the "it is not working" issue. > After some debugging I realized Antonin's variant (catalog-mode based) > seems to be broken also... > > And after a few more hours I realized non-MVCC is broken as well :) Ugh. Well, obviously we need to get this fixed if we want CONCURRENTLY at all :-) Please don't post patches that aren't the commitfest item's main patch as attachment with .patch extension. This confuses the CFbot into thinking your patch is the patch-of-record (which it isn't) and reports that the patch fails CI. See here: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5117 (For the same reason, it isn't useful to number them as if they were part of the patch series). If you want to post secondary patches, please rename them to end in something like .txt or .nocfbot or whatever. See here: https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches? Thanks for your interest in this topic, -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
On Wed, Aug 27, 2025 at 10:22:24AM +0200, Mihail Nikalayeu wrote: > That worries me - it is not the behaviour someone expects from a > database by default. At least the warning should be much more visible > and obvious. > I think most of user will expect the same guarantees as [CREATE|RE] > INDEX CONCURRENTLY provides. Having a unified path for the handling of the waits and the locking sounds to me like a pretty good argument in favor of a basic implementation. In my experience, users do not really care about the time it takes to complete a operation involving CONCURRENTLY if we allow concurrent reads and writes in parallel of it. I have not looked at the proposal in details, but before trying a more folkloric MVCC approach, relying on basics that we know have been working for some time seems like a good and sufficient initial step in terms of handling the waits and the locks with table AMs (aka heap or something else). Just my 2c. -- Michael
Вложения
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > Hello! > > I started an attempt to make a "lightweight" MVCC-safe prototype and > stuck into the "it is not working" issue. > After some debugging I realized Antonin's variant (catalog-mode based) > seems to be broken also... > > And after a few more hours I realized non-MVCC is broken as well :) > > This is a patch with a test to reproduce the issue related to repack + > concurrent modifications. > Seems like some updates may be lost. > > I hope the patch logic is clear - but feel free to ask if not. Are you sure the test is complete? I see no occurrence of the REPACK command in it. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hello! Antonin Houska <ah@cybertec.at>: > Are you sure the test is complete? I see no occurrence of the REPACK command > in it. Oops, send invalid file. The correct one in attachment.
Вложения
Hello, Álvaro! Alvaro Herrera <alvherre@alvh.no-ip.org>: > If you want to post secondary patches, please rename them to end in > something like .txt or .nocfbot or whatever. See here: > https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches? Sorry, I missed that. But now it is possible to send ".patch" without changing the extension [0]. > It also ignores any files that start with "nocfbot". [0]: https://discord.com/channels/1258108670710124574/1328362897189113867/1412021226528051250
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > Antonin Houska <ah@cybertec.at>: > > Are you sure the test is complete? I see no occurrence of the REPACK command > > in it. > Oops, send invalid file. The correct one in attachment. Thanks! The problem was that when removing the original "preserve visibility patch" v12-0005 [1] from the series, I forgot to change the value of 'need_full_snapshot' argument of CreateInitDecodingContext(). v12 and earlier treated the repacked table like system catalog, so it was o.k. to pass need_full_snapshot=false. However, it must be true now, otherwise the snapshot created for the initial copy does not see commits of transactions that do not change regular catalogs. The fix is as simple as diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c index f481a3cec6d..7866ac01278 100644 --- a/src/backend/replication/logical/snapbuild.c +++ b/src/backend/replication/logical/snapbuild.c @@ -502,6 +502,7 @@ SnapBuildInitialSnapshotForRepack(SnapBuild *builder) StringInfo buf = makeStringInfo(); Assert(builder->state == SNAPBUILD_CONSISTENT); + Assert(builder->building_full_snapshot); snap = SnapBuildBuildSnapshot(builder); I'll apply it to the next version of the "Add CONCURRENTLY option to REPACK command" patch. [1] https://www.postgresql.org/message-id/flat/CAFj8pRDK89FtY_yyGw7-MW-zTaHOCY4m6qfLRittdoPocz+dMQ@mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hello! Antonin Houska <ah@cybertec.at>: > I'll apply it to the next version of the "Add CONCURRENTLY option to REPACK > command" patch. I have added it to the v21 patchset. Also, I’ve updated the MVCC-safe patch: * it uses the "XactLockTableWait before replay + SnapshotSelf" approach from [0] * it includes a TAP test to ensure MVCC safety - not intended to be committed in its current form (too heavy) * documentation has been updated. It's now much simpler and does not negatively impact performance. It is less aggressive in tuple freezing, but can be updated to match the non-MVCC-safe version if needed. While testing MVCC-safe version with stress-tests 007_repack_concurrently_mvcc.pl I encountered some random crashes with such logs: 25-09-02 12:24:40.039 CEST client backend[261907] 007_repack_concurrently_mvcc.pl ERROR: relcache reference 0x7715b9f394a8 is not owned by resource owner TopTransaction 2025-09-02 12:24:40.039 CEST client backend[261907] 007_repack_concurrently_mvcc.pl STATEMENT: REPACK (CONCURRENTLY) tbl1 USING INDEX tbl1_pkey; TRAP: failed Assert("rel->rd_refcnt > 0"), File: "../src/backend/utils/cache/relcache.c", Line: 6992, PID: 261907 postgres: CIC_test: nkey postgres [local] REPACK(ExceptionalCondition+0xbe)[0x5b7ac41d79f9] postgres: CIC_test: nkey postgres [local] REPACK(+0x852d2e)[0x5b7ac41cbd2e] postgres: CIC_test: nkey postgres [local] REPACK(+0x8aa4a6)[0x5b7ac42234a6] postgres: CIC_test: nkey postgres [local] REPACK(+0x8aad3b)[0x5b7ac4223d3b] postgres: CIC_test: nkey postgres [local] REPACK(+0x8aac69)[0x5b7ac4223c69] postgres: CIC_test: nkey postgres [local] REPACK(ResourceOwnerRelease+0x32)[0x5b7ac4223c26] postgres: CIC_test: nkey postgres [local] REPACK(+0x1f43bf)[0x5b7ac3b6d3bf] postgres: CIC_test: nkey postgres [local] REPACK(+0x1f4dfa)[0x5b7ac3b6ddfa] postgres: CIC_test: nkey postgres [local] REPACK(AbortCurrentTransaction+0xe)[0x5b7ac3b6dd6b] postgres: CIC_test: nkey postgres [local] REPACK(PostgresMain+0x57d)[0x5b7ac3fd7238] postgres: CIC_test: nkey postgres [local] REPACK(+0x654102)[0x5b7ac3fcd102] postgres: CIC_test: nkey postgres [local] REPACK(postmaster_child_launch+0x191)[0x5b7ac3eceb7a] postgres: CIC_test: nkey postgres [local] REPACK(+0x55c8c1)[0x5b7ac3ed58c1] postgres: CIC_test: nkey postgres [local] REPACK(+0x559d1e)[0x5b7ac3ed2d1e] postgres: CIC_test: nkey postgres [local] REPACK(PostmasterMain+0x168a)[0x5b7ac3ed25f8] postgres: CIC_test: nkey postgres [local] REPACK(main+0x3a1)[0x5b7ac3da2bd6] /lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7715b962a1ca] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7715b962a28b] This time I was clever and tried to attempt to reproduce the issue on a non-MVCC safe version at first - and it is reproducible. Just comment \if :p_t1 != :p_t2 (and its internals, because they catching non-mvcc behaviour which is expected without 0006 patch); and set '--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=25000' It takes about a minute on my PC to get the crash. [0]: https://www.postgresql.org/message-id/flat/CADzfLwXCTXNdxK-XGTKmObvT%3D_QnaCviwgrcGtG9chsj5sYzrg%40mail.gmail.com#6d6ccf34e6debcd386d735db593379d8 Best regards, Mikhail.
Вложения
- v21-0006-Preserve-visibility-information-of-the-concurren.patch
- v21-0003-Refactor-index_concurrently_create_copy-for-use-.patch
- v21-0005-Add-CONCURRENTLY-option-to-REPACK-command.patch
- v21-0002-Add-REPACK-command.patch
- v21-0004-Move-conversion-of-a-historic-to-MVCC-snapshot-t.patch
- v21-0001-Split-vacuumdb-to-create-vacuuming.c-h.patch
Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote: > While testing MVCC-safe version with stress-tests > 007_repack_concurrently_mvcc.pl I encountered some random crashes with > such logs: > > 25-09-02 12:24:40.039 CEST client backend[261907] > 007_repack_concurrently_mvcc.pl ERROR: relcache reference > 0x7715b9f394a8 is not owned by resource owner TopTransaction > ... > This time I was clever and tried to attempt to reproduce the issue on > a non-MVCC safe version at first - and it is reproducible. Thanks again for a thorough testing! I think this should be fixed separately [1]. [1] https://www.postgresql.org/message-id/119497.1756892972%40localhost -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hello, Barring further commentary, I intend to get 0001 committed tomorrow, and 0002 some time later -- perhaps by end of this week, or sometime next week. Regards -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
After looking at this some more, I realized that 0001 had been written a bit too hastily and that it could use with some more cleanup -- in particular, we don't need to export most of the function prototypes other than vacuuming_main() (and the trivial escape_quotes helper). I made the other functions static. Also, prepare_vacuum_command() also needs the encoding in order to do fmtIdEnc() on a given index name (for `pg_repackdb -t table --index=foobar`), so I changed it to take the PGconn instead of just the serverVersion. I realized that it makes no sense that objfilter is a global variable instead of living inside `main` and be passed as argument where needed. (Heck, maybe it should be inside vacuumingOpts). Lastly, it seemed weird coding that the functions would sometimes exit(1) instead of returning a result code, so I made them do that and have the callers react appropriately. These are all fairly straightforward changes. So here's v22 with those and rebased to current sources. Only the first two patches this time, which are the ones I would be glad to receive input on. I also wonder if analyze_only and analyze_in_stages should be new values in RunMode rather than separate booleans ... I think that might make the code simpler. I didn't try though. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Los dioses no protegen a los insensatos. Éstos reciben protección de otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)
Вложения
Em qui., 25 de set. de 2025 às 15:12, Álvaro Herrera <alvherre@kurilemu.de> escreveu:
Some typos I've found on usage of pg_repackdb.
+ printf(_(" -n, --schema=SCHEMA repack tables in the specified schema(s) only\n"));
+ printf(_(" -N, --exclude-schema=SCHEMA do not repack tables in the specified schema(s)\n"));
+ printf(_(" -N, --exclude-schema=SCHEMA do not repack tables in the specified schema(s)\n"));
both options can point to a single schema, so "(s)" should be removed.
"in the specified schema(s)" should be "in the specified schema"
Same occurs on this one, which should be table, not table(s)
+ printf(_(" -t, --table='TABLE' repack specific table(s) only\n"));
regards
Marcos
On Thu, Sep 25, 2025 at 4:21 PM Marcos Pegoraro <marcos@f10.com.br> wrote: > > Em qui., 25 de set. de 2025 às 15:12, Álvaro Herrera <alvherre@kurilemu.de> escreveu: > > Some typos I've found on usage of pg_repackdb. > > + printf(_(" -n, --schema=SCHEMA repack tables in the specified schema(s) only\n")); > + printf(_(" -N, --exclude-schema=SCHEMA do not repack tables in the specified schema(s)\n")); > both options can point to a single schema, so "(s)" should be removed. > "in the specified schema(s)" should be "in the specified schema" > > Same occurs on this one, which should be table, not table(s) > + printf(_(" -t, --table='TABLE' repack specific table(s) only\n")); > This pattern is used because you can pass more than one argument, for example, something like pg_repackdb -d pagila -v -n public -n legacy While I agree that the wording is a little awkward; I'd prefer "repack tables only in the specified schema(s)"; but this follows the same pattern as pg_dump and friends. Robert Treat https://xzilla.net
Em qui., 25 de set. de 2025 às 18:31, Robert Treat <rob@xzilla.net> escreveu:
This pattern is used because you can pass more than one argument, for
example, something like
I know that
While I agree that the wording is a little awkward, this follows the same
pattern as pg_dump and friends.
well, I think pg_dump looks wrong too. Because if you explain that it's a
single table or single schema on docs, why you write on plural on usage ?
+ Repack or analyze all tables in
+ <replaceable class="parameter">schema</replaceable> only. Multiple
+ schemas can be repacked by writing multiple <option>-n</option>
+ switches.
+ <replaceable class="parameter">schema</replaceable> only. Multiple
+ schemas can be repacked by writing multiple <option>-n</option>
+ switches.
instead of
+ printf(_(" -n, --schema=SCHEMA repack tables in the specified schema(s) only\n"));
maybe this ?
+ printf(_(" -n, --schema=SCHEMA repack tables in the specified schema, can be used several times\n"));
regards
Marcos
Hello! Álvaro Herrera <alvherre@kurilemu.de>: > So here's v22 with those and rebased to current sources. Only the first > two patches this time, which are the ones I would be glad to receive > input on. > get_tables_to_repack_partitioned(RepackCommand cmd, MemoryContext cluster_context, > Oid relid, bool rel_is_index) Should we rename it to repack_context to be aligned with the calling side? --------- 'cmd' in > static List *get_tables_to_repack(RepackCommand cmd, bool usingindex, > MemoryContext permcxt); but 'command' in > get_tables_to_repack(RepackCommand command, bool usingindex, > MemoryContext permcxt) --------- > cmd == REPACK_COMMAND_CLUSTER ? "CLUSTER" : "REPACK", May be changed to RepackCommandAsString ----------- if (cmd == REPACK_COMMAND_REPACK) pgstat_progress_update_param(PROGRESS_REPACK_COMMAND, PROGRESS_REPACK_COMMAND_REPACK); else if (cmd == REPACK_COMMAND_CLUSTER) { pgstat_progress_update_param(PROGRESS_REPACK_COMMAND, PROGRESS_CLUSTER_COMMAND_CLUSTER); } else .... '{' and '}' looks a little bit weird. -------- Documentation of pg_repackdb contains a lot of "analyze" and even "--analyze" parameter - but I can't see anything related in the code. Best regards, Mikhail.
On Thu, Sep 25, 2025 at 2:12 PM Álvaro Herrera <alvherre@kurilemu.de> wrote: > So here's v22 with those and rebased to current sources. Only the first > two patches this time, which are the ones I would be glad to receive > input on. > A number of small issues I noticed. I don't know that they all need addressing right now, but seems worth asking the questions... #1 "pg_repackdb --help" does not mention the --index option, although the flag is accepted. I'm not sure if this is meant to match clusterdb, but since we need the index option to invoke the clustering behavior, I think it needs to be there. #2 [xzilla@zebes] pgsql/bin/pg_repackdb -d pagila -v -t customer --index=idx_last_name pg_repackdb: repacking database "pagila" INFO: clustering "public.customer" using sequential scan and sort [xzilla@zebes] pgsql/bin/pg_repackdb -d pagila -v -t customer pg_repackdb: repacking database "pagila" INFO: vacuuming "public.customer" This was less confusing once I figured out we could pass the --index option, but even with that it is a little confusing, I think mostly because it looks like we are "vacuuming" the table, which in a world of repack and vacuum (ie. no vacuum full) doesn't make sense. I think the right thing to do here would be to modify it to be "repacking %s" in both cases, with the "using sequential scan and sort" as the means to understand which version of repack is being executed. #3 pg_repackdb does not offer an --analyze option, which istm it should to match the REPACK command #4 SQL level REPACK help shows: where option can be one of: VERBOSE [ boolean ] ANALYSE | ANALYZE but SQL level VACUUM does VERBOSE [ boolean ] ANALYZE [ boolean ] These operate the same way, so I would expect it to match the language in vacuum. #5 [xzilla@zebes] pgsql/bin/pg_repackdb -d pagila -v -t film --index pg_repackdb: repacking database "pagila" In the above scenario, I am repacking without having previously specified an index. At the SQL level this would throw an error, at the command line it gives me a heart attack. :-) It's actually not that bad, because we don't actually do anything, but maybe we should throw an error? #6 On the individual command pages (like sql-repack.html), I think there should be more cross-linking, ie. repack should probably say "see also cluster" and vice versa. Likely similarly with vacuum and repack. #7 Is there some reason you chose to intermingle the repack regression tests with the existing tests? I feel like it'd be easier to differentiate potential regressions and new functionality if these were separated. Robert Treat https://xzilla.net