Обсуждение: REPACK and naming
I am starting to get worried about the confusing of adding a REPACK command. We already have a lot of confusion around vacuum and analyze: * autoanalyze does vacuum and analyze * VACUUM FULL is much different from VACUUM It seems if we add REPACK as a command, it is somewhere between VACUUM FULL and VACUUM in severity/impact. Should we be rethinking the naming in this area? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Hi, On 2025-Sep-16, Bruce Momjian wrote: > I am starting to get worried about the confusing of adding a REPACK > command. We already have a lot of confusion around vacuum and analyze: > > * autoanalyze does vacuum and analyze > * VACUUM FULL is much different from VACUUM > > It seems if we add REPACK as a command, it is somewhere between VACUUM > FULL and VACUUM in severity/impact. No, REPACK is exactly where VACUUM FULL is in terms of impact and severity; it's not between anything. The confusion stems precisely from VACUUM being a thing that's a completely different one from VACUUM FULL, yet they have pretty much the same name. What I'm doing is give one of those things a different name, to reduce confusion. Note that there's no intention to add autorepack, because that would (IMO) make no sense. Another thing I'm doing with that patch is rename CLUSTER so that it is also REPACK. This also makes sense, because VACUUM FULL _is_ the same as CLUSTER, except that it follows current physical order instead of following a specific index's order. Peter E suggested that since we have REINDEX to rewrite indexes, then the command to rewrite tables should be RETABLE. I haven't been able to get myself to like that idea, and also I think that was a bit tongue-in-cheek, but if you like RETABLE better than REPACK, then maybe we can have a vote to decide which one of those names to use. However, I don't think that change would make a tremendous difference, and also I don't think RETABLE is enough of an English name to become a command name. > Should we be rethinking the naming in this area? I haven't seen anything that needs renaming TBH, but if you have specific proposals, feel free to air them. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Thou shalt not follow the NULL pointer, for chaos and madness await thee at its end." (2nd Commandment for C programmers)
On Tue, Sep 16, 2025 at 7:42 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Peter E suggested that since we have REINDEX to rewrite indexes, then > the command to rewrite tables should be RETABLE. I haven't been able to > get myself to like that idea, and also I think that was a bit > tongue-in-cheek, but if you like RETABLE better than REPACK, then maybe > we can have a vote to decide which one of those names to use. However, > I don't think that change would make a tremendous difference, and also I > don't think RETABLE is enough of an English name to become a command > name. I think RETABLE is not a proposal to be taken seriously. That's extremely confusing. I don't love the name REPACK, but I think it's good enough. If we come up with something better, great. I agree that having a single command that does both VACUUM FULL and CLUSTER makes a lot more sense than the status quo, which is a confusing historical accident. -- Robert Haas EDB: http://www.enterprisedb.com
Em ter., 16 de set. de 2025 às 23:01, Robert Haas <robertmhaas@gmail.com> escreveu:
I think RETABLE is not a proposal to be taken seriously. That's
extremely confusing.
This feature could be used in a future version to rearrange fields in a table, for better padding.
I don't think we have another one available for this purpose.
CREATE TABLE T(A text, B integer, C bigint, D integer);
We could have something like
RETABLE T USING(B, D, C, A)
So REPACK isn't the best for this, if this feature would exist some day.
regards
Marcos
On Wed, 17 Sept 2025, 03:01 Robert Haas, <robertmhaas@gmail.com> wrote:
On Tue, Sep 16, 2025 at 7:42 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Peter E suggested that since we have REINDEX to rewrite indexes, then
> the command to rewrite tables should be RETABLE. I haven't been able to
> get myself to like that idea, and also I think that was a bit
> tongue-in-cheek, but if you like RETABLE better than REPACK, then maybe
> we can have a vote to decide which one of those names to use. However,
> I don't think that change would make a tremendous difference, and also I
> don't think RETABLE is enough of an English name to become a command
> name.
I think RETABLE is not a proposal to be taken seriously. That's
extremely confusing.
I don't love the name REPACK, but I think it's good enough. If we come
up with something better, great.
COMPACT?
Thom
On Wed, Sep 17, 2025 at 01:42:29AM +0200, Álvaro Herrera wrote: > Hi, > > On 2025-Sep-16, Bruce Momjian wrote: > > > I am starting to get worried about the confusing of adding a REPACK > > command. We already have a lot of confusion around vacuum and analyze: > > > > * autoanalyze does vacuum and analyze > > * VACUUM FULL is much different from VACUUM > > > > It seems if we add REPACK as a command, it is somewhere between VACUUM > > FULL and VACUUM in severity/impact. > > No, REPACK is exactly where VACUUM FULL is in terms of impact and > severity; it's not between anything. The confusion stems precisely from > VACUUM being a thing that's a completely different one from VACUUM FULL, > yet they have pretty much the same name. What I'm doing is give one of > those things a different name, to reduce confusion. Note that there's > no intention to add autorepack, because that would (IMO) make no sense. > > Another thing I'm doing with that patch is rename CLUSTER so that it is > also REPACK. This also makes sense, because VACUUM FULL _is_ the same > as CLUSTER, except that it follows current physical order instead of > following a specific index's order. So the CLUSTER command is removed and people should use REPACK instead? And VACUUM FULL stays unchanged? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
Em ter., 16 de set. de 2025 às 13:40, Bruce Momjian <bruce@momjian.us> escreveu:
I am starting to get worried about the confusing of adding a REPACK
command. We already have a lot of confusion around vacuum and analyze:
* autoanalyze does vacuum and analyze
* VACUUM FULL is much different from VACUUM
It seems if we add REPACK as a command, it is somewhere between VACUUM
FULL and VACUUM in severity/impact. Should we be rethinking the naming
in this area?
SqlServer has similar feature.
SHRINK
best regards,
Ranier Vilela
On Wed, Sep 17, 2025 at 8:04 AM Marcos Pegoraro <marcos@f10.com.br> wrote: > Em ter., 16 de set. de 2025 às 23:01, Robert Haas <robertmhaas@gmail.com> escreveu: >> I think RETABLE is not a proposal to be taken seriously. That's >> extremely confusing. > > This feature could be used in a future version to rearrange fields in a table, for better padding. > I don't think we have another one available for this purpose. > > CREATE TABLE T(A text, B integer, C bigint, D integer); > > We could have something like > RETABLE T USING(B, D, C, A) > > So REPACK isn't the best for this, if this feature would exist some day. RETABLE just isn't a word. The code sometimes calls this a REWRITE of a table, which would be reasonable. I suspect, though, that changing the column order would end up being a form of ALTER TABLE. -- Robert Haas EDB: http://www.enterprisedb.com
On Wednesday, September 17, 2025, Marcos Pegoraro <marcos@f10.com.br> wrote:
Em ter., 16 de set. de 2025 às 23:01, Robert Haas <robertmhaas@gmail.com> escreveu:I think RETABLE is not a proposal to be taken seriously. That's
extremely confusing.This feature could be used in a future version to rearrange fields in a table, for better padding.I don't think we have another one available for this purpose.CREATE TABLE T(A text, B integer, C bigint, D integer);We could have something likeRETABLE T USING(B, D, C, A)
That changes logical aspects of a table and so would be done as part of alter table, IMO. “AT tbl Rearrange columns (names list) “
I’m not a fan of “retable” as a command keyword.
But this digresses from the topic at hand.
I’m fine with repack itself. Deprecating vacuum full would be nice - but actually renaming existing things is bound to just make matters worse, IMO.
Concretely, maybe we should remove vacuum full from the vacuum command page, and just call it out as compatibility spelling of repack on its page. Maybe do the same for cluster (I haven’t dived into the new feature enough to confidently describe all this yet though).
David J.
On Wed, Sep 17, 2025 at 9:01 PM Ranier Vilela <ranier.vf@gmail.com> wrote: > > > > Em ter., 16 de set. de 2025 às 13:40, Bruce Momjian <bruce@momjian.us> escreveu: >> >> I am starting to get worried about the confusing of adding a REPACK >> command. We already have a lot of confusion around vacuum and analyze: >> >> * autoanalyze does vacuum and analyze >> * VACUUM FULL is much different from VACUUM >> >> It seems if we add REPACK as a command, it is somewhere between VACUUM >> FULL and VACUUM in severity/impact. Should we be rethinking the naming >> in this area? > > SqlServer has similar feature. > SHRINK C++ vector has a shrink_to_fit method which seems to serve a similar purpose. REPACK or REBUILD looks good to me, COMPACT, on the other hand, feels more specific to the idea of consolidating free space within a page or block. > > best regards, > Ranier Vilela -- Regards Junwang Zhao
Em qua., 17 de set. de 2025 às 10:17, David G. Johnston <david.g.johnston@gmail.com> escreveu:
That changes logical aspects of a table and so would be done as part of alter table, IMO. “AT tbl Rearrange columns (names list) “
If this command recreates entirely that table, it is not only a logical aspect of that table.
REPACK/RETABLE recreates that table as a replacement for VACUUM FULL/CLUSTER
and ALTER TABLE REARRANGE COLUMNS would recreate that table too ?
and both would have USING INDEX to do what CLUSTER does today ? both would have CONCURRENTLY ?
Being named REPACK or RETABLE or RECREATE TABLE or COMPACT or anything else could do it all.
regards
Marcos
On 2025-Sep-17, David G. Johnston wrote: > That changes logical aspects of a table and so would be done as part of > alter table, IMO. “AT tbl Rearrange columns (names list) “ Yes. > Concretely, maybe we should remove vacuum full from the vacuum command > page, and just call it out as compatibility spelling of repack on its > page. Maybe do the same for cluster (I haven’t dived into the new feature > enough to confidently describe all this yet though). I think we should list VACUUM FULL as deprecated, document that feature in the REPACK documentation page, and leave VACUUM FULL in working state so as not to break existing scripts. Same for CLUSTER. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Those who use electric razors are infidels destined to burn in hell while we drink from rivers of beer, download free vids and mingle with naked well shaved babes." (http://slashdot.org/comments.pl?sid=44793&cid=4647152)
On 2025-Sep-17, Bruce Momjian wrote: > So the CLUSTER command is removed and people should use REPACK instead? > And VACUUM FULL stays unchanged? No, not removed. It's going to stay, to avoid breaking scripts. People should use REPACK on new code going forward, but existing code is not going to break. Same with VACUUM FULL. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Uno puede defenderse de los ataques; contra los elogios se esta indefenso"
On Wednesday, September 17, 2025, Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Sep-17, David G. Johnston wrote:
> Concretely, maybe we should remove vacuum full from the vacuum command
> page, and just call it out as compatibility spelling of repack on its
> page. Maybe do the same for cluster (I haven’t dived into the new feature
> enough to confidently describe all this yet though).
I think we should list VACUUM FULL as deprecated, document that feature
in the REPACK documentation page, and leave VACUUM FULL in working state
so as not to break existing scripts. Same for CLUSTER.
I was unclear - this is indeed what I suggesting as well. Reframe the documentation but leave the commands functioning.
David J.
Ranier Vilela <ranier.vf@gmail.com>: > SqlServer has similar feature. > SHRINK MySQL/MariaDB OPTIMIZE TABLE table_name SQL Server ALTER TABLE table_name REBUILD DBCC SHRINKFILE DBCC SHRINKDATABASE Oracle ALTER TABLE table_name SHRINK SPACE SQLite VACUUM IBM DB2 REORG TABLE table_name Sybase ASE REORG REBUILD table_name Best regards, Mikhail.
=?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@alvh.no-ip.org> writes: > On 2025-Sep-17, David G. Johnston wrote: >> Concretely, maybe we should remove vacuum full from the vacuum command >> page, and just call it out as compatibility spelling of repack on its >> page. Maybe do the same for cluster (I haven’t dived into the new feature >> enough to confidently describe all this yet though). > I think we should list VACUUM FULL as deprecated, document that feature > in the REPACK documentation page, and leave VACUUM FULL in working state > so as not to break existing scripts. Same for CLUSTER. I'm not at all in love with documenting VACUUM FULL and CLUSTER as being fundamentally the same thing. I think that is an implementation happenstance that could go away as easily as it appeared. Even if you think we'll never again rewrite it for heap, what of other table AMs? The underlying reality could be totally different for them. By and large, I don't think I like this renaming proposal. Maybe eventually it would reduce confusion, but there will be a long interval where it adds more. regards, tom lane
On Wed, Sep 17, 2025 at 10:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > By and large, I don't think I like this renaming proposal. > Maybe eventually it would reduce confusion, but there will be > a long interval where it adds more. I mean, it's PRETTY confusing that VACUUM FULL does something much more similar to CLUSTER than it is to VACUUM. We can't ever get out from under that confusion if we don't change something. I think it's more than fair to bikeshed what the verb should be that describes the action we currently describe by writing either VACUUM FULL or CLUSTER, but I agree with Álvaro that having one verb for both of those things makes a lot more sense than the status quo. -- Robert Haas EDB: http://www.enterprisedb.com
On 2025-Sep-17, Tom Lane wrote: > I'm not at all in love with documenting VACUUM FULL and CLUSTER as > being fundamentally the same thing. I think that is an implementation > happenstance that could go away as easily as it appeared. Even if you > think we'll never again rewrite it for heap, what of other table AMs? > The underlying reality could be totally different for them. So there two operations here. One is REPACK tab USING INDEX idx which we currently call CLUSTER, and there is also REPACK TAB (no index specified) which we currently call VACUUM FULL. These have the very specific charter of rewriting the table while removing bloat, the distinction being that they keep the rows ordered according to the index or not. Both these operations currently use the same implementation, yes; but if we were to reimplement one of them to use some completely different piece of code, then the new command name continues to work, it just calls the new different implementation, while the other command continues to call the other one. (Or maybe we decide reimplement both using different techniques, and we throw away cluster.c, but still the command names continue to be sensible and would continue to work.) Thinking about the other half of your argument: if we add new table AMs for which the cluster.c implementation doesn't work, then we'll have to wire the table AM support routines to call some different implementation into REPACK or REPACK USING INDEX. This is no different than if we keep these commands being VACUUM FULL or CLUSTER; we would still need a different implementation underneath, and we would still need to wire the table AM support routines to call that different implementation. So all things considered, I'm not seeing what aspect of the renaming exactly are you uncomfortable with. We're not making the situation any worse. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ <Schwern> It does it in a really, really complicated way <crab> why does it need to be complicated? <Schwern> Because it's MakeMaker.
On Thu, 18 Sept 2025 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote: > RETABLE just isn't a word. The code sometimes calls this a REWRITE of > a table, which would be reasonable. +1. I was reading this yesterday wondering why "REWRITE" didn't get a mention. The problem I have with REPACK is that "re" indicates that something is being re-done that's been done before. If you're calling REPACK for the first time on a table, that's not true. David J's "REBUILD" also seems ok. In a green field, you could then have "REBUILD TABLE ..." and "REBUILD INDEX ..." David
On Wednesday, September 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 18 Sept 2025 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote:
> RETABLE just isn't a word. The code sometimes calls this a REWRITE of
> a table, which would be reasonable.
+1. I was reading this yesterday wondering why "REWRITE" didn't get a
mention.
Agreed.
The problem I have with REPACK is that "re" indicates that
something is being re-done that's been done before. If you're calling
REPACK for the first time on a table, that's not true.
As soon as you’ve written the first tuple you’ve begun “packing” the table - repack then is simply unpacking it and putting back the stuff you want to keep in possibly a structured way.
David J's "REBUILD" also seems ok. In a green field, you could then
have "REBUILD TABLE ..." and "REBUILD INDEX ..."
Rebuild has some prior art apparently, which makes it appealing. But I’m not a fan of the “shrink” usage the other products seem drawn to.
David J.
On Wed, Sep 17, 2025 at 10:59:55AM -0400, Robert Haas wrote: > On Wed, Sep 17, 2025 at 10:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > By and large, I don't think I like this renaming proposal. > > Maybe eventually it would reduce confusion, but there will be > > a long interval where it adds more. > > I mean, it's PRETTY confusing that VACUUM FULL does something much > more similar to CLUSTER than it is to VACUUM. We can't ever get out > from under that confusion if we don't change something. I think it's > more than fair to bikeshed what the verb should be that describes the > action we currently describe by writing either VACUUM FULL or CLUSTER, > but I agree with Álvaro that having one verb for both of those things > makes a lot more sense than the status quo. Yeah, I think we are in a bad naming place here with VACUUM FULL and CLUSTER, and avoiding clarifying it because of a risk of future change just seems unwise. We know it is confusing, and there are no plans to expand VACUUM FULL or CLUSTER, so let's improve the user interface and take the hit if things ever change in the future. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
On Wed, Sep 17, 2025 at 05:05:38PM -0400, Bruce Momjian wrote: > On Wed, Sep 17, 2025 at 10:59:55AM -0400, Robert Haas wrote: > > I mean, it's PRETTY confusing that VACUUM FULL does something much > > more similar to CLUSTER than it is to VACUUM. We can't ever get out > > from under that confusion if we don't change something. I think it's > > more than fair to bikeshed what the verb should be that describes the > > action we currently describe by writing either VACUUM FULL or CLUSTER, > > but I agree with Álvaro that having one verb for both of those things > > makes a lot more sense than the status quo. > > Yeah, I think we are in a bad naming place here with VACUUM FULL and > CLUSTER, and avoiding clarifying it because of a risk of future change > just seems unwise. We know it is confusing, and there are no plans to > expand VACUUM FULL or CLUSTER, so let's improve the user interface and > take the hit if things ever change in the future. Uh, if VACUUM FULL and CLUSTER functionality are going to eventually point to REPACK, REPACK is the wrong name. While I can see REPACK having a similar function to VACUUM FULL, the ordering idea of CLUSTER just doesn't fit into "repack". I am repacking in a specific order? What if I only want the order changed without repacking? When did I pack that I have to re-pack now? Of the names I have seen, I think REBUILD makes the most sense. I built the table --- I want it rebuilt now, and perhaps with a specific ordering. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
On 18/09/2025 16:56, Bruce Momjian wrote: > On Wed, Sep 17, 2025 at 05:05:38PM -0400, Bruce Momjian wrote: >> On Wed, Sep 17, 2025 at 10:59:55AM -0400, Robert Haas wrote: >>> I mean, it's PRETTY confusing that VACUUM FULL does something much >>> more similar to CLUSTER than it is to VACUUM. We can't ever get out >>> from under that confusion if we don't change something. I think it's >>> more than fair to bikeshed what the verb should be that describes the >>> action we currently describe by writing either VACUUM FULL or CLUSTER, >>> but I agree with Álvaro that having one verb for both of those things >>> makes a lot more sense than the status quo. >> Yeah, I think we are in a bad naming place here with VACUUM FULL and >> CLUSTER, and avoiding clarifying it because of a risk of future change >> just seems unwise. We know it is confusing, and there are no plans to >> expand VACUUM FULL or CLUSTER, so let's improve the user interface and >> take the hit if things ever change in the future. > Uh, if VACUUM FULL and CLUSTER functionality are going to eventually > point to REPACK, REPACK is the wrong name. While I can see REPACK > having a similar function to VACUUM FULL, the ordering idea of CLUSTER > just doesn't fit into "repack". I am repacking in a specific order? > What if I only want the order changed without repacking? When did I > pack that I have to re-pack now? > > Of the names I have seen, I think REBUILD makes the most sense. I built > the table --- I want it rebuilt now, and perhaps with a specific > ordering. +1 -- Vik Fearing
On 2025-Sep-18, Bruce Momjian wrote: > Uh, if VACUUM FULL and CLUSTER functionality are going to eventually > point to REPACK, REPACK is the wrong name. I obviously don't agree. > While I can see REPACK having a similar function to VACUUM FULL, the > ordering idea of CLUSTER just doesn't fit into "repack". In my opinion, it fits perfectly well. > I am repacking in a specific order? What if I only want the order > changed without repacking? Yes, you can ask the repacking process to write the tuples in a specific order -- that's a perfectly reasonable thing to ask for, and nobody would be confused about what that would do. You want the order changed but not repacking? I don't see why would anybody think that that's something to ask for. > When did I pack that I have to re-pack now? David Johnston said it here best: https://postgr.es/m/CAKFQuwYbZ2c7viape0B+TAoa_t8WteNfu+RF8+3i=D1ZQZQFAg@mail.gmail.com you packed when you started adding tuples to the table. Repack takes what you have "packed" and puts it in order again. > Of the names I have seen, I think REBUILD makes the most sense. I > built the table --- I want it rebuilt now, and perhaps with a specific > ordering. I frankly don't like REBUILD one bit. Nobody BUILDs tables either. One thing we do build is indexes, but a table is not an index. (I was not very impressed with David Rowley's idea that we could have REBUILD INDEX as a substitute for REINDEX, if we were inventing these in a green field. Two reasons for that: one is that REINDEX is not only a Postgres-established term, but also an industry-wide established term. The other is that REINDEX also supports REINDEX TABLE, which doesn't literally mean to reindex the table but instead of reindex all the _indexes_ of the table; if we wanted to use REBUILD for that we would have to do something like "REBUILD INDEXES OF TABLE xyz" which, as you see, gets unwieldy. And for what?) I highly doubt that anybody is going to be confused in the slightest about what the REPACK command does, or that they would start to think about when did they PACK anything before. That sort of argument seems entirely artificial, and that is not how people think. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "The ability of users to misuse tools is, of course, legendary" (David Steele) https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net
On 2025-Sep-17, David G. Johnston wrote: > On Wednesday, September 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote: > > > On Thu, 18 Sept 2025 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote: > > > RETABLE just isn't a word. The code sometimes calls this a REWRITE of > > > a table, which would be reasonable. > > > > +1. I was reading this yesterday wondering why "REWRITE" didn't get a > > mention. > > Agreed. That's true, but ALTER TABLE also rewrites the table in some cases, so the words used inside the source code shouldn't really be taken as guidance for what to call the user-visible features. Consider for instance that the code that implements this whole thing lives in cluster.c, and we're not going to change that file name, at least not yet. I'm not particularly excited about REWRITE, as it sounds a little bit too generic, plus we already have the "query rewriter" which rewrites, and I think it would be good that we stop overloading terms for completely different things. > > The problem I have with REPACK is that "re" indicates that something > > is being re-done that's been done before. If you're calling REPACK > > for the first time on a table, that's not true. > > As soon as you’ve written the first tuple you’ve begun “packing” the > table - repack then is simply unpacking it and putting back the stuff > you want to keep in possibly a structured way. Yeah, that's how I see it. > > David J's "REBUILD" also seems ok. In a green field, you could then > > have "REBUILD TABLE ..." and "REBUILD INDEX ..." Yeah, but then we have REINDEX TABLE or REINDEX SYSTEM, so if you wanted to do all of REINDEX with REPACK or REBUILD, you would need more syntax than just that. It had never occurred to me that we could use this new command (whatever its name) to rebuild indexes without rebuilding the table. I'm not proposing that. Do we actually want to entertain such a proposal? > Rebuild has some prior art apparently, which makes it appealing. Can you cite that? I've seen "ALTER TABLE/INDEX REBUILD", but not REBUILD as a standalone command. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On Thu, Sep 18, 2025 at 3:55 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Sep-17, David G. Johnston wrote:
> On Wednesday, September 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
>
> > On Thu, 18 Sept 2025 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote:
> > > RETABLE just isn't a word. The code sometimes calls this a REWRITE of
> > > a table, which would be reasonable.
> >
> > +1. I was reading this yesterday wondering why "REWRITE" didn't get a
> > mention.
>
I'm not particularly excited about REWRITE, as it sounds a little bit
too generic, plus we already have the "query rewriter" which rewrites,
and I think it would be good that we stop overloading terms for
completely different things.
Agreed
> Rebuild has some prior art apparently, which makes it appealing.
Can you cite that? I've seen "ALTER TABLE/INDEX REBUILD", but not
REBUILD as a standalone command.
I was just skimming the list Mikhail provided.
It isn't standalone but the keyword REBUILD gets used versus our unlisted use of REPACK.
But there isn't enough consistency/similarity for me to want to avoid repack altogether.
David J.
On Thu, 18 Sept 2025 at 03:03, Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > So there two operations here. One is > REPACK tab USING INDEX idx > which we currently call CLUSTER, and there is also > REPACK TAB > (no index specified) which we currently call VACUUM FULL. I was just thinking about how much of a heap-ism cluster using an index is. If we were to ever have an index organised table AM, what would it mean to REPACK tab USING INDEX idx? Would that "secondary" index then go away and the table would become that index? or would both continue to exist and the secondary index would be surplus? I do understand that heap is well ingrained in our code (still), but at least things like system catalogue tables/columns can evolve over time. e.g pg_index.indisclustered I could imagine evolving (or disappearing) if we had an IOT-AM. I do think locking in syntax is going to be quite a bit more permanent and needs to be considered very carefully. Something like REPACK tab ORDER BY col1; seems a bit more future proof. table_relation_copy_for_cluster() does support both use of an Index to get presorted results and sorting by the index's key columns, so it doesn't seem impossible that the ability to cluster a table *specifically* by an index couldn't easily go away at some point. Locking us deeper into a syntax for that, I do have concerns for. But maybe you've thought about all this already and I'm just not aware... I'm also trying to keep something like a column store in mind here where you might not have any indexes, and efficient filtering is done via the pruning of "chunks", which works by each chunk recording the min/max (or maybe a dictionary of) values it contains for the columns. I imagine something like that very much would want the ability to have something like REPACK tbl ORDER BY col; if you think how efficient run-length encoding would be for some orders and now inefficient it could be for other orders. Anyway, I'm not intentionally trying to make your job here any more complex. I'm just trying to help make sure we don't end up with some new syntax that also won't stand up to the test of time. David
On 2025-Sep-19, David Rowley wrote: > I was just thinking about how much of a heap-ism cluster using an > index is. If we were to ever have an index organised table AM, what > would it mean to REPACK tab USING INDEX idx? Would that "secondary" > index then go away and the table would become that index? or would > both continue to exist and the secondary index would be surplus? So, there's already an implementation of an index-organized table in OrioleDB, as I understand, so maybe we can ask Alexander K. about this. I suspect it's fine to say that if you have a table for which it makes no sense to use REPACK USING INDEX, then we just throw an error in that case (but I suppose plain REPACK continues to work, and it just recreates/compacts the primary index and rebuilds all secondary indexes, just like VACUUM FULL would presumably do.) > I do understand that heap is well ingrained in our code (still), but > at least things like system catalogue tables/columns can evolve over > time. e.g pg_index.indisclustered I could imagine evolving (or > disappearing) if we had an IOT-AM. I do think locking in syntax is > going to be quite a bit more permanent and needs to be considered very > carefully. Something like REPACK tab ORDER BY col1; seems a bit more > future proof. Oh, I think we can implement REPACK tab ORDER BY all right -- do note that the current syntax has mandatory USING INDEX keywords (unlike CLUSTER), so we can add that feature and others with no grammar problems. In fact even for current heaps it might make sense to allow an ORDER BY clause for which there's no index. I don't see us gratuituously removing the option of specifying just an index name (or indisclustered), though, because there are likely users that have been running that for years. > table_relation_copy_for_cluster() does support both use > of an Index to get presorted results and sorting by the index's key > columns, so it doesn't seem impossible that the ability to cluster a > table *specifically* by an index couldn't easily go away at some > point. Well, I hope you mean that clustering by an index would stop being the _only_ way, not that it would completely disappear as an option. > Locking us deeper into a syntax for that, I do have concerns for. But > maybe you've thought about all this already and I'm just not aware... At this point we're not *implementing* any of that, but it is possible to do so afterwards and we're not blocking that road. > I'm also trying to keep something like a column store in mind here > where you might not have any indexes, and efficient filtering is done > via the pruning of "chunks", which works by each chunk recording the > min/max (or maybe a dictionary of) values it contains for the columns. > I imagine something like that very much would want the ability to have > something like REPACK tbl ORDER BY col; if you think how efficient > run-length encoding would be for some orders and now inefficient it > could be for other orders. That makes sense, yes, and again, AFAICT it can easily be implemented on top of the current work. > Anyway, I'm not intentionally trying to make your job here any more > complex. I'm just trying to help make sure we don't end up with some > new syntax that also won't stand up to the test of time. The time you and others spend on this thread is much appreciated. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "If you want to have good ideas, you must have many ideas. Most of them will be wrong, and what you have to learn is which ones to throw away." (Linus Pauling)
On Fri, 19 Sept 2025 at 21:36, Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2025-Sep-19, David Rowley wrote: > > > I was just thinking about how much of a heap-ism cluster using an > > index is. If we were to ever have an index organised table AM, what > > would it mean to REPACK tab USING INDEX idx? Would that "secondary" > > index then go away and the table would become that index? or would > > both continue to exist and the secondary index would be surplus? > > So, there's already an implementation of an index-organized table in > OrioleDB, as I understand, so maybe we can ask Alexander K. about this. > I suspect it's fine to say that if you have a table for which it makes > no sense to use REPACK USING INDEX, then we just throw an error in that > case (but I suppose plain REPACK continues to work, and it just > recreates/compacts the primary index and rebuilds all secondary indexes, > just like VACUUM FULL would presumably do.) What I was wondering was if REPACK tab ORDER BY col; would be the way to change the IOT's order. Otherwise, I'm not sure how you'd make that work aside from providing something like: SELECT pg_recluster_iot_by(....) > > table_relation_copy_for_cluster() does support both use > > of an Index to get presorted results and sorting by the index's key > > columns, so it doesn't seem impossible that the ability to cluster a > > table *specifically* by an index couldn't easily go away at some > > point. > > Well, I hope you mean that clustering by an index would stop being the > _only_ way, not that it would completely disappear as an option. I mean support clustered orders that don't exist in any defined index. David
Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > On 2025-Sep-19, David Rowley wrote: > > > I was just thinking about how much of a heap-ism cluster using an > > index is. If we were to ever have an index organised table AM, what > > would it mean to REPACK tab USING INDEX idx? Would that "secondary" > > index then go away and the table would become that index? or would > > both continue to exist and the secondary index would be surplus? > > So, there's already an implementation of an index-organized table in > OrioleDB, as I understand, so maybe we can ask Alexander K. about this. > I suspect it's fine to say that if you have a table for which it makes > no sense to use REPACK USING INDEX, then we just throw an error in that > case (but I suppose plain REPACK continues to work, and it just > recreates/compacts the primary index and rebuilds all secondary indexes, > just like VACUUM FULL would presumably do.) > > > I do understand that heap is well ingrained in our code (still), but > > at least things like system catalogue tables/columns can evolve over > > time. e.g pg_index.indisclustered I could imagine evolving (or > > disappearing) if we had an IOT-AM. I do think locking in syntax is > > going to be quite a bit more permanent and needs to be considered very > > carefully. Something like REPACK tab ORDER BY col1; seems a bit more > > future proof. > > Oh, I think we can implement REPACK tab ORDER BY all right -- do note > that the current syntax has mandatory USING INDEX keywords (unlike > CLUSTER), so we can add that feature and others with no grammar > problems. In fact even for current heaps it might make sense to allow > an ORDER BY clause for which there's no index. I don't see us > gratuituously removing the option of specifying just an index name (or > indisclustered), though, because there are likely users that have been > running that for years. > > > table_relation_copy_for_cluster() does support both use > > of an Index to get presorted results and sorting by the index's key > > columns, so it doesn't seem impossible that the ability to cluster a > > table *specifically* by an index couldn't easily go away at some > > point. > > Well, I hope you mean that clustering by an index would stop being the > _only_ way, not that it would completely disappear as an option. > > > Locking us deeper into a syntax for that, I do have concerns for. But > > maybe you've thought about all this already and I'm just not aware... > > At this point we're not *implementing* any of that, but it is possible > to do so afterwards and we're not blocking that road. > > > I'm also trying to keep something like a column store in mind here > > where you might not have any indexes, and efficient filtering is done > > via the pruning of "chunks", which works by each chunk recording the > > min/max (or maybe a dictionary of) values it contains for the columns. > > I imagine something like that very much would want the ability to have > > something like REPACK tbl ORDER BY col; if you think how efficient > > run-length encoding would be for some orders and now inefficient it > > could be for other orders. > > That makes sense, yes, and again, AFAICT it can easily be implemented on > top of the current work. Admittedly I haven't thought about clause like ORDER BY yet, but I wonder if it'd really be useful. My understanding is that the purpose of clustering is to make index scan more efficient: with a clustered table, the heap tuples pertaining to given index tuple should be located on the same page, so the heap access is not that random. If IOT-AM table does not have anything like index, I imagine it has some kind of ordering information in the system catalog. Without that the query planner can hardly utilize the ordering. In such case REPACK should use the catalog information on ordering rather than accept arbitrary ORDER BY clause. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Fri, 19 Sept 2025 at 23:58, Antonin Houska <ah@cybertec.at> wrote: > Admittedly I haven't thought about clause like ORDER BY yet, but I wonder if > it'd really be useful. My understanding is that the purpose of clustering is > to make index scan more efficient: with a clustered table, the heap tuples > pertaining to given index tuple should be located on the same page, so the > heap access is not that random. I imagine that's true most of the time, but it could also be so that fewer pages are dirtied when an UPDATE updates a set or rows with the same or similar clustered column values. > If IOT-AM table does not have anything like index, I imagine it has some kind > of ordering information in the system catalog. Without that the query planner > can hardly utilize the ordering. In such case REPACK should use the catalog > information on ordering rather than accept arbitrary ORDER BY clause. Well, it would be impossible to insert records without some metadata to indicate the IOT keys... You might assume that someone might change their mind one day about the chosen order and wish to change it. My point was about leaving the door open to support that by having some native syntax that could be used to trigger that change. David
David Rowley <dgrowleyml@gmail.com> wrote: > On Fri, 19 Sept 2025 at 23:58, Antonin Houska <ah@cybertec.at> wrote: > > Admittedly I haven't thought about clause like ORDER BY yet, but I wonder if > > it'd really be useful. My understanding is that the purpose of clustering is > > to make index scan more efficient: with a clustered table, the heap tuples > > pertaining to given index tuple should be located on the same page, so the > > heap access is not that random. > > I imagine that's true most of the time, but it could also be so that > fewer pages are dirtied when an UPDATE updates a set or rows with the > same or similar clustered column values. Good point. > > If IOT-AM table does not have anything like index, I imagine it has some kind > > of ordering information in the system catalog. Without that the query planner > > can hardly utilize the ordering. In such case REPACK should use the catalog > > information on ordering rather than accept arbitrary ORDER BY clause. > > Well, it would be impossible to insert records without some metadata > to indicate the IOT keys... > > You might assume that someone might change their mind one day about > the chosen order and wish to change it. My point was about leaving the > door open to support that by having some native syntax that could be > used to trigger that change. I doubted whether the current AM API is designed to do catalog changes, but then recalled that CLUSTER does set pg_index.indisclustered, and that it does so outside table_relation_copy_for_cluster(). So I can now imagine that REPACK ... ORDER BY can do something like that. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2025-Sep-19, David Rowley wrote: > On Fri, 19 Sept 2025 at 21:36, Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > So, there's already an implementation of an index-organized table in > > OrioleDB, as I understand, so maybe we can ask Alexander K. about this. > > I suspect it's fine to say that if you have a table for which it makes > > no sense to use REPACK USING INDEX, then we just throw an error in that > > case (but I suppose plain REPACK continues to work, and it just > > recreates/compacts the primary index and rebuilds all secondary indexes, > > just like VACUUM FULL would presumably do.) > > What I was wondering was if REPACK tab ORDER BY col; would be the way > to change the IOT's order. Sure -- the IOT is the "primary index", so what you are saying is, I think, the same as what I was saying. > > Well, I hope you mean that clustering by an index would stop being the > > _only_ way, not that it would completely disappear as an option. > > I mean support clustered orders that don't exist in any defined index. Yes: I mean, I imagine that an implementation of REPACK tab ORDER BY col2 (on normal heap tables) when there's no index on col2 would just use a seqscan-and-sort to put the tably in that physical order, and potentially store somewhere that "col2" is the order that was used. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Selbst das größte Genie würde nicht weit kommen, wenn es alles seinem eigenen Innern verdanken wollte." (Johann Wolfgang von Goethe) Ni aún el genio más grande llegaría muy lejos si quisiera sacarlo todo de su propio interior.
On 2025-Sep-19, Antonin Houska wrote: > Admittedly I haven't thought about clause like ORDER BY yet, but I wonder if > it'd really be useful. My understanding is that the purpose of clustering is > to make index scan more efficient: Not necessarily. For some queries in some workloads, having tuples in a certain order for a seqscan might give considerable performance benefit also. Moreso with, say, BRIN indexes, where having one tuple in one page range or another could mean having to scan that page range or eliding it completely. > with a clustered table, the heap tuples > pertaining to given index tuple should be located on the same page, so the > heap access is not that random. Yes, I suppose this is the first-order reason, and probably why we currently only support basing clustering on an index. But I doubt it's the only one. (It's also worth pointing out that quite possibly having REPACK CONCURRENTLY is going to make clustering a lot more popular; without concurrency, clustering is practically useless.) > If IOT-AM table does not have anything like index, I imagine it has some kind > of ordering information in the system catalog. Without that the query planner > can hardly utilize the ordering. Sure. > In such case REPACK should use the catalog information on ordering > rather than accept arbitrary ORDER BY clause. ... but, as David said, it might be valuable to change that ordering for whatever reason. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
On 2025-09-18 Th 6:53 PM, Álvaro Herrera wrote:
I highly doubt that anybody is going to be confused in the slightest about what the REPACK command does, or that they would start to think about when did they PACK anything before. That sort of argument seems entirely artificial, and that is not how people think.
I agree.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com