Обсуждение: REPACK and naming

Поиск
Список
Период
Сортировка

REPACK and naming

От
Bruce Momjian
Дата:
I am starting to get worried about the confusing of adding a REPACK
command.  We already have a lot of confusion around vacuum and analyze:

*  autoanalyze does vacuum and analyze
*  VACUUM FULL is much different from VACUUM

It seems if we add REPACK as a command, it is somewhere between VACUUM
FULL and VACUUM in severity/impact.  Should we be rethinking the naming
in this area?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.



Re: REPACK and naming

От
Álvaro Herrera
Дата:
Hi,

On 2025-Sep-16, Bruce Momjian wrote:

> I am starting to get worried about the confusing of adding a REPACK
> command.  We already have a lot of confusion around vacuum and analyze:
> 
> *  autoanalyze does vacuum and analyze
> *  VACUUM FULL is much different from VACUUM
> 
> It seems if we add REPACK as a command, it is somewhere between VACUUM
> FULL and VACUUM in severity/impact.

No, REPACK is exactly where VACUUM FULL is in terms of impact and
severity; it's not between anything.  The confusion stems precisely from
VACUUM being a thing that's a completely different one from VACUUM FULL,
yet they have pretty much the same name.  What I'm doing is give one of
those things a different name, to reduce confusion.  Note that there's
no intention to add autorepack, because that would (IMO) make no sense.

Another thing I'm doing with that patch is rename CLUSTER so that it is
also REPACK.  This also makes sense, because VACUUM FULL _is_ the same
as CLUSTER, except that it follows current physical order instead of
following a specific index's order.


Peter E suggested that since we have REINDEX to rewrite indexes, then
the command to rewrite tables should be RETABLE.  I haven't been able to
get myself to like that idea, and also I think that was a bit
tongue-in-cheek, but if you like RETABLE better than REPACK, then maybe
we can have a vote to decide which one of those names to use.  However,
I don't think that change would make a tremendous difference, and also I
don't think RETABLE is enough of an English name to become a command
name.

> Should we be rethinking the naming in this area?

I haven't seen anything that needs renaming TBH, but if you have
specific proposals, feel free to air them.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Thou shalt not follow the NULL pointer, for chaos and madness await
thee at its end." (2nd Commandment for C programmers)



Re: REPACK and naming

От
Robert Haas
Дата:
On Tue, Sep 16, 2025 at 7:42 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Peter E suggested that since we have REINDEX to rewrite indexes, then
> the command to rewrite tables should be RETABLE.  I haven't been able to
> get myself to like that idea, and also I think that was a bit
> tongue-in-cheek, but if you like RETABLE better than REPACK, then maybe
> we can have a vote to decide which one of those names to use.  However,
> I don't think that change would make a tremendous difference, and also I
> don't think RETABLE is enough of an English name to become a command
> name.

I think RETABLE is not a proposal to be taken seriously. That's
extremely confusing.

I don't love the name REPACK, but I think it's good enough. If we come
up with something better, great.

I agree that having a single command that does both VACUUM FULL and
CLUSTER makes a lot more sense than the status quo, which is a
confusing historical accident.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: REPACK and naming

От
Marcos Pegoraro
Дата:
Em ter., 16 de set. de 2025 às 23:01, Robert Haas <robertmhaas@gmail.com> escreveu:
I think RETABLE is not a proposal to be taken seriously. That's
extremely confusing.

This feature could be used in a future version to rearrange fields in a table, for better padding.
I don't think we have another one available for this purpose.

CREATE TABLE T(A text, B integer, C bigint, D integer);

We could have something like
RETABLE T USING(B, D, C, A)

So REPACK isn't the best for this, if this feature would exist some day.

regards
Marcos

Re: REPACK and naming

От
Thom Brown
Дата:
On Wed, 17 Sept 2025, 03:01 Robert Haas, <robertmhaas@gmail.com> wrote:
On Tue, Sep 16, 2025 at 7:42 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Peter E suggested that since we have REINDEX to rewrite indexes, then
> the command to rewrite tables should be RETABLE.  I haven't been able to
> get myself to like that idea, and also I think that was a bit
> tongue-in-cheek, but if you like RETABLE better than REPACK, then maybe
> we can have a vote to decide which one of those names to use.  However,
> I don't think that change would make a tremendous difference, and also I
> don't think RETABLE is enough of an English name to become a command
> name.

I think RETABLE is not a proposal to be taken seriously. That's
extremely confusing.

I don't love the name REPACK, but I think it's good enough. If we come
up with something better, great.

COMPACT?

Thom

Re: REPACK and naming

От
Bruce Momjian
Дата:
On Wed, Sep 17, 2025 at 01:42:29AM +0200, Álvaro Herrera wrote:
> Hi,
> 
> On 2025-Sep-16, Bruce Momjian wrote:
> 
> > I am starting to get worried about the confusing of adding a REPACK
> > command.  We already have a lot of confusion around vacuum and analyze:
> > 
> > *  autoanalyze does vacuum and analyze
> > *  VACUUM FULL is much different from VACUUM
> > 
> > It seems if we add REPACK as a command, it is somewhere between VACUUM
> > FULL and VACUUM in severity/impact.
> 
> No, REPACK is exactly where VACUUM FULL is in terms of impact and
> severity; it's not between anything.  The confusion stems precisely from
> VACUUM being a thing that's a completely different one from VACUUM FULL,
> yet they have pretty much the same name.  What I'm doing is give one of
> those things a different name, to reduce confusion.  Note that there's
> no intention to add autorepack, because that would (IMO) make no sense.
> 
> Another thing I'm doing with that patch is rename CLUSTER so that it is
> also REPACK.  This also makes sense, because VACUUM FULL _is_ the same
> as CLUSTER, except that it follows current physical order instead of
> following a specific index's order.

So the CLUSTER command is removed and people should use REPACK instead? 
And VACUUM FULL stays unchanged?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.



Re: REPACK and naming

От
Ranier Vilela
Дата:


Em ter., 16 de set. de 2025 às 13:40, Bruce Momjian <bruce@momjian.us> escreveu:
I am starting to get worried about the confusing of adding a REPACK
command.  We already have a lot of confusion around vacuum and analyze:

*  autoanalyze does vacuum and analyze
*  VACUUM FULL is much different from VACUUM

It seems if we add REPACK as a command, it is somewhere between VACUUM
FULL and VACUUM in severity/impact.  Should we be rethinking the naming
in this area?
SqlServer has similar feature.
SHRINK

best regards,
Ranier Vilela

Re: REPACK and naming

От
Robert Haas
Дата:
On Wed, Sep 17, 2025 at 8:04 AM Marcos Pegoraro <marcos@f10.com.br> wrote:
> Em ter., 16 de set. de 2025 às 23:01, Robert Haas <robertmhaas@gmail.com> escreveu:
>> I think RETABLE is not a proposal to be taken seriously. That's
>> extremely confusing.
>
> This feature could be used in a future version to rearrange fields in a table, for better padding.
> I don't think we have another one available for this purpose.
>
> CREATE TABLE T(A text, B integer, C bigint, D integer);
>
> We could have something like
> RETABLE T USING(B, D, C, A)
>
> So REPACK isn't the best for this, if this feature would exist some day.

RETABLE just isn't a word. The code sometimes calls this a REWRITE of
a table, which would be reasonable. I suspect, though, that changing
the column order would end up being a form of ALTER TABLE.

--
Robert Haas
EDB: http://www.enterprisedb.com



REPACK and naming

От
"David G. Johnston"
Дата:
On Wednesday, September 17, 2025, Marcos Pegoraro <marcos@f10.com.br> wrote:
Em ter., 16 de set. de 2025 às 23:01, Robert Haas <robertmhaas@gmail.com> escreveu:
I think RETABLE is not a proposal to be taken seriously. That's
extremely confusing.

This feature could be used in a future version to rearrange fields in a table, for better padding.
I don't think we have another one available for this purpose.

CREATE TABLE T(A text, B integer, C bigint, D integer);

We could have something like
RETABLE T USING(B, D, C, A)

That changes logical aspects of a table and so would be done as part of alter table, IMO. “AT tbl Rearrange columns (names list) “

I’m not a fan of “retable” as a command keyword.

But this digresses from the topic at hand.

I’m fine with repack itself.  Deprecating vacuum full would be nice - but actually renaming existing things is bound to just make matters worse, IMO.

Concretely, maybe we should remove vacuum full from the vacuum command page, and just call it out as compatibility spelling of repack on its page.  Maybe do the same for cluster (I haven’t dived into the new feature enough to confidently describe all this yet though).

David J.

Re: REPACK and naming

От
Junwang Zhao
Дата:
On Wed, Sep 17, 2025 at 9:01 PM Ranier Vilela <ranier.vf@gmail.com> wrote:
>
>
>
> Em ter., 16 de set. de 2025 às 13:40, Bruce Momjian <bruce@momjian.us> escreveu:
>>
>> I am starting to get worried about the confusing of adding a REPACK
>> command.  We already have a lot of confusion around vacuum and analyze:
>>
>> *  autoanalyze does vacuum and analyze
>> *  VACUUM FULL is much different from VACUUM
>>
>> It seems if we add REPACK as a command, it is somewhere between VACUUM
>> FULL and VACUUM in severity/impact.  Should we be rethinking the naming
>> in this area?
>
> SqlServer has similar feature.
> SHRINK

C++ vector has a shrink_to_fit method which seems to serve a similar purpose.

REPACK or REBUILD looks good to me, COMPACT, on the other hand, feels
more specific to the idea of consolidating free space within a page or block.

>
> best regards,
> Ranier Vilela



--
Regards
Junwang Zhao



Re: REPACK and naming

От
Marcos Pegoraro
Дата:
Em qua., 17 de set. de 2025 às 10:17, David G. Johnston <david.g.johnston@gmail.com> escreveu:
That changes logical aspects of a table and so would be done as part of alter table, IMO. “AT tbl Rearrange columns (names list) “

If this command recreates entirely that table, it is not only a logical aspect of that table.
REPACK/RETABLE recreates that table as a replacement for VACUUM FULL/CLUSTER
and ALTER TABLE REARRANGE COLUMNS would recreate that table too ?
and both would have USING INDEX to do what CLUSTER does today ? both would have CONCURRENTLY ?

Being named REPACK or RETABLE or RECREATE TABLE or COMPACT or anything else could do it all.

regards
Marcos

Re: REPACK and naming

От
Álvaro Herrera
Дата:
On 2025-Sep-17, David G. Johnston wrote:

> That changes logical aspects of a table and so would be done as part of
> alter table, IMO. “AT tbl Rearrange columns (names list) “

Yes.

> Concretely, maybe we should remove vacuum full from the vacuum command
> page, and just call it out as compatibility spelling of repack on its
> page.  Maybe do the same for cluster (I haven’t dived into the new feature
> enough to confidently describe all this yet though).

I think we should list VACUUM FULL as deprecated, document that feature
in the REPACK documentation page, and leave VACUUM FULL in working state
so as not to break existing scripts.  Same for CLUSTER.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Those who use electric razors are infidels destined to burn in hell while
we drink from rivers of beer, download free vids and mingle with naked
well shaved babes." (http://slashdot.org/comments.pl?sid=44793&cid=4647152)



Re: REPACK and naming

От
Álvaro Herrera
Дата:
On 2025-Sep-17, Bruce Momjian wrote:

> So the CLUSTER command is removed and people should use REPACK instead? 
> And VACUUM FULL stays unchanged?

No, not removed.  It's going to stay, to avoid breaking scripts.  People
should use REPACK on new code going forward, but existing code is not
going to break.  Same with VACUUM FULL.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"



Re: REPACK and naming

От
"David G. Johnston"
Дата:
On Wednesday, September 17, 2025, Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Sep-17, David G. Johnston wrote:

> Concretely, maybe we should remove vacuum full from the vacuum command
> page, and just call it out as compatibility spelling of repack on its
> page.  Maybe do the same for cluster (I haven’t dived into the new feature
> enough to confidently describe all this yet though).

I think we should list VACUUM FULL as deprecated, document that feature
in the REPACK documentation page, and leave VACUUM FULL in working state
so as not to break existing scripts.  Same for CLUSTER.

I was unclear - this is indeed what I suggesting as well.  Reframe the documentation but leave the commands functioning.

David J.

Re: REPACK and naming

От
Mihail Nikalayeu
Дата:
Ranier Vilela <ranier.vf@gmail.com>:
> SqlServer has similar feature.
> SHRINK

MySQL/MariaDB
OPTIMIZE TABLE table_name

SQL Server
ALTER TABLE table_name REBUILD
DBCC SHRINKFILE
DBCC SHRINKDATABASE

Oracle
ALTER TABLE table_name SHRINK SPACE

SQLite
VACUUM

IBM DB2
REORG TABLE table_name

Sybase ASE
REORG REBUILD table_name

Best regards,
Mikhail.



Re: REPACK and naming

От
Tom Lane
Дата:
=?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@alvh.no-ip.org> writes:
> On 2025-Sep-17, David G. Johnston wrote:
>> Concretely, maybe we should remove vacuum full from the vacuum command
>> page, and just call it out as compatibility spelling of repack on its
>> page.  Maybe do the same for cluster (I haven’t dived into the new feature
>> enough to confidently describe all this yet though).

> I think we should list VACUUM FULL as deprecated, document that feature
> in the REPACK documentation page, and leave VACUUM FULL in working state
> so as not to break existing scripts.  Same for CLUSTER.

I'm not at all in love with documenting VACUUM FULL and CLUSTER as
being fundamentally the same thing.  I think that is an implementation
happenstance that could go away as easily as it appeared.  Even if you
think we'll never again rewrite it for heap, what of other table AMs?
The underlying reality could be totally different for them.

By and large, I don't think I like this renaming proposal.
Maybe eventually it would reduce confusion, but there will be
a long interval where it adds more.

            regards, tom lane



Re: REPACK and naming

От
Robert Haas
Дата:
On Wed, Sep 17, 2025 at 10:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> By and large, I don't think I like this renaming proposal.
> Maybe eventually it would reduce confusion, but there will be
> a long interval where it adds more.

I mean, it's PRETTY confusing that VACUUM FULL does something much
more similar to CLUSTER than it is to VACUUM. We can't ever get out
from under that confusion if we don't change something. I think it's
more than fair to bikeshed what the verb should be that describes the
action we currently describe by writing either VACUUM FULL or CLUSTER,
but I agree with Álvaro that having one verb for both of those things
makes a lot more sense than the status quo.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: REPACK and naming

От
Álvaro Herrera
Дата:
On 2025-Sep-17, Tom Lane wrote:

> I'm not at all in love with documenting VACUUM FULL and CLUSTER as
> being fundamentally the same thing.  I think that is an implementation
> happenstance that could go away as easily as it appeared.  Even if you
> think we'll never again rewrite it for heap, what of other table AMs?
> The underlying reality could be totally different for them.

So there two operations here.  One is 
REPACK tab USING INDEX idx
which we currently call CLUSTER, and there is also
REPACK TAB
(no index specified) which we currently call VACUUM FULL.  These have
the very specific charter of rewriting the table while removing bloat,
the distinction being that they keep the rows ordered according to the
index or not.  Both these operations currently use the same
implementation, yes; but if we were to reimplement one of them to use
some completely different piece of code, then the new command name
continues to work, it just calls the new different implementation, while
the other command continues to call the other one.  (Or maybe we decide
reimplement both using different techniques, and we throw away
cluster.c, but still the command names continue to be sensible and would
continue to work.)

Thinking about the other half of your argument: if we add new table AMs
for which the cluster.c implementation doesn't work, then we'll have to
wire the table AM support routines to call some different implementation
into REPACK or REPACK USING INDEX.  This is no different than if we keep
these commands being VACUUM FULL or CLUSTER; we would still need a
different implementation underneath, and we would still need to wire the
table AM support routines to call that different implementation.

So all things considered, I'm not seeing what aspect of the renaming
exactly are you uncomfortable with.  We're not making the situation any
worse.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
<Schwern> It does it in a really, really complicated way
<crab> why does it need to be complicated?
<Schwern> Because it's MakeMaker.



Re: REPACK and naming

От
David Rowley
Дата:
On Thu, 18 Sept 2025 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote:
> RETABLE just isn't a word. The code sometimes calls this a REWRITE of
> a table, which would be reasonable.

+1. I was reading this yesterday wondering why "REWRITE" didn't get a
mention.  The problem I have with REPACK is that "re" indicates that
something is being re-done that's been done before. If you're calling
REPACK for the first time on a table, that's not true.

David J's "REBUILD" also seems ok.  In a green field, you could then
have "REBUILD TABLE ..." and "REBUILD INDEX ..."

David



REPACK and naming

От
"David G. Johnston"
Дата:
On Wednesday, September 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 18 Sept 2025 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote:
> RETABLE just isn't a word. The code sometimes calls this a REWRITE of
> a table, which would be reasonable.

+1. I was reading this yesterday wondering why "REWRITE" didn't get a
mention.

Agreed.
 

The problem I have with REPACK is that "re" indicates that
something is being re-done that's been done before. If you're calling
REPACK for the first time on a table, that's not true.

As soon as you’ve written the first tuple you’ve begun “packing” the table - repack then is simply unpacking it and putting back the stuff you want to keep in possibly a structured way.

David J's "REBUILD" also seems ok.  In a green field, you could then
have "REBUILD TABLE ..." and "REBUILD INDEX ..."

Rebuild has some prior art apparently, which makes it appealing.  But I’m not a fan of the “shrink” usage the other products seem drawn to.

David J.

Re: REPACK and naming

От
Bruce Momjian
Дата:
On Wed, Sep 17, 2025 at 10:59:55AM -0400, Robert Haas wrote:
> On Wed, Sep 17, 2025 at 10:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > By and large, I don't think I like this renaming proposal.
> > Maybe eventually it would reduce confusion, but there will be
> > a long interval where it adds more.
> 
> I mean, it's PRETTY confusing that VACUUM FULL does something much
> more similar to CLUSTER than it is to VACUUM. We can't ever get out
> from under that confusion if we don't change something. I think it's
> more than fair to bikeshed what the verb should be that describes the
> action we currently describe by writing either VACUUM FULL or CLUSTER,
> but I agree with Álvaro that having one verb for both of those things
> makes a lot more sense than the status quo.

Yeah, I think we are in a bad naming place here with VACUUM FULL and
CLUSTER, and avoiding clarifying it because of a risk of future change
just seems unwise.  We know it is confusing, and there are no plans to
expand VACUUM FULL or CLUSTER, so let's improve the user interface and
take the hit if things ever change in the future.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.



Re: REPACK and naming

От
Bruce Momjian
Дата:
On Wed, Sep 17, 2025 at 05:05:38PM -0400, Bruce Momjian wrote:
> On Wed, Sep 17, 2025 at 10:59:55AM -0400, Robert Haas wrote:
> > I mean, it's PRETTY confusing that VACUUM FULL does something much
> > more similar to CLUSTER than it is to VACUUM. We can't ever get out
> > from under that confusion if we don't change something. I think it's
> > more than fair to bikeshed what the verb should be that describes the
> > action we currently describe by writing either VACUUM FULL or CLUSTER,
> > but I agree with Álvaro that having one verb for both of those things
> > makes a lot more sense than the status quo.
> 
> Yeah, I think we are in a bad naming place here with VACUUM FULL and
> CLUSTER, and avoiding clarifying it because of a risk of future change
> just seems unwise.  We know it is confusing, and there are no plans to
> expand VACUUM FULL or CLUSTER, so let's improve the user interface and
> take the hit if things ever change in the future.

Uh, if VACUUM FULL and CLUSTER functionality are going to eventually
point to REPACK, REPACK is the wrong name.  While I can see REPACK
having a similar function to VACUUM FULL, the ordering idea of CLUSTER
just doesn't fit into "repack".  I am repacking in a specific order? 
What if I only want the order changed without repacking?  When did I
pack that I have to re-pack now?

Of the names I have seen, I think REBUILD makes the most sense.  I built
the table --- I want it rebuilt now, and perhaps with a specific
ordering.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.



Re: REPACK and naming

От
Vik Fearing
Дата:
On 18/09/2025 16:56, Bruce Momjian wrote:
> On Wed, Sep 17, 2025 at 05:05:38PM -0400, Bruce Momjian wrote:
>> On Wed, Sep 17, 2025 at 10:59:55AM -0400, Robert Haas wrote:
>>> I mean, it's PRETTY confusing that VACUUM FULL does something much
>>> more similar to CLUSTER than it is to VACUUM. We can't ever get out
>>> from under that confusion if we don't change something. I think it's
>>> more than fair to bikeshed what the verb should be that describes the
>>> action we currently describe by writing either VACUUM FULL or CLUSTER,
>>> but I agree with Álvaro that having one verb for both of those things
>>> makes a lot more sense than the status quo.
>> Yeah, I think we are in a bad naming place here with VACUUM FULL and
>> CLUSTER, and avoiding clarifying it because of a risk of future change
>> just seems unwise.  We know it is confusing, and there are no plans to
>> expand VACUUM FULL or CLUSTER, so let's improve the user interface and
>> take the hit if things ever change in the future.
> Uh, if VACUUM FULL and CLUSTER functionality are going to eventually
> point to REPACK, REPACK is the wrong name.  While I can see REPACK
> having a similar function to VACUUM FULL, the ordering idea of CLUSTER
> just doesn't fit into "repack".  I am repacking in a specific order?
> What if I only want the order changed without repacking?  When did I
> pack that I have to re-pack now?
>
> Of the names I have seen, I think REBUILD makes the most sense.  I built
> the table --- I want it rebuilt now, and perhaps with a specific
> ordering.


+1

-- 

Vik Fearing




Re: REPACK and naming

От
Álvaro Herrera
Дата:
On 2025-Sep-18, Bruce Momjian wrote:

> Uh, if VACUUM FULL and CLUSTER functionality are going to eventually
> point to REPACK, REPACK is the wrong name.

I obviously don't agree.

> While I can see REPACK having a similar function to VACUUM FULL, the
> ordering idea of CLUSTER just doesn't fit into "repack".

In my opinion, it fits perfectly well.

> I am repacking in a specific order?  What if I only want the order
> changed without repacking?

Yes, you can ask the repacking process to write the tuples in a specific
order -- that's a perfectly reasonable thing to ask for, and nobody
would be confused about what that would do.  You want the order changed
but not repacking?  I don't see why would anybody think that that's
something to ask for.

> When did I pack that I have to re-pack now?

David Johnston said it here best:
https://postgr.es/m/CAKFQuwYbZ2c7viape0B+TAoa_t8WteNfu+RF8+3i=D1ZQZQFAg@mail.gmail.com
you packed when you started adding tuples to the table.  Repack takes
what you have "packed" and puts it in order again.

> Of the names I have seen, I think REBUILD makes the most sense.  I
> built the table --- I want it rebuilt now, and perhaps with a specific
> ordering.

I frankly don't like REBUILD one bit.  Nobody BUILDs tables either.
One thing we do build is indexes, but a table is not an index.

(I was not very impressed with David Rowley's idea that we could have
REBUILD INDEX as a substitute for REINDEX, if we were inventing these in
a green field.  Two reasons for that: one is that REINDEX is not only a
Postgres-established term, but also an industry-wide established term.
The other is that REINDEX also supports REINDEX TABLE, which doesn't
literally mean to reindex the table but instead of reindex all the
_indexes_ of the table; if we wanted to use REBUILD for that we would
have to do something like "REBUILD INDEXES OF TABLE xyz" which, as you
see, gets unwieldy.  And for what?)

I highly doubt that anybody is going to be confused in the slightest
about what the REPACK command does, or that they would start to think
about when did they PACK anything before.  That sort of argument seems
entirely artificial, and that is not how people think.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132797@pgmasters.net



Re: REPACK and naming

От
Álvaro Herrera
Дата:
On 2025-Sep-17, David G. Johnston wrote:

> On Wednesday, September 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
> 
> > On Thu, 18 Sept 2025 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote:
> > > RETABLE just isn't a word. The code sometimes calls this a REWRITE of
> > > a table, which would be reasonable.
> >
> > +1. I was reading this yesterday wondering why "REWRITE" didn't get a
> > mention.
> 
> Agreed.

That's true, but ALTER TABLE also rewrites the table in some cases, so
the words used inside the source code shouldn't really be taken as
guidance for what to call the user-visible features.  Consider for
instance that the code that implements this whole thing lives in
cluster.c, and we're not going to change that file name, at least not
yet.

I'm not particularly excited about REWRITE, as it sounds a little bit
too generic, plus we already have the "query rewriter" which rewrites,
and I think it would be good that we stop overloading terms for
completely different things.

> > The problem I have with REPACK is that "re" indicates that something
> > is being re-done that's been done before. If you're calling REPACK
> > for the first time on a table, that's not true.
>
> As soon as you’ve written the first tuple you’ve begun “packing” the
> table - repack then is simply unpacking it and putting back the stuff
> you want to keep in possibly a structured way.

Yeah, that's how I see it.

> > David J's "REBUILD" also seems ok.  In a green field, you could then
> > have "REBUILD TABLE ..." and "REBUILD INDEX ..."

Yeah, but then we have REINDEX TABLE or REINDEX SYSTEM, so if you wanted
to do all of REINDEX with REPACK or REBUILD, you would need more syntax
than just that.

It had never occurred to me that we could use this new command (whatever
its name) to rebuild indexes without rebuilding the table.  I'm not
proposing that.  Do we actually want to entertain such a proposal?

> Rebuild has some prior art apparently, which makes it appealing.

Can you cite that?  I've seen "ALTER TABLE/INDEX REBUILD", but not
REBUILD as a standalone command.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: REPACK and naming

От
"David G. Johnston"
Дата:
On Thu, Sep 18, 2025 at 3:55 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2025-Sep-17, David G. Johnston wrote:

> On Wednesday, September 17, 2025, David Rowley <dgrowleyml@gmail.com> wrote:
>
> > On Thu, 18 Sept 2025 at 01:09, Robert Haas <robertmhaas@gmail.com> wrote:
> > > RETABLE just isn't a word. The code sometimes calls this a REWRITE of
> > > a table, which would be reasonable.
> >
> > +1. I was reading this yesterday wondering why "REWRITE" didn't get a
> > mention.
>

I'm not particularly excited about REWRITE, as it sounds a little bit
too generic, plus we already have the "query rewriter" which rewrites,
and I think it would be good that we stop overloading terms for
completely different things.

Agreed


> Rebuild has some prior art apparently, which makes it appealing.

Can you cite that?  I've seen "ALTER TABLE/INDEX REBUILD", but not
REBUILD as a standalone command.


I was just skimming the list Mikhail provided.


It isn't standalone but the keyword REBUILD gets used versus our unlisted use of REPACK.

But there isn't enough consistency/similarity for me to want to avoid repack altogether.

David J.

Re: REPACK and naming

От
David Rowley
Дата:
On Thu, 18 Sept 2025 at 03:03, Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> So there two operations here.  One is
> REPACK tab USING INDEX idx
> which we currently call CLUSTER, and there is also
> REPACK TAB
> (no index specified) which we currently call VACUUM FULL.

I was just thinking about how much of a heap-ism cluster using an
index is. If we were to ever have an index organised table AM, what
would it mean to REPACK tab USING INDEX idx? Would that "secondary"
index then go away and the table would become that index? or would
both continue to exist and the secondary index would be surplus?

I do understand that heap is well ingrained in our code (still), but
at least things like system catalogue tables/columns can evolve over
time. e.g pg_index.indisclustered I could imagine evolving (or
disappearing) if we had an IOT-AM. I do think locking in syntax is
going to be quite a bit more permanent and needs to be considered very
carefully. Something like REPACK tab ORDER BY col1; seems a bit more
future proof. table_relation_copy_for_cluster() does support both use
of an Index to get presorted results and sorting by the index's key
columns, so it doesn't seem impossible that the ability to cluster a
table *specifically* by an index couldn't easily go away at some
point. Locking us deeper into a syntax for that, I do have concerns
for. But maybe you've thought about all this already and I'm just not
aware...

I'm also trying to keep something like a column store in mind here
where you might not have any indexes, and efficient filtering is done
via the pruning of "chunks", which works by each chunk recording the
min/max (or maybe a dictionary of) values it contains for the columns.
I imagine something like that very much would want the ability to have
something like REPACK tbl ORDER BY col; if you think how efficient
run-length encoding would be for some orders and now inefficient it
could be for other orders.

Anyway, I'm not intentionally trying to make your job here any more
complex. I'm just trying to help make sure we don't end up with some
new syntax that also won't stand up to the test of time.

David



Re: REPACK and naming

От
Álvaro Herrera
Дата:
On 2025-Sep-19, David Rowley wrote:

> I was just thinking about how much of a heap-ism cluster using an
> index is. If we were to ever have an index organised table AM, what
> would it mean to REPACK tab USING INDEX idx? Would that "secondary"
> index then go away and the table would become that index? or would
> both continue to exist and the secondary index would be surplus?

So, there's already an implementation of an index-organized table in
OrioleDB, as I understand, so maybe we can ask Alexander K. about this.
I suspect it's fine to say that if you have a table for which it makes
no sense to use REPACK USING INDEX, then we just throw an error in that
case (but I suppose plain REPACK continues to work, and it just
recreates/compacts the primary index and rebuilds all secondary indexes,
just like VACUUM FULL would presumably do.)

> I do understand that heap is well ingrained in our code (still), but
> at least things like system catalogue tables/columns can evolve over
> time. e.g pg_index.indisclustered I could imagine evolving (or
> disappearing) if we had an IOT-AM. I do think locking in syntax is
> going to be quite a bit more permanent and needs to be considered very
> carefully. Something like REPACK tab ORDER BY col1; seems a bit more
> future proof.

Oh, I think we can implement REPACK tab ORDER BY all right -- do note
that the current syntax has mandatory USING INDEX keywords (unlike
CLUSTER), so we can add that feature and others with no grammar
problems.  In fact even for current heaps it might make sense to allow
an ORDER BY clause for which there's no index.  I don't see us
gratuituously removing the option of specifying just an index name (or
indisclustered), though, because there are likely users that have been
running that for years.

> table_relation_copy_for_cluster() does support both use
> of an Index to get presorted results and sorting by the index's key
> columns, so it doesn't seem impossible that the ability to cluster a
> table *specifically* by an index couldn't easily go away at some
> point.

Well, I hope you mean that clustering by an index would stop being the
_only_ way, not that it would completely disappear as an option.

> Locking us deeper into a syntax for that, I do have concerns for. But
> maybe you've thought about all this already and I'm just not aware...

At this point we're not *implementing* any of that, but it is possible
to do so afterwards and we're not blocking that road.

> I'm also trying to keep something like a column store in mind here
> where you might not have any indexes, and efficient filtering is done
> via the pruning of "chunks", which works by each chunk recording the
> min/max (or maybe a dictionary of) values it contains for the columns.
> I imagine something like that very much would want the ability to have
> something like REPACK tbl ORDER BY col; if you think how efficient
> run-length encoding would be for some orders and now inefficient it
> could be for other orders.

That makes sense, yes, and again, AFAICT it can easily be implemented on
top of the current work.

> Anyway, I'm not intentionally trying to make your job here any more
> complex. I'm just trying to help make sure we don't end up with some
> new syntax that also won't stand up to the test of time.

The time you and others spend on this thread is much appreciated.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"If you want to have good ideas, you must have many ideas.  Most of them
will be wrong, and what you have to learn is which ones to throw away."
                                                         (Linus Pauling)



Re: REPACK and naming

От
David Rowley
Дата:
On Fri, 19 Sept 2025 at 21:36, Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2025-Sep-19, David Rowley wrote:
>
> > I was just thinking about how much of a heap-ism cluster using an
> > index is. If we were to ever have an index organised table AM, what
> > would it mean to REPACK tab USING INDEX idx? Would that "secondary"
> > index then go away and the table would become that index? or would
> > both continue to exist and the secondary index would be surplus?
>
> So, there's already an implementation of an index-organized table in
> OrioleDB, as I understand, so maybe we can ask Alexander K. about this.
> I suspect it's fine to say that if you have a table for which it makes
> no sense to use REPACK USING INDEX, then we just throw an error in that
> case (but I suppose plain REPACK continues to work, and it just
> recreates/compacts the primary index and rebuilds all secondary indexes,
> just like VACUUM FULL would presumably do.)

What I was wondering was if REPACK tab ORDER BY col; would be the way
to change the IOT's order. Otherwise, I'm not sure how you'd make that
work aside from providing something like: SELECT
pg_recluster_iot_by(....)

> > table_relation_copy_for_cluster() does support both use
> > of an Index to get presorted results and sorting by the index's key
> > columns, so it doesn't seem impossible that the ability to cluster a
> > table *specifically* by an index couldn't easily go away at some
> > point.
>
> Well, I hope you mean that clustering by an index would stop being the
> _only_ way, not that it would completely disappear as an option.

I mean support clustered orders that don't exist in any defined index.

David



Re: REPACK and naming

От
Antonin Houska
Дата:
Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> On 2025-Sep-19, David Rowley wrote:
>
> > I was just thinking about how much of a heap-ism cluster using an
> > index is. If we were to ever have an index organised table AM, what
> > would it mean to REPACK tab USING INDEX idx? Would that "secondary"
> > index then go away and the table would become that index? or would
> > both continue to exist and the secondary index would be surplus?
>
> So, there's already an implementation of an index-organized table in
> OrioleDB, as I understand, so maybe we can ask Alexander K. about this.
> I suspect it's fine to say that if you have a table for which it makes
> no sense to use REPACK USING INDEX, then we just throw an error in that
> case (but I suppose plain REPACK continues to work, and it just
> recreates/compacts the primary index and rebuilds all secondary indexes,
> just like VACUUM FULL would presumably do.)
>
> > I do understand that heap is well ingrained in our code (still), but
> > at least things like system catalogue tables/columns can evolve over
> > time. e.g pg_index.indisclustered I could imagine evolving (or
> > disappearing) if we had an IOT-AM. I do think locking in syntax is
> > going to be quite a bit more permanent and needs to be considered very
> > carefully. Something like REPACK tab ORDER BY col1; seems a bit more
> > future proof.
>
> Oh, I think we can implement REPACK tab ORDER BY all right -- do note
> that the current syntax has mandatory USING INDEX keywords (unlike
> CLUSTER), so we can add that feature and others with no grammar
> problems.  In fact even for current heaps it might make sense to allow
> an ORDER BY clause for which there's no index.  I don't see us
> gratuituously removing the option of specifying just an index name (or
> indisclustered), though, because there are likely users that have been
> running that for years.
>
> > table_relation_copy_for_cluster() does support both use
> > of an Index to get presorted results and sorting by the index's key
> > columns, so it doesn't seem impossible that the ability to cluster a
> > table *specifically* by an index couldn't easily go away at some
> > point.
>
> Well, I hope you mean that clustering by an index would stop being the
> _only_ way, not that it would completely disappear as an option.
>
> > Locking us deeper into a syntax for that, I do have concerns for. But
> > maybe you've thought about all this already and I'm just not aware...
>
> At this point we're not *implementing* any of that, but it is possible
> to do so afterwards and we're not blocking that road.
>
> > I'm also trying to keep something like a column store in mind here
> > where you might not have any indexes, and efficient filtering is done
> > via the pruning of "chunks", which works by each chunk recording the
> > min/max (or maybe a dictionary of) values it contains for the columns.
> > I imagine something like that very much would want the ability to have
> > something like REPACK tbl ORDER BY col; if you think how efficient
> > run-length encoding would be for some orders and now inefficient it
> > could be for other orders.
>
> That makes sense, yes, and again, AFAICT it can easily be implemented on
> top of the current work.

Admittedly I haven't thought about clause like ORDER BY yet, but I wonder if
it'd really be useful. My understanding is that the purpose of clustering is
to make index scan more efficient: with a clustered table, the heap tuples
pertaining to given index tuple should be located on the same page, so the
heap access is not that random.

If IOT-AM table does not have anything like index, I imagine it has some kind
of ordering information in the system catalog. Without that the query planner
can hardly utilize the ordering. In such case REPACK should use the catalog
information on ordering rather than accept arbitrary ORDER BY clause.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: REPACK and naming

От
David Rowley
Дата:
On Fri, 19 Sept 2025 at 23:58, Antonin Houska <ah@cybertec.at> wrote:
> Admittedly I haven't thought about clause like ORDER BY yet, but I wonder if
> it'd really be useful. My understanding is that the purpose of clustering is
> to make index scan more efficient: with a clustered table, the heap tuples
> pertaining to given index tuple should be located on the same page, so the
> heap access is not that random.

I imagine that's true most of the time, but it could also be so that
fewer pages are dirtied when an UPDATE updates a set or rows with the
same or similar clustered column values.

> If IOT-AM table does not have anything like index, I imagine it has some kind
> of ordering information in the system catalog. Without that the query planner
> can hardly utilize the ordering. In such case REPACK should use the catalog
> information on ordering rather than accept arbitrary ORDER BY clause.

Well, it would be impossible to insert records without some metadata
to indicate the IOT keys...

You might assume that someone might change their mind one day about
the chosen order and wish to change it. My point was about leaving the
door open to support that by having some native syntax that could be
used to trigger that change.

David



Re: REPACK and naming

От
Antonin Houska
Дата:
David Rowley <dgrowleyml@gmail.com> wrote:

> On Fri, 19 Sept 2025 at 23:58, Antonin Houska <ah@cybertec.at> wrote:
> > Admittedly I haven't thought about clause like ORDER BY yet, but I wonder if
> > it'd really be useful. My understanding is that the purpose of clustering is
> > to make index scan more efficient: with a clustered table, the heap tuples
> > pertaining to given index tuple should be located on the same page, so the
> > heap access is not that random.
>
> I imagine that's true most of the time, but it could also be so that
> fewer pages are dirtied when an UPDATE updates a set or rows with the
> same or similar clustered column values.

Good point.

> > If IOT-AM table does not have anything like index, I imagine it has some kind
> > of ordering information in the system catalog. Without that the query planner
> > can hardly utilize the ordering. In such case REPACK should use the catalog
> > information on ordering rather than accept arbitrary ORDER BY clause.
>
> Well, it would be impossible to insert records without some metadata
> to indicate the IOT keys...
>
> You might assume that someone might change their mind one day about
> the chosen order and wish to change it. My point was about leaving the
> door open to support that by having some native syntax that could be
> used to trigger that change.

I doubted whether the current AM API is designed to do catalog changes, but
then recalled that CLUSTER does set pg_index.indisclustered, and that it does
so outside table_relation_copy_for_cluster(). So I can now imagine that REPACK
... ORDER BY can do something like that.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com



Re: REPACK and naming

От
Álvaro Herrera
Дата:
On 2025-Sep-19, David Rowley wrote:

> On Fri, 19 Sept 2025 at 21:36, Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > So, there's already an implementation of an index-organized table in
> > OrioleDB, as I understand, so maybe we can ask Alexander K. about this.
> > I suspect it's fine to say that if you have a table for which it makes
> > no sense to use REPACK USING INDEX, then we just throw an error in that
> > case (but I suppose plain REPACK continues to work, and it just
> > recreates/compacts the primary index and rebuilds all secondary indexes,
> > just like VACUUM FULL would presumably do.)
> 
> What I was wondering was if REPACK tab ORDER BY col; would be the way
> to change the IOT's order.

Sure -- the IOT is the "primary index", so what you are saying is, I
think, the same as what I was saying.

> > Well, I hope you mean that clustering by an index would stop being the
> > _only_ way, not that it would completely disappear as an option.
> 
> I mean support clustered orders that don't exist in any defined index.

Yes: I mean, I imagine that an implementation of
  REPACK tab ORDER BY col2
(on normal heap tables) when there's no index on col2 would just use a
seqscan-and-sort to put the tably in that physical order, and
potentially store somewhere that "col2" is the order that was used.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Selbst das größte Genie würde nicht weit kommen, wenn es
alles seinem eigenen Innern verdanken wollte." (Johann Wolfgang von Goethe)
               Ni aún el genio más grande llegaría muy lejos si
                    quisiera sacarlo todo de su propio interior.



Re: REPACK and naming

От
Álvaro Herrera
Дата:
On 2025-Sep-19, Antonin Houska wrote:

> Admittedly I haven't thought about clause like ORDER BY yet, but I wonder if
> it'd really be useful. My understanding is that the purpose of clustering is
> to make index scan more efficient:

Not necessarily.  For some queries in some workloads, having tuples in a
certain order for a seqscan might give considerable performance benefit
also.  Moreso with, say, BRIN indexes, where having one tuple in one
page range or another could mean having to scan that page range or
eliding it completely.

> with a clustered table, the heap tuples
> pertaining to given index tuple should be located on the same page, so the
> heap access is not that random.

Yes, I suppose this is the first-order reason, and probably why we
currently only support basing clustering on an index.  But I doubt it's
the only one.  (It's also worth pointing out that quite possibly having
REPACK CONCURRENTLY is going to make clustering a lot more popular;
without concurrency, clustering is practically useless.)

> If IOT-AM table does not have anything like index, I imagine it has some kind
> of ordering information in the system catalog. Without that the query planner
> can hardly utilize the ordering.

Sure.

> In such case REPACK should use the catalog information on ordering
> rather than accept arbitrary ORDER BY clause.

... but, as David said, it might be valuable to change that ordering for
whatever reason.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: REPACK and naming

От
Andrew Dunstan
Дата:


On 2025-09-18 Th 6:53 PM, Álvaro Herrera wrote:
I highly doubt that anybody is going to be confused in the slightest
about what the REPACK command does, or that they would start to think
about when did they PACK anything before.  That sort of argument seems
entirely artificial, and that is not how people think.



I agree.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com