Обсуждение: pg_upgrade failures with large partition definitions on upgrades from ~13 to 14~

Поиск
Список
Период
Сортировка

pg_upgrade failures with large partition definitions on upgrades from ~13 to 14~

От
Michael Paquier
Дата:
Hi all,

The following SQL sequence causes a failure of pg_upgrade when these
are executed on a cluster of ~13, doing an upgrade to 14~, assuming
that the relation page size is 8kB.  This creates a partition table
with a set of values large enough that it can be created in ~13:
CREATE TABLE parent_list (id int) PARTITION BY LIST (id);
CREATE OR REPLACE FUNCTION create_long_list(tabname text,
  tabparent text,
  num_vals int)
RETURNS VOID AS
$func$
DECLARE
  query text;
BEGIN
  query := 'CREATE TABLE ' || tabname ||
           ' PARTITION OF ' || tabparent || ' FOR VALUES IN (';
  FOR i IN 1..num_vals LOOP
    query := query || i;
    IF i != num_vals THEN
      query := query || ', ';
    END IF;
  END LOOP;
  query := query || ')';
  EXECUTE format(query);
END
$func$ LANGUAGE plpgsql;
-- Large enough to trigger pg_class failure in 14~
-- Use 953 to make it work in 14~
SELECT create_long_list('child_list_2', 'parent_list', 956);

However, pg_upgrade fails in the middle of processing when restoring
the objects in the new cluster, with the same error as one would get
because the row is too big and we have no toast tables in pg_class:
pg_restore: error: could not execute query: ERROR:  row is too big:
size 8184, maximum size 8160
Command was: ALTER TABLE ONLY "public"."parent_list" ATTACH PARTITION

Then, as of pg_upgrade_internal.log:
Restoring database schemas in the new cluster
*failure*
Consult the last few lines of "pg_upgrade_dump_13468.log" for
the probable cause of the failure.

No fields have been added to pg_class between 13 and 14, however the
amount of data stored in relpartbound got larger between these two
versions (just do a length() on it for example using what I posted
above).  Hence, if the original cluster has a version of pg_class
large enough to just fit into a single page without the need of
toasting, it may fail when created in the new cluster because it lacks
space to fit on a page because of this extra partition bound data.

In such cases, the usual recommendation would be to adjust the
partition layer so as the schema has smaller pg_node_trees for the
partition bounds.  Still, waiting for something to blow up in the
middle of pg_upgrade is very unfriendly, and a pg_upgrade --link would
mean the need to rollback to a previous snapshot, which can be
costly.

Adding a toast table to pg_class or even pg_attribute (because this
could also happen with a bunch of attribute-level ACLs) has been
proposed for some time, though there have always been concerns about
circling dependencies back to pg_class.  More toasting or a split of
relpartbound into a separate catalog (with toast in it) would solve
this issue at its root, but that's not something that would happen in
14~15 anyway.

Shouldn't we have a safeguard of some kind in the pre-check phase of
pg_upgrade at least?  I think that this comes down to checking
sum(pg_column_size(pg_class.*)), roughly, with alignment and page
header, and do the same for pg_attribute.

Thoughts?
--
Michael

Вложения

Re: pg_upgrade failures with large partition definitions on upgrades from ~13 to 14~

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> The following SQL sequence causes a failure of pg_upgrade when these
> are executed on a cluster of ~13, doing an upgrade to 14~, assuming
> that the relation page size is 8kB.
> ...
> No fields have been added to pg_class between 13 and 14, however the
> amount of data stored in relpartbound got larger between these two
> versions (just do a length() on it for example using what I posted
> above).  Hence, if the original cluster has a version of pg_class
> large enough to just fit into a single page without the need of
> toasting, it may fail when created in the new cluster because it lacks
> space to fit on a page because of this extra partition bound data.

Bleah.

> Shouldn't we have a safeguard of some kind in the pre-check phase of
> pg_upgrade at least?  I think that this comes down to checking
> sum(pg_column_size(pg_class.*)), roughly, with alignment and page
> header, and do the same for pg_attribute.

It might be worth expending a pre-check on, if only because the
check could offer some advice about fixing the problem.  But it
seems like quite a corner case --- what are the odds of hitting
this?

            regards, tom lane



Re: pg_upgrade failures with large partition definitions on upgrades from ~13 to 14~

От
Michael Paquier
Дата:
On Thu, Feb 09, 2023 at 12:33:06AM -0500, Tom Lane wrote:
> It might be worth expending a pre-check on, if only because the
> check could offer some advice about fixing the problem.

Based on the information coming from pg_class, yes, something could be
reported back.  Now things get more hairy if the oversized tuple has a
mix of long ACLs and a long partition bound.

> But it seems like quite a corner case --- what are the odds of
> hitting this?

Low, I guess, as you need a tuple small enough that it fits right into
a page in 13~, but large enough to hit the upper-bound on insert
because of the extra overhead of relpartbound (something like 20B, at
short glance, in my case).  Well, this would not be an issue if there
were more toasting done.  I agree that schemas with such long
definitions point out to deficiencies usually, but the user experience
is bad when once would expect an upgrade with no hiccups, then fails
on this stuff, delaying an upgrade longer because the instance
requires a rollback.
--
Michael

Вложения