Обсуждение: Test instability when pg_dump orders by OID
A 002_pg_upgrade.pl run got swapped order of tags "notnull_tbl1_upg nn" and "notnull_parent_upg nn" for the schema diff test that commit 172259afb563d35001410dc6daad78b250924038 added in v18: @@ -436873,14 +436873,14 @@ ALTER TABLE public.insert_tbl ADD CONSTRAINT ne_insert_tbl_con CHECK (((x + z) = 1)) NOT ENFORCED; -- --- Name: notnull_tbl1_upg nn; Type: CONSTRAINT; Schema: public; Owner: nm +-- Name: notnull_parent_upg nn; Type: CONSTRAINT; Schema: public; Owner: nm -- -ALTER TABLE public.notnull_tbl1_upg +ALTER TABLE public.notnull_parent_upg ADD CONSTRAINT nn NOT NULL a NOT VALID; -- --- Name: notnull_parent_upg nn; Type: CONSTRAINT; Schema: public; Owner: nm +-- Name: notnull_tbl1_upg nn; Type: CONSTRAINT; Schema: public; Owner: nm -- -ALTER TABLE public.notnull_parent_upg +ALTER TABLE public.notnull_tbl1_upg pg_dump uses pg_constraint.oid as one sort key, and "pg_restore -j" opens the door for OID assignment order to vary. The first attached patch demonstrates this by simulation. It yields that diff and some operator order diffs. Let's get rid of pg_dump's need to sort by OID, apart from catalog corruption scenarios. Adding an assert found a total of seven affected object types. See the second attached patch. The drawback is storing five more fields in pg_dump memory: oprleft, oprright, opcmethod, opfmethod, and collencoding. That seems minor relative to existing pg_dump memory efficiency. Since this is a source of test flakes in v18, I'd like to back-patch to v18. I'm not sure why the buildfarm hasn't seen the above diff, but I expect the diff could happen there. This is another nice win for the new test from commit 172259afb563d35001410dc6daad78b250924038. The order instability was always bad for users, but the test brought it to the forefront. One might argue for back-patching $SUBJECT further, too. Thanks, nm
Вложения
On Mon, Jul 7, 2025 at 3:27 PM Noah Misch <noah@leadboat.com> wrote: > Let's get rid of pg_dump's need to sort by OID, apart from catalog corruption > scenarios. +1. I had at one point believed that sorting by OID was a good way to make dumps stable, but this disproves that theory. Sorting by logical properties of the object is better. > Adding an assert found a total of seven affected object types. > See the second attached patch. The drawback is storing five more fields in > pg_dump memory: oprleft, oprright, opcmethod, opfmethod, and collencoding. > That seems minor relative to existing pg_dump memory efficiency. Since this > is a source of test flakes in v18, I'd like to back-patch to v18. I'm not > sure why the buildfarm hasn't seen the above diff, but I expect the diff could > happen there. This is another nice win for the new test from commit > 172259afb563d35001410dc6daad78b250924038. The order instability was always > bad for users, but the test brought it to the forefront. One might argue for > back-patching $SUBJECT further, too. I agree with back-patching it at least as far as v18. I think it probably wouldn't hurt anything to back-patch further, and it might avoid future buildfarm failures. Against that, there's a remote possibility that someone who is currently saving pg_dump output for later comparison, say in a case where OIDs are always stable in practice, could be displeased to see the pg_dump order change in a minor release. But that seems like a very weak argument against back-patching. I can't see us ever deciding to put up with buildfarm instability on such grounds. Reviewing: + * Sort by name. This differs from "Name:" in plain format output, which + * is a _tocEntry.tag. For example, DumpableObject.name of a constraint + * is pg_constraint.conname, but _tocEntry.tag of a constraint is relname + * and conname joined with a space. This comment is useful, but if I were to be critical, it does a better job saying what this field isn't than what it is. + * Sort by encoding, per pg_collation_name_enc_nsp_index. This is + * mostly academic, because CREATE COLLATION has restrictions to make + * (nspname, collname) uniquely identify a collation within a given + * DatabaseEncoding. pg_import_system_collations() bypasses those + * restrictions, but pg_dump+restore fails after a + * pg_import_system_collations('my_schema') that creates collations + * for a blend of encodings. This comment is also useful, but if I were to be critical again, it does a better job saying why we shouldn't do what the code then does than why we should. Neither of those issues seem like must-fix problems to me. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jul 17, 2025 at 09:24:02AM -0400, Robert Haas wrote: > On Mon, Jul 7, 2025 at 3:27 PM Noah Misch <noah@leadboat.com> wrote: > > Let's get rid of pg_dump's need to sort by OID, apart from catalog corruption > > scenarios. > > +1. I had at one point believed that sorting by OID was a good way to > make dumps stable, but this disproves that theory. Sorting by logical > properties of the object is better. Sorting by OID was a reasonable approximation, for its time. > > Adding an assert found a total of seven affected object types. > > See the second attached patch. The drawback is storing five more fields in > > pg_dump memory: oprleft, oprright, opcmethod, opfmethod, and collencoding. > > That seems minor relative to existing pg_dump memory efficiency. Since this > > is a source of test flakes in v18, I'd like to back-patch to v18. I'm not > > sure why the buildfarm hasn't seen the above diff, but I expect the diff could > > happen there. This is another nice win for the new test from commit > > 172259afb563d35001410dc6daad78b250924038. The order instability was always > > bad for users, but the test brought it to the forefront. One might argue for > > back-patching $SUBJECT further, too. > > I agree with back-patching it at least as far as v18. I think it > probably wouldn't hurt anything to back-patch further, and it might > avoid future buildfarm failures. Against that, there's a remote > possibility that someone who is currently saving pg_dump output for > later comparison, say in a case where OIDs are always stable in > practice, could be displeased to see the pg_dump order change in a > minor release. But that seems like a very weak argument against > back-patching. I can't see us ever deciding to put up with buildfarm > instability on such grounds. Thanks for reviewing. I agree with those merits of back-patching further. A back-patch to v13 has no pg_dump_sort.c conflicts, while pg_dump.c has mechanical conflicts around retrieving the extra sort inputs. If there are no objections in the next 72h, I'll likely back-patch. > Reviewing: > > + * Sort by name. This differs from "Name:" in plain format output, which > + * is a _tocEntry.tag. For example, DumpableObject.name of a constraint > + * is pg_constraint.conname, but _tocEntry.tag of a constraint is relname > + * and conname joined with a space. > > This comment is useful, but if I were to be critical, it does a better > job saying what this field isn't than what it is. True. I've changed it to this: * Sort by name. With a few exceptions, names here are single catalog * columns. To get a fuller picture, grep pg_dump.c for "dobj.name = ". * Names here don't match "Name:" in plain format output, which is a * _tocEntry.tag. For example, DumpableObject.name of a constraint is * pg_constraint.conname, but _tocEntry.tag of a constraint is relname and * conname joined with a space. The patch's original change to the comment was a reaction to my own surprise. Reading "pg_dump regression|grep Name:|sort|grep CONSTRAINT" I saw unique "Name:" output for constraints, which felt at odds with the instability in DOTypeNameCompare() sorting them. But it wasn't the name I was looking for: - getConstraints() sets DumpableObject.name = conname - DOTypeNameCompare() sorts by DumpableObject.name - dumpConstraint() sets _tocEntry.tag = "relname conname" - _tocEntry.tag becomes the "Name:" in pg_dump output Long-term, in a post-scarcity world, I'd do one of these or similar: a. Change what we store in DumpableObject.name so it matches _tocEntry.tag. b. Rename field DumpableObject.name, so there's no longer a field called "name" with contents different from the "Name:" values in pg_dump output. > + * Sort by encoding, per pg_collation_name_enc_nsp_index. This is > + * mostly academic, because CREATE COLLATION has restrictions to make > + * (nspname, collname) uniquely identify a collation within a given > + * DatabaseEncoding. pg_import_system_collations() bypasses those > + * restrictions, but pg_dump+restore fails after a > + * pg_import_system_collations('my_schema') that creates collations > + * for a blend of encodings. > > This comment is also useful, but if I were to be critical again, it > does a better job saying why we shouldn't do what the code then does > than why we should. I've tried the following further refinement. If it's worse, I can go back to the last version. diff --git a/src/bin/pg_dump/pg_dump_sort.c b/src/bin/pg_dump/pg_dump_sort.c index ffae7b3..f7d6a03 100644 --- a/src/bin/pg_dump/pg_dump_sort.c +++ b/src/bin/pg_dump/pg_dump_sort.c @@ -248,7 +250,19 @@ DOTypeNameCompare(const void *p1, const void *p2) if (cmpval != 0) return cmpval; - /* To have a stable sort order, break ties for some object types */ + /* + * To have a stable sort order, break ties for some object types. Most + * catalogs have a natural key, e.g. pg_proc_proname_args_nsp_index. + * Where the above "namespace" and "name" comparisons don't cover all + * natural key columns, compare the rest here. + * + * The natural key usually refers to other catalogs by surrogate keys. + * Hence, this translates each of those references to the natural key of + * the referenced catalog. That may descend through multiple levels of + * catalog references. For example, to sort by pg_proc.proargtypes, + * descend to each pg_type and then further to its pg_namespace, for an + * overall sort by (nspname, typname). + */ if (obj1->objType == DO_FUNC || obj1->objType == DO_AGG) { FuncInfo *fobj1 = *(FuncInfo *const *) p1; @@ -312,13 +326,16 @@ DOTypeNameCompare(const void *p1, const void *p2) CollInfo *cobj2 = *(CollInfo *const *) p2; /* - * Sort by encoding, per pg_collation_name_enc_nsp_index. This is - * mostly academic, because CREATE COLLATION has restrictions to make - * (nspname, collname) uniquely identify a collation within a given - * DatabaseEncoding. pg_import_system_collations() bypasses those - * restrictions, but pg_dump+restore fails after a - * pg_import_system_collations('my_schema') that creates collations - * for a blend of encodings. + * Sort by encoding, per pg_collation_name_enc_nsp_index. Wherever + * this changes dump order, restoring the dump fails anyway. CREATE + * COLLATION can't create a tie for this to break, because it imposes + * restrictions to make (nspname, collname) uniquely identify a + * collation within a given DatabaseEncoding. While + * pg_import_system_collations() can create a tie, pg_dump+restore + * fails after pg_import_system_collations('my_schema') does so. + * There's little to gain by ignoring one natural key column on the + * basis of those limitations elsewhere, so respect the full natural + * key like we do for other object types. */ cmpval = cobj1->collencoding - cobj2->collencoding; if (cmpval != 0)
On Fri, Jul 18, 2025 at 3:17 PM Noah Misch <noah@leadboat.com> wrote: > > This comment is useful, but if I were to be critical, it does a better > > job saying what this field isn't than what it is. > > True. I've changed it to this: That looks great. > - /* To have a stable sort order, break ties for some object types */ > + /* > + * To have a stable sort order, break ties for some object types. Most > + * catalogs have a natural key, e.g. pg_proc_proname_args_nsp_index. > + * Where the above "namespace" and "name" comparisons don't cover all > + * natural key columns, compare the rest here. > + * > + * The natural key usually refers to other catalogs by surrogate keys. > + * Hence, this translates each of those references to the natural key of > + * the referenced catalog. That may descend through multiple levels of > + * catalog references. For example, to sort by pg_proc.proargtypes, > + * descend to each pg_type and then further to its pg_namespace, for an > + * overall sort by (nspname, typname). > + */ I really like this. > + * Sort by encoding, per pg_collation_name_enc_nsp_index. Wherever > + * this changes dump order, restoring the dump fails anyway. CREATE > + * COLLATION can't create a tie for this to break, because it imposes > + * restrictions to make (nspname, collname) uniquely identify a > + * collation within a given DatabaseEncoding. While > + * pg_import_system_collations() can create a tie, pg_dump+restore > + * fails after pg_import_system_collations('my_schema') does so. > + * There's little to gain by ignoring one natural key column on the > + * basis of those limitations elsewhere, so respect the full natural > + * key like we do for other object types. This is also good. I suggest s/Wherever/Technically, this is not necessary, because wherever/ and s/There's/However, there's/. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Jul 21, 2025 at 09:40:02AM -0400, Robert Haas wrote: > On Fri, Jul 18, 2025 at 3:17 PM Noah Misch <noah@leadboat.com> wrote: > > + * Sort by encoding, per pg_collation_name_enc_nsp_index. Wherever > > + * this changes dump order, restoring the dump fails anyway. CREATE > > + * COLLATION can't create a tie for this to break, because it imposes > > + * restrictions to make (nspname, collname) uniquely identify a > > + * collation within a given DatabaseEncoding. While > > + * pg_import_system_collations() can create a tie, pg_dump+restore > > + * fails after pg_import_system_collations('my_schema') does so. > > + * There's little to gain by ignoring one natural key column on the > > + * basis of those limitations elsewhere, so respect the full natural > > + * key like we do for other object types. > > This is also good. I suggest s/Wherever/Technically, this is not > necessary, because wherever/ and s/There's/However, there's/. I used that. I started to prepare the back-branch versions, but that revealed three problems affecting the master patch: (1) Sorting constraints segfaulted if either of a pair of equal-name constraints was a domain constraint. Fortunately, commit da71717 added a test case for that between when I mailed patch v1 and when I went to commit. One can reproduce it by dumping a database containing: CREATE DOMAIN d1 AS int CONSTRAINT dc CHECK (value > 0); CREATE DOMAIN d2 AS int CONSTRAINT dc CHECK (value > 0); I made pg_dump sort domain constraints of a given name before table constraints of that name, for consistency with our decision to sort CREATE DOMAIN before CREATE TABLE. The main alternative was to sort by parent object name irrespective of parent object type, i.e. DOMAIN a < TABLE b < DOMAIN c. That alternative lacked a relevant precedent. I've now audited the natural keys of catalogs for which I'm changing sort order, and I think that was the only one I missed. (2) Sorting opclasses failed a new assertion when dumping a v9.2 source (and likely 9.[345]), because getAccessMethods() doesn't read pg_am when dumping from a version predating CREATE ACCESS METHOD. findAccessMethodByOid() found no access methods, since pg_dump had read none. I've changed getAccessMethods() to always read pg_am. (For pre-v9.6 sources, I've kept the function's behavior of never marking an access method for dumping.) pg_am is small enough for this read to incur negligible cost. The main alternative was, for pre-v9.6, sorting access methods by pg_am.oid. That would have been less code, but dump order would have differed between pre-v9.6 and v9.6+. (3) pgTypeNameCompare() implied postfix operators don't exist, but master pg_dump will support reading from pre-v14 clusters for several more years. The code behaved fine, but I've updated the comment. I regret missing those in v1. I've attached v2, including branch-specific patches. I'll first need to back-patch 350e6b8, which fixed sorting of CREATE RULE, to v17 and earlier. Since 350e6b8 is conflict-free all the way back to v13, I'm not attaching it. Thanks, nm
Вложения
On Thu, Jul 24, 2025 at 10:27 PM Noah Misch <noah@leadboat.com> wrote: > I regret missing those in v1. I've attached v2, including branch-specific > patches. I'll first need to back-patch 350e6b8, which fixed sorting of CREATE > RULE, to v17 and earlier. Since 350e6b8 is conflict-free all the way back to > v13, I'm not attaching it. Back-patching 350e6b8 to facilitate back-patching this seems OK. I did a read-through of dobjcomp20-disambiguate-v2.patch and have no further review comments. I did not read through the back-patched versions on the assumption that back-porting was straightforward enough that a separate review was not required. -- Robert Haas EDB: http://www.enterprisedb.com
Hello Noah,
07.07.2025 22:26, Noah Misch wrote:
07.07.2025 22:26, Noah Misch wrote:
A 002_pg_upgrade.pl run got swapped order of tags "notnull_tbl1_upg nn" and "notnull_parent_upg nn" for the schema diff test that commit 172259afb563d35001410dc6daad78b250924038 added in v18: @@ -436873,14 +436873,14 @@ ALTER TABLE public.insert_tbl ADD CONSTRAINT ne_insert_tbl_con CHECK (((x + z) = 1)) NOT ENFORCED; -- --- Name: notnull_tbl1_upg nn; Type: CONSTRAINT; Schema: public; Owner: nm +-- Name: notnull_parent_upg nn; Type: CONSTRAINT; Schema: public; Owner: nm -- -ALTER TABLE public.notnull_tbl1_upg +ALTER TABLE public.notnull_parent_upg ADD CONSTRAINT nn NOT NULL a NOT VALID; -- --- Name: notnull_parent_upg nn; Type: CONSTRAINT; Schema: public; Owner: nm +-- Name: notnull_tbl1_upg nn; Type: CONSTRAINT; Schema: public; Owner: nm -- -ALTER TABLE public.notnull_parent_upg +ALTER TABLE public.notnull_tbl1_upg
It's rather funny that a few days before the fix is going to be pushed,
hippopotamus proved the need for it [1] (I saw no such failures on the
buildfarm before):
...
[17:09:56.372](2.577s) not ok 8 - dump outputs from original and restored regression databases match
[17:09:56.372](0.000s)
[17:09:56.372](0.000s) # Failed test 'dump outputs from original and restored regression databases match'
[17:09:56.372](0.000s) # at /home/buildfarm/hippopotamus/buildroot/REL_18_STABLE/pgsql.build/src/bin/pg_upgrade/../../../src/test/perl/PostgreSQL/Test/Utils.pm line 800.
[17:09:56.373](0.000s) # got: '1'
# expected: '0'
=== diff of /home/buildfarm/hippopotamus/buildroot/REL_18_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_v61D/src_dump.sql_adjusted and /home/buildfarm/hippopotamus/buildroot/REL_18_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_v61D/dest_dump.sql_adjusted
=== stdout ===
--- /home/buildfarm/hippopotamus/buildroot/REL_18_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_v61D/src_dump.sql_adjusted 2025-07-28 17:09:55.040029896 +0200
+++ /home/buildfarm/hippopotamus/buildroot/REL_18_STABLE/pgsql.build/src/bin/pg_upgrade/tmp_check/tmp_test_v61D/dest_dump.sql_adjusted 2025-07-28 17:09:56.208057237 +0200
@@ -436960,14 +436960,14 @@
ALTER TABLE public.insert_tbl
ADD CONSTRAINT ne_insert_tbl_con CHECK (((x + z) = 1)) NOT ENFORCED;
--
--- Name: notnull_tbl1_upg nn; Type: CONSTRAINT; Schema: public; Owner: buildfarm
+-- Name: notnull_parent_upg nn; Type: CONSTRAINT; Schema: public; Owner: buildfarm
--
-ALTER TABLE public.notnull_tbl1_upg
+ALTER TABLE public.notnull_parent_upg
ADD CONSTRAINT nn NOT NULL a NOT VALID;
--
--- Name: notnull_parent_upg nn; Type: CONSTRAINT; Schema: public; Owner: buildfarm
+-- Name: notnull_tbl1_upg nn; Type: CONSTRAINT; Schema: public; Owner: buildfarm
--
-ALTER TABLE public.notnull_parent_upg
+ALTER TABLE public.notnull_tbl1_upg
ADD CONSTRAINT nn NOT NULL a NOT VALID;
--
-- Name: notnul
Thank you for working on the fix!
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hippopotamus&dt=2025-07-28%2015%3A05%3A11
Best regards,
Alexander
On Fri, Jul 25, 2025 at 02:01:01PM -0400, Robert Haas wrote: > On Thu, Jul 24, 2025 at 10:27 PM Noah Misch <noah@leadboat.com> wrote: > > I regret missing those in v1. I've attached v2, including branch-specific > > patches. I'll first need to back-patch 350e6b8, which fixed sorting of CREATE > > RULE, to v17 and earlier. Since 350e6b8 is conflict-free all the way back to > > v13, I'm not attaching it. > > Back-patching 350e6b8 to facilitate back-patching this seems OK. I did > a read-through of dobjcomp20-disambiguate-v2.patch and have no further > review comments. I did not read through the back-patched versions on > the assumption that back-porting was straightforward enough that a > separate review was not required. Pushed as 0decd5e. v14 supports binary-upgrade from v8.3 and regular dump from v8.0. That required two other changes. First, pg_opclass.opcmethod had name opcamid until v8.3 (commit a78fcfb). Accounting for both names was trivial. Second, pg_am first had fixed OID AccessMethodRelationId in v8.1 (commit 7c13781). The find*ByOid functions have been assuming fixed catalog OIDs since v15's commit 92316a4. The $SUBJECT commit added findAccessMethodByOid() to all branches, so I changed the v14/v13 findAccessMethodByOid() to be more like v14/v13 findOprByOid(), which doesn't assume AccessMethodRelationId. If folks want more details, let me know.
Hello Noah, 04.08.2025 03:03, Noah Misch wrote: > Pushed as 0decd5e. ... Please look at a new anomaly introduced with that commit. The following script: createdb regression echo " CREATE USER u1; ALTER DEFAULT PRIVILEGES FOR ROLE u1 REVOKE INSERT ON TABLES FROM u1; CREATE USER u2; ALTER DEFAULT PRIVILEGES FOR ROLE u2 REVOKE INSERT ON TABLES FROM u2; " | psql regression pg_dump regression triggers: pg_dump: pg_dump_sort.c:454: DOTypeNameCompare: Assertion `0' failed. Reproduced on master and REL_13_STABLE. Best regards, Alexander
Hi all! On Sun, 10 Aug 2025 at 12:00, Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Noah, > > 04.08.2025 03:03, Noah Misch wrote: > > Pushed as 0decd5e. ... > > Please look at a new anomaly introduced with that commit. The following > script: > createdb regression > > echo " > CREATE USER u1; > ALTER DEFAULT PRIVILEGES FOR ROLE u1 REVOKE INSERT ON TABLES FROM u1; > > CREATE USER u2; > ALTER DEFAULT PRIVILEGES FOR ROLE u2 REVOKE INSERT ON TABLES FROM u2; > " | psql regression > > pg_dump regression > > triggers: > pg_dump: pg_dump_sort.c:454: DOTypeNameCompare: Assertion `0' failed. I reproduced this. Indeed, in case of default acl we happen to use OID sort. PFA resolves this issue. I simply added DEFAULT ACL case-specific tiebreaker that resolves object order. -- Best regards, Kirill Reshke
Вложения
On Sun, Aug 10, 2025 at 04:41:20PM +0500, Kirill Reshke wrote: > On Sun, 10 Aug 2025 at 12:00, Alexander Lakhin <exclusion@gmail.com> wrote: > > 04.08.2025 03:03, Noah Misch wrote: > > > Pushed as 0decd5e. ... > > > > Please look at a new anomaly introduced with that commit. The following > > script: > > createdb regression > > > > echo " > > CREATE USER u1; > > ALTER DEFAULT PRIVILEGES FOR ROLE u1 REVOKE INSERT ON TABLES FROM u1; > > > > CREATE USER u2; > > ALTER DEFAULT PRIVILEGES FOR ROLE u2 REVOKE INSERT ON TABLES FROM u2; > > " | psql regression > > > > pg_dump regression > > > > triggers: > > pg_dump: pg_dump_sort.c:454: DOTypeNameCompare: Assertion `0' failed. > > I reproduced this. Indeed, in case of default acl we happen to use OID sort. Thanks. Given the current state of freeze for tomorrow's release wrap, the decision is less obvious than usual. I'm seeing these options: 1. Remove the new assertion in v13-v18. 2. Push your proposed fix. 3. Change nothing. (This would be the choice if one is maximally concerned about deviating from the freeze and unconcerned about --enable-cassert builds of releases.) I am inclined to make today's change be (1). A fresh audit of catalog PRIMARY KEY and UNIQUE constraints didn't find any more missed cases, but (1) still feels like the right level of cautiousness. If there are no objections in the next 3hr, I'll proceed with (1). > PFA resolves this issue. I simply added DEFAULT ACL case-specific > tiebreaker that resolves object order. Thanks. Could you make src/test/regress create regression database objects so the code addition has coverage? Using pg_signal_backend and pg_read_all_settings as the default ACL role names should avoid that suite's limitations. (The suite must run under any role name and must drop any roles it creates, so it can't assume any particular non-system role name survives the suite.)
On Sun, Aug 10, 2025 at 12:37 PM Noah Misch <noah@leadboat.com> wrote: > Thanks. Given the current state of freeze for tomorrow's release wrap, the > decision is less obvious than usual. I'm seeing these options: > > 1. Remove the new assertion in v13-v18. > 2. Push your proposed fix. > 3. Change nothing. (This would be the choice if one is maximally concerned > about deviating from the freeze and unconcerned about --enable-cassert > builds of releases.) > > I am inclined to make today's change be (1). Sounds right to me. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Aug 10, 2025 at 12:37 PM Noah Misch <noah@leadboat.com> wrote: >> Thanks. Given the current state of freeze for tomorrow's release wrap, the >> decision is less obvious than usual. I'm seeing these options: >> >> 1. Remove the new assertion in v13-v18. >> 2. Push your proposed fix. >> 3. Change nothing. (This would be the choice if one is maximally concerned >> about deviating from the freeze and unconcerned about --enable-cassert >> builds of releases.) >> >> I am inclined to make today's change be (1). > Sounds right to me. I agree. The fact that this case escaped notice suggests that there might be more. We don't want to ship a version of pg_dump that will assert if that happens. Keep the assert in HEAD, for sure, but it's uncomfortable having it in back branches. As for the actual fix, push it after the freeze lifts. The fact that we didn't quite get there on making dump order stable isn't a freeze-break-worthy bug. regards, tom lane
On Sun, 10 Aug 2025 at 21:37, Noah Misch <noah@leadboat.com> wrote: > > Thanks. Could you make src/test/regress create regression database objects so > the code addition has coverage? Using pg_signal_backend and > pg_read_all_settings as the default ACL role names should avoid that suite's > limitations. (The suite must run under any role name and must drop any roles > it creates, so it can't assume any particular non-system role name survives > the suite.) Here is my attempt at implementing necessary legwork. It's v3 because I accidentally cleared the CC list in my previous attempt. Noah kindly explained to me how additions to the regress test will cause pg_dump logic to be tested as well. TIL 002_pg_upgarde.pl runs a regression suite, so if we create any database objects in it, it will end up being dumped and restored in that test. So, I checked that without changes in pg_dump_sort.c, 002_pg_upgarde fails and with changes it does not. PFA. I am not horribly sure about my additions to the `src/test/regress/sql/privileges.sql` file, maybe appending SQL to the end of the file is not the best option and there is a better place. -- Best regards, Kirill Reshke
Вложения
On Sun, 10 Aug 2025 at 21:37, Noah Misch <noah@leadboat.com> wrote: > Thanks. Given the current state of freeze for tomorrow's release wrap, the > decision is less obvious than usual. I'm seeing these options: > > 1. Remove the new assertion in v13-v18. > 2. Push your proposed fix. > 3. Change nothing. (This would be the choice if one is maximally concerned > about deviating from the freeze and unconcerned about --enable-cassert > builds of releases.) > > I am inclined to make today's change be (1). A fresh audit of catalog PRIMARY > KEY and UNIQUE constraints didn't find any more missed cases, but (1) still > feels like the right level of cautiousness. If there are no objections in the > next 3hr, I'll proceed with (1). > Hi! I can see we have option (1) (28e7252 etc). Can we now move forward with option (2) for HEAD? -- Best regards, Kirill Reshke
On Wed, Aug 20, 2025 at 10:11:15AM +0500, Kirill Reshke wrote: > On Sun, 10 Aug 2025 at 21:37, Noah Misch <noah@leadboat.com> wrote: > > Thanks. Given the current state of freeze for tomorrow's release wrap, the > > decision is less obvious than usual. I'm seeing these options: > > > > 1. Remove the new assertion in v13-v18. > > 2. Push your proposed fix. > > 3. Change nothing. (This would be the choice if one is maximally concerned > > about deviating from the freeze and unconcerned about --enable-cassert > > builds of releases.) > > > > I am inclined to make today's change be (1). A fresh audit of catalog PRIMARY > > KEY and UNIQUE constraints didn't find any more missed cases, but (1) still > > feels like the right level of cautiousness. If there are no objections in the > > next 3hr, I'll proceed with (1). > > Hi! I can see we have option (1) (28e7252 etc). Can we now move > forward with option (2) for HEAD? Yep. It's in my queue.
On Mon, Aug 11, 2025 at 12:20:09AM +0500, Kirill Reshke wrote: > On Sun, 10 Aug 2025 at 21:37, Noah Misch <noah@leadboat.com> wrote: > > Thanks. Could you make src/test/regress create regression database objects so > > the code addition has coverage? Using pg_signal_backend and > > pg_read_all_settings as the default ACL role names should avoid that suite's > > limitations. (The suite must run under any role name and must drop any roles > > it creates, so it can't assume any particular non-system role name survives > > the suite.) > > Here is my attempt at implementing necessary legwork. It's v3 because > I accidentally cleared the CC list in my previous attempt. Noah kindly > explained to me how additions to the regress test will cause pg_dump > logic to be tested as well. > TIL 002_pg_upgarde.pl runs a regression suite, so if we create any > database objects in it, it will end up being dumped and restored in > that test. > So, I checked that without changes in pg_dump_sort.c, 002_pg_upgarde > fails and with changes it does not. Great. > PFA. I am not horribly sure about my additions to the > `src/test/regress/sql/privileges.sql` file, maybe appending SQL to the > end of the file is not the best option and there is a better place. I like how src/test/regress/sql/collate.icu.utf8.sql puts that kind of thing just after cleanup, so I put it there. Pushed as b61a5c4 with a few other cosmetic changes. Thanks.
On Fri, Aug 22, 2025 at 10:20:19PM -0700, Noah Misch wrote: > On Mon, Aug 11, 2025 at 12:20:09AM +0500, Kirill Reshke wrote: > > On Sun, 10 Aug 2025 at 21:37, Noah Misch <noah@leadboat.com> wrote: > > > Thanks. Could you make src/test/regress create regression database objects so > > > the code addition has coverage? Using pg_signal_backend and > > > pg_read_all_settings as the default ACL role names should avoid that suite's > > > limitations. (The suite must run under any role name and must drop any roles > > > it creates, so it can't assume any particular non-system role name survives > > > the suite.) > > > > Here is my attempt at implementing necessary legwork. It's v3 because > > I accidentally cleared the CC list in my previous attempt. Noah kindly > > explained to me how additions to the regress test will cause pg_dump > > logic to be tested as well. > > TIL 002_pg_upgarde.pl runs a regression suite, so if we create any > > database objects in it, it will end up being dumped and restored in > > that test. > > So, I checked that without changes in pg_dump_sort.c, 002_pg_upgarde > > fails and with changes it does not. > > Great. > > > PFA. I am not horribly sure about my additions to the > > `src/test/regress/sql/privileges.sql` file, maybe appending SQL to the > > end of the file is not the best option and there is a better place. > > I like how src/test/regress/sql/collate.icu.utf8.sql puts that kind of thing > just after cleanup, so I put it there. Pushed as b61a5c4 with a few other > cosmetic changes. Thanks. TestUpgradeXversion fails: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2025-08-23%2007%3A34%3A38 --- /home/pgbf/buildroot/upgrade.copperhead/REL_18_STABLE/origin-REL_16_STABLE.sql.fixed 2025-08-23 10:28:16.464887433+0200 +++ /home/pgbf/buildroot/upgrade.copperhead/REL_18_STABLE/converted-REL_16_STABLE-to-REL_18_STABLE.sql.fixed 2025-08-2310:28:16.508887289 +0200 @@ -606490,13 +606490,13 @@ -- -- Name: DEFAULT PRIVILEGES FOR TABLES; Type: DEFAULT ACL; Schema: -; Owner: pg_read_all_settings -- -ALTER DEFAULT PRIVILEGES FOR ROLE pg_read_all_settings REVOKE SELECT,INSERT,REFERENCES,DELETE,TRIGGER,TRUNCATE,UPDATE ONTABLES FROM pg_read_all_settings; -ALTER DEFAULT PRIVILEGES FOR ROLE pg_read_all_settings GRANT SELECT,REFERENCES,DELETE,TRIGGER,TRUNCATE,UPDATE ON TABLESTO pg_read_all_settings; +ALTER DEFAULT PRIVILEGES FOR ROLE pg_read_all_settings REVOKE ALL ON TABLES FROM pg_read_all_settings; +ALTER DEFAULT PRIVILEGES FOR ROLE pg_read_all_settings GRANT SELECT,REFERENCES,DELETE,TRIGGER,TRUNCATE,MAINTAIN,UPDATE ONTABLES TO pg_read_all_settings; Crossing the boundary of the MAINTAIN privilege existing seems relevant. Will fix. (My checklist did tell me to do a local run of TestUpgradeXversion. I skipped it, betting this patch wouldn't break that test. I lost that bet.)
On Sat, Aug 23, 2025 at 07:45:05AM -0700, Noah Misch wrote: > On Fri, Aug 22, 2025 at 10:20:19PM -0700, Noah Misch wrote: > > Pushed as b61a5c4 with a few other > > cosmetic changes. Thanks. > > TestUpgradeXversion fails: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2025-08-23%2007%3A34%3A38 > > --- /home/pgbf/buildroot/upgrade.copperhead/REL_18_STABLE/origin-REL_16_STABLE.sql.fixed 2025-08-23 10:28:16.464887433+0200 > +++ /home/pgbf/buildroot/upgrade.copperhead/REL_18_STABLE/converted-REL_16_STABLE-to-REL_18_STABLE.sql.fixed 2025-08-2310:28:16.508887289 +0200 > @@ -606490,13 +606490,13 @@ > -- > -- Name: DEFAULT PRIVILEGES FOR TABLES; Type: DEFAULT ACL; Schema: -; Owner: pg_read_all_settings > -- > -ALTER DEFAULT PRIVILEGES FOR ROLE pg_read_all_settings REVOKE SELECT,INSERT,REFERENCES,DELETE,TRIGGER,TRUNCATE,UPDATEON TABLES FROM pg_read_all_settings; > -ALTER DEFAULT PRIVILEGES FOR ROLE pg_read_all_settings GRANT SELECT,REFERENCES,DELETE,TRIGGER,TRUNCATE,UPDATE ON TABLESTO pg_read_all_settings; > +ALTER DEFAULT PRIVILEGES FOR ROLE pg_read_all_settings REVOKE ALL ON TABLES FROM pg_read_all_settings; > +ALTER DEFAULT PRIVILEGES FOR ROLE pg_read_all_settings GRANT SELECT,REFERENCES,DELETE,TRIGGER,TRUNCATE,MAINTAIN,UPDATEON TABLES TO pg_read_all_settings; > > > Crossing the boundary of the MAINTAIN privilege existing seems relevant. Will > fix. (My checklist did tell me to do a local run of TestUpgradeXversion. I > skipped it, betting this patch wouldn't break that test. I lost that bet.) I've pushed commit ad44124 to fix this by changing "REVOKE INSERT ON TABLES" to "REVOKE USAGE ON TYPES". Types have the same one privilege in all supported versions, so they avoid the problem. An alternative was to GRANT to a different role, not REVOKE from the owner role. I noticed that the buildfarm diff revealed suboptimal pg_dump behaviors involving v17+ pg_dump dumping a v16- origin server. I failed to resist studying that. While I plan not to work on these myself, see below for what I found. If someone works on these, that probably deserves its own thread: ==== Dump+reload+dump sees a schema diff Consider the sequence: 1. start a v16 server 2. run the commands commit b61a5c4 added to privileges.sql 3. dump the v16 server w/ v17 pg_dump includes: REVOKE (all but MAINTAIN); GRANT (all but INSERT and MAINTAIN) 4. start a v17 server 5. restore the dump to the v17 server 6. dump the v17 server includes: REVOKE ALL; GRANT (all but INSERT) The dump from (3) differs from the dump in (6), hence the buildfarm failure. Both dumps correctly reproduce catalog state, but it's not great to vary output like this. buildACLCommands() operates at the aclitem level. If it gets these args: acls = defaclacl = {pg_signal_backend=rwdDxt/pg_signal_backend} baseacls = acldefault = {pg_signal_backend=arwdDxt/pg_signal_backend} then it revokes the 7 "baseacls" privileges and grants the 6 "acls" privileges. By diffing individual ai_privs bits instead of whole aclitem values, it could deduce that REVOKE INSERT suffices. In other words, do the equivalent of exploding each aclitem into individual privilege records before diffing to decide what to GRANT and what to REVOKE: # defaclacl (pg_signal_backend,r,$grantor) ... # acldefault (pg_signal_backend,a,$grantor) (pg_signal_backend,r,$grantor) ... I think that would also make the dump output do the right thing in e.g. v17 pg_dump migrating a v17 origin to a v16 destination. (That's not guaranteed to work, but we don't offer a better-supported downgrade path.) We could then re-add the test from commit b61a5c4. There could easily be corner cases I'm not considering; the strategy may be a dead end. ==== Never emits {REVOKE|GRANT} ALL ON {TABLE|TABLES} Tactically, that's because v17 parseACLItem() never finds the MAINTAIN bit in a v16- aclitem. A v16 GRANT ALL ON TABLE becomes a GRANT (all but MAINTAIN) ON TABLE when migrated to v17. The prior debut of new relation ai_privs bits, REFERENCES etc., came in v7.2. Unlike MAINTAIN, v7.2 commit 1b68bcfad checked the v7.2+ bits only for v7.2+ origin servers. A v7.1 aclitem that had every v7.1-era privilege (arwR = SELECT,UPDATE/DELETE,INSERT,RULE) would also get v7.2 REFERENCES. Lack of any v7.1-era privilege would mean no REFERENCES. The v7.2 approach would be suboptimal if a new bit is more powerful than all existing bits for the object type. For example, FOREIGN SERVER, LANGUAGE, TYPE, etc. have just USAGE. If v19 were to introduce any other privilege for those, v18->v19 restore shouldn't emit GRANT ALL for those. (v7.2 was okay, because v7.1 RULE was roughly as powerful as v7.2 REFERENCES.) Given the 11 months without complaints, changing the MAINTAIN treatment now is likely riskier than leaving it as-is. If I had a realized this outcome before the v17 release, though, I would have recommended the v7.2 approach instead. (MAINTAIN is strictly less powerful than REFERENCES, so the last paragraph's caveat does not apply.) ==== Outdated comment in buildACLCommands (This is not specific to v17+.) /* * At this point we have issued REVOKE statements for all initial and * default privileges that are no longer present on the object, so we are * almost ready to GRANT the privileges listed in grantitems[]. * * We still need some hacking though to cover the case where new default * public privileges are added in new versions: the REVOKE ALL will revoke * them, leading to behavior different from what the old version had, * which is generally not what's wanted. So add back default privs if the * source database is too old to have had that particular priv. (As of * right now, no such cases exist in supported versions.) */ This originated in commit 2ee56b6a3 (2007-01), to handle v8.2 adding privilege CONNECT ON DATABASE. It's no longer just "new default public privileges" that could need code here. Any version-dependent "baseacls" (world_default, owner_default, or pg_init_privs) could entail code here. For example, if MAINTAIN privilege had used the v7.2 pg_dump approach (see heading "Never emits {REVOKE|GRANT} ALL ON {TABLE|TABLES}"), then a GRANT MAINTAIN for table owners would belong here, to compensate for the v17 owner_default change. The "diffing individual privileges" approach potentially removes the need for code here. If anyone seeks more details on the above, let me know.
Hi, On 2025-08-23 07:45:05 -0700, Noah Misch wrote: > Crossing the boundary of the MAINTAIN privilege existing seems relevant. Will > fix. (My checklist did tell me to do a local run of TestUpgradeXversion. I > skipped it, betting this patch wouldn't break that test. I lost that bet.) I wonder if it's worth adding support to CI to perform the cross-version upgrade test. It'd be pretty easy to install all pgdg apt postgres packages to the debian image, which then could be used as the source version... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I wonder if it's worth adding support to CI to perform the cross-version > upgrade test. It'd be pretty easy to install all pgdg apt postgres packages to > the debian image, which then could be used as the source version... I feel that that's the wrong tradeoff. CI should be expected to be fairly cheap, not to catch everything the buildfarm could catch. regards, tom lane
On Sun, Aug 24, 2025 at 11:50:01AM -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > I wonder if it's worth adding support to CI to perform the cross-version > > upgrade test. It'd be pretty easy to install all pgdg apt postgres packages to > > the debian image, which then could be used as the source version... I think catching this particular case would take more than that. It entails running the latest v16 src/test/regress suite, capturing the dump of that into $animal_root/upgrade.$animal/REL_16_STABLE/*.sql, and seeing the upgrade failure of that dump having the latest v16 regression objects. I don't know how to get there without a v16 source tree. > I feel that that's the wrong tradeoff. CI should be expected to be > fairly cheap, not to catch everything the buildfarm could catch. It can always be non-default, like the mingw test.
Hi, On 2025-08-24 09:08:16 -0700, Noah Misch wrote: > On Sun, Aug 24, 2025 at 11:50:01AM -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > I wonder if it's worth adding support to CI to perform the cross-version > > > upgrade test. It'd be pretty easy to install all pgdg apt postgres packages to > > > the debian image, which then could be used as the source version... > > I think catching this particular case would take more than that. It entails > running the latest v16 src/test/regress suite, capturing the dump of that into > $animal_root/upgrade.$animal/REL_16_STABLE/*.sql, and seeing the upgrade > failure of that dump having the latest v16 regression objects. I don't know > how to get there without a v16 source tree. Ah, ok, that does make it less worthwhile to go after. > > I feel that that's the wrong tradeoff. CI should be expected to be > > fairly cheap, not to catch everything the buildfarm could catch. I think it's also about removing painful manual testing - and imo manually running cross-version pg_upgrade tests is really rather painful. > It can always be non-default, like the mingw test. Indeed. We now have the infrastructure to enable such tests for cfbot while not running by default in user's repositories (which will commonly be more compute constrained). Greetings, Andres Freund
On Mon, Aug 25, 2025 at 05:15:55PM -0400, Andres Freund wrote: > On 2025-08-24 09:08:16 -0700, Noah Misch wrote: > > On Sun, Aug 24, 2025 at 11:50:01AM -0400, Tom Lane wrote: > > > Andres Freund <andres@anarazel.de> writes: > > > > I wonder if it's worth adding support to CI to perform the cross-version > > > > upgrade test. It'd be pretty easy to install all pgdg apt postgres packages to > > > > the debian image, which then could be used as the source version... > > > > I think catching this particular case would take more than that. It entails > > running the latest v16 src/test/regress suite, capturing the dump of that into > > $animal_root/upgrade.$animal/REL_16_STABLE/*.sql, and seeing the upgrade > > failure of that dump having the latest v16 regression objects. I don't know > > how to get there without a v16 source tree. > > Ah, ok, that does make it less worthwhile to go after. > > > > I feel that that's the wrong tradeoff. CI should be expected to be > > > fairly cheap, not to catch everything the buildfarm could catch. > > I think it's also about removing painful manual testing - and imo manually > running cross-version pg_upgrade tests is really rather painful. I make the buildfarm client drive it. That was painful to set up the first time[1], but the per-run manual pain isn't bad. A run of all supported branches takes hours of wall time, though. There are some optimization opportunities, but it hasn't come up often enough to make those compelling for me to implement. [1] For example, there's no one OpenSSL version compatible with all of v9.2 - v19. Disabling SSL doesn't solve that: some versions then disable pgcrypto, and the upgrade test fails for pgcrypto being absent on one side of the upgrade. I settled on SSL only for the versions where pgcrypto requires it. Version-dependent LD_LIBRARY_PATH etc. likely would have been an alternative.
Noah Misch <noah@leadboat.com> writes: > On Mon, Aug 25, 2025 at 05:15:55PM -0400, Andres Freund wrote: >> I think it's also about removing painful manual testing - and imo manually >> running cross-version pg_upgrade tests is really rather painful. > I make the buildfarm client drive it. Yeah, same here. I have a BF instance lying around that's configured to do the cross-version pg_upgrade tests. It's not registered with the BF server, I just run it manually when I need to test that. regards, tom lane