Обсуждение: Can we let extensions change their dumped catalog schemas?
Hi all, I've been talking to other Timescale devs about a requested change to pg_dump, and there's been quite a bit of back-and-forth to figure out what, exactly, we want. Any mistakes here are mine, but I think we've been able to distill it down to the following request: We'd like to be allowed to change the schema for a table that's been marked in the past with pg_extension_config_dump(). Unless I'm missing something obvious (please, let it be that) there's no way to do this safely. Once you've marked an internal table as dumpable, its schema is effectively frozen if you want your dumps to work across versions, because otherwise you'll try to restore that "catalog" data into a table that has different columns. And while sometimes you can make that work, it doesn't in the general case. We (Timescale) do already change the schemas today, but we pay the associated costs in that dump/restore doesn't work without manual version bookkeeping and user fiddling -- and in the worst cases, it appears to "work" across versions but leaves our catalog tables in an inconsistent state. So the request is to come up with a way to support this case. Some options that have been proposed so far: 1) Don't ask for a new feature, and instead try to ensure infinite backwards compatibility for those tables. For extension authors who have already done this -- and have likely done some heavy architectural lifting to make it work -- this is probably the first thing that will come to mind, and it was the first thing I said, too. But the more I say it, the less acceptable it feels. Not even Postgres is expected to maintain infinite catalog compatibility into the future. We need to evolve our catalogs, too -- and we already provide the standard update scripts to perform migrations of those tables, but a dump/restore doesn't have any way to use them today. 2) Provide a way to record the exact version of an extension in a dump. Brute-force, but pretty much guaranteed to fix the cross-version problem, because the dump can't be accidentally restored to an extension version with a different catalog schema. Users then manually ALTER EXTENSION ... UPDATE (or we could even include that in the dump itself, as the final action). Doing this by default would punish extensions that don't have this problem, so it would have to be opt-in in some way. It's also unnecessarily strict IMO -- even if we don't have a config table change in a new version, we'll still require the old extension version to be available alongside the new version during a restore. Maybe a tweak on this idea would be to introduce a catversion for extensions. 3) Provide a way to record the entire internal state of an extension in a dump. Every extension is already expected to handle the case where the internal state is at version X but the installed extension is at version X+N, and the update scripts we provide will perform the necessary migrations. But there's no way to reproduce this case using dump/restore, because dumping an extension omits its internals. If a dump could instead include the entire internal state of an extension, then we'd be guaranteed to reproduce the exact situation that we already have to support for an in-place upgrade. After a restore, the SQL is at version X, the installed extension is some equal or later version, and all that remains is to run the update scripts, either manually or within the dump itself. Like (2), I think there's no way you'd all accept this cost for every extension. It'd have to be opt-in. -- Hopefully that makes a certain amount of sense. Does it seem like a reasonable thing to ask? I'm happy to clarify anything above, and if you know of an obvious solution I'm missing, I would love to be corrected. :D Thanks, --Jacob
Jacob Champion <jchampion@timescale.com> writes: > We'd like to be allowed to change the schema for a table that's been > marked in the past with pg_extension_config_dump(). > Unless I'm missing something obvious (please, let it be that) there's no > way to do this safely. Once you've marked an internal table as dumpable, > its schema is effectively frozen if you want your dumps to work across > versions, because otherwise you'll try to restore that "catalog" data > into a table that has different columns. And while sometimes you can > make that work, it doesn't in the general case. I agree that's a problem, but it's not that we're arbitrarily prohibiting something that would work. What, concretely, do you think could be done to improve the situation? > 2) Provide a way to record the exact version of an extension in a dump. Don't really see how that helps? I also fear that it will break a bunch of use-cases that work fine today, which are exactly the ones for which we originally defined pg_dump as *not* committing to a particular extension version. It feels like what we really need here is some way to mutate the old format of an extension config table into the new format. Simple addition of new columns shouldn't be a problem (in fact, I think that works already, or could easily be made to). If you want some ETL processing then it's harder :-(. Could an ON INSERT trigger on an old config table transpose converted data into a newer config table? Another point that ought to be made here is that pg_dump is not the only outside consumer of extension config data. You're likely to break some applications if you change a config table too much. That's not an argument that we shouldn't try to make pg_dump more forgiving, but I'm not sure that we need to move heaven and earth. regards, tom lane
On Tue, Jan 10, 2023 at 7:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jacob Champion <jchampion@timescale.com> writes: > > Unless I'm missing something obvious (please, let it be that) there's no > > way to do this safely. Once you've marked an internal table as dumpable, > > its schema is effectively frozen if you want your dumps to work across > > versions, because otherwise you'll try to restore that "catalog" data > > into a table that has different columns. And while sometimes you can > > make that work, it doesn't in the general case. > > I agree that's a problem, but it's not that we're arbitrarily prohibiting > something that would work. What, concretely, do you think could be > done to improve the situation? Concretely, I think extensions should be able to invoke their update scripts at some point after a dump/restore cycle, whether automatically or manually. > > 2) Provide a way to record the exact version of an extension in a dump. > > Don't really see how that helps? If pg_dump recorded our extension using CREATE EXTENSION timescaledb VERSION <original version> then we'd be able to migrate the changed catalog post-restore, using a standard ALTER EXTENSION ... UPDATE. > I also fear that it will break > a bunch of use-cases that work fine today, which are exactly the > ones for which we originally defined pg_dump as *not* committing > to a particular extension version. Right, I think it would have to be opt-in. Say, a new control file option dump_version or some such. > It feels like what we really need here is some way to mutate the > old format of an extension config table into the new format. Agreed. We already provide mutation functions via the update scripts, so I think both proposal 2 and 3 do that. I'm curious about your opinion on option 3, since it would naively seem to make pg_dump do _less_ work for a given extension. > Simple addition of new columns shouldn't be a problem (in fact, > I think that works already, or could easily be made to). If you > want some ETL processing then it's harder :-(. One sharp edge for the add-a-new-column case is where you give the new column a default, and you want all of the old migrated rows to have some non-default value to handle backwards compatibility. (But that case is handled trivially if you _know_ that you're performing a migration.) > Could an ON INSERT > trigger on an old config table transpose converted data into a > newer config table? You mean something like, introduce table catalog_v2, and have all INSERTs to catalog_v1 migrate and redirect the rows? That seems like it could work today, though it would mean maintaining two different upgrade paths for the same table, migrating all users of the catalog to the new name, and needing to drop the old table at... some point after the restore? I don't know if there would be performance concerns with larger catalogs (in fact I'm not sure how big these catalogs get). > Another point that ought to be made here is that pg_dump is not > the only outside consumer of extension config data. You're likely > to break some applications if you change a config table too much. Such as? We don't really want applications to be coupled against our internals by accident, but we have to dump the internals to be able to reproduce the state of the system. > That's not an argument that we shouldn't try to make pg_dump more > forgiving, but I'm not sure that we need to move heaven and earth. Agreed. Hopefully we can find something that just moves a little earth. :D Thanks! --Jacob
Jacob Champion <jchampion@timescale.com> writes: > On Tue, Jan 10, 2023 at 7:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I also fear that it will break >> a bunch of use-cases that work fine today, which are exactly the >> ones for which we originally defined pg_dump as *not* committing >> to a particular extension version. > Right, I think it would have to be opt-in. Say, a new control file > option dump_version or some such. That would require all the installed extensions to cope with this the same way, which does not seem like a great assumption. regards, tom lane
On Wed, Jan 11, 2023 at 1:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jacob Champion <jchampion@timescale.com> writes: > > Right, I think it would have to be opt-in. Say, a new control file > > option dump_version or some such. > > That would require all the installed extensions to cope with this > the same way, which does not seem like a great assumption. How so? Most installed extensions would not opt into a version dump, I'd imagine. Or do you mean that the version dump would apply retroactively to older versions of the extension, even if it wasn't needed in the past? --Jacob
On 1/12/23 11:04, Jacob Champion wrote: > On Wed, Jan 11, 2023 at 1:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Jacob Champion <jchampion@timescale.com> writes: >>> Right, I think it would have to be opt-in. Say, a new control file >>> option dump_version or some such. >> >> That would require all the installed extensions to cope with this >> the same way, which does not seem like a great assumption. > > How so? Most installed extensions would not opt into a version dump, > I'd imagine. As a concrete example, Timescale's extension control file could look like this: default_version = '2.x.y' module_pathname = '$libdir/timescaledb-2.x.y' ... dump_version = true which would then cause pg_dump to issue a VERSION for its CREATE EXTENSION line. Other extensions would remain with the default (dump_version = false), so they'd be dumped without an explicit VERSION. (And in the case of option 3, the name of the control file option changes -- dump_internals, maybe? -- but it still doesn't affect other installed extensions.) --Jacob
On Tue, Jan 17, 2023 at 3:18 PM Jacob Champion <jchampion@timescale.com> wrote: > As a concrete example, Timescale's extension control file could look > like this: > > default_version = '2.x.y' > module_pathname = '$libdir/timescaledb-2.x.y' > ... > dump_version = true > > which would then cause pg_dump to issue a VERSION for its CREATE > EXTENSION line. Other extensions would remain with the default > (dump_version = false), so they'd be dumped without an explicit VERSION. Even more concretely, here's a draft patch. There's no nuance -- setting dump_version affects all past versions of the extension, and it can't be turned off at restore time. So extensions opting in would have to be written to be side-by-side installable. (Ours is, in its own way, but the PGDG installers don't allow it -- which maybe highlights a weakness in this approach.) I'm still not sure if this addresses Tom's concerns, or just adds new ones. We could maybe give the user more control by overriding the default version for an extension in a different TOC entry, and then add options to ignore or include those version numbers during restore. That doesn't address the side-by-side problem directly but gives an escape hatch. Earlier I wrote: > I'm curious about your > opinion on option 3, since it would naively seem to make pg_dump do > _less_ work for a given extension. This was definitely naive :D We can't just make use of the binary-upgrade machinery to dump extension internals, because it pins OIDs. So that might still be a valid approach, but it's not "less work." Thanks, --Jacob
Вложения
On Tue, Feb 7, 2023 at 10:16 AM Jacob Champion <jchampion@timescale.com> wrote: > Even more concretely, here's a draft patch. There's no nuance -- > setting dump_version affects all past versions of the extension, and > it can't be turned off at restore time. So extensions opting in would > have to be written to be side-by-side installable. (Ours is, in its > own way, but the PGDG installers don't allow it -- which maybe > highlights a weakness in this approach.) I'm still not sure if this > addresses Tom's concerns, or just adds new ones. Any general thoughts on this approach? I don't think it's baked enough for registration yet, but I also don't know what approach would be better. Given the recent chatter around extension versions in other threads [1, 2], I feel like there is a big gap between the Postgres core expectations and what extension authors are actually doing when it comes to handling version upgrades. I'd like to chip away at that, somehow. Thanks, --Jacob [1] https://www.postgresql.org/message-id/212074.1678301349%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/CA%2BTgmoYqK6nfP15SjuyO6t5jOmymG%3DqO7JOOVJdTOj96L0XJ1Q%40mail.gmail.com