Обсуждение: Re: pgsql: Teach DSM registry to ERROR if attaching to an uninitialized ent
On Wed, Nov 12, 2025 at 3:31 PM Nathan Bossart <nathan@postgresql.org> wrote: > Teach DSM registry to ERROR if attaching to an uninitialized entry. > > If DSM entry initialization fails, backends could try to use an > uninitialized DSM segment, DSA, or dshash table (since the entry is > still added to the registry). To fix, keep track of whether > initialization completed, and ERROR if a backend tries to attach to > an uninitialized entry. We could instead retry initialization as > needed, but that seemed complicated, error prone, and unlikely to > help most cases. Furthermore, such problems probably indicate a > coding error. Having read the thread that led to this commit, I agree that this is an improvement, but it still doesn't seem like a great situation to me. Maybe I'm misunderstanding, but it seems like once the initialization has failed, there's no way to retry: you can't retry the initialization on the existing segment, and you can't delete it and create a new segment. If so, that means your only option for restoring proper function after an initialization failure is to bounce the entire server. Now, perhaps the chances of an initialization failure in practice are quite low. Maybe a typical initialization function won't do anything that could fail. But it just seems weird to me to design something that thinks errors are likely enough that it's worth having some amount of mechanism to deal with them, but unlikely enough that it's not worth making that mechanism more robust. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Nov 20, 2025 at 10:44:31AM -0500, Robert Haas wrote: > Now, perhaps the chances of an initialization failure in practice are > quite low. Maybe a typical initialization function won't do anything > that could fail. But it just seems weird to me to design something > that thinks errors are likely enough that it's worth having some > amount of mechanism to deal with them, but unlikely enough that it's > not worth making that mechanism more robust. Well, I will attempt to convince you that is indeed the case, although I'm not sure I will succeed... ISTM that besides pure coding errors, the most likely reasons for an ERROR during DSM segment initialization are things like out-of-memory, too many LWLock tranches registered, etc. In those cases, do we want every single backend that tries to attach to the segment to retry and most likely fail again? And what if the initialization callback completed halfway before failing? Do we expect every extension author to carefully craft their callbacks to handle that? On the other hand, if we don't handle ERRORs and someone does run into an OOM or whatnot, do we want other backends to attach and use the unitialized segment (leading to things like seg-faults)? Tracking whether an entry was initialized seems like cheap insurance against these problems. Finally, the only report from the field about any of this involves SQLsmith allocating tons of LWLock tranches via test_dsa, which has been fixed independently. So I have no evidence to suggest that we need to make it more robust. In fact, without this one arguably contrived report, I probably wouldn't have done anything at all. I'll grant you that this is a judgment call. I tried to find a good middle ground, but I would be unsurprised to learn that other committers would have done things differently. -- nathan
On Thu, Nov 20, 2025 at 11:12 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > ISTM that besides pure coding errors, the most likely reasons for an ERROR > during DSM segment initialization are things like out-of-memory, too many > LWLock tranches registered, etc. In those cases, do we want every single > backend that tries to attach to the segment to retry and most likely fail > again? And what if the initialization callback completed halfway before > failing? Do we expect every extension author to carefully craft their > callbacks to handle that? Well, I don't want to pretend like I know all the answers here. I do think it's pretty clear that making it the initialization callback's job to deal with a partially-failed initialization would be crazy. What I'd be inclined to do after a failure is tear down the entire segment. Even if there's no mechanism to retry, you're saving the memory that the segment would have consumed. And maybe there is a mechanism to retry. If these initialization callbacks are cheap enough, maybe having every backend try is not really a big deal -- maybe it's good, insofar as an interactive session would be more likely to see an error indicating the real cause of failure than a fake error that basically tells you at some point in the past something went wrong. If repeated failures are too painful, another idea could be to just mark the entry as being in a failed state (maybe with a timestamp) and let the extension decide when it wants to press the reset button. Or maybe that's all too complicated for too little utility. I'm not sure. My perception is that this is setting a significantly lower standard for error tolerance than what we typically seek to achieve, but sometimes my perceptions are wrong, and sometimes better options are hard to come by. > On the other hand, if we don't handle ERRORs and someone does run into an > OOM or whatnot, do we want other backends to attach and use the unitialized > segment (leading to things like seg-faults)? Tracking whether an entry was > initialized seems like cheap insurance against these problems. That's fair. > Finally, the only report from the field about any of this involves SQLsmith > allocating tons of LWLock tranches via test_dsa, which has been fixed > independently. So I have no evidence to suggest that we need to make it > more robust. In fact, without this one arguably contrived report, I > probably wouldn't have done anything at all. Yeah, I'm not clear who is using this or how, so that makes it hard to judge. > I'll grant you that this is a judgment call. I tried to find a good > middle ground, but I would be unsurprised to learn that other committers > would have done things differently. Such is life! -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Nov 20, 2025 at 01:18:56PM -0500, Robert Haas wrote: > What I'd be inclined to do after a failure is tear down the entire > segment. Even if there's no mechanism to retry, you're saving the > memory that the segment would have consumed. Unpinning/detaching the segment/DSA/dshash table and deleting the DSM registry entry in a PG_CATCH block scares me a little, but it might be doable. > Or maybe that's all too complicated for too little utility. I'm not sure. That's ultimately where I landed. > My perception is that this is setting a significantly lower standard for > error tolerance than what we typically seek to achieve, but sometimes my > perceptions are wrong, and sometimes better options are hard to come by. Another thing that might be subconsciously guiding my decisions here is the existing behavior when a shmem request/startup hook ERRORs (server startup fails). I'd expect DSM registry users to be doing similar things in their initialization callbacks, and AFAIK this behavior hasn't been a source of complaints. -- nathan
On Thu, Nov 20, 2025 at 6:09 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Unpinning/detaching the segment/DSA/dshash table and deleting the DSM > registry entry in a PG_CATCH block scares me a little, but it might be > doable. It seems a bit weird to be doing explicit unpinning in a PG_CATCH block. Ideally you'd want to postpone the pinning until initialization has succeeded, so that if you fail before that, transaction cleanup takes care of it automatically. Alternatively, destroying what existed before could be deferred until later, when an as-yet-unfailed transaction stumbles across the tombstone. > Another thing that might be subconsciously guiding my decisions here is the > existing behavior when a shmem request/startup hook ERRORs (server startup > fails). I'd expect DSM registry users to be doing similar things in their > initialization callbacks, and AFAIK this behavior hasn't been a source of > complaints. What bugs me here is the fact that you have a perfectly good working server that you can't "salvage". If the server had failed at startup, that would have sucked, but you would have been forced into correcting the problem (or giving up on the extension) and restarting, so once the server gets up and running, it's always in a good state. With this mechanism, you can get a running server that's stuck in this failed-initialization state, and there's no way for the DBA to do anything except by bouncing the entire server. That seems like it could be really frustrating. Now, if it's the case that resetting this mechanism wouldn't fundamentally fix anything because the reinitialization attempt would inevitably fail for the same reason, then I suppose it's not so bad, but I'm not sure that would always be true. I just feel like one-way state transitions from good->bad are undesirable. One way of viewing log levels like ERROR, FATAL, and PANIC is that they force you to do a reset of the transaction, session, or entire server to get back to a good state, but here you just get stuck in the bad one. Am I worrying too much? Possibly! But as I said to David on another thread this morning, it's better to worry on pgsql-hackers before any problem happens than to start worrying after something bad happens in a customer situation. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Nov 22, 2025 at 10:25:48AM -0500, Robert Haas wrote: > On Thu, Nov 20, 2025 at 6:09 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> Unpinning/detaching the segment/DSA/dshash table and deleting the DSM >> registry entry in a PG_CATCH block scares me a little, but it might be >> doable. > > It seems a bit weird to be doing explicit unpinning in a PG_CATCH > block. Ideally you'd want to postpone the pinning until initialization > has succeeded, so that if you fail before that, transaction cleanup > takes care of it automatically. Alternatively, destroying what existed > before could be deferred until later, when an as-yet-unfailed > transaction stumbles across the tombstone. Oh, yeah, good idea. > Am I worrying too much? Possibly! But as I said to David on another > thread this morning, it's better to worry on pgsql-hackers before any > problem happens than to start worrying after something bad happens in > a customer situation. I'll give what you suggested a try. It seems reasonable enough. -- nathan
On Sat, Nov 22, 2025 at 01:41:21PM -0600, Nathan Bossart wrote: > I'll give what you suggested a try. It seems reasonable enough. Here is a mostly-untested first try. -- nathan
Вложения
On Mon, Nov 24, 2025 at 12:36:25PM -0600, Nathan Bossart wrote: > On Sat, Nov 22, 2025 at 01:41:21PM -0600, Nathan Bossart wrote: >> I'll give what you suggested a try. It seems reasonable enough. > > Here is a mostly-untested first try. I think 0002 was more complicated than necessary. Here's a second try. -- nathan
Вложения
On Mon, Nov 24, 2025 at 2:03 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I think 0002 was more complicated than necessary. Here's a second try. I haven't done a full review of this and I'm not sure whether you want me to spend more time on it, but something I notice about this version is that the PG_CATCH() blocks only contain dshash_delete_entry() calls. That seems good, because that means that all the other cleanup is being handled by transaction abort (assuming that the patch isn't buggy, which I haven't attempted to verify). However, it does make me wonder if we could also find a way to postpone adding the dshash entries so that we don't need to do anything in PG_CATCH() blocks at all. I'm guessing that the reason why that doesn't easily work is because you're relying on those locks to prevent multiple backends from doing the same initialization? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 24, 2025 at 04:02:18PM -0500, Robert Haas wrote: > I haven't done a full review of this and I'm not sure whether you want > me to spend more time on it... I'd appreciate an eyeball check if you have the time. > I'm guessing that the reason why that doesn't easily work is > because you're relying on those locks to prevent multiple backends > from doing the same initialization? For GetNamedDSMSegment(), I bet we could avoid needing a PG_CATCH by taking the DSMRegistryLock exclusively when accessing the registry. But for GetNamedDSA() and GetNamedDSHash(), we want to keep the half-initialized entry so that we don't leak LWLock tranche IDs. I initially thought that might be okay, but if every backend is retrying, you could quickly run out of tranche IDs. -- nathan
On Mon, Nov 24, 2025 at 4:25 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > On Mon, Nov 24, 2025 at 04:02:18PM -0500, Robert Haas wrote: > > I haven't done a full review of this and I'm not sure whether you want > > me to spend more time on it... > > I'd appreciate an eyeball check if you have the time. OK. I am not too familiar with the current code but I will have a look. > > I'm guessing that the reason why that doesn't easily work is > > because you're relying on those locks to prevent multiple backends > > from doing the same initialization? > > For GetNamedDSMSegment(), I bet we could avoid needing a PG_CATCH by taking > the DSMRegistryLock exclusively when accessing the registry. But for > GetNamedDSA() and GetNamedDSHash(), we want to keep the half-initialized > entry so that we don't leak LWLock tranche IDs. I initially thought that > might be okay, but if every backend is retrying, you could quickly run out > of tranche IDs. I think that leaking tranche IDs is definitely a bad plan, even if it didn't cause us to run out. Ideally, I think the design goal should be to set up the structure of things so that we never need to do anything beyond error cleanup if we fail. If we can't achieve that, well then fine, but that is what I would view as most desirable. For instance, I would be inclined to set up GetNamedDSA like this: 1. First or create create the dshash entry. If that fails, we haven't done anything yet, so no problem. 2. Next, if there dshash entry doesn't have an assigned tranche ID yet, then try to assign one. This probably won't fail. But if it does, it will produce a meaningful error, and every future attempt will likely produce the same, still-meaningful error. I think this is a significant improvement over having every attempt but the first return "requested DSA \"%s\" failed initialization," which hides the real cause of failure. 3. Next, if the segment doesn't yet have a DSA, then try to create one.This could fail due to lack of memory; if so, future attempts might succeed, or might continue to fail with out-of-memory errors. Assuming that creating the DSA is successful, pin the DSA and store the handle into the dshash entry. On the other hand, if we don't create a DSA here because one already exists, try to attach to it. That could also fail, but if it does, we can still retry later. 4. Pin the mapping for the DSA that we created or attached to in the previous step. 5. dshash_release_lock. In other words, I'd try to create a system that always attempts to make forward progress, but when that's not possible, fails without permanently leaking any resources or spoiling anything for the next backend that wants to make an attempt. One worry that we've mentioned previously is that this might lead to every backend retrying. But as the code is currently written, that basically happens anyway except all of them but one produce a generic, uninformative error. Now, if we do what I'm proposing here, we'll repeatedly try to create DSM, DSA, and DSH objects in a way that actually allocates memory, so the retries get a little bit more expensive. At the moment, I'm inclined to believe this doesn't really matter. If it's true that failures are very rare, and I suspect it is, then it doesn't really matter if the cost of retries goes up a bit. If hypothetically they are common enough to matter, then we definitely need a mechanism that can't get permanently stuck in a half-initialized state. If we were to find that retrying too many times too quickly creates other problems, my suggested response would be to add a timestamp to the dshash entry and limit retries to once per minute or something. However, in the absence of evidence that we need such a mechanism, I'd be inclined to guess that we don't. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 25, 2025 at 10:37:50AM -0500, Robert Haas wrote: > 1. First or create create the dshash entry. If that fails, we haven't > done anything yet, so no problem. > > 2. Next, if there dshash entry doesn't have an assigned tranche ID > yet, then try to assign one. This probably won't fail. But if it does, > it will produce a meaningful error, and every future attempt will > likely produce the same, still-meaningful error. I think this is a > significant improvement over having every attempt but the first return > "requested DSA \"%s\" failed initialization," which hides the real > cause of failure. > > 3. Next, if the segment doesn't yet have a DSA, then try to create > one.This could fail due to lack of memory; if so, future attempts > might succeed, or might continue to fail with out-of-memory errors. > Assuming that creating the DSA is successful, pin the DSA and store > the handle into the dshash entry. On the other hand, if we don't > create a DSA here because one already exists, try to attach to it. > That could also fail, but if it does, we can still retry later. > > 4. Pin the mapping for the DSA that we created or attached to in the > previous step. > > 5. dshash_release_lock. I think the only difference between this and 0002 is in step 2. In 0002, if allocating a tranche fails, we remove the entry from the registry so that another backend can try again later. Otherwise, we retain the entry with just the tranche ID set, even if later setup fails. I believe that achieves basically the same thing, and is perhaps a tad simpler to code. > In other words, I'd try to create a system that always attempts to > make forward progress, but when that's not possible, fails without > permanently leaking any resources or spoiling anything for the next > backend that wants to make an attempt. One worry that we've mentioned > previously is that this might lead to every backend retrying. But as > the code is currently written, that basically happens anyway except > all of them but one produce a generic, uninformative error. Now, if we > do what I'm proposing here, we'll repeatedly try to create DSM, DSA, > and DSH objects in a way that actually allocates memory, so the > retries get a little bit more expensive. At the moment, I'm inclined > to believe this doesn't really matter. If it's true that failures are > very rare, and I suspect it is, then it doesn't really matter if the > cost of retries goes up a bit. If hypothetically they are common > enough to matter, then we definitely need a mechanism that can't get > permanently stuck in a half-initialized state. If we were to find that > retrying too many times too quickly creates other problems, my > suggested response would be to add a timestamp to the dshash entry and > limit retries to once per minute or something. However, in the absence > of evidence that we need such a mechanism, I'd be inclined to guess > that we don't. > > Thoughts? Yeah, IMHO this is fine. I expect most of these ERRORs to be reached during extension development or out-of-memory scenarios, and I don't see how slightly more expensive retry logic would hurt anything there. -- nathan
On Tue, Nov 25, 2025 at 10:50 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > I think the only difference between this and 0002 is in step 2. In 0002, > if allocating a tranche fails, we remove the entry from the registry so > that another backend can try again later. Otherwise, we retain the entry > with just the tranche ID set, even if later setup fails. I believe that > achieves basically the same thing, and is perhaps a tad simpler to code. > how slightly more expensive retry logic would hurt anything there. The downside is that then we have to rely on PG_CATCH() to make things whole. I am not sure that there's any problem with that, but I'm also not sure that there isn't. The timing of PG_CATCH() block execution often creates bugs, because it runs before (sub)transaction abort. That means that there's a real risk that you try to acquire an LWLock you already hold, for example. It's a lot easier to be confident that cleanup actions will reliably succeed when they run inside the transaction abort path that knows the order in which various resources should be released. Generally, I think it's PG_CATCH() is fine if you're releasing a resource that is decoupled from everything else, like using a third-party library's free function to free memory allocated by that library. But if you're releasing PostgreSQL resources that are layered on top of other PostgreSQL resources, like a DSA that depends on DSM and LWLock, I think it's a lot more difficult to be certain that you aren't going to end up trying to release the same stuff multiple times or in the wrong order. I'm not saying you can't make it work, but I've banged my head on this particular doorframe enough times that my reflex is to duck. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 25, 2025 at 11:25:09AM -0500, Robert Haas wrote: > The downside is that then we have to rely on PG_CATCH() to make things > whole. I am not sure that there's any problem with that, but I'm also > not sure that there isn't. The timing of PG_CATCH() block execution > often creates bugs, because it runs before (sub)transaction abort. > That means that there's a real risk that you try to acquire an LWLock > you already hold, for example. It's a lot easier to be confident that > cleanup actions will reliably succeed when they run inside the > transaction abort path that knows the order in which various resources > should be released. Generally, I think it's PG_CATCH() is fine if > you're releasing a resource that is decoupled from everything else, > like using a third-party library's free function to free memory > allocated by that library. But if you're releasing PostgreSQL > resources that are layered on top of other PostgreSQL resources, like > a DSA that depends on DSM and LWLock, I think it's a lot more > difficult to be certain that you aren't going to end up trying to > release the same stuff multiple times or in the wrong order. I'm not > saying you can't make it work, but I've banged my head on this > particular doorframe enough times that my reflex is to duck. I gave your idea a try, and I like the result a lot. IMHO it makes the code much easier to reason about. -- nathan
Вложения
On Tue, Nov 25, 2025 at 12:45 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > I gave your idea a try, and I like the result a lot. IMHO it makes the > code much easier to reason about. I looked through this and I agree this looks good. Thanks for working on it. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 25, 2025 at 03:58:47PM -0500, Robert Haas wrote: > On Tue, Nov 25, 2025 at 12:45 PM Nathan Bossart > <nathandbossart@gmail.com> wrote: >> I gave your idea a try, and I like the result a lot. IMHO it makes the >> code much easier to reason about. > > I looked through this and I agree this looks good. Thanks for working on it. My tests seem happy, so I will plan on committing these patches tomorrow. The only difference in v4 is that pg_get_dsm_registry_allocations() will no longer show partially-initialized entries. I thought that was the best option in this case because such entries shouldn't have any allocated memory, and retrying GetNamed*() would set *found to false. -- nathan
Вложения
On Tue, Nov 25, 2025 at 4:16 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > My tests seem happy, so I will plan on committing these patches tomorrow. > The only difference in v4 is that pg_get_dsm_registry_allocations() will no > longer show partially-initialized entries. I thought that was the best > option in this case because such entries shouldn't have any allocated > memory, and retrying GetNamed*() would set *found to false. Without actually opening the new patch files, that seems OK to me. I think the important thing about such a view is that it shouldn't make any state changes; it should just show how things are. In my ideal world, it would probably show partially-initialized entires in some distinguishable way, like with a null size. But leaving them out is also defensible. The thing that I think really matters for failure cases is the ability to retry whatever's gone wrong, either to get it fixed or to see the error, and the other changes already accomplish that goal, so I'm happy. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 25, 2025 at 07:13:03PM -0500, Robert Haas wrote: > On Tue, Nov 25, 2025 at 4:16 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> My tests seem happy, so I will plan on committing these patches tomorrow. >> The only difference in v4 is that pg_get_dsm_registry_allocations() will no >> longer show partially-initialized entries. I thought that was the best >> option in this case because such entries shouldn't have any allocated >> memory, and retrying GetNamed*() would set *found to false. > > Without actually opening the new patch files, that seems OK to me. I > think the important thing about such a view is that it shouldn't make > any state changes; it should just show how things are. In my ideal > world, it would probably show partially-initialized entires in some > distinguishable way, like with a null size. But leaving them out is > also defensible. The thing that I think really matters for failure > cases is the ability to retry whatever's gone wrong, either to get it > fixed or to see the error, and the other changes already accomplish > that goal, so I'm happy. Committed, thanks for reviewing. -- nathan