Обсуждение: Replace O_EXCL with O_TRUNC for creation of state.tmp in SaveSlotToPath
Hi,
We have encountered a few instances where logical replication errors out during SaveSlotToPath() after creating the state.tmp file, but before it was renamed (due to ENOSPC, for example). In these cases, since state.tmp is not cleaned up and is created with the O_EXCL flag, further invocations of SaveSlotToPath() for this slot will error out on OpenTransientFile() with EEXIST, completely blocking slot metadata persistence. The only explicit cleanup for state.tmp occurs during server startup as part of RestoreSlotFromDisk().
It doesn't seem that this function relies on data written to state.tmp previously, so O_EXCL is unnecessary. Attaching a patch that swaps O_EXCL for O_TRUNC, ensuring a fresh state.tmp is available for writing.
Thanks,
Kevin
We have encountered a few instances where logical replication errors out during SaveSlotToPath() after creating the state.tmp file, but before it was renamed (due to ENOSPC, for example). In these cases, since state.tmp is not cleaned up and is created with the O_EXCL flag, further invocations of SaveSlotToPath() for this slot will error out on OpenTransientFile() with EEXIST, completely blocking slot metadata persistence. The only explicit cleanup for state.tmp occurs during server startup as part of RestoreSlotFromDisk().
It doesn't seem that this function relies on data written to state.tmp previously, so O_EXCL is unnecessary. Attaching a patch that swaps O_EXCL for O_TRUNC, ensuring a fresh state.tmp is available for writing.
Thanks,
Kevin
Вложения
Re: Replace O_EXCL with O_TRUNC for creation of state.tmp in SaveSlotToPath
От
Michael Paquier
Дата:
On Tue, Sep 30, 2025 at 05:21:05PM +0530, Kevin K Biju wrote: > We have encountered a few instances where logical replication errors out > during SaveSlotToPath() after creating the state.tmp file, but before it > was renamed (due to ENOSPC, for example). In these cases, since state.tmp > is not cleaned up and is created with the O_EXCL flag, further invocations > of SaveSlotToPath() for this slot will error out on OpenTransientFile() > with EEXIST, completely blocking slot metadata persistence. The only > explicit cleanup for state.tmp occurs during server startup as part of > RestoreSlotFromDisk(). Ah, you are referring to the window between a CloseTransientFile() completing and the rename(). It's not the first time this report pops up. I have found two references, for the same error as yours, with one referring to a discussion about O_EXCL vs O_TRUNC: https://www.postgresql.org/message-id/08bbfab1-a61d-3750-fc18-4ab2c1aa7f09@postgrespro.ru https://www.postgresql.org/message-id/3559061693910326@qy4q4a6esb2lebnz.sas.yp-c.yandex.net > It doesn't seem that this function relies on data written to state.tmp > previously, so O_EXCL is unnecessary. Attaching a patch that swaps O_EXCL > for O_TRUNC, ensuring a fresh state.tmp is available for writing. Using O_TRUNC has been discussed and discarded because O_EXCL is more protective in this specific code path, see the argument here: https://www.postgresql.org/message-id/20191202161222.sazl2omhhk5pl3nl@alap3.anarazel.de An alternative fix that we can do here instead is to unlink() the temporary file when reaching on these error code paths, allowing future accesses to work correctly. This was suggested as a second solution, other than the O_TRUNC objected to. One thing is to make sure that the unlinks are done while holding the lwlock for the IO in progress. So, something like the attached should also solve your problem. Any thoughts or comments from the others? I'd like to backpatch that all the way down, 6 years too late. But later is better than never, right? -- Michael
Вложения
Re: Replace O_EXCL with O_TRUNC for creation of state.tmp in SaveSlotToPath
От
Michael Paquier
Дата:
On Thu, Oct 09, 2025 at 05:02:12PM +0900, Michael Paquier wrote: > An alternative fix that we can do here instead is to unlink() the > temporary file when reaching on these error code paths, allowing > future accesses to work correctly. This was suggested as a second > solution, other than the O_TRUNC objected to. One thing is to make > sure that the unlinks are done while holding the lwlock for the IO in > progress. So, something like the attached should also solve your > problem. I have been playing a bit more with that, and applied 912af1c7e9c9 to do the unlink() calls down to v13. While on it, I have also played with hard crashes timed while we are in the middle of SaveSlotToPath() with state.tmp still around at restart (injection point wait just before the rename(), for example), and double-checked that recovery is able to do a correct cleanup job. I didn't fully recall this last part, as it has been a couple of years since the last report. There may be an argument for having an automated test, like: - Physical slot creation. - Use a bit the slot. - Injection point wait before the rename() of SaveSlotToPath(). - Checkpoint, that would not be able to finish. - Hard crash. - Restart, check that the slot state is correct after recovery. However, I am not sure that this would be worth the cycles spent on. There are a lot more interesting scenarios for write failures in the tree than this one. -- Michael
Вложения
Hi Michael,
Thanks for the context. I missed the earlier discussions about the same issue. Using unlink in the error paths makes sense to me. There is an edge case in my mind, in case unlink fails as well, and we end up in the same condition; however, the chance of that occurring is sufficiently low.
It looks like this change was already committed in https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0adf424b490b701ae8eeeca3d9630773f18cd937. Thanks for the quick response!
Kevin
On Fri, Oct 10, 2025 at 6:22 AM Michael Paquier <michael@paquier.xyz> wrote:
On Thu, Oct 09, 2025 at 05:02:12PM +0900, Michael Paquier wrote:
> An alternative fix that we can do here instead is to unlink() the
> temporary file when reaching on these error code paths, allowing
> future accesses to work correctly. This was suggested as a second
> solution, other than the O_TRUNC objected to. One thing is to make
> sure that the unlinks are done while holding the lwlock for the IO in
> progress. So, something like the attached should also solve your
> problem.
I have been playing a bit more with that, and applied 912af1c7e9c9 to
do the unlink() calls down to v13. While on it, I have also played
with hard crashes timed while we are in the middle of SaveSlotToPath()
with state.tmp still around at restart (injection point wait just
before the rename(), for example), and double-checked that recovery is
able to do a correct cleanup job. I didn't fully recall this last
part, as it has been a couple of years since the last report.
There may be an argument for having an automated test, like:
- Physical slot creation.
- Use a bit the slot.
- Injection point wait before the rename() of SaveSlotToPath().
- Checkpoint, that would not be able to finish.
- Hard crash.
- Restart, check that the slot state is correct after recovery.
However, I am not sure that this would be worth the cycles spent on.
There are a lot more interesting scenarios for write failures in the
tree than this one.
--
Michael
Re: Replace O_EXCL with O_TRUNC for creation of state.tmp in SaveSlotToPath
От
Michael Paquier
Дата:
On Wed, Oct 15, 2025 at 07:30:39PM +0530, Kevin K Biju wrote: > Thanks for the context. I missed the earlier discussions about the same > issue. Using unlink in the error paths makes sense to me. There is an edge > case in my mind, in case unlink fails as well, and we end up in the same > condition; however, the chance of that occurring is sufficiently low. My suspicion is that an unlink() failure would link to much more problems, including what could be crashes in critical sections. That would mean a crash, where the state.tmp file would get cleaned up by recovery. What we have now in the tree should be (I hope!) a good balance. That was 6 years ago, and it is very easy to miss, so no worries. Even in my case, I've recalled the previous discussion after one night of sleep, only because I've participated in it and because the automated truncation you have suggested was itching me. Somewhat. > Thanks for the quick response! No problem. -- Michael