Обсуждение: Non-blocking archiver process

Поиск
Список
Период
Сортировка

Non-blocking archiver process

От
Ronan Dunklau
Дата:
Hello,

We've noticed a behavior that seems surprising to us. 
Since DROP DATABASE now waits for a ProcSignalBarrier, it can hang up 
indefinitely if the archive_command hangs. 

The reason for this is that the builtin archive module doesn't process any 
interrupts while the archiving command is running, as it's run with a system() 
call, blocking undefintely. 

Before rushing on to implement a non-blocking archive library (perhaps using 
popen or posix_spawn, while keeping other systems in mind), what unintended 
consequences would it have to actually run the archive_command in a non-
blocking way, and checking interrupts while it runs ?

Thanks !

--
Ronan Dunklau





Re: Non-blocking archiver process

От
Noah Misch
Дата:
On Fri, Jul 04, 2025 at 08:46:08AM +0200, Ronan Dunklau wrote:
> We've noticed a behavior that seems surprising to us. 
> Since DROP DATABASE now waits for a ProcSignalBarrier, it can hang up 
> indefinitely if the archive_command hangs. 
> 
> The reason for this is that the builtin archive module doesn't process any 
> interrupts while the archiving command is running, as it's run with a system() 
> call, blocking undefintely. 
> 
> Before rushing on to implement a non-blocking archive library (perhaps using 
> popen or posix_spawn, while keeping other systems in mind), what unintended 
> consequences would it have to actually run the archive_command in a non-
> blocking way, and checking interrupts while it runs ?

I can't think of any unintended consequences.  I think we just missed this
when adding the first use of ProcSignalBarrier (v15).  Making this easier to
miss, archiver spent most of its history not connecting to shared memory.  Its
shared memory connection appeared in v14.



Re: Non-blocking archiver process

От
Patrick Stählin
Дата:
On 05.07.25 05:01, Noah Misch wrote:
> On Fri, Jul 04, 2025 at 08:46:08AM +0200, Ronan Dunklau wrote:
>> We've noticed a behavior that seems surprising to us.
>> Since DROP DATABASE now waits for a ProcSignalBarrier, it can hang up
>> indefinitely if the archive_command hangs.
>>
>> The reason for this is that the builtin archive module doesn't process any
>> interrupts while the archiving command is running, as it's run with a system()
>> call, blocking undefintely.
>>
>> Before rushing on to implement a non-blocking archive library (perhaps using
>> popen or posix_spawn, while keeping other systems in mind), what unintended
>> consequences would it have to actually run the archive_command in a non-
>> blocking way, and checking interrupts while it runs ?
> 
> I can't think of any unintended consequences.  I think we just missed this
> when adding the first use of ProcSignalBarrier (v15).  Making this easier to
> miss, archiver spent most of its history not connecting to shared memory.  Its
> shared memory connection appeared in v14.

I've taken some time, mostly for WIN32, to implement an interruptible 
version of archive_command. My WIN32 days are long behind me, so it's 
quite possible that this has some faults I'm not seeing. Then again, it 
passes CI.
I failed to make it work in WIN32 with popen since the handles it 
returns can't be made non-blocking so this change is a bit bigger.

@Ronan: Let me now if you'd like to be attributed more, I took some 
inspiration from a private repos with your prototype.

I don't know if I should add that to the running commitfest for PG19 or 
if this is something that would need to be backported. Just let me know.

Thanks,
Patrick
Вложения

Re: Non-blocking archiver process

От
Artem Gavrilov
Дата:
Hello Patrick

I did a review of your patch. 

Initial Run
===========
The patch applies cleanly to HEAD (196063d6761). All tests successfully pass on MacOS 15.7.

Comments
===========
1) Instead of `malloc` and `free` it should be `palloc` and `pfree`.

2) In fact `archiveFileno` is a file descriptor, and the first variable one is not. I think it's worth to rename them to `archiveFile` and `archiveFd`. Or maybe even just `file` and `fd` as the scope there is not so big.
FILE   *archiveFd = NULL;
int archiveFileno;

3)  Variable name `bytesRead` is rare in PG code base. It is used only two times, while `readBytes` is used four times. Other variants, like `nbytes` are more common. Let's pick some popular name.

4) Variable name `dwRc` is unique for the PG codebase and not really clear as for me. How about name it just `exitcode`?

5) `return` is redundant here as it will never be reached. 

ereport(FATAL,
        (errmsg_internal("Failed to malloc win32cmd %m")));
return false;

6) Same here, `free` and `return` are unreachable due ereport with fatal level.
ereport(FATAL,
        (errmsg("CreateProcess() call failed: %m (error code %lu)",
                GetLastError())));
free(win32cmd);
return false;

7) This loop can be stuck forever as `WaitForSingleObject` may return `WAIT_FAILEDand it's not always retriable.
while (true)
{
    CHECK_FOR_INTERRUPTS();
    if (WaitForSingleObject(pi.hProcess, POLL_TIMEOUT_MSEC) == WAIT_OBJECT_0)
        break;
}
--

Artem Gavrilov
Senior Software Engineer, Percona

artem.gavrilov@percona.com