Обсуждение: [PATCH] Fix orphaned backend processes on Windows using Job Objects
Greetings, When the postmaster exits unexpectedly on Windows (crash, kill, debugger abort), backend processes continue running. Windows lacks any equivalent to Unix's getppid() orphan detection. These orphaned backends hold locks and shared memory, preventing clean restart. This leads to a delay in restarts and manual killing of orphans. The problem is easy to reproduce. Start postgres, open a transaction with LOCK TABLE, then kill the postmaster with taskkill /F. The backend continues running and restart fails. Manual cleanup is required. Current approaches (inherited event handles, shared memory flags) depend on the postmaster running code during exit. A segfault or kill bypasses all of that. My proposed solution is to use Windows Job Objects with KILL_ON_JOB_CLOSE. We just need to call CreateJobObject() in PostmasterMain(), configure with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster. Children inherit membership automatically. When the job handle closes on postmaster exit, the kernel terminates all children atomically. This is kernel-enforced with no polling and no race conditions. Job creation can fail if postgres runs under an existing job (service managers, debuggers). Windows 7 disallows nested jobs. We detect this with IsProcessInJob(), and if AssignProcessToJobObject() returns ERROR_ACCESS_DENIED, we log and continue without orphan protection. KILL_ON_JOB_CLOSE doesn't interfere with clean shutdown. Normal shutdown signals backends via SetEvent, they exit, postmaster exits, job closes. Nothing left to kill. The flag only fires during crashes when backends are still running - exactly when forced termination is correct. The code is ~200 lines in pg_job_object.c, less than win32/signal.c (~500 lines). It fails gracefully and works regardless of how postgres is started, unlike service manager approaches. This avoids polling unreliability. The patch has been tested on Windows 10/11 with both MSVC and MinGW builds. Nested jobs fail gracefully as expected. Clean shutdown is unaffected. Crash tests with taskkill /F, debugger abort, and access violations all correctly terminate children immediately with zero orphans. This patch does not include automated tests because the core functionality (orphan prevention on crash) requires simulating process termination, which is difficult to test reliably in CI. Patch attached. Can add documentation if this approach is approved. Thoughts? Bryan Green
Вложения
Hi, On 2025-11-03 09:12:03 -0600, Bryan Green wrote: > We just need to call CreateJobObject() in PostmasterMain(), configure > with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster. > Children inherit membership automatically. When the job handle closes on > postmaster exit, the kernel terminates all children atomically. This is > kernel-enforced with no polling and no race conditions. What happens if a postmaster child exits irregularly? Is postmaster terminated as well? > The patch has been tested on Windows 10/11 with both MSVC and MinGW > builds. Nested jobs fail gracefully as expected. Clean shutdown is > unaffected. Crash tests with taskkill /F, debugger abort, and access > violations all correctly terminate children immediately with zero orphans. > > This patch does not include automated tests because the core > functionality (orphan prevention on crash) requires simulating process > termination, which is difficult to test reliably in CI. Why is it difficult to test in CI? We do some related tests in 013_crash_restart.pl, it doesn't seem like it ought to be hard to also add tests for postmaster? Greetings, Andres Freund
On 11/3/2025 9:19 AM, Andres Freund wrote: > Hi, > > On 2025-11-03 09:12:03 -0600, Bryan Green wrote: >> We just need to call CreateJobObject() in PostmasterMain(), configure >> with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster. >> Children inherit membership automatically. When the job handle closes on >> postmaster exit, the kernel terminates all children atomically. This is >> kernel-enforced with no polling and no race conditions. > > What happens if a postmaster child exits irregularly? Is postmaster terminated > as well? > No, Job Objects are unidirectional. KILL_ON_JOB_CLOSE only acts when the postmaster (which holds the job handle) exits. Backend crashes are handled through PostgreSQL's existing crash recovery mechanism - the postmaster detects the crash via WaitForMultipleObjects() and initiates recovery as normal. The Job Object only takes action when the job handle closes, which happens when the postmaster exits. It's analogous to a Unix process group - sending SIGTERM to the group leader kills the group, but children dying doesn't affect the parent. >> The patch has been tested on Windows 10/11 with both MSVC and MinGW >> builds. Nested jobs fail gracefully as expected. Clean shutdown is >> unaffected. Crash tests with taskkill /F, debugger abort, and access >> violations all correctly terminate children immediately with zero orphans. >> >> This patch does not include automated tests because the core >> functionality (orphan prevention on crash) requires simulating process >> termination, which is difficult to test reliably in CI. > > Why is it difficult to test in CI? We do some related tests in > 013_crash_restart.pl, it doesn't seem like it ought to be hard to also add > tests for postmaster? > Fair point. I was hesitant because testing the actual orphan prevention requires killing the postmaster while backends are active, which seemed fragile. But you're right that we already test similar scenarios. I can add a test to 013_crash_restart.pl (or a new Windows-specific test file) that: 1. Starts server with active backend 2. Kills postmaster ungracefully (taskkill /F) 3. Verifies backend process terminates automatically 4. Confirms clean restart Would that be sufficient, or do you have other test scenarios in mind? > Greetings, > > Andres Freund
On 2025-11-03 09:25:11 -0600, Bryan Green wrote: > On 11/3/2025 9:19 AM, Andres Freund wrote: > > Hi, > > > > On 2025-11-03 09:12:03 -0600, Bryan Green wrote: > >> We just need to call CreateJobObject() in PostmasterMain(), configure > >> with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster. > >> Children inherit membership automatically. When the job handle closes on > >> postmaster exit, the kernel terminates all children atomically. This is > >> kernel-enforced with no polling and no race conditions. > > > > What happens if a postmaster child exits irregularly? Is postmaster terminated > > as well? > > > > No, Job Objects are unidirectional. Great. > >> The patch has been tested on Windows 10/11 with both MSVC and MinGW > >> builds. Nested jobs fail gracefully as expected. Clean shutdown is > >> unaffected. Crash tests with taskkill /F, debugger abort, and access > >> violations all correctly terminate children immediately with zero orphans. > >> > >> This patch does not include automated tests because the core > >> functionality (orphan prevention on crash) requires simulating process > >> termination, which is difficult to test reliably in CI. > > > > Why is it difficult to test in CI? We do some related tests in > > 013_crash_restart.pl, it doesn't seem like it ought to be hard to also add > > tests for postmaster? > > > > Fair point. I was hesitant because testing the actual orphan prevention > requires killing the postmaster while backends are active, which seemed > fragile. But you're right that we already test similar scenarios. > > I can add a test to 013_crash_restart.pl (or a new Windows-specific test > file) that: > 1. Starts server with active backend > 2. Kills postmaster ungracefully (taskkill /F) > 3. Verifies backend process terminates automatically > 4. Confirms clean restart > > Would that be sufficient, or do you have other test scenarios in mind? That's pretty much what I had in mind. Greetings, Andres Freund
On 11/3/2025 9:29 AM, Andres Freund wrote: > On 2025-11-03 09:25:11 -0600, Bryan Green wrote: >> On 11/3/2025 9:19 AM, Andres Freund wrote: >>> Hi, >>> >>> On 2025-11-03 09:12:03 -0600, Bryan Green wrote: >>>> We just need to call CreateJobObject() in PostmasterMain(), configure >>>> with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster. >>>> Children inherit membership automatically. When the job handle closes on >>>> postmaster exit, the kernel terminates all children atomically. This is >>>> kernel-enforced with no polling and no race conditions. >>> >>> What happens if a postmaster child exits irregularly? Is postmaster terminated >>> as well? >>> >> >> No, Job Objects are unidirectional. > > Great. > > >>>> The patch has been tested on Windows 10/11 with both MSVC and MinGW >>>> builds. Nested jobs fail gracefully as expected. Clean shutdown is >>>> unaffected. Crash tests with taskkill /F, debugger abort, and access >>>> violations all correctly terminate children immediately with zero orphans. >>>> >>>> This patch does not include automated tests because the core >>>> functionality (orphan prevention on crash) requires simulating process >>>> termination, which is difficult to test reliably in CI. >>> >>> Why is it difficult to test in CI? We do some related tests in >>> 013_crash_restart.pl, it doesn't seem like it ought to be hard to also add >>> tests for postmaster? >>> >> >> Fair point. I was hesitant because testing the actual orphan prevention >> requires killing the postmaster while backends are active, which seemed >> fragile. But you're right that we already test similar scenarios. >> >> I can add a test to 013_crash_restart.pl (or a new Windows-specific test >> file) that: >> 1. Starts server with active backend >> 2. Kills postmaster ungracefully (taskkill /F) >> 3. Verifies backend process terminates automatically >> 4. Confirms clean restart >> >> Would that be sufficient, or do you have other test scenarios in mind? > > That's pretty much what I had in mind. > > Greetings, > > Andres Freund I've implemented the test in 013_crash_restart.pl. The test passes on Windows 10/11 with both MSVC and MinGW builds. Backends are typically terminated within 100-200ms after postmaster kill, confirming the Job Object KILL_ON_JOB_CLOSE mechanism works as intended. Updated patch (v2) attached. -- Bryan