Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
От | Tom Lane |
---|---|
Тема | Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f |
Дата | |
Msg-id | 926839.1642708567@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: pgsql: TAP tests: check for postmaster.pid anyway when "pg_ctl start" f
|
Список | pgsql-committers |
Andres Freund <andres@anarazel.de> writes: > IOW, the test doesn't actually quite seem to be working in the "local" case? > Hence not seeing the problem of picking up some wrong pid. It looks like there are two different failure cases: 1. The symptom shown by skink: # pg_ctl start failed; logfile: ... 2022-01-20 12:36:22.021 UTC [4023581][postmaster][:0] FATAL: pre-existing shared memory block (key 16253309, ID 1602617349)is still in use 2022-01-20 12:36:22.021 UTC [4023581][postmaster][:0] HINT: Terminate any old server processes associated with data directory"/mnt/resource/bf/build/skink-master/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata". I can reproduce this locally by using a valgrind wrapper around the postmaster. What is evidently happening is that some postmaster child is slow to exit after the postmaster is kill -9'd (thanks to being valgrind'd) and so the would-be new postmaster sees the shmem block as still active. 2. The symptom shown on Noah's AIX flotilla: # pg_ctl start failed; logfile: ... 2022-01-20 03:08:56.200 UTC [11796728:1] FATAL: lock file "postmaster.pid" already exists 2022-01-20 03:08:56.200 UTC [11796728:2] HINT: Is another postmaster (PID 6750300) running in data directory "/home/nm/farm/xlc64/HEAD/pgsql.build/src/test/recovery/tmp_check/t_017_shm_gnat_data/pgdata"? Looking at the code in miscinit.c that emits that message, it seems that kill(6750300, 0) must have succeeded, even though nontrivial time has passed since we did the kill -9. Alternatively, AIX returned some errno other than the two we ignore --- but I checked its manpages and it alleges just the same set of error conditions as Linux. The first of these scenarios could be dealt with easily if we make Cluster::start validate the PID it gets out of the pidfile. The second one seems problematic though: if the killed postmaster is not-dead-yet when the replacement postmaster looks, it might still be that way when Cluster::start checks, a few microseconds later, so that we'd "validate" a PID that's not going to be around much longer. What I'm thinking of doing is inventing a "soft_stop" variant of Cluster::stop that won't complain if pg_ctl stop fails, and then having 017_shm's poll_start() call that before retrying the start call. In this way, if we did manage to start a postmaster then we'll shut it down before trying again, while if we didn't successfully start one then soft_stop will end with _pid unset, allowing Cluster::start to succeed. It seems like it'd also be a good idea if the END block used soft_stop instead of regular stop. As is, if we have multiple postmasters and the first one gives us trouble (say it's dead already), I think we'll likely fail to stop the rest. Not sure what happens if we BAIL_OUT from an END block, but it seems unlikely to be good. regards, tom lane
В списке pgsql-committers по дате отправления: