Обсуждение: auto removing stale pid for postmaster NT service
Hi, I've seen this question a couple of times in the archives, but I wasn't able to find a solution. Please advise if you know of a workaround. I have postmaster install as a service throught cygrunsrv on my win2k machine. The postmaster service starts and stops fine most of the time. But if the server crashes without a proper shutdown, the postmaster.pid is left behind and the postmaster service fails to start at the next boot. Is there a way to delete stale postmaster.pid on boot-up before the postmaster service is attempted to be started? Thanks -Tony
Tony_Chao@putnam.com writes: > I have postmaster install as a service throught cygrunsrv on my win2k > machine. The postmaster service starts and stops fine most of the time. > But if the server crashes without a proper shutdown, the postmaster.pid > is left behind and the postmaster service fails to start at the next boot. It should manage to start anyway --- why exactly does it refuse to start? regards, tom lane
On Mon, 16 Sep 2002 09:23:49 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: TL> > But if the server crashes without a proper shutdown, the postmaster.pid TL> > is left behind and the postmaster service fails to start at the next boot. TL> TL> It should manage to start anyway --- why exactly does it refuse to TL> start? it happens on linux as well: if there's a stale file at boot, it refuses to start saying that it's already running. -- Simone Tellini E-mail: tellini@areabusiness.it http://www.areabusiness.it
On Mon, Sep 16, 2002 at 04:56:26PM +0200, Simone Tellini wrote: > > it happens on linux as well: if there's a stale file at boot, it refuses > to start saying that it's already running. Not exactly. If there is a stale pid file, it looks to see if a process with that pid exists. _Then_ it refuses to start. This is because there is a process with the same pid as the postmaster. This will happen in cases where the machine crashes and starts up again -- something else happens to get the (former) postgres pid at startup, and so when postgres checks for a process with that pid, one exists. And kerplooey. I seem to recall that someone (maybe Tom Lane?) suggested an extension to the current pidfile check, so that it will also check to see if the process really is PostgreSQL. But I don't know if it was implemented. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan <andrew@libertyrms.info> writes: > This is because there is a process with the same pid as the > postmaster. This will happen in cases where the machine crashes and > starts up again -- something else happens to get the (former) > postgres pid at startup, and so when postgres checks for a process > with that pid, one exists. And kerplooey. FYI, sendmail has the same restart failure mode; I imagine a lot of other Unix daemons do too. > I seem to recall that someone (maybe Tom Lane?) suggested an > extension to the current pidfile check, so that it will also check to > see if the process really is PostgreSQL. But I don't know if it was > implemented. It wasn't yet, mainly because it's not obvious how to tell reliably whether some other process is a postmaster or not. I think I had suggested distinguishing EPERM from other kill() errors, which would tell us whether the other process is under the same userid as us or not; if not, we could perhaps safely assume that it's not a postmaster (or at least not one likely to be using our data directory). Unfortunately, that doesn't really improve the odds very much. The typical scenario for this problem is that the PID we get assigned will wobble around by one or two counts from one boot cycle to the next, depending on just how fast other startup processes manage to finish. (If we get the exact same PID as before, there's no problem; the code is smart enough to notice that case.) But the PID(s) adjacent to the postmaster's will likely also belong to the postgres user --- consider the shell that launched us, for example. The shell, or whatever it might launch right after the postmaster, would look enough like a postmaster to fool this simplistic test. So I'm at a loss how the postmaster can improve the reliability of this check, without throwing the baby out with the bathwater by making a check that might fail to recognize a conflicting postmaster. The consequences of that would be *dire*. The best solution is probably to forcibly unlink the postmaster.pid file in some startup script --- but it has to be a script that is *only* run during boot, never anytime later. The postgres start script is not the place for this. regards, tom lane
On Mon, Sep 16, 2002 at 05:27:38PM -0400, Tom Lane wrote: > FYI, sendmail has the same restart failure mode; I imagine a lot of > other Unix daemons do too. Yes, as far as I know atd, klogd, and ypbind also fail this way, at least on some flavours of Linux (where I've had it happen). And ISTR that some bit of NFS didn't recover correctly under Solaris 2.6, but I can't recall for sure now. > It wasn't yet, mainly because it's not obvious how to tell reliably > whether some other process is a postmaster or not. I had a feeling this might be the case. I think the suggestion of a boot-time "cleaning script" is a good idea -- something run by root before switching runlevels is the obvious answer -- but that's the sort of thing that probably should be hand-crafted by a competent sysadmin for each case. In some environments, there are good reasons not to restart things in case of crash. (If the hardware is flakey, for instance, you might not want the service to be going up and down.) A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110