Обсуждение: Nonexistent pid in pg_locks
I have a 8.3.6 postgres database running on Arch Linux (2.6.28 kernel) with the following entry in pg_locks: locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted relation | 16385 | 16427 | | | | | | | | 54/0 | 10453 | ShareUpdateExclusiveLock | t That pid is 10453. The weird part is that there is no process running on this machine with that pid. I had to bounce an OpenMQ broker this morning (this database is the DB for an OpenMQ HA setup) and couldn't get it to reconnect to postgres. On inspecting the database I found dozens of vacuum processes waiting (I have a cron job that vacuums each night) and chewing up connection slots. Killing those left a few autovacuum worker process waiting. Killing those left just this one orphaned pid apparently holding a lock. Assumably they were all waiting on the lock "held" by 10453. The database continues to function normally, but when I (or the autovacuum process) attempts to vacuum or analyze I get this message: "WARNING: could not send signal to process 10453: No such process" It can't kill it because that process/pid does not exist. Additionally if I "vacuum verbose" the vacuum does seem to run to completion before waiting forever after issuing that warning. I have tried killing the autovacuum launcher process and letting it restart but still as soon as the next vacuum is issued it gets blocked and waits. Is there any way for me to clear that orphaned entry out of pg_locks? What could I have done to cause this? I have used postgres for everything including our main product database (hundreds of transactions/sec, 100's of GB of data) for years and have never seen this scenario. Any help is appreciated, I can easily provide any additional information that may be helpful. Joe Uhl
Joe Uhl <joeuhl@gmail.com> writes: > I had to bounce an OpenMQ broker this morning (this database is the DB > for an OpenMQ HA setup) and couldn't get it to reconnect to postgres. > On inspecting the database I found dozens of vacuum processes waiting > (I have a cron job that vacuums each night) and chewing up connection > slots. Killing those left a few autovacuum worker process waiting. > Killing those left just this one orphaned pid apparently holding a > lock. Assumably they were all waiting on the lock "held" by 10453. What exactly did you do to "kill" those processes? Do you remember whether any of them happened to have PID 10453? > Is there any way for me to clear that orphaned entry out of pg_locks? Restarting the database should take care of this, I think. regards, tom lane
On Jul 8, 2009, at 2:41 PM, Tom Lane wrote: > Joe Uhl <joeuhl@gmail.com> writes: >> I had to bounce an OpenMQ broker this morning (this database is the >> DB >> for an OpenMQ HA setup) and couldn't get it to reconnect to postgres. >> On inspecting the database I found dozens of vacuum processes waiting >> (I have a cron job that vacuums each night) and chewing up connection >> slots. Killing those left a few autovacuum worker process waiting. >> Killing those left just this one orphaned pid apparently holding a >> lock. Assumably they were all waiting on the lock "held" by 10453. > > What exactly did you do to "kill" those processes? Do you remember > whether any of them happened to have PID 10453? I used "kill pid1 pid2 pid3 ..." (no -9) as root. Unfortunately I do not recall if that pid was one of the processes I killed and not enough scrollback in this screen to see. It is a ShareUpdateExclusiveLock lock though and I definitely only killed vacuum/analyze pids so thinking there is a very high chance of 10453 being one of them. > >> Is there any way for me to clear that orphaned entry out of pg_locks? > > Restarting the database should take care of this, I think. > > regards, tom lane I've got a block of time scheduled for tonight to restart, will give that a shot. Thanks for the response, Joe
Joe Uhl <joeuhl@gmail.com> writes: > On Jul 8, 2009, at 2:41 PM, Tom Lane wrote: >> What exactly did you do to "kill" those processes? Do you remember >> whether any of them happened to have PID 10453? > I used "kill pid1 pid2 pid3 ..." (no -9) as root. Unfortunately I do > not recall if that pid was one of the processes I killed and not > enough scrollback in this screen to see. It is a > ShareUpdateExclusiveLock lock though and I definitely only killed > vacuum/analyze pids so thinking there is a very high chance of 10453 > being one of them. Hmm. In any case that shouldn't have led to a lock left hanging. Assuming that it was a regular and not autovacuum, do you know what the exact command would have been? (In particular, FULL, ANALYZE, etc options) regards, tom lane
On Jul 8, 2009, at 3:00 PM, Tom Lane wrote: > Joe Uhl <joeuhl@gmail.com> writes: >> On Jul 8, 2009, at 2:41 PM, Tom Lane wrote: >>> What exactly did you do to "kill" those processes? Do you remember >>> whether any of them happened to have PID 10453? > >> I used "kill pid1 pid2 pid3 ..." (no -9) as root. Unfortunately I do >> not recall if that pid was one of the processes I killed and not >> enough scrollback in this screen to see. It is a >> ShareUpdateExclusiveLock lock though and I definitely only killed >> vacuum/analyze pids so thinking there is a very high chance of 10453 >> being one of them. > > Hmm. In any case that shouldn't have led to a lock left hanging. > Assuming that it was a regular and not autovacuum, do you know what > the exact command would have been? (In particular, FULL, ANALYZE, > etc options) > > regards, tom lane They were VACUUM VERBOSE ANALYZE. Specifically run with "/usr/bin/ vacuumdb -v --analyze $DB_NAME" in the cron job each night.
Joe Uhl <joeuhl@gmail.com> writes: > On Jul 8, 2009, at 3:00 PM, Tom Lane wrote: >> Hmm. In any case that shouldn't have led to a lock left hanging. >> Assuming that it was a regular and not autovacuum, do you know what >> the exact command would have been? (In particular, FULL, ANALYZE, >> etc options) > They were VACUUM VERBOSE ANALYZE. Specifically run with "/usr/bin/ > vacuumdb -v --analyze $DB_NAME" in the cron job each night. We had bugs long ago (8.0 or before) where SIGTERM'ing a backend could result in lock entries not getting cleaned up, but that's all fixed now AFAIK. I tried some simple experiments with SIGTERM'ing a VACUUM in 8.3.x and couldn't reproduce a problem. So there must have been some other contributing factor here, but without any idea what it was, it's hard to investigate further :-( regards, tom lane
On Jul 8, 2009, at 3:42 PM, Tom Lane wrote: > Joe Uhl <joeuhl@gmail.com> writes: >> On Jul 8, 2009, at 3:00 PM, Tom Lane wrote: >>> Hmm. In any case that shouldn't have led to a lock left hanging. >>> Assuming that it was a regular and not autovacuum, do you know what >>> the exact command would have been? (In particular, FULL, ANALYZE, >>> etc options) > >> They were VACUUM VERBOSE ANALYZE. Specifically run with "/usr/bin/ >> vacuumdb -v --analyze $DB_NAME" in the cron job each night. > > We had bugs long ago (8.0 or before) where SIGTERM'ing a backend could > result in lock entries not getting cleaned up, but that's all fixed > now > AFAIK. I tried some simple experiments with SIGTERM'ing a VACUUM in > 8.3.x and couldn't reproduce a problem. So there must have been some > other contributing factor here, but without any idea what it was, it's > hard to investigate further :-( > > regards, tom lane No worries, appreciate the response. I should have collected more information at the initial state before diving in with the mass kill. I'll restart tonight, and get a zabbix script monitoring the vacuums. If they start piling up again i'll be able to gather more useful information.