Обсуждение: Nonexistent pid in pg_locks

Поиск
Список
Период
Сортировка

Nonexistent pid in pg_locks

От
Joe Uhl
Дата:
I have a 8.3.6 postgres database running on Arch Linux (2.6.28 kernel)
with the following entry in pg_locks:

  locktype  | database | relation | page | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction |
pid  |           mode           | granted
relation   |    16385 |    16427 |      |       |
|               |         |       |          | 54/0               |
10453 | ShareUpdateExclusiveLock | t

That pid is 10453.  The weird part is that there is no process running
on this machine with that pid.

I had to bounce an OpenMQ broker this morning (this database is the DB
for an OpenMQ HA setup) and couldn't get it to reconnect to postgres.
On inspecting the database I found dozens of vacuum processes waiting
(I have a cron job that vacuums each night) and chewing up connection
slots.  Killing those left a few autovacuum worker process waiting.
Killing those left just this one orphaned pid apparently holding a
lock.  Assumably they were all waiting on the lock "held" by 10453.

The database continues to function normally, but when I (or the
autovacuum process) attempts to vacuum or analyze I get this message:

"WARNING:  could not send signal to process 10453: No such process"

It can't kill it because that process/pid does not exist.
Additionally if I "vacuum verbose" the vacuum does seem to run to
completion before waiting forever after issuing that warning.  I have
tried killing the autovacuum launcher process and letting it restart
but still as soon as the next vacuum is issued it gets blocked and
waits.

Is there any way for me to clear that orphaned entry out of pg_locks?
What could I have done to cause this?

I have used postgres for everything including our main product
database (hundreds of transactions/sec, 100's of GB of data) for years
and have never seen this scenario.

Any help is appreciated, I can easily provide any additional
information that may be helpful.

Joe Uhl

Re: Nonexistent pid in pg_locks

От
Tom Lane
Дата:
Joe Uhl <joeuhl@gmail.com> writes:
> I had to bounce an OpenMQ broker this morning (this database is the DB
> for an OpenMQ HA setup) and couldn't get it to reconnect to postgres.
> On inspecting the database I found dozens of vacuum processes waiting
> (I have a cron job that vacuums each night) and chewing up connection
> slots.  Killing those left a few autovacuum worker process waiting.
> Killing those left just this one orphaned pid apparently holding a
> lock.  Assumably they were all waiting on the lock "held" by 10453.

What exactly did you do to "kill" those processes?  Do you remember
whether any of them happened to have PID 10453?

> Is there any way for me to clear that orphaned entry out of pg_locks?

Restarting the database should take care of this, I think.

            regards, tom lane

Re: Nonexistent pid in pg_locks

От
Joe Uhl
Дата:
On Jul 8, 2009, at 2:41 PM, Tom Lane wrote:

> Joe Uhl <joeuhl@gmail.com> writes:
>> I had to bounce an OpenMQ broker this morning (this database is the
>> DB
>> for an OpenMQ HA setup) and couldn't get it to reconnect to postgres.
>> On inspecting the database I found dozens of vacuum processes waiting
>> (I have a cron job that vacuums each night) and chewing up connection
>> slots.  Killing those left a few autovacuum worker process waiting.
>> Killing those left just this one orphaned pid apparently holding a
>> lock.  Assumably they were all waiting on the lock "held" by 10453.
>
> What exactly did you do to "kill" those processes?  Do you remember
> whether any of them happened to have PID 10453?

I used "kill pid1 pid2 pid3 ..." (no -9) as root.  Unfortunately I do
not recall if that pid was one of the processes I killed and not
enough scrollback in this screen to see.  It is a
ShareUpdateExclusiveLock lock though and I definitely only killed
vacuum/analyze pids so thinking there is a very high chance of 10453
being one of them.

>
>> Is there any way for me to clear that orphaned entry out of pg_locks?
>
> Restarting the database should take care of this, I think.
>
>             regards, tom lane

I've got a block of time scheduled for tonight to restart, will give
that a shot.  Thanks for the response,

Joe

Re: Nonexistent pid in pg_locks

От
Tom Lane
Дата:
Joe Uhl <joeuhl@gmail.com> writes:
> On Jul 8, 2009, at 2:41 PM, Tom Lane wrote:
>> What exactly did you do to "kill" those processes?  Do you remember
>> whether any of them happened to have PID 10453?

> I used "kill pid1 pid2 pid3 ..." (no -9) as root.  Unfortunately I do
> not recall if that pid was one of the processes I killed and not
> enough scrollback in this screen to see.  It is a
> ShareUpdateExclusiveLock lock though and I definitely only killed
> vacuum/analyze pids so thinking there is a very high chance of 10453
> being one of them.

Hmm.  In any case that shouldn't have led to a lock left hanging.
Assuming that it was a regular and not autovacuum, do you know what
the exact command would have been?  (In particular, FULL, ANALYZE,
etc options)

            regards, tom lane

Re: Nonexistent pid in pg_locks

От
Joe Uhl
Дата:
On Jul 8, 2009, at 3:00 PM, Tom Lane wrote:

> Joe Uhl <joeuhl@gmail.com> writes:
>> On Jul 8, 2009, at 2:41 PM, Tom Lane wrote:
>>> What exactly did you do to "kill" those processes?  Do you remember
>>> whether any of them happened to have PID 10453?
>
>> I used "kill pid1 pid2 pid3 ..." (no -9) as root.  Unfortunately I do
>> not recall if that pid was one of the processes I killed and not
>> enough scrollback in this screen to see.  It is a
>> ShareUpdateExclusiveLock lock though and I definitely only killed
>> vacuum/analyze pids so thinking there is a very high chance of 10453
>> being one of them.
>
> Hmm.  In any case that shouldn't have led to a lock left hanging.
> Assuming that it was a regular and not autovacuum, do you know what
> the exact command would have been?  (In particular, FULL, ANALYZE,
> etc options)
>
>             regards, tom lane

They were VACUUM VERBOSE ANALYZE.  Specifically run with "/usr/bin/
vacuumdb -v --analyze $DB_NAME" in the cron job each night.

Re: Nonexistent pid in pg_locks

От
Tom Lane
Дата:
Joe Uhl <joeuhl@gmail.com> writes:
> On Jul 8, 2009, at 3:00 PM, Tom Lane wrote:
>> Hmm.  In any case that shouldn't have led to a lock left hanging.
>> Assuming that it was a regular and not autovacuum, do you know what
>> the exact command would have been?  (In particular, FULL, ANALYZE,
>> etc options)

> They were VACUUM VERBOSE ANALYZE.  Specifically run with "/usr/bin/
> vacuumdb -v --analyze $DB_NAME" in the cron job each night.

We had bugs long ago (8.0 or before) where SIGTERM'ing a backend could
result in lock entries not getting cleaned up, but that's all fixed now
AFAIK.  I tried some simple experiments with SIGTERM'ing a VACUUM in
8.3.x and couldn't reproduce a problem.  So there must have been some
other contributing factor here, but without any idea what it was, it's
hard to investigate further :-(

            regards, tom lane

Re: Nonexistent pid in pg_locks

От
Joe Uhl
Дата:
On Jul 8, 2009, at 3:42 PM, Tom Lane wrote:

> Joe Uhl <joeuhl@gmail.com> writes:
>> On Jul 8, 2009, at 3:00 PM, Tom Lane wrote:
>>> Hmm.  In any case that shouldn't have led to a lock left hanging.
>>> Assuming that it was a regular and not autovacuum, do you know what
>>> the exact command would have been?  (In particular, FULL, ANALYZE,
>>> etc options)
>
>> They were VACUUM VERBOSE ANALYZE.  Specifically run with "/usr/bin/
>> vacuumdb -v --analyze $DB_NAME" in the cron job each night.
>
> We had bugs long ago (8.0 or before) where SIGTERM'ing a backend could
> result in lock entries not getting cleaned up, but that's all fixed
> now
> AFAIK.  I tried some simple experiments with SIGTERM'ing a VACUUM in
> 8.3.x and couldn't reproduce a problem.  So there must have been some
> other contributing factor here, but without any idea what it was, it's
> hard to investigate further :-(
>
>             regards, tom lane

No worries, appreciate the response.  I should have collected more
information at the initial state before diving in with the mass kill.
I'll restart tonight, and get a zabbix script monitoring the vacuums.
If they start piling up again i'll be able to gather more useful
information.