Обсуждение: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`

Поиск
Список
Период
Сортировка

BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`

От
brian@pukkasoft.com
Дата:
The following bug has been logged on the website:

Bug reference:      13970
Logged by:          Brian Ghidinelli
Email address:      brian@pukkasoft.com
PostgreSQL version: 9.4.6
Operating system:   Linux (RHEL 5.11)
Description:

Hi Pg team - I've been running a 9.4.1 server for the last year+. In the
past few months I've had a couple of instances of the server locking up.
I've done more troubleshooting into this last event and uncovered what
appears to be a bug.  My situation is much like these:

http://comments.gmane.org/gmane.comp.db.postgresql.admin/40587
http://postgresql.nabble.com/VACUUM-hanging-on-PostgreSQL-8-3-1-for-larger-tables-td1898438.html

The former claims lightweight locks had a bug up thru 9.4.5 but I'm running
the latest 9.4.6 and still experiencing this issue with one table.  Here's
the scenario:

* Either autovacuum OR manual vacuum on a single table hangs
* There is no cpu or i/o usage; top -p <pid> shows the vacuum process is
sleeping
* strace of the process id shows rapidly scrolling screenfuls of `select(0,
NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)`
* Running a query against pg_locks shows the vacuum has been granted a lock
but it is not fast path.
* There are no other queries running... I can trigger this behavior after a
fresh reboot and no other users by issuing a simple vacuum.
* I have reindex'd the table as well as dropped all but the primary key in
case there were issues with the index - still hung when vacuum was
attempted
* Interestingly when I check the last autovacuum/autoanalyze report, they
are all blank, even though I have autovacuum on

It scares me a lot that pg_cancel_backend and pg_terminate_backend don't
work. It requires a kill -QUIT to break out this process.

What can I investigate to help add more information?


Brian

Re: BUG #13970: Vacuum hangs on particular table; cannot be terminated - requires `kill -QUIT pid`

От
Alvaro Herrera
Дата:
brian@pukkasoft.com wrote:

> Hi Pg team - I've been running a 9.4.1 server for the last year+. In the
> past few months I've had a couple of instances of the server locking up.
> I've done more troubleshooting into this last event and uncovered what
> appears to be a bug.  My situation is much like these:
>
> http://comments.gmane.org/gmane.comp.db.postgresql.admin/40587
> http://postgresql.nabble.com/VACUUM-hanging-on-PostgreSQL-8-3-1-for-larger-tables-td1898438.html
>
> The former claims lightweight locks had a bug up thru 9.4.5 but I'm running
> the latest 9.4.6 and still experiencing this issue with one table.  Here's
> the scenario:
>
> * Either autovacuum OR manual vacuum on a single table hangs
> * There is no cpu or i/o usage; top -p <pid> shows the vacuum process is
> sleeping
> * strace of the process id shows rapidly scrolling screenfuls of `select(0,
> NULL, NULL, NULL, {0, 1000}) = 0 (Timeout)`

This smells like it's looping waiting for a multixact to be fully
written out ... except that the uninterruptibility part of that was fixed in
time for 9.4.0,

Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Branch: master Release: REL9_5_BR [51f9ea25d] 2014-11-14 15:14:01 -0300
Branch: REL9_4_STABLE Release: REL9_4_0 [137e4da6d] 2014-11-14 15:14:02 -0300
Branch: REL9_3_STABLE Release: REL9_3_6 [d45e8dc52] 2014-11-14 15:14:02 -0300
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=137e4da6d

Can you attach to the looping process with gdb when it's doing the
select() dance, and obtain a backtrace?  You need debug symbols
installed.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services