BUG #14960: Queries stuck on BtreePage lock on parallel index scan
От | tlw@monsido.com |
---|---|
Тема | BUG #14960: Queries stuck on BtreePage lock on parallel index scan |
Дата | |
Msg-id | 20171211092249.20146.60649@wrigleys.postgresql.org обсуждение исходный текст |
Ответы |
Re: BUG #14960: Queries stuck on BtreePage lock on parallel index scan
|
Список | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 14960 Logged by: Tim Warberg Email address: tlw@monsido.com PostgreSQL version: 10.1 Operating system: Ubuntu 16.04 LTS Description: Hi, Ran into a issue on our PostgreSQL 10.1 cluster that seems to be a parallel index scan bug. Our queue system scheduled 30 almost identical concurrent queries where some of them was executed as parallel queries and all of them became stuck on IPC BtreePage lock. We discovered this after they've been stuck like that for about 10 hours [1]. At the same time a autovacuum was progressing one of the queried tables and it had become stuck in LWLock buffer_content while vacuuming indexes. None of the query processes responded to pg_cancel_backend nor pg_terminate_backend including the autovacuum. We eventually saw no other option than restarting PostgreSQL. After the restart the queue system immediately restarted the queries and they all once more became stuck on BtreePage lock. We then decided to disable parallel queries (max_parallel_workers_per_gather=0) and restarted PostgreSQL once more and this time the queries completed without issue. The system had been running with 10.1 and parallel queries for about 19 days before this without issues and at least once with the same query set. We've also been running the same queries at the same concurrency on PostgreSQL 9.5 for months without issues. Don't know if its possible to create the same condition in a isolated test but we could try to reproduce on a snapshot if wanted. If so is there any debug flags we should set or things we should look for? Server runs on Google Cloud with 18 vcpu, 118GB memory, ZFS filesystem with lz4 compression and the database takes up 4.69TB [2] pg_stat_activity before restart: https://gist.github.com/twarberg/514f382af7adec19248eba55f94a4e6a pg_stat_activity after restart: https://gist.github.com/twarberg/20443600478d8396e76ae1a699298217 [1] We had seen no errors and long running queue jobs is not uncommon [2] 2TB compressed and we've just done some cleanup and a lot of it was still not reclaimed by autovacuum when queries ran.
В списке pgsql-bugs по дате отправления: