BUG #16582: Logical index corruption leading to apparent index scan infinite loop
От | PG Bug reporting form |
---|---|
Тема | BUG #16582: Logical index corruption leading to apparent index scan infinite loop |
Дата | |
Msg-id | 16582-4945b19d95e41c46@postgresql.org обсуждение исходный текст |
Ответы |
Re: BUG #16582: Logical index corruption leading to apparent index scan infinite loop
|
Список | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 16582 Logged by: James Lucas Email address: jlucasdba@gmail.com PostgreSQL version: 11.5 Operating system: Centos 7 Description: I have an 11.5 database with a declaratively partitioned table. The table has around 40 partitions. We began observing a number of long running sessions (over a week), all executing the same select query against this table. Attempting to kill the sessions with pg_terminate_backend has no effect. I pulled several stack traces against the backend processes executing this query and traced the issue back to the _bt_moveright function in src/backend/access/nbtree/nbtsearch.c. This function appears to be implemented as an infinite loop, which breaks when it finds a page that matches what it's looking for. So my working theory is that I have a corrupt index where that condition is never properly satisfied, resulting in an endless loop. I found a thread in the mailing list archive about a similar issue in one of the other functions in this file (https://www.postgresql.org/message-id/flat/CAM3SWZT7dCes%3DuOA3NAHYBA1kth%3Db4pXkszNLMPVtNAAYUp_wg%40mail.gmail.com#e9bf163da8f50f0f2845b59250da6ab5). The _bt_moveright function does not have CHECK_FOR_INTERRUPTS in its loop, which would explain why the sessions don't respond to pg_terminate_backend. The table has two indexes, so I decided to scan both indexes on all partitions with the bt_index_check function from the amcheck extension. I identified one partition where both indexes throw the following result: ERROR: cross page item order invariant violated for index "xxxxx" DETAIL: Last item on page tid(xx,xx) page lsn=xxxxxxxxxx Since both indexes appear to have the same issue, it suggests to me something about this table/data may be triggering a bug. I have not yet tried reindexing. I did pg_dump/pg_restore the entire table (all partitions) to another system (also 11.5), and the indexes built there test okay. Other possibly relevant information: The table is range partitioned on an insert timestamp (timestamp(0) without timezone). The two indexes involved are both composite indexes, over columns of types (numeric, numeric, numeric) and (numeric, timestamp(6) without time zone). The partition key is not covered by either index. There is also no primary key. I realize it may not be possible to remote diagnose the root cause of the index issues, but figured it was worth reporting. My next steps would probably be to try reindexing the affected partition. If that doesn't work, it seems like either a vacuum full or pg_dump/pg_restore would be the next step. Any other suggestions appreciated. Thanks, James
В списке pgsql-bugs по дате отправления: