Обсуждение: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
От
maxim.boguk@gmail.com
Дата:
The following bug has been logged on the website: Bug reference: 13657 Logged by: Maksym Boguk Email address: maxim.boguk@gmail.com PostgreSQL version: 9.4.4 Operating system: Linux Description: Hi, On one of 9.4.4 replica database I had found strange issue with stuck query and stuck wal replay process waiting each other but with no detectable deadlock condtions. Stuck query runs: SELECT schemaname, relname, pg_relation_size(relid) as table_size, pg_indexes_size(relid) as table_indexes_size, seq_scan, coalesce(idx_scan,0) as idx_scan, seq_tup_read, coalesce(idx_tup_fetch,0) as idx_tup_fetch, n_tup_ins as inserted, n_tup_upd as updated, n_tup_del as deleted, n_tup_hot_upd as hot_updated FROM pg_stat_user_tables In pg_stat_activity it's on waiting state. Locks for this query: postgres=# select * from pg_locks where pid=862 order by granted, relation; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath ------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----+-----------------+---------+---------- relation | 16424 | 17987 | | | | | | | | 7/924904 | 862 | AccessShareLock | f | f relation | 16424 | 1259 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 2610 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 2615 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 2662 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 2663 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 2678 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 2679 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 2684 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 2685 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 3455 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 11221 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t relation | 16424 | 11236 | | | | | | | | 7/924904 | 862 | AccessShareLock | t | t virtualxid | | | | | 7/924904 | | | | | 7/924904 | 862 | ExclusiveLock | t | t So not granted an single lock on relation 17987 (just common table inside the database). Now lets see whats going on with wal replay: ps -axuww | grep recover postgres 26786 2.5 0.0 177620 76784 ? Ss Sep30 48:27 postgres: startup process recovering 00000001000061A0000000AE waiting So wal replay waiting for something. Lets see what locks wal replay have: postgres=# select * from pg_locks where pid=26786 order by granted, relation; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath ------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------------+---------+---------- relation | 16424 | 17987 | | | | | | | | 1/0 | 26786 | AccessExclusiveLock | t | f virtualxid | | | | | 1/1 | | | | | 1/0 | 26786 | ExclusiveLock | t | t (2 rows) So wal replay got AccessExclusiveLock on relation 17987 and waiting for something. And query waiting for AccessShareLock on the same relation. Once I killed waiting query, wal replay resume with no issues. So seems there are some king of undetected deadlock condition between query and wal replay. PS: situation reproduced every few hours so it's possible to debug but I have no good idea where to start. Kind Regards, Maksym
> > > So wal replay got AccessExclusiveLock on relation 17987 and waiting for > something. > And query waiting for AccessShareLock on the same relation. > > =E2=80=8Bgdb backtrace from stuck startup process: (gdb) bt #0 0x00007f04ad862633 in select () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f04af86488e in pg_usleep (microsec=3D<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53 #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c= :171 #3 ResolveRecoveryConflictWithVirtualXIDs (reason=3DPROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=3D0x7f04b13ba2f0) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c= :232 #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=3D0x7f04b13ba2f0, reason=3DPROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c= :191 #5 0x00007f04af544445 in heap_xlog_clean (record=3D0x7f04b1395b80, lsn=3D107351881751648) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:= 7329 #6 heap2_redo (lsn=3D107351881751648, record=3D0x7f04b1395b80) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:= 8488 #7 0x00007f04af580648 in StartupXLOG () at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c= :6920 #8 0x00007f04af6f5d8f in StartupProcessMain () at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/startup.c:= 224 #9 0x00007f04af58c579 in AuxiliaryProcessMain (argc=3D2, argv=3D<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/bootstrap/bootstrap.c= :422 #10 0x00007f04af6f235b in StartChildProcess (type=3DStartupProcess) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/postmaster= .c:5155 #11 0x00007f04af6f4847 in PostmasterMain (argc=3D<optimized out>, argv=3D<optimized out>) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/postmaster/postmaster= .c:1237 #12 0x00007f04af513a5d in main (argc=3D5, argv=3D0x7f04b13731a0) at /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/main/main.c:228 =E2=80=8B --=20 Maxim Boguk Senior Postgresql DBA http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/= > Phone RU: +7 910 405 4718 Phone AU: +61 45 218 5678 LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b Skype: maxim.boguk Jabber: maxim.boguk@gmail.com =D0=9C=D0=BE=D0=B9=D0=9A=D1=80=D1=83=D0=B3: http://mboguk.moikrug.ru/ "People problems are solved with people. If people cannot solve the problem, try technology. People will then wish they'd listened at the first stage."
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
От
Michael Paquier
Дата:
On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: >> So wal replay got AccessExclusiveLock on relation 17987 and waiting for >> something. >> And query waiting for AccessShareLock on the same relation. >> > gdb backtrace from stuck startup process: > (gdb) bt > #0 0x00007f04ad862633 in select () from /lib/x86_64-linux-gnu/libc.so.6 > #1 0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53 > #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171 > #3 ResolveRecoveryConflictWithVirtualXIDs > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0) at > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232 > #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0, > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191 > #5 0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80, > lsn=107351881751648) at > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329 This backtrace is not indicating that this process is waiting on a relation lock, it is resolving a recovery conflict while removing tuples, killing the virtual transaction depending on if max_standby_streaming_delay or max_standby_archive_delay are set if the conflict gets longer. Did you change the default of those parameters, which is 30s, to -1? This would mean that the standby waits indefinitely. -- Michael
=E2=80=8B>=E2=80=8B This backtrace is not indicating that this process is waiting on a relation lock, it is resolving a recovery conflict while removing tuples, killing the virtual transaction depending on if max_standby_streaming_delay or max_standby_archive_delay are set if the conflict gets longer. Did you change the default of those parameters, which is 30s, to -1? This would mean that the standby waits indefinitely. =E2=80=8BProblem that startup process have confict with a query, which bloc= ked (waiting for) on the startup process itself (query could not process because it waiting for lock which held by startup process, and startup process waiting for finishing this query). So it's an undetected deadlock condtion here (as I understand situation). =E2=80=8B PS: there are no other activity on the database during that problem except blocked query. --=20 Maxim Boguk Senior Postgresql DBA http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/= > Phone RU: +7 910 405 4718 Phone AU: +61 45 218 5678 LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b Skype: maxim.boguk Jabber: maxim.boguk@gmail.com =D0=9C=D0=BE=D0=B9=D0=9A=D1=80=D1=83=D0=B3: http://mboguk.moikrug.ru/ "People problems are solved with people. If people cannot solve the problem, try technology. People will then wish they'd listened at the first stage."
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
От
Michael Paquier
Дата:
On Fri, Oct 2, 2015 at 2:14 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: > =E2=80=8B>=E2=80=8B > This backtrace is not indicating that this process is waiting on a > relation lock, it is resolving a recovery conflict while removing tuples, > killing the virtual transaction depending on if max_standby_streaming_del= ay > or max_standby_archive_delay are set if the conflict gets longer. Did you > change the default of those parameters, which is 30s, to -1? This would > mean that the standby waits indefinitely. > > > =E2=80=8BProblem that startup process have confict with a query, which bl= ocked > (waiting for) on the startup process itself (query could not process > because it waiting for lock which held by startup process, and startup > process waiting for finishing this query). So it's an undetected deadlock > condtion here (as I understand situation). =E2=80=8B > > PS: there are no other activity on the database during that problem excep= t > blocked query. > Don't you have other queries running in parallel of the one you are defining as "stuck" on the standby that prevent replay to move on? Like a long-running transaction working on the relation involved? Are you sure that you did not set up max_standby_streaming_delay to -1? --=20 Michael
On Fri, Oct 2, 2015 at 4:58 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > On Fri, Oct 2, 2015 at 2:14 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote= : > >> =E2=80=8B>=E2=80=8B >> This backtrace is not indicating that this process is waiting on a >> relation lock, it is resolving a recovery conflict while removing tuples= , >> killing the virtual transaction depending on if max_standby_streaming_de= lay >> or max_standby_archive_delay are set if the conflict gets longer. Did yo= u >> change the default of those parameters, which is 30s, to -1? This would >> mean that the standby waits indefinitely. >> >> >> =E2=80=8BProblem that startup process have confict with a query, which b= locked >> (waiting for) on the startup process itself (query could not process >> because it waiting for lock which held by startup process, and startup >> process waiting for finishing this query). So it's an undetected deadloc= k >> condtion here (as I understand situation). =E2=80=8B >> >> PS: there are no other activity on the database during that problem >> except blocked query. >> > > Don't you have other queries running in parallel of the one you are > defining as "stuck" on the standby that prevent replay to move on? Like a > long-running transaction working on the relation involved? Are you sure > that you did not set up > =E2=80=8B=E2=80=8B > max_standby_streaming_delay to -1? > -- > Michael > During the problem period on the database had runned only one query (listed in intial report) and nothing more (and this query had beed in waiting state according to pg_stat_activity). The pg_locks show that the query waiting for AccessShareLock on relation 17987, in the same time the startup process have AccessExclusiveLock on the same relation and waiting for something. No other activity on the replica going on. And yes, the=E2=80=8B max_standby_streaming_delay to -1, as a result the replication process had been stuck on query from external monitoring tool forever until I killed that query, but situation repeated in few hours again. --=20 Maxim Boguk Senior Postgresql DBA http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/= > Phone RU: +7 910 405 4718 Phone AU: +61 45 218 5678 LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b Skype: maxim.boguk Jabber: maxim.boguk@gmail.com =D0=9C=D0=BE=D0=B9=D0=9A=D1=80=D1=83=D0=B3: http://mboguk.moikrug.ru/ "People problems are solved with people. If people cannot solve the problem, try technology. People will then wish they'd listened at the first stage."
On Fri, Oct 2, 2015 at 6:52 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: > > > On Fri, Oct 2, 2015 at 4:58 PM, Michael Paquier <michael.paquier@gmail.co= m > > wrote: > >> >> >> On Fri, Oct 2, 2015 at 2:14 PM, Maxim Boguk <maxim.boguk@gmail.com> >> wrote: >> >>> =E2=80=8B>=E2=80=8B >>> This backtrace is not indicating that this process is waiting on a >>> relation lock, it is resolving a recovery conflict while removing tuple= s, >>> killing the virtual transaction depending on if max_standby_streaming_d= elay >>> or max_standby_archive_delay are set if the conflict gets longer. Did y= ou >>> change the default of those parameters, which is 30s, to -1? This would >>> mean that the standby waits indefinitely. >>> >>> >>> =E2=80=8BProblem that startup process have confict with a query, which = blocked >>> (waiting for) on the startup process itself (query could not process >>> because it waiting for lock which held by startup process, and startup >>> process waiting for finishing this query). So it's an undetected deadlo= ck >>> condtion here (as I understand situation). =E2=80=8B >>> >>> PS: there are no other activity on the database during that problem >>> except blocked query. >>> >> >> Don't you have other queries running in parallel of the one you are >> defining as "stuck" on the standby that prevent replay to move on? Like = a >> long-running transaction working on the relation involved? Are you sure >> that you did not set up >> =E2=80=8B=E2=80=8B >> max_standby_streaming_delay to -1? >> -- >> Michael >> > > During the problem period on the database had runned only one query > (listed in intial report) and nothing more (and this query had beed in > waiting state according to pg_stat_activity). > The pg_locks show that the query waiting for AccessShareLock on relation > 17987, in the same time the startup process have AccessExclusiveLock on t= he > same relation and waiting for something. No other activity on the replica > going on. > And yes, the=E2=80=8B max_standby_streaming_delay to -1, as a result the > replication process had been stuck on query from external monitoring tool > forever until I killed that query, but situation repeated in few hours > again. > > =E2=80=8BOne last addition to the report. The same database replica (server= ), the same monitoring queries, all the same but using PostgreSQL =E2=80=8B9.2.13 = doesn't have that problem. It been used for a few years without such issues, problem only arise only after whole cluster upgrade to version 9.4.4. > -- > Maxim Boguk >
On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: > >> So wal replay got AccessExclusiveLock on relation 17987 and waiting for > >> something. > >> And query waiting for AccessShareLock on the same relation. > >> > > gdb backtrace from stuck startup process: > > (gdb) bt > > #0 0x00007f04ad862633 in select () from /lib/x86_64-linux-gnu/libc.so.6 > > #1 0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53 > > #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at > > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171 > > #3 ResolveRecoveryConflictWithVirtualXIDs > > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0) at > > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232 > > #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0, > > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at > > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191 > > #5 0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80, > > lsn=107351881751648) at > > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329 > > This backtrace is not indicating that this process is waiting on a > relation lock, it is resolving a recovery conflict while removing tuples, > killing the virtual transaction depending on if max_standby_streaming_delay > or max_standby_archive_delay are set if the conflict gets longer. Did you > change the default of those parameters, which is 30s, to -1? This would > mean that the standby waits indefinitely. > While setting it to -1 gives the startup process permission to wait indefinitely for another back-end which is doing something, I don't think that that means it should have permission to deadlock indefinitely. A deadlock is a different kind of thing that "someone started a transaction and then left on vacation for a month without closing it" Cheers, Jeff
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
От
Michael Paquier
Дата:
On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> >> >> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: >> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting for >> >> something. >> >> And query waiting for AccessShareLock on the same relation. >> >> >> > gdb backtrace from stuck startup process: >> > (gdb) bt >> > #0 0x00007f04ad862633 in select () from /lib/x86_64-linux-gnu/libc.so.6 >> > #1 0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53 >> > #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at >> > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171 >> > #3 ResolveRecoveryConflictWithVirtualXIDs >> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0) at >> > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232 >> > #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0, >> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at >> > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191 >> > #5 0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80, >> > lsn=107351881751648) at >> > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329 >> >> This backtrace is not indicating that this process is waiting on a >> relation lock, it is resolving a recovery conflict while removing tuples, >> killing the virtual transaction depending on if max_standby_streaming_delay >> or max_standby_archive_delay are set if the conflict gets longer. Did you >> change the default of those parameters, which is 30s, to -1? This would mean >> that the standby waits indefinitely. > > While setting it to -1 gives the startup process permission to wait > indefinitely for another back-end which is doing something, I don't think > that that means it should have permission to deadlock indefinitely. A > deadlock is a different kind of thing that "someone started a transaction > and then left on vacation for a month without closing it" Looking at that more closely... We know that a startup process is the only process in a standby that can take an exclusive lock on a relation, per LockAcquireExtended in lock.c and that this lock is taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by Maxim indicates that the deadlock is showing up in heap_xlog_clean, which may happen in the case of a VACUUM for example. It seems to me that we could have a deadlock if some WAL records had their order swapped. For example, the EXCLUSIVE lock is replayed first and the cleanup of relation pages happens before the lock has been released, freezing the other backend using pg_relation_size on a standby waiting for the access share lock indefinitely. This would indeed prevent recovery conflict to work properly, because we should not hold an exclusive lock on the relation while replaying heap_xlog_clean. Hence it seems to me that we may have an issue with 68a2e52 (XLogInsert multi-slot), and it has been introduced in 9.4. We could perhaps think about adding some safeguards in ResolveRecoveryConflictWithSnapshot. Maxim, are you using commit_delay? It may be possible that this is fixed by bab9599 that has been pushed after 9.4.4 was released. Could you also check if HEAD or the soon-to-be-released 9.4.5 fix your problem? Regards, -- Michael
On Mon, Oct 5, 2015 at 7:15 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier < > michael.paquier@gmail.com> > > wrote: > >> > >> > >> > >> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com> > wrote: > >> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting > for > >> >> something. > >> >> And query waiting for AccessShareLock on the same relation. > >> >> > >> > gdb backtrace from stuck startup process: > >> > (gdb) bt > >> > #0 0x00007f04ad862633 in select () from > /lib/x86_64-linux-gnu/libc.so.6 > >> > #1 0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53 > >> > #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171 > >> > #3 ResolveRecoveryConflictWithVirtualXIDs > >> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0) > at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232 > >> > #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0, > >> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191 > >> > #5 0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80, > >> > lsn=107351881751648) at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329 > >> > >> This backtrace is not indicating that this process is waiting on a > >> relation lock, it is resolving a recovery conflict while removing > tuples, > >> killing the virtual transaction depending on if > max_standby_streaming_delay > >> or max_standby_archive_delay are set if the conflict gets longer. Did > you > >> change the default of those parameters, which is 30s, to -1? This would > mean > >> that the standby waits indefinitely. > > > > While setting it to -1 gives the startup process permission to wait > > indefinitely for another back-end which is doing something, I don't think > > that that means it should have permission to deadlock indefinitely. A > > deadlock is a different kind of thing that "someone started a transaction > > and then left on vacation for a month without closing it" > > Looking eat that more closely... We know that a startup process is the > only process in a standby that can take an exclusive lock on a > relation, per LockAcquireExtended in lock.c and that this lock is > taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by > Maxim indicates that the deadlock is showing up in heap_xlog_clean, > which may happen in the case of a VACUUM for example. It seems to me > that we could have a deadlock if some WAL records had their order > swapped. For example, the EXCLUSIVE lock is replayed first and the > cleanup of relation pages happens before the lock has been released, > freezing the other backend using pg_relation_size on a standby waiting > for the access share lock indefinitely. This would indeed prevent > recovery conflict to work properly, because we should not hold an > exclusive lock on the relation while replaying heap_xlog_clean. Hence > it seems to me that we may have an issue with 68a2e52 (XLogInsert > multi-slot), and it has been introducd in 9.4. > > We could perhaps think about adding some safeguards in > ResolveRecoveryConflictWithSnapshot. > > Maxim, are you using commit_delay? It may be possible that this is > fixed by bab9599 that has been pushed after 9.4.4 was released. Could > you also check if HEAD or the soon-to-be-released 9.4.5 fix your > problem? > Regards, > -- > Michael > Hi Michael, commit_delay isn't used on master server or replicas. I would like to check 9.4.5 but I prefer wait for release and not build it by hands, and from my experience building from git tree could be tricky as I don't have root access to install additional packages on the server. However, I will try it in next few days (if I get an permission). -- Maxim Boguk Senior Postgresql DBA http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/>
On Mon, Oct 5, 2015 at 1:15 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > > On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier < > michael.paquier@gmail.com> > > wrote: > >> > >> > >> > >> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com> > wrote: > >> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting > for > >> >> something. > >> >> And query waiting for AccessShareLock on the same relation. > >> >> > >> > gdb backtrace from stuck startup process: > >> > (gdb) bt > >> > #0 0x00007f04ad862633 in select () from > /lib/x86_64-linux-gnu/libc.so.6 > >> > #1 0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at > >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53 > >> > #2 0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171 > >> > #3 ResolveRecoveryConflictWithVirtualXIDs > >> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0) > at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232 > >> > #4 ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0, > >> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191 > >> > #5 0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80, > >> > lsn=107351881751648) at > >> > > >> > > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329 > >> > >> This backtrace is not indicating that this process is waiting on a > >> relation lock, it is resolving a recovery conflict while removing > tuples, > >> killing the virtual transaction depending on if > max_standby_streaming_delay > >> or max_standby_archive_delay are set if the conflict gets longer. Did > you > >> change the default of those parameters, which is 30s, to -1? This would > mean > >> that the standby waits indefinitely. > > > > While setting it to -1 gives the startup process permission to wait > > indefinitely for another back-end which is doing something, I don't think > > that that means it should have permission to deadlock indefinitely. A > > deadlock is a different kind of thing that "someone started a transaction > > and then left on vacation for a month without closing it" > > Looking at that more closely... We know that a startup process is the > only process in a standby that can take an exclusive lock on a > relation, per LockAcquireExtended in lock.c and that this lock is > taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by > Maxim indicates that the deadlock is showing up in heap_xlog_clean, > which may happen in the case of a VACUUM for example. It seems to me > that we could have a deadlock if some WAL records had their order > swapped. For example, the EXCLUSIVE lock is replayed first and the > cleanup of relation pages happens before the lock has been released, > freezing the other backend using pg_relation_size on a standby waiting > for the access share lock indefinitely. This would indeed prevent > recovery conflict to work properly, because we should not hold an > exclusive lock on the relation while replaying heap_xlog_clean. Hence > it seems to me that we may have an issue with 68a2e52 (XLogInsert > multi-slot), and it has been introduced in 9.4. > I was assuming that the table being AccessExclusive locked and the table being VACUUMed were two different tables, which were done in two different processes on the master, which means that no re-ordering of WAL records would be necessary to hit this. Although I also haven't yet been able to reproduce the deadlock under this assumption. Cheers, Jeff
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
От
Michael Paquier
Дата:
On Tue, Oct 6, 2015 at 2:10 AM, Jeff Janes wrote: > I was assuming that the table being AccessExclusive locked and the table > being VACUUMed were two different tables, which were done in two different > processes on the master, which means that no re-ordering of WAL records > would be necessary to hit this. Right. This may be possible as well. > Although I also haven't yet been able to reproduce the deadlock under this assumption. Yeah, we are a bit blind here. Maxim, did you perform an operation on master that took an exclusive lock on the relation involved here? Was this relation a table, an index? If an index, what kind of index? I think that such information may help us narrowing down a test case. -- Michael
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
От
Michael Paquier
Дата:
On Mon, Oct 5, 2015 at 5:47 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote: > I would like to check 9.4.5 but I prefer wait for release and not build it > by hands, and from my experience building from git tree could be tricky as I > don't have root access to install additional packages on the server. > However, I will try it in next few days (if I get an permission). Maxim, did you get the occasion to test your setup with 9.4.5? It seems that your environment is more likely to reproduce this deadlock. -- Michael
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
От
Sergey Burladyan
Дата:
On Wed, Oct 21, 2015 at 5:42 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Maxim, did you get the occasion to test your setup with 9.4.5? It > seems that your environment is more likely to reproduce this deadlock. I think I also have this issue, but with 9.2.13, today I see dozens of waiting requests at standby (via pgadmin Tools-Server Status). All of it is waiting for PID 48205, it is startup process: 48205 ? Ss 6289:29 postgres: startup process recovering 0000001000024C5E000000DF waiting and startup process hold AccessExclusiveLock for table: 'virtualxid' <NULL> <NULL> <NULL> <NULL> '1/1' <NULL> <NULL> <NULL> <NULL> '1/0' 48205 'ExclusiveLock' t t 'relation' 16444 16993 <NULL> <NULL> '<NULL>' <NULL> <NULL> <NULL> <NULL> '1/0' 48205 'AccessExclusiveLock' t f I have 'ALTER tblname RENAME xxx TO yyy' at master, before this lock. And I am using recovery command, without streaming replication. PostgreSQL 9.2.13 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.7.2-5) 4.7.2, 64-bit -- Sergey Burladyan
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
От
Michael Paquier
Дата:
On Fri, Oct 23, 2015 at 11:36 PM, Sergey Burladyan <eshkinkot@gmail.com> wrote: > On Wed, Oct 21, 2015 at 5:42 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Maxim, did you get the occasion to test your setup with 9.4.5? It >> seems that your environment is more likely to reproduce this deadlock. > > I think I also have this issue, but with 9.2.13, today I see dozens of > waiting requests at standby > (via pgadmin Tools-Server Status). All of it is waiting for PID 48205, > it is startup process: > 48205 ? Ss 6289:29 postgres: startup process recovering > 0000001000024C5E000000DF waiting > > and startup process hold AccessExclusiveLock for table: > 'virtualxid' <NULL> <NULL> <NULL> <NULL> '1/1' <NULL> <NULL> <NULL> > <NULL> '1/0' 48205 'ExclusiveLock' t t > 'relation' 16444 16993 <NULL> <NULL> '<NULL>' <NULL> <NULL> <NULL> > <NULL> '1/0' 48205 'AccessExclusiveLock' t f > > I have 'ALTER tblname RENAME xxx TO yyy' at master, before this lock. > And I am using recovery command, without streaming replication. Thanks! This blows away my previous assumption that this was limited to 9.4~. If you still have the standby in a frozen state, where is the startup process stucked? Could you get a backtrace from it? Perhaps it is replaying a XLOG_HEAP2_CLEAN record. Did you run any queries on the standby that took locks on the renamed relation 16993? Also, segment 0000001000024C5E000000DF may be an important piece of the puzzle as the records replayed may be in an unexpected order. Could it be possible to get a dump of it using xlogdump (Postgres core includes it from 9.3~) around the point where startup process has been waiting? -- Michael
On Wed, Oct 21, 2015 at 4:42 AM, Michael Paquier <michael.paquier@gmail.com= > wrote: > On Mon, Oct 5, 2015 at 5:47 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote= : > > I would like to check 9.4.5 but I prefer wait for release and not build > it > > by hands, and from my experience building from git tree could be tricky > as I > > don't have root access to install additional packages on the server. > > However, I will try it in next few days (if I get an permission). > > Maxim, did you get the occasion to test your setup with 9.4.5? It > seems that your environment is more likely to reproduce this deadlock. =E2=80=8BSorry for late response, after upgrade to 9.4.5 this issue never h= appened anymore (zero occurrences during the last week). Before all cases of deadlock happened during wal replay of 2 specific tables (and only them) of the database of few thousands. These tables are somewhat special in sense they have medium insert/delete load but most time they have zero alive tuples and size changes from 0 block to 1-2 block and back quite a lot.=E2=80=8B =E2=80=8B I suppose deadlock had been happened when replay process replayin= g truncation to zero pages after vacuum on the master db (but it's mostly theory with no good evidence). =E2=80=8B --=20 Maxim Boguk Senior Postgresql DBA http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/= > Phone RU: +7 910 405 4718 Phone AU: +61 45 218 5678 LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b Skype: maxim.boguk Jabber: maxim.boguk@gmail.com "People problems are solved with people. If people cannot solve the problem, try technology. People will then wish they'd listened at the first stage."
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
От
Michael Paquier
Дата:
On Sat, Oct 24, 2015 at 4:16 PM, Maxim Boguk wrote: > On Wed, Oct 21, 2015 at 4:42 AM, Michael Paquier wrote: >> Maxim, did you get the occasion to test your setup with 9.4.5? It >> seems that your environment is more likely to reproduce this deadlock. > > > Sorry for late response, after upgrade to 9.4.5 this issue never happened > anymore (zero occurrences during the last week). Good to know. If you see any problems, feel free to report them on this thread of course. > Before all cases of deadlock happened during wal replay of 2 specific tables > (and only them) of the database of few thousands. These tables are somewhat > special in sense they have medium insert/delete load but most time they have > zero alive tuples and size changes from 0 block to 1-2 block and back quite > a lot. > I suppose deadlock had been happened when replay process replaying > truncation to zero pages after vacuum on the master db (but it's mostly > theory with no good evidence). By what was AccessExclusiveLock created on master? Did you run TRUNCATE on those two relations from time to time? -- Michael
On Sat, Oct 24, 2015 at 4:08 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Oct 24, 2015 at 4:16 PM, Maxim Boguk wrote: >> I suppose deadlock had been happened when replay process replaying >> truncation to zero pages after vacuum on the master db (but it's mostly >> theory with no good evidence). > > By what was AccessExclusiveLock created on master? Did you run > TRUNCATE on those two relations from time to time? AccessExclusiveLock is automatically taken by vacuum when it thinks it can truncate pages off the end of the relation. Cheers, Jeff
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
От
Michael Paquier
Дата:
On Sun, Oct 25, 2015 at 2:28 AM, Jeff Janes wrote: > On Sat, Oct 24, 2015 at 4:08 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Sat, Oct 24, 2015 at 4:16 PM, Maxim Boguk wrote: >>> I suppose deadlock had been happened when replay process replaying >>> truncation to zero pages after vacuum on the master db (but it's mostly >>> theory with no good evidence). >> >> By what was AccessExclusiveLock created on master? Did you run >> TRUNCATE on those two relations from time to time? > > AccessExclusiveLock is automatically taken by vacuum when it thinks it > can truncate pages off the end of the relation. Thanks, I didn't know this one, that's in lazy_truncate_heap. Still I don't think that this would normally deadlock except if the order of the records was somewhat swapped: if the truncation actually happens what is generated is XLOG_SMGR_TRUNCATE and this code path does not perform recovery conflict checks. A XLOG_HEAP2_CLEAN record is generated before the exclusive lock is taken for truncation. -- Michael