Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas
От | Tomas Vondra |
---|---|
Тема | Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas |
Дата | |
Msg-id | 35291484-f007-52c5-18a6-80c345f5c2ed@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas ("Bossart, Nathan" <bossartn@amazon.com>) |
Ответы |
Re: Bug in ProcArrayApplyRecoveryInfo for snapshots crossing 4B, breaking replicas
|
Список | pgsql-hackers |
On 1/24/22 22:28, Bossart, Nathan wrote: > On 1/22/22, 4:43 PM, "Tomas Vondra" <tomas.vondra@enterprisedb.com> wrote: >> There's a bug in ProcArrayApplyRecoveryInfo, introduced by 8431e296ea, >> which may cause failures when starting a replica, making it unusable. >> The commit message for 8431e296ea is not very clear about what exactly >> is being done and why, but the root cause is that at while processing >> RUNNING_XACTS, the XIDs are sorted like this: >> >> /* >> * Sort the array so that we can add them safely into >> * KnownAssignedXids. >> */ >> qsort(xids, nxids, sizeof(TransactionId), xidComparator); >> >> where "safely" likely means "not violating the ordering expected by >> KnownAssignedXidsAdd". Unfortunately, xidComparator compares the values >> as plain uint32 values, while KnownAssignedXidsAdd actually calls >> TransactionIdFollowsOrEquals() and compares the logical XIDs :-( > > Wow, nice find. > >> This likely explains why we never got any reports about this - most >> systems probably don't leave transactions running for this long, so the >> probability is much lower. And replica restarts are generally not that >> common events either. > > I'm aware of one report with the same message [0], but I haven't read > closely enough to determine whether it is the same issue. It looks > like that particular report was attributed to backup_label being > removed. > Yeah, I saw that thread too, and I don't think it's the same issue. As you say, it seems to be caused by the backup_label shenanigans, and there's also the RUNNING_XACTS message: Sep 20 15:00:27 ... CONTEXT: xlog redo Standby/RUNNING_XACTS: nextXid 38585 latestCompletedXid 38571 oldestRunningXid 38572; 14 xacts: 38573 38575 38579 38578 38574 38581 38580 38576 38577 38572 38582 38584 38583 38583 The XIDs don't cross the 4B boundary at all, so this seems unrelated. >> Attached patch is fixing this by just sorting the XIDs logically. The >> xidComparator is meant for places that can't do logical ordering. But >> these XIDs come from RUNNING_XACTS, so they actually come from the same >> wraparound epoch (so sorting logically seems perfectly fine). > > The patch looks reasonable to me. > Thanks! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: