Re: txid failed epoch increment, again, aka 6291
| От | Daniel Farina |
|---|---|
| Тема | Re: txid failed epoch increment, again, aka 6291 |
| Дата | |
| Msg-id | CAAZKuFbDRuvL7i5_wheWYud7yFf69Nmnq+0XTBfTCFyR0B_gAw@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: txid failed epoch increment, again, aka 6291 (Noah Misch <noah@leadboat.com>) |
| Ответы |
Re: txid failed epoch increment, again, aka 6291
|
| Список | pgsql-hackers |
On Thu, Sep 6, 2012 at 3:04 AM, Noah Misch <noah@leadboat.com> wrote: > On Tue, Sep 04, 2012 at 09:46:58AM -0700, Daniel Farina wrote: >> I might try to find the segments leading up to the overflow point and >> try xlogdumping them to see what we can see. > > That would be helpful to see. > > Just to grasp at yet-flimsier straws, could you post (URL preferred, else > private mail) the output of "objdump -dS" on your "postgres" executable? https://dl.dropbox.com/s/444ktxbrimaguxu/txid-wrap-objdump-dS-postgres.txt.gz Sure, it's a 9.0.6 with pg_cancel_backend by-same-role backported along with the standard debian changes, so nothing all that interesting should be going on that isn't going on normally with compilers on this platform. I am also starting to grovel through this assembly, although I don't have a ton of experience finding problems this way. To save you a tiny bit of time aligning the assembly with the C, this line c797f: e8 7c c9 17 00 callq 244300 <LWLockAcquire> Seems to be the beginning of: LWLockAcquire(XidGenLock, LW_SHARED);checkPoint.nextXid = ShmemVariableCache->nextXid;checkPoint.oldestXid = ShmemVariableCache->oldestXid;checkPoint.oldestXidDB= ShmemVariableCache->oldestXidDB;LWLockRelease(XidGenLock); >> If there's anything to note about the workload, I'd say that it does >> tend to make fairly pervasive use of long running transactions which >> can span probably more than one checkpoint, and the txid reporting >> functions, and a concurrency level of about 300 or so backends ... but >> per my reading of the mechanism so far, it doesn't seem like any of >> this should matter. > > Thanks for the details; I agree none of that sounds suspicious. > > After some further pondering and testing, this remains a mystery to me. These > symptoms imply a proper update of ControlFile->checkPointCopy.nextXid without > having properly updated ControlFile->checkPointCopy.nextXidEpoch. After > recovery, only CreateCheckPoint() updates ControlFile->checkPointCopy at all. > Its logic for doing so looks simple and correct. Yeah. I'm pretty flabbergasted that so much seems to be going right while this goes wrong. -- fdr
В списке pgsql-hackers по дате отправления: