Обсуждение: Re: [HACKERS] ERROR: btree scan list trashed ??
Adriaan Joubert <a.joubert@albourne.com> writes: >> What Postgres version are you using, and on what platform? If it's >> anything older than 6.5.1, an upgrade would probably be a good idea. > Sorry, I should have mentioined that. I'm using 6.5.0 on DEC Alpha > (Digital Unix, compiled with cc). Alpha, eh? We have some known porting problems on 64-bit architectures, and I wonder whether this is one of them. Going to be hard to nail that down until we can reproduce the error, however. After some digging around in backend/access/nbtree/nbtscan.c, which is producing the error, I notice that the routine in question is searching a list that does not get cleared properly at transaction abort. It's not clear that that's the cause of the error message, though. What I suggest at this point is that you pay more attention to what happens just before the transaction in which you get the "btree scan list trashed" message. In particular, are there any commands that abort with errors a little bit earlier in the same backend? It might take the combination of an error in a btree-index-using command and then another btree index access to provoke the "trashed" symptom. regards, tom lane
Tom Lane wrote: > > Adriaan Joubert <a.joubert@albourne.com> writes: > >> What Postgres version are you using, and on what platform? If it's > >> anything older than 6.5.1, an upgrade would probably be a good idea. > > > Sorry, I should have mentioined that. I'm using 6.5.0 on DEC Alpha > > (Digital Unix, compiled with cc). > > Alpha, eh? We have some known porting problems on 64-bit architectures, > and I wonder whether this is one of them. Going to be hard to nail that > down until we can reproduce the error, however. > > After some digging around in backend/access/nbtree/nbtscan.c, which is > producing the error, I notice that the routine in question is searching > a list that does not get cleared properly at transaction abort. It's > not clear that that's the cause of the error message, though. What > I suggest at this point is that you pay more attention to what happens > just before the transaction in which you get the "btree scan list > trashed" message. In particular, are there any commands that abort > with errors a little bit earlier in the same backend? It might take > the combination of an error in a btree-index-using command and then > another btree index access to provoke the "trashed" symptom. That may be it. I have some PL routines that raise an exception if an operation could lead to an inconsistency in my database tables. This is not really an error, but I do want to abort the transaction in that case, so I raise an exception. I'll continue trying to nail it down and then I can look with the debugger what happens. BTW, I've installed 6.5.1 and still have the same problems. Vacuuming hung up everything, and I had to shut the whole thing down and restart it to get it working again. Dropping the indices and rebuilding them all fixed the problem. How difficult is it to clear the list at transaction abort? Is this something I could patch and try out? Thanks a lot for looking at this, much appreciated! Adriaan
Adriaan Joubert <a.joubert@albourne.com> writes: >> After some digging around in backend/access/nbtree/nbtscan.c, which is >> producing the error, I notice that the routine in question is searching >> a list that does not get cleared properly at transaction abort. It's >> not clear that that's the cause of the error message, though. > BTW, I've installed 6.5.1 and still have the same problems. No surprise, really. > Vacuuming hung up everything, and I had to shut the whole thing down > and restart it to get it working again. Dropping the indices and > rebuilding them all fixed the problem. Hmm, that suggests that your indexes are actually getting corrupted. > How difficult is it to clear the list at transaction abort? Is this > something I could patch and try out? The BTScans variable in nbtscan.c needs to be reset to NULL during xact abort. I don't see how this would *directly* cause the observed symptom, but failing to do it should lead to misbehavior in _bt_adjscans() during later transactions, so it might be related somehow. If you want to patch it, make a subroutine that clears the variable (no need to free the list; since it's palloc'd it'll go away anyway) and call it from transaction cleanup in backend/access/transam/xact.c. regards, tom lane
Tom Lane wrote: > > The BTScans variable in nbtscan.c needs to be reset to NULL during > xact abort. I don't see how this would *directly* cause the > observed symptom, but failing to do it should lead to misbehavior in > _bt_adjscans() during later transactions, so it might be related > somehow. If you want to patch it, make a subroutine that clears the > variable (no need to free the list; since it's palloc'd it'll go > away anyway) and call it from transaction cleanup in > backend/access/transam/xact.c. This should be fixed in CVS too. Vadim
On Thu, 5 Aug 1999, Vadim Mikheev wrote: > Tom Lane wrote: > > > > The BTScans variable in nbtscan.c needs to be reset to NULL during > > xact abort. I don't see how this would *directly* cause the > > observed symptom, but failing to do it should lead to misbehavior in > > _bt_adjscans() during later transactions, so it might be related > > somehow. If you want to patch it, make a subroutine that clears the > > variable (no need to free the list; since it's palloc'd it'll go > > away anyway) and call it from transaction cleanup in > > backend/access/transam/xact.c. > > This should be fixed in CVS too. Is this something that can be easily back-patched for v6.5.2? Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
The Hermit Hacker <scrappy@hub.org> writes: > On Thu, 5 Aug 1999, Vadim Mikheev wrote: >> Tom Lane wrote: >>>> The BTScans variable in nbtscan.c needs to be reset to NULL during >>>> xact abort. >> >> This should be fixed in CVS too. Yes, absolutely. > Is this something that can be easily back-patched for v6.5.2? I will patch this in both current and REL6_5. But, although this is clearly a bug, I am not at all convinced that it explains Adriaan's problem. I think more creepie-crawlies lurk nearby :-( regards, tom lane
> I will patch this in both current and REL6_5. But, although this > is clearly a bug, I am not at all convinced that it explains > Adriaan's problem. I think more creepie-crawlies lurk nearby :-( Hmm, I made the changes and I only got three errors out of the system today. So it is not fixed, although perhaps improved (or was I just lucky?). I've been locking tables more restrictively, so this may have helped as well. I definitely think this has something to do with concurrent accesses to the same index. It always seems to start happening as the the tables start getting updates more rapidly. Another thought: an index on a table that gets updated sometimes through a PL trigger is an index on a user-defined type (the bitmask type I posted a while ago). Could this have something to do with a btree index on a user-defined type? I'll drop that index and see whether it makes a difference. All indexes on other tables that are touched are int4. Thanks for all the help, Tom! Adriaan
OK, I've dropped my user-defined type index and it hasn't made any difference. I've had quite a few of the following again: UPDATE TasksIds SET qstate=8::bit1 where task=358 and id=5654 ERROR: btree scan list trashed; can't find 0x1401744a0 I've got a lot of logging switched on, and these do not seem to be preceded by errors. Since patching it the system seems to recover ok, so I'm wondering whether this could be a caching issue. I think I will just lock all tables in their entirety now, and see whether that fixes it (there goes my MVCC performance boost 8-(). I still think it has something to do with concurrent access to the indices. If anybody has any more suggestions of what I could try, please let me know. Cheers, Adriaan
Adriaan Joubert <a.joubert@albourne.com> writes: > OK, I've dropped my user-defined type index and it hasn't made any > difference. I've had quite a few of the following again: > UPDATE TasksIds SET qstate=8::bit1 where task=358 and id=5654 > ERROR: btree scan list trashed; can't find 0x1401744a0 > I've got a lot of logging switched on, and these do not seem to be > preceded by errors. Since patching it the system seems to recover ok, so > I'm wondering whether this could be a caching issue. I think I will just > lock all tables in their entirety now, and see whether that fixes it > (there goes my MVCC performance boost 8-(). I still think it has > something to do with concurrent access to the indices. Let us know whether going to full locking makes any difference. I am currently wondering whether this is a porting issue (64-bit vs 32-bit pointers). If it only happens on 64-bit platforms, that would explain why we haven't seen many similar reports. Unfortunately, that theory provides little useful guidance about where to look :-( regards, tom lane