Обсуждение: 8.1 index corruption woes

Поиск
Список
Период
Сортировка

8.1 index corruption woes

От
Alvaro Herrera
Дата:
Hi,

We've detected what I think is some sort of index corruption in 8.1.
The server is running 8.1.11, so AFAICT the problem with truncated pages
in vacuum is already patched and accounted for (i.e. we reindexed, and a
bit later the problem presented itself again).  There haven't been any
relevant fixes after that AFAICT.

What we see is that after a bit of updating the index, it starts having
tuples that poing to heap entries which are marked unused.

I detected one of these almost by accident, and then built a tool to
discover them by examining pg_filedump output.  Basically what it does
is scan the heap, note which heap tuples are marked "unused", and then
scan the index and for each index tuple in leaf pages, see if it points
to an unused heap tuple.  The number of occurrences is amazingly high.
Right after a reindex there isn't any occurrence; but after a while of
application load, a lot of them appear.

I catalogued this as index corruption: the theory is that as soon as the
unused heap tuple is reused, the index will have a pointer with the
wrong index key pointing to a live heap tuple.

(We saw an occurrence of this too, but I wasn't motivated enough to
write a tool to verify the data in index vs. heap tuples.)

However, seeing the high prevalence of the problem, I started
considering whether the tool is misreading the output anyway -- i.e.
maybe it's a bug in the tool, or a misconception on my part.

The tool output looks like this:

INDEX PTR TO UNUSED HEAP: index 273375 (78,18) -> (5530, 17)
INDEX PTR TO UNUSED HEAP: index 273375 (96,84) -> (5436, 3)
INDEX PTR TO UNUSED HEAP: index 273375 (111,1) -> (1317, 26)
INDEX PTR TO UNUSED HEAP: index 273375 (145,1) -> (1665, 26)
INDEX PTR TO UNUSED HEAP: index 273375 (174,1) -> (2656, 17)
INDEX PTR TO UNUSED HEAP: index 273375 (199,1) -> (1745, 21)
INDEX PTR TO UNUSED HEAP: index 273375 (207,1) -> (358, 26)
INDEX PTR TO UNUSED HEAP: index 273375 (214,1) -> (2800, 17)
(many more lines of the same stuff)

What this means (first line) is that on index 273375, page 78, offset
18, there's a pointer to heap page 5530, offset 17; but that heap offset
shows up as unused.  This is from the heap:

Block 5530 ********************************************************
<Header> -----
 Block Offset: 0x02b34000         Offsets: Lower     136 (0x0088)
 Block: Size 8192  Version    3            Upper     992 (0x03e0)
 LSN:  logid    130 recoff 0xd53b3090      Special  8192 (0x2000)
 Items:   29                   Free Space:  856
 Length (including item array): 140

[ ... snip other items ...]

 Item  17 -- Length:    0  Offset:  504 (0x01f8)  Flags: 0x00



This is from the index:

Block   78 ********************************************************
<Header> -----
 Block Offset: 0x0009c000         Offsets: Lower     788 (0x0314)
 Block: Size 8192  Version    3            Upper    5104 (0x13f0)
 LSN:  logid    131 recoff 0x01b3c6e0      Special  8176 (0x1ff0)
 Items:  192                   Free Space: 4316
 Length (including item array): 792

[ ... snip other items ... ]

 Item  18 -- Length:   16  Offset: 5184 (0x1440)  Flags: USED
  Block Id: 5530  linp Index: 17  Size: 16
  Has Nulls: 0  Has Varwidths: 0


Of course, we're very concerned about this problem.  Personally I am
prepared to believe that this could be a hardware problem, because no
one else seems to be reporting this kind of thing, but I don't see how
it could cause this particular issue and not more wide-ranging data
corruption.

I would like comments on whether this is really a problem or I am just
misreading pg_filedump output.  If we confirm this, we can try to
proceed to investigate it further.

I attach the checker tool in case it is of any interest.

Thanks,

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Вложения

Re: 8.1 index corruption woes

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> We've detected what I think is some sort of index corruption in 8.1.
> The server is running 8.1.11, so AFAICT the problem with truncated pages
> in vacuum is already patched and accounted for (i.e. we reindexed, and a
> bit later the problem presented itself again).  There haven't been any
> relevant fixes after that AFAICT.

> What we see is that after a bit of updating the index, it starts having
> tuples that poing to heap entries which are marked unused.

Do you actually see any observed problem, or is this conclusion based
entirely on your pg_filedump analysis?

I suspect a problem with your analysis script, although a quick scan
of the code didn't find an issue.

Another point to keep in mind, if you are trying to analyze files
belonging to a live database, is that what you can see in the filesystem
may not be the "current" contents of every page.  For typical access
patterns it'd be unsurprising for the visible index pages to lag behind
those of the heap, since they'd be "hotter" and tend to stay in shared
buffers longer.
        regards, tom lane


Re: 8.1 index corruption woes

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > We've detected what I think is some sort of index corruption in 8.1.
> > The server is running 8.1.11, so AFAICT the problem with truncated pages
> > in vacuum is already patched and accounted for (i.e. we reindexed, and a
> > bit later the problem presented itself again).  There haven't been any
> > relevant fixes after that AFAICT.
> 
> > What we see is that after a bit of updating the index, it starts having
> > tuples that poing to heap entries which are marked unused.
> 
> Do you actually see any observed problem, or is this conclusion based
> entirely on your pg_filedump analysis?

Well, yeah, this all started because the guys started getting weird
results in queries, and found out that disabling index scans returned
different results.

> I suspect a problem with your analysis script, although a quick scan
> of the code didn't find an issue.

Well, that's why I posted the corresponding pg_filedump output -- manual
examination shows of that shows that the script is drawing the correct
conclusions.

> Another point to keep in mind, if you are trying to analyze files
> belonging to a live database, is that what you can see in the filesystem
> may not be the "current" contents of every page.  For typical access
> patterns it'd be unsurprising for the visible index pages to lag behind
> those of the heap, since they'd be "hotter" and tend to stay in shared
> buffers longer.

Hmm, I think the files come from a PITR slave that's not online.  I'll
ask to be sure.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: 8.1 index corruption woes

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> Another point to keep in mind, if you are trying to analyze files
>> belonging to a live database, is that what you can see in the filesystem
>> may not be the "current" contents of every page.  For typical access
>> patterns it'd be unsurprising for the visible index pages to lag behind
>> those of the heap, since they'd be "hotter" and tend to stay in shared
>> buffers longer.

> Hmm, I think the files come from a PITR slave that's not online.  I'll
> ask to be sure.

8.1 didn't have restartpoint code, and of course bgwriter isn't running;
so I believe filesystem pages on a PITR slave could be arbitrarily far
out of date if the corresponding shared buffer got touched regularly.
Try doing the analysis on the master immediately after a CHECKPOINT
command.
        regards, tom lane


Re: 8.1 index corruption woes

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane wrote:
> >> Another point to keep in mind, if you are trying to analyze files
> >> belonging to a live database, is that what you can see in the filesystem
> >> may not be the "current" contents of every page.  For typical access
> >> patterns it'd be unsurprising for the visible index pages to lag behind
> >> those of the heap, since they'd be "hotter" and tend to stay in shared
> >> buffers longer.
> 
> > Hmm, I think the files come from a PITR slave that's not online.  I'll
> > ask to be sure.
> 
> 8.1 didn't have restartpoint code, and of course bgwriter isn't running;
> so I believe filesystem pages on a PITR slave could be arbitrarily far
> out of date if the corresponding shared buffer got touched regularly.
> Try doing the analysis on the master immediately after a CHECKPOINT
> command.

Okay.

I'll also check out tomorrow the LSN on the involved pages to see if
there's anything obviously bogus.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: 8.1 index corruption woes

От
Decibel!
Дата:
On Jul 7, 2008, at 7:39 PM, Tom Lane wrote:
> Another point to keep in mind, if you are trying to analyze files
> belonging to a live database, is that what you can see in the  
> filesystem
> may not be the "current" contents of every page.  For typical access
> patterns it'd be unsurprising for the visible index pages to lag  
> behind
> those of the heap, since they'd be "hotter" and tend to stay in shared
> buffers longer.


None of the tests were done on the production database. Most of the  
checks were not done on a PITR restore; they were done on a SAN-level  
snapshot that had been run through the recovery process (startup  
postmaster on snapshot, let it recover, shut down).

I hadn't thought about checkpointing; I'll make sure to do that next  
time we take a snapshot.

We also analyzed a single table from a completely different (much  
larger) database. In that case the analysis was done on a PITR- 
recovered "slave" that was up and running, but nothing should have  
been writing to the table at all, and it would have been up long  
enough that it would have checkpointed after exiting PITR recovery  
(though IIRC there's a manual checkpoint done at exit of PITR  
recovery anyway). That check didn't show as many questionable index  
pointers, but there were some (again, the bulk of them were index  
pointers that were using the first line pointer slot in the index page).
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: 8.1 index corruption woes

От
Alvaro Herrera
Дата:
Alvaro Herrera wrote:
> Tom Lane wrote:
> > Alvaro Herrera <alvherre@commandprompt.com> writes:
> > > We've detected what I think is some sort of index corruption in 8.1.
> > > The server is running 8.1.11, so AFAICT the problem with truncated pages
> > > in vacuum is already patched and accounted for (i.e. we reindexed, and a
> > > bit later the problem presented itself again).  There haven't been any
> > > relevant fixes after that AFAICT.
> > 
> > > What we see is that after a bit of updating the index, it starts having
> > > tuples that poing to heap entries which are marked unused.
> > 
> > Do you actually see any observed problem, or is this conclusion based
> > entirely on your pg_filedump analysis?
> 
> Well, yeah, this all started because the guys started getting weird
> results in queries, and found out that disabling index scans returned
> different results.

I neglected to mention that further analysis of the failed index scans
showed that index entries were pointing to heap tuples with completely
different data.

I was tracking one of those, when I noticed that in the same index page
was an entry pointing to a heap entry marked as unused.

> > I suspect a problem with your analysis script, although a quick scan
> > of the code didn't find an issue.

I don't want to discard this idea, because we're getting a very
unusually high number of bogus entries.  However, they are all (or a
very high percentage of them) the very first entry on each index page.
I want to confirm that the leftmost on a leaf btree page is a valid
item, and not something like the lower bound value?  (I think we only
store high bounds on internal pages, not leaf pages, but I'm not sure).


> > Another point to keep in mind, if you are trying to analyze files
> > belonging to a live database, is that what you can see in the filesystem
> > may not be the "current" contents of every page.  For typical access
> > patterns it'd be unsurprising for the visible index pages to lag behind
> > those of the heap, since they'd be "hotter" and tend to stay in shared
> > buffers longer.

It was confirmed that the servers are not live, and a checkpoint has
been executed by the recovery code.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: 8.1 index corruption woes

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I don't want to discard this idea, because we're getting a very
> unusually high number of bogus entries.  However, they are all (or a
> very high percentage of them) the very first entry on each index page.
> I want to confirm that the leftmost on a leaf btree page is a valid
> item, and not something like the lower bound value?  (I think we only
> store high bounds on internal pages, not leaf pages, but I'm not sure).

Er ... no.  Per nbtree/README:

: On a page that is not rightmost in its tree level, the "high key" is
: kept in the page's first item, and real data items start at item 2.
: The link portion of the "high key" item goes unused.  A page that is
: rightmost has no "high key", so data items start with the first item.
: Putting the high key at the left, rather than the right, may seem odd,
: but it avoids moving the high key as we add data items.

(Right offhand, it looks like _bt_split just copies the item that it's
cloning the index key of.  Maybe it would be worth setting the item
pointer invalid, to prevent confusion in future?)
        regards, tom lane


Re: 8.1 index corruption woes

От
Decibel!
Дата:
On Jul 11, 2008, at 2:17 PM, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> I don't want to discard this idea, because we're getting a very
>> unusually high number of bogus entries.  However, they are all (or a
>> very high percentage of them) the very first entry on each index  
>> page.
>> I want to confirm that the leftmost on a leaf btree page is a valid
>> item, and not something like the lower bound value?  (I think we only
>> store high bounds on internal pages, not leaf pages, but I'm not  
>> sure).
>
> Er ... no.  Per nbtree/README:
>
> : On a page that is not rightmost in its tree level, the "high key" is
> : kept in the page's first item, and real data items start at item 2.
> : The link portion of the "high key" item goes unused.  A page that is
> : rightmost has no "high key", so data items start with the first  
> item.
> : Putting the high key at the left, rather than the right, may seem  
> odd,
> : but it avoids moving the high key as we add data items.
>
> (Right offhand, it looks like _bt_split just copies the item that it's
> cloning the index key of.  Maybe it would be worth setting the item
> pointer invalid, to prevent confusion in future?)

Another option would be to change pg_filedump to understand what the  
first item in the index is.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828