Обсуждение: The lightbulb just went on...

Поиск
Список
Период
Сортировка

The lightbulb just went on...

От
Tom Lane
Дата:
... with a blinding flash ...

The VACUUM funnies I was complaining about before may or may not be real
bugs, but they are not what's biting Alfred.  None of them can lead to
the observed crashes AFAICT.

What's biting Alfred is the code that moves a tuple update chain, lines
1541 ff in REL7_0_PATCHES.  This sets up a pointer to a source tuple in
"tuple".  Then it gets the destination page it plans to move the tuple
to, and applies vc_vacpage to that page if it hasn't been done already.
But when we're moving a tuple chain, *it is possible for the destination
page to be the same as the source page*.  Since vc_vacpage applies
PageRepairFragmentation, all the live tuples on the page may get moved.
Afterwards, tuple.t_data is out of date and pointing at some random
chunk of some other tuple.  The subsequent copy of the tuple copies
garbage, which explains Alfred's several crashes in constructing index
entries for the copied tuple (all of which bombed out from the
index-build calls at lines 1634 ff, ie, for tuples being moved as part
of a chain).  Once in a while, the obsolete pointer will be pointing at
the real header of a different tuple --- perhaps even the place where we
are about to put the copy.  This improbable case explains the one
observed Assert crash in which a copied tuple's HEAP_MOVED_IN bit
mysteriously got turned off.  Reason: it was cleared through the
old-tuple pointer just after being set via the new-tuple one.

Proof that this is happening can be seen in the core dumps for Alfred's
index-construction-crash cases: tuple.t_data does not point at the same
place that the tuple.ip_posid'th page line item points at.  This could
only happen if the page was reshuffled since the tuple pointer was set
up.  The explanation for the Assert crash is a bit of a leap of faith,
but I feel confident that it's right.

The solution is to do everything we're going to do with the source
tuple, especially copying it and updating its state, *before* we apply
vc_vacpage to the destination page.  Then we don't care if the source
gets moved during vc_vacpage.

I will prepare a patch along this line and send it to Alfred for
testing.
        regards, tom lane


Re: The lightbulb just went on...

От
The Hermit Hacker
Дата:
Something to force a v7.0.3 ... ?

On Mon, 16 Oct 2000, Tom Lane wrote:

> ... with a blinding flash ...
> 
> The VACUUM funnies I was complaining about before may or may not be real
> bugs, but they are not what's biting Alfred.  None of them can lead to
> the observed crashes AFAICT.
> 
> What's biting Alfred is the code that moves a tuple update chain, lines
> 1541 ff in REL7_0_PATCHES.  This sets up a pointer to a source tuple in
> "tuple".  Then it gets the destination page it plans to move the tuple
> to, and applies vc_vacpage to that page if it hasn't been done already.
> But when we're moving a tuple chain, *it is possible for the destination
> page to be the same as the source page*.  Since vc_vacpage applies
> PageRepairFragmentation, all the live tuples on the page may get moved.
> Afterwards, tuple.t_data is out of date and pointing at some random
> chunk of some other tuple.  The subsequent copy of the tuple copies
> garbage, which explains Alfred's several crashes in constructing index
> entries for the copied tuple (all of which bombed out from the
> index-build calls at lines 1634 ff, ie, for tuples being moved as part
> of a chain).  Once in a while, the obsolete pointer will be pointing at
> the real header of a different tuple --- perhaps even the place where we
> are about to put the copy.  This improbable case explains the one
> observed Assert crash in which a copied tuple's HEAP_MOVED_IN bit
> mysteriously got turned off.  Reason: it was cleared through the
> old-tuple pointer just after being set via the new-tuple one.
> 
> Proof that this is happening can be seen in the core dumps for Alfred's
> index-construction-crash cases: tuple.t_data does not point at the same
> place that the tuple.ip_posid'th page line item points at.  This could
> only happen if the page was reshuffled since the tuple pointer was set
> up.  The explanation for the Assert crash is a bit of a leap of faith,
> but I feel confident that it's right.
> 
> The solution is to do everything we're going to do with the source
> tuple, especially copying it and updating its state, *before* we apply
> vc_vacpage to the destination page.  Then we don't care if the source
> gets moved during vc_vacpage.
> 
> I will prepare a patch along this line and send it to Alfred for
> testing.
> 
>             regards, tom lane
> 
> 

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: The lightbulb just went on...

От
Tom Lane
Дата:
The Hermit Hacker <scrappy@hub.org> writes:
> Something to force a v7.0.3 ... ?

Yes.  We had plenty to force a 7.0.3 already, actually, but I was
holding off recommending a release in hopes of finding Alfred's
problem.

I will get this patch made up tonight for REL7_0; if Alfred doesn't
see more failures after running it for a few days, then let's move
forward on a 7.0.3 release.
        regards, tom lane


Re: The lightbulb just went on...

От
The Hermit Hacker
Дата:
On Mon, 16 Oct 2000, Tom Lane wrote:

> The Hermit Hacker <scrappy@hub.org> writes:
> > Something to force a v7.0.3 ... ?
> 
> Yes.  We had plenty to force a 7.0.3 already, actually, but I was
> holding off recommending a release in hopes of finding Alfred's
> problem.

I thought so, about having plenty, but when I asked before SF, it sort of
fell on deaf ears, so figured you weren't ready yet :)

> I will get this patch made up tonight for REL7_0; if Alfred doesn't
> see more failures after running it for a few days, then let's move
> forward on a 7.0.3 release.

that works for me ... I'm in Montreal for the weekend, so if we can get it
out before Thursday, great, else we'll do it on Monday, 'k? 




Re: The lightbulb just went on...

От
Tom Lane
Дата:
The Hermit Hacker <scrappy@hub.org> writes:
>> I will get this patch made up tonight for REL7_0; if Alfred doesn't
>> see more failures after running it for a few days, then let's move
>> forward on a 7.0.3 release.

> that works for me ... I'm in Montreal for the weekend, so if we can get it
> out before Thursday, great, else we'll do it on Monday, 'k? 

I think he was seeing MTBF of several days anyway, so we won't have any
confidence that the problem is gone before next week.
        regards, tom lane


Re: The lightbulb just went on...

От
Michael J Schout
Дата:
Tom:

I think I may have been seeing this problem as well.  We were getting
crashes very often with 7.0.2 during VACUUM's if activity was going
on to our database during the vacuum (even though the activity was 
light).  Our solution in the meantime was to simply disable the
aplications during a vacuum to avoid any activity during hte vacuum,
and we have not had a crash on vacuum since that happened.  If this
sounds consistent with the problem you think Alfred is having, then
I would be willing to test your patch on our system as well.

If you think it would help, feel free to send me the patch and I will
do some testing on it for you.

Thanks.
Mike

On Mon, 16 Oct 2000, Tom Lane wrote:

> ... with a blinding flash ...
> 
> The VACUUM funnies I was complaining about before may or may not be real
> bugs, but they are not what's biting Alfred.  None of them can lead to
> the observed crashes AFAICT.
...



Re: The lightbulb just went on...

От
Tom Lane
Дата:
Michael J Schout <mschout@gkg.net> writes:
> I think I may have been seeing this problem as well.  We were getting
> crashes very often with 7.0.2 during VACUUM's if activity was going
> on to our database during the vacuum (even though the activity was 
> light).  Our solution in the meantime was to simply disable the
> aplications during a vacuum to avoid any activity during hte vacuum,
> and we have not had a crash on vacuum since that happened.  If this
> sounds consistent with the problem you think Alfred is having,

Yes, it sure does.

The patch I have applies atop a previous change in the REL7_0_PATCHES
branch, so what I would recommend is that you pull the current state of
the REL7_0_PATCHES branch from our CVS server, and then you can test
what will shortly become 7.0.3.  There are several other critical bug
fixes in there since 7.0.2.

Dunno if you know how to use cvs, but the critical steps are explained
at http://www.postgresql.org/docs/postgres/x28786.htm.  Note that the
given recipe will pull current development tip, which is NOT what you
want.  In step 3, instead of doing... co -P pgsql
do... co -P -r REL7_0_PATCHES pgsql

Then configure and build as usual.
        regards, tom lane


Re: The lightbulb just went on...

От
Alfred Perlstein
Дата:
* Michael J Schout <mschout@gkg.net> [001017 08:50] wrote:
> Tom:
> 
> I think I may have been seeing this problem as well.  We were getting
> crashes very often with 7.0.2 during VACUUM's if activity was going
> on to our database during the vacuum (even though the activity was 
> light).  Our solution in the meantime was to simply disable the
> aplications during a vacuum to avoid any activity during hte vacuum,
> and we have not had a crash on vacuum since that happened.  If this
> sounds consistent with the problem you think Alfred is having, then
> I would be willing to test your patch on our system as well.
> 
> If you think it would help, feel free to send me the patch and I will
> do some testing on it for you.

I'm not sure if you've been subscribed to this list for long but
It would have been nice if you had spoken up when I initially
reported the problems so that the developers realized this wasn't
a completely isolated incident.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


Re: The lightbulb just went on...

От
Michael J Schout
Дата:
On Tue, 17 Oct 2000, Tom Lane wrote:

> > and we have not had a crash on vacuum since that happened.  If this
> > sounds consistent with the problem you think Alfred is having,
> 
> Yes, it sure does.
> 
> The patch I have applies atop a previous change in the REL7_0_PATCHES
> branch, so what I would recommend is that you pull the current state of
> the REL7_0_PATCHES branch from our CVS server, and then you can test
> what will shortly become 7.0.3.  There are several other critical bug
> fixes in there since 7.0.2.

Hi Tom.

I have built from the REL7_0_PATCHES tree yesturday and did some testing on the
database.  So far no  crashes during vacuum like I had been seeing with 7.0.2
:).

I am seeing a different problem (and I have seen this with 7.0.2 as well).  If
I run vacuum, sometimes this error pops up in the client appliction during the
vacuum:

ERROR:  RelationClearRelation: relation 1668325 modified while in use

relation 1668325 is a view named "sessions".

what happens to sessions is that it does:

SELECT session_data, id 
FROM   sessions
WHERE  id = ?
FOR UPDATE

.... client does some processing ...

UPDATE sesssions set session_data = ? WHERE id = ?;

(this is where the error happens)

I think part of my problem might be that sessions is a view and not a table,
but it is probably a bug that needs to be noted nonetheless.  I am going to try
converting "sessions" to a view and see if I can reproduce it that way.

Mike



Re: The lightbulb just went on...

От
Tom Lane
Дата:
Michael J Schout <mschout@gkg.net> writes:
> ERROR:  RelationClearRelation: relation 1668325 modified while in use
> relation 1668325 is a view named "sessions".

Hm.  This message is coming out of the relation cache code when it sees
an invalidate-your-cache-for-this-relation message from another backend
and the relation in question has already been locked during the current
transaction.  Probably, what is happening is that the vacuum process is
vacuuming the view (not too much to do there ;-) but it does it anyway)
and sending out the cache inval message for it after the other client
process has already started parsing of a query using the view.

This is a fairly subtle problem that I don't think we will be able to
fix as a backpatch for 7.0.*.  It's on the to-fix list for 7.1 though.
        regards, tom lane