Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"
От | Jeff Janes |
---|---|
Тема | Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx" |
Дата | |
Msg-id | CAMkU=1wKFSCByKkhdbPPD49ENJQJ9NrXDkDHZyBdqiL1KGdWTA@mail.gmail.com обсуждение исходный текст |
Ответ на | 9.3: more problems with "Could not open file "pg_multixact/members/xxxx" (Jeff Janes <jeff.janes@gmail.com>) |
Ответы |
Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"
|
Список | pgsql-hackers |
On Tue, Jul 15, 2014 at 3:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:Jeff Janes wrote:Could you please to reproduce it after updating to latest? I pushed
> This problem was initially fairly easy to reproduce, but since I
> started adding instrumentation specifically to catch it, it has become
> devilishly hard to reproduce.
>
> I think my next step will be to also log each of the values which goes
> into the complex if (...) expression that decides on the deletion.
fixes that should close these issues. Maybe you want to remove the
instrumentation you added, to make failures more likely.There are still some problems in 9.4, but I haven't been able to diagnose them and wanted to do more research on it. The announcement of upcoming back-branches for 9.3 spurred me to try it there, and I have problems with 9.3 (12c5bbdcbaa292b2a4b09d298786) as well. The move of truncation to the checkpoint seems to have made the problem easier to reproduce. On an 8 core machine, this test fell over after about 20 minutes, which is much faster than it usually reproduces.This the error I get:2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR: could not access status of transaction 858372212084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL: Could not open file "pg_multixact/members/14031": No such file or directory.2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."foo_parent" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"The testing harness is attached as 3 patches that must be made to the test server, and 2 scripts. The script do.sh sets up the database (using fixed paths, so be careful) and then invokes count.pl in a loop to do the actual work.
Sorry, after a long time when I couldn't do much testing on this, I've now been able to get back to it.
It looks like what is happening is that checkPoint.nextMultiOffset wraps around from 2^32 to 0, even if 0 is still being used. At that point it starts deleting member files that are still needed.
Is there some interlock which is supposed to prevent from checkPoint.nextMultiOffset rom lapping iself? I haven't been able to find it. It seems like the interlock applies only to MultiXid, not the Offset.
Thanks,
Jeff
В списке pgsql-hackers по дате отправления: