Обсуждение: failure with pg_dump

Поиск
Список
Период
Сортировка

failure with pg_dump

От
Mija Lee
Дата:
Hi:

I have a script that I use to do regular dumps of my database. Over the
weekend it failed, and produced the following error message. I'm not
sure why this would have happened, how I would find out which index is
referenced by 136451098, or where this select came from.

pg_dump.sqlhost: Error message from server: ERROR:  cache lookup failed
for index 136451098
pg_dump.sqlhost: The command was: SELECT t.tableoid, t.oid, t.relname as
indexname, pg_catalog.pg_get_indexdef(i.indexrelid) as indexdef,
t.relnatts as indnkeys, i.indkey, i.indisclustered, c.contype,
c.conname, c.tableoid as contableoid, c.oid as conoid, (SELECT spcname
FROM pg_catalog.pg_tablespace s WHERE s.oid = t.reltablespace) as
tablespace, array_to_string(t.reloptions, ', ') as options FROM
pg_catalog.pg_index i JOIN pg_catalog.pg_class t ON (t.oid =
i.indexrelid) LEFT JOIN pg_catalog.pg_depend d ON (d.classid =
t.tableoid AND d.objid = t.oid AND d.deptype = 'i') LEFT JOIN
pg_catalog.pg_constraint c ON (d.refclassid = c.tableoid AND d.refobjid
= c.oid) WHERE i.indrelid = '136451090'::pg_catalog.oid ORDER BY indexname


Any help would be greatly appreciated.

Mija

Re: failure with pg_dump

От
Tom Lane
Дата:
Mija Lee <mija@scharp.org> writes:
> I have a script that I use to do regular dumps of my database. Over the
> weekend it failed, and produced the following error message. I'm not
> sure why this would have happened, how I would find out which index is
> referenced by 136451098, or where this select came from.

It sounds like system catalog corruption, which is not good :-(.

> pg_dump.sqlhost: Error message from server: ERROR:  cache lookup failed
> for index 136451098
> pg_dump.sqlhost: The command was: SELECT t.tableoid, t.oid, t.relname as
> indexname, pg_catalog.pg_get_indexdef(i.indexrelid) as indexdef,
> t.relnatts as indnkeys, i.indkey, i.indisclustered, c.contype,
> c.conname, c.tableoid as contableoid, c.oid as conoid, (SELECT spcname
> FROM pg_catalog.pg_tablespace s WHERE s.oid = t.reltablespace) as
> tablespace, array_to_string(t.reloptions, ', ') as options FROM
> pg_catalog.pg_index i JOIN pg_catalog.pg_class t ON (t.oid =
> i.indexrelid) LEFT JOIN pg_catalog.pg_depend d ON (d.classid =
> t.tableoid AND d.objid = t.oid AND d.deptype = 'i') LEFT JOIN
> pg_catalog.pg_constraint c ON (d.refclassid = c.tableoid AND d.refobjid
> = c.oid) WHERE i.indrelid = '136451090'::pg_catalog.oid ORDER BY indexname

That looks like pg_dump's query to get information about the indexes of
a particular table.  So apparently the problem index is one of the ones
for the table with OID 136451090.  The easiest way to find out which one
that is is
    select '136451090'::regclass;
Trying \d on each of that table's indexes in succession would tell you
which one is trashed.

As for fixing it, the $64 question is how extensive is the catalog
corruption.  I see no very good reason to hope that only this one index
is affected :-(.  What you probably want to do is try to get a clean
pg_dump then initdb and reload --- at least that's how I'd approach it,
rather than hoping that there's no lurking problems remaining after you
hack your way around the one you can see.

What I'd try first is a REINDEX on pg_class.  If that doesn't help,
try to delete the pg_index row linking 136451098 and 136451090.

What PG version is this, anyway, and did anything weird happen on your
system that might explain data corruption?

            regards, tom lane

Re: failure with pg_dump

От
Mija Lee
Дата:
We've had a number of odd things that have been going on that I can't
really explain, and that don't seem to result in log entries. Here's
some info:

- this is running 8.2.4 on a solaris 10 machine
- I reran the dump after posting and these problems did not reoccur
- We have a number of replicated schemas and tables on this server.
There were other problems with the replication that happened earlier in
the evening.
- we have been having some very odd problems where our replication
scripts hang intermittantly. For the life of me I can't figure out why,
but when this happens, I look for processes that are idle in transaction
that are more than one day old and kill them. That seems to allow the
replication to finish.  I have a few users that use a variety of
products to view and manipulate the data in these tables (tableau,
access, excel, ems, phppgadmin, dbvisualizer) and it seems like some
connections/transactions never terminate, but I can't figure out which
ones or why. I've been struggling with this problem for some time, but
have never had an issue with the stalled replication affecting the dump.
I was actually hoping that this error would help shed light on the
replication problem.

Mija

Tom Lane wrote:
> Mija Lee <mija@scharp.org> writes:
>> I have a script that I use to do regular dumps of my database. Over the
>> weekend it failed, and produced the following error message. I'm not
>> sure why this would have happened, how I would find out which index is
>> referenced by 136451098, or where this select came from.
>
> It sounds like system catalog corruption, which is not good :-(.
>
>> pg_dump.sqlhost: Error message from server: ERROR:  cache lookup failed
>> for index 136451098
>> pg_dump.sqlhost: The command was: SELECT t.tableoid, t.oid, t.relname as
>> indexname, pg_catalog.pg_get_indexdef(i.indexrelid) as indexdef,
>> t.relnatts as indnkeys, i.indkey, i.indisclustered, c.contype,
>> c.conname, c.tableoid as contableoid, c.oid as conoid, (SELECT spcname
>> FROM pg_catalog.pg_tablespace s WHERE s.oid = t.reltablespace) as
>> tablespace, array_to_string(t.reloptions, ', ') as options FROM
>> pg_catalog.pg_index i JOIN pg_catalog.pg_class t ON (t.oid =
>> i.indexrelid) LEFT JOIN pg_catalog.pg_depend d ON (d.classid =
>> t.tableoid AND d.objid = t.oid AND d.deptype = 'i') LEFT JOIN
>> pg_catalog.pg_constraint c ON (d.refclassid = c.tableoid AND d.refobjid
>> = c.oid) WHERE i.indrelid = '136451090'::pg_catalog.oid ORDER BY indexname
>
> That looks like pg_dump's query to get information about the indexes of
> a particular table.  So apparently the problem index is one of the ones
> for the table with OID 136451090.  The easiest way to find out which one
> that is is
>     select '136451090'::regclass;
> Trying \d on each of that table's indexes in succession would tell you
> which one is trashed.
>
> As for fixing it, the $64 question is how extensive is the catalog
> corruption.  I see no very good reason to hope that only this one index
> is affected :-(.  What you probably want to do is try to get a clean
> pg_dump then initdb and reload --- at least that's how I'd approach it,
> rather than hoping that there's no lurking problems remaining after you
> hack your way around the one you can see.
>
> What I'd try first is a REINDEX on pg_class.  If that doesn't help,
> try to delete the pg_index row linking 136451098 and 136451090.
>
> What PG version is this, anyway, and did anything weird happen on your
> system that might explain data corruption?
>
>             regards, tom lane

Re: failure with pg_dump

От
Tom Lane
Дата:
Mija Lee <mija@scharp.org> writes:
> We've had a number of odd things that have been going on that I can't
> really explain, and that don't seem to result in log entries. Here's
> some info:

> - this is running 8.2.4 on a solaris 10 machine
> - I reran the dump after posting and these problems did not reoccur
> - We have a number of replicated schemas and tables on this server.
> There were other problems with the replication that happened earlier in
> the evening.

Hmm, are you using Slony?  If so, you ought to take this to the Slony
mailing lists.  There are various restrictions in Slony on what it
assumes can be done to a replicated table, and I believe that things
like what you are seeing are one of the possible consequences of
breaking Slony's expectations.  That's about as far as my knowledge
goes though.

            regards, tom lane

Re: failure with pg_dump

От
Mija Lee
Дата:
We are not using slony - we are replicating from sql server using perl.

Tom Lane wrote:
> Mija Lee <mija@scharp.org> writes:
>> We've had a number of odd things that have been going on that I can't
>> really explain, and that don't seem to result in log entries. Here's
>> some info:
>
>> - this is running 8.2.4 on a solaris 10 machine
>> - I reran the dump after posting and these problems did not reoccur
>> - We have a number of replicated schemas and tables on this server.
>> There were other problems with the replication that happened earlier in
>> the evening.
>
> Hmm, are you using Slony?  If so, you ought to take this to the Slony
> mailing lists.  There are various restrictions in Slony on what it
> assumes can be done to a replicated table, and I believe that things
> like what you are seeing are one of the possible consequences of
> breaking Slony's expectations.  That's about as far as my knowledge
> goes though.
>
>             regards, tom lane