Обсуждение: cache lookup failed for index

Поиск

Список

Период

Сортировка

cache lookup failed for index

От

Nathan Robertson

Дата:

30 июня 2010 г., 02:21:35

Hi Everyone,

So I have a small web cluster that I'm running PSQL on. today, Apache failed which caused PSQL to fail which at some point caused some data in the database to become corrupt. So, now when I try to connect to the data in the database in question I get: FATAL: cache lookup failed for index 2662. I get this when I try to connect on the standalone console or when I try to connect to the running server. So, this is going to sound terrible but there is no current backup of this database. As the admin this is really embarrassing and I know I messed up.

That said, could someone point me in the right direction for restoring this? I've tried to reindex which doesn't work because I can't even connect to the database in question. If I try to connect through the standalone console to the system and reindex the entire system the same error message pops up in the list.

What information can I provide that would help?

CentOS 5.5
PostgreSQL 8.1.21

This is the information from postgresql-Tue.log:
LOG: database system was interrupted at 2010-06-29 17:08:49 EDT
LOG: checkpoint record is at 0/3788A208
LOG: redo record is at 0/3788A208; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 610809; next OID: 28328
LOG: next MultiXactId: 8; next MultiXactOffset: 15
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/3788A258
LOG: redo is not required
LOG: database system is ready
LOG: transaction ID wrap limit is 2147484146, limited by database "postgres"
LOG: database system was interrupted at 2010-06-29 19:52:08 EDT
LOG: checkpoint record is at 0/3788A258
LOG: redo record is at 0/3788A258; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 610809; next OID: 28328
LOG: next MultiXactId: 8; next MultiXactOffset: 15
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/3788A2A8
LOG: redo is not required
LOG: database system is ready
LOG: transaction ID wrap limit is 2147484146, limited by database "postgres"
FATAL: cache lookup failed for index 2662

Any help in this matter would be GREATLY appreciated.

Nate Robertson

Re: cache lookup failed for index

От

"Kevin Grittner"

Дата:

30 июня 2010 г., 13:36:31

Nathan Robertson <nathan.robertson@gmail.com> wrote:

> Apache failed which caused PSQL to fail which at some point caused
> some data in the database to become corrupt.

What?  What does Apache have to do with psql (a command-line client
for PostgreSQL), and how would either of those failing cause a
database corruption?  Could you give more details?  Did the OS lock
up at some point?  Was there any power loss while PostgreSQL was
running?

Without an event like that, database corruption is often the result
of hardware problems.  Testing your RAM and making sure you have
SMART monitoring working on the drives, with due attention to any
reported problems, might be a good idea; otherwise, whatever you
recover may be further damaged.

Also, it would help a lot to know what your postgresql.conf file
contains (excluding all comments).

But first and foremost, you should make a file-copy backup of your
entire PostgreSQL data directory tree with the PostgreSQL server
stopped, if you haven't done that already.  Any attempt at recovery
may misfire, and you might want to get back to what you have now.

-Kevin

Re: cache lookup failed for index

От

Nathan Robertson

Дата:

30 июня 2010 г., 13:43:06

There was a cascade effect. Apache failed which caused the server overall to fail. The data is stored on an iSCSI drive and the mount of the iSCSI drive became corrupt when everything failed. I was able to remount the drive and get access to data now I have this index error.

So, this is where I'm at. If anyone could help resolve the index cache error I would be eternally great full.

On Wed, Jun 30, 2010 at 9:36 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

Nathan Robertson <nathan.robertson@gmail.com> wrote:

> Apache failed which caused PSQL to fail which at some point caused
> some data in the database to become corrupt.

What? What does Apache have to do with psql (a command-line client
for PostgreSQL), and how would either of those failing cause a
database corruption? Could you give more details? Did the OS lock
up at some point? Was there any power loss while PostgreSQL was
running?

Without an event like that, database corruption is often the result
of hardware problems. Testing your RAM and making sure you have
SMART monitoring working on the drives, with due attention to any
reported problems, might be a good idea; otherwise, whatever you
recover may be further damaged.

Also, it would help a lot to know what your postgresql.conf file
contains (excluding all comments).

But first and foremost, you should make a file-copy backup of your
entire PostgreSQL data directory tree with the PostgreSQL server
stopped, if you haven't done that already. Any attempt at recovery
may misfire, and you might want to get back to what you have now.

-Kevin

Re: cache lookup failed for index

От

"Kevin Grittner"

Дата:

30 июня 2010 г., 14:14:40

Nathan Robertson <nathan.robertson@gmail.com> wrote:

> There was a cascade effect. Apache failed which caused the server
> overall to fail. The data is stored on an iSCSI drive and the
> mount of the iSCSI drive became corrupt when everything failed. I
> was able to remount the drive and get access to data now I have
> this index error.

Now we're getting somewhere.  The disk drive "became corrupt" while
PostgreSQL was running?  Was the drive unmounted or remounted while
PostgreSQL was running, or did you stop PostgreSQL first?  Do you
have any errors in the PostgreSQL log from the time this was all
going on?

Also, how confident are you that the Apache failure caused the drive
to be corrupted?  That sounds *much* less likely than the other way
around.  Without understanding that better, fixing one particular
problem in the database on this machine might be like rearranging
deck chairs on a sinking ship.

> So, this is where I'm at. If anyone could help resolve the index
> cache error I would be eternally great full.

We'd like to help, and perhaps someone else can suggest something on
the basis of information you've provided so far, but I'm not
comfortable suggesting something without a little more of a sense of
what happened and what your configuration is.

>> Also, it would help a lot to know what your postgresql.conf file
>> contains (excluding all comments).

This would still be useful.

>> But first and foremost, you should make a file-copy backup of
>> your entire PostgreSQL data directory tree with the PostgreSQL
>> server stopped, if you haven't done that already.  Any attempt at
>> recovery may misfire, and you might want to get back to what you
>> have now.

I can't, in good conscience, recommend any recovery attempts until
you confirm that you have a copy to restore if the cleanup effort
misfires.

One more question occurs to me -- it seems unusual for someone to be
running on a single disk with no RAID and no backup, but to be
running with a version of PostgreSQL with is only about a month old.
Was 8.1.21 the version you were running at the time of the failure,
or have you upgraded during the recovery attempt?  If you've
upgraded, the version in use when the corruption occurred could be
relevant.

-Kevin

Re: cache lookup failed for index

От

Nathan Robertson

Дата:

30 июня 2010 г., 14:36:24

Hi Kevin,

Thanks for the response.

Now we're getting somewhere. The disk drive "became corrupt" while
PostgreSQL was running? Was the drive unmounted or remounted while
PostgreSQL was running, or did you stop PostgreSQL first? Do you
have any errors in the PostgreSQL log from the time this was all
going on?

The failure basically happened because the Django webapp we're running isn't effectively closing database connections. So, memory is completely filling up and causing the server to hang. Yesterday, when this happened it caused the entire network interface to become inoperable which meant that the iscsi connection to the shared drive stopped working and data became corrupt.

I stopped the postgresql service before unmounting and remounting the target.

My first concern is restoring the database. I'll fix the problems with django and apache later. I can deal with those problems. I'm also going to create a series of database backups that can be used to quickly restore data if this happens again. My concern is simply just getting this back to baseline.

One more question occurs to me -- it seems unusual for someone to be
running on a single disk with no RAID and no backup, but to be
running with a version of PostgreSQL with is only about a month old.
Was 8.1.21 the version you were running at the time of the failure,
or have you upgraded during the recovery attempt? If you've
upgraded, the version in use when the corruption occur

This storage server has RAID and there are backups, it just so happens that the most recent usable backup is from June 20th. I completely forgot to configure the backups on this server. I normally wouldn't make this mistake, but I did this time.

On the version, this is the version that comes standard with CentOS 5.5. This was a clean CentOS 5.5 install and it's been live for about a month.

>> Also, it would help a lot to know what your postgresql.conf file
>> contains (excluding all comments).

The only uncommented lines are:
max_connections = 500
shared_buffers = 4000
redirect_stderr = on
log_directory = 'pg_log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1440
log_rotation_size = 0
redirect_stderr = on
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'

I can't, in good conscience, recommend any recovery attempts until
you confirm that you have a copy to restore if the cleanup effort
misfires.

I have a full backup of the entire directory structure I took shortly after the database became unusable.

On Wed, Jun 30, 2010 at 10:14 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

Nathan Robertson <nathan.robertson@gmail.com> wrote:

> There was a cascade effect. Apache failed which caused the server
> overall to fail. The data is stored on an iSCSI drive and the
> mount of the iSCSI drive became corrupt when everything failed. I
> was able to remount the drive and get access to data now I have
> this index error.

Now we're getting somewhere. The disk drive "became corrupt" while
PostgreSQL was running? Was the drive unmounted or remounted while
PostgreSQL was running, or did you stop PostgreSQL first? Do you
have any errors in the PostgreSQL log from the time this was all
going on?

Also, how confident are you that the Apache failure caused the drive
to be corrupted? That sounds *much* less likely than the other way
around. Without understanding that better, fixing one particular
problem in the database on this machine might be like rearranging
deck chairs on a sinking ship.

> So, this is where I'm at. If anyone could help resolve the index
> cache error I would be eternally great full.

We'd like to help, and perhaps someone else can suggest something on
the basis of information you've provided so far, but I'm not
comfortable suggesting something without a little more of a sense of
what happened and what your configuration is.

>> Also, it would help a lot to know what your postgresql.conf file
>> contains (excluding all comments).

This would still be useful.

>> But first and foremost, you should make a file-copy backup of
>> your entire PostgreSQL data directory tree with the PostgreSQL
>> server stopped, if you haven't done that already. Any attempt at
>> recovery may misfire, and you might want to get back to what you
>> have now.

I can't, in good conscience, recommend any recovery attempts until
you confirm that you have a copy to restore if the cleanup effort
misfires.
red could be
relevant.
One more question occurs to me -- it seems unusual for someone to be
running on a single disk with no RAID and no backup, but to be
running with a version of PostgreSQL with is only about a month old.
Was 8.1.21 the version you were running at the time of the failure,
or have you upgraded during the recovery attempt? If you've
upgraded, the version in use when the corruption occur

-Kevin

Re: cache lookup failed for index

От

"Kevin Grittner"

Дата:

30 июня 2010 г., 15:51:24

Nathan Robertson <nathan.robertson@gmail.com> wrote:

> The failure basically happened because the Django webapp we're
> running isn't effectively closing database connections. So, memory
> is completely filling up and causing the server to hang.
> Yesterday, when this happened it caused the entire network
> interface to become inoperable which meant that the iscsi
> connection to the shared drive stopped working and data became
> corrupt.
>
> I stopped the postgresql service before unmounting and remounting
> the target.

OK, I think the appropriate next step would be to try to run the
PostgreSQL cluster in single-user mode:

http://www.postgresql.org/docs/8.1/interactive/app-postgres.html

Try to REINDEX pg_class_oid_index in that mode.  If that fails, it
might possibly help to run these statements and try the REINDEX
command again:

set enable_indexscan = off;
set enable_bitmapscan = off;

I hope this helps.

-Kevin

Re: cache lookup failed for index

От

Nathan Robertson

Дата:

30 июня 2010 г., 16:04:15

Hi Kevin,

Thanks for this. I've found a lot of information on this online but I'm a little unclear about how exactly I should connect and run the reindex.

My thinking based on the documentation is I run (as postgres user):
postgres -O -P -D /dbcluster/location

Then I run:
REINDEX TABLE pg_class_oid_in;

Is this correct?

Nate

On Wed, Jun 30, 2010 at 11:51 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:

Nathan Robertson <nathan.robertson@gmail.com> wrote:

> The failure basically happened because the Django webapp we're
> running isn't effectively closing database connections. So, memory
> is completely filling up and causing the server to hang.
> Yesterday, when this happened it caused the entire network
> interface to become inoperable which meant that the iscsi
> connection to the shared drive stopped working and data became
> corrupt.
>
> I stopped the postgresql service before unmounting and remounting
> the target.

OK, I think the appropriate next step would be to try to run the
PostgreSQL cluster in single-user mode:

http://www.postgresql.org/docs/8.1/interactive/app-postgres.html

Try to REINDEX pg_class_oid_index in that mode. If that fails, it
might possibly help to run these statements and try the REINDEX
command again:

set enable_indexscan = off;
set enable_bitmapscan = off;

I hope this helps.

-Kevin

Re: cache lookup failed for index

От

"Kevin Grittner"

Дата:

30 июня 2010 г., 16:13:39

Nathan Robertson <nathan.robertson@gmail.com> wrote:

> My thinking based on the documentation is I run (as postgres
> user):
> postgres -O -P -D /dbcluster/location

Looks good to me.  In fact, I hadn't remembered the -P option;
definitely a good choice here, and it should obviate the need to try
to disable the index usage using the commands I showed.

> Then I run:
> REINDEX TABLE pg_class_oid_in;

You either need to specify the INDEX keyword or a table name.
Perhaps this would be best:

REINDEX TABLE pg_class;

-Kevin

Re: cache lookup failed for index

От

Tom Lane

Дата:

30 июня 2010 г., 16:14:32

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> OK, I think the appropriate next step would be to try to run the
> PostgreSQL cluster in single-user mode:

> http://www.postgresql.org/docs/8.1/interactive/app-postgres.html

> Try to REINDEX pg_class_oid_index in that mode.  If that fails, it
> might possibly help to run these statements and try the REINDEX
> command again:

> set enable_indexscan = off;
> set enable_bitmapscan = off;

Those won't help.  What you *will* need, in order to even start the
single-user backend, is to tell it to disregard system indexes
(-P command line option).

I wouldn't be too surprised if the corruption extends a lot further than
the one index :-( but maybe you will be able to extract something after
reindexing.

            regards, tom lane

Re: cache lookup failed for index

От

Nathan Robertson

Дата:

30 июня 2010 г., 16:19:26

Thanks Tom.

OK, I ran:
postgres -O -P -D /cluster/location

reindex table pg_class;
backend> reindex table pg_class;

And then nothing returns. Nothing stating whether it was successful or a failure.

And then if I do:
bash-3.2$ postgres -O -P -D /shared/webapp/database webapp

I still get:
FATAL: cache lookup failed for index 2662

On Wed, Jun 30, 2010 at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> OK, I think the appropriate next step would be to try to run the
> PostgreSQL cluster in single-user mode:

> http://www.postgresql.org/docs/8.1/interactive/app-postgres.html

> Try to REINDEX pg_class_oid_index in that mode. If that fails, it
> might possibly help to run these statements and try the REINDEX
> command again:

> set enable_indexscan = off;
> set enable_bitmapscan = off;

Those won't help. What you *will* need, in order to even start the
single-user backend, is to tell it to disregard system indexes
(-P command line option).

I wouldn't be too surprised if the corruption extends a lot further than
the one index :-( but maybe you will be able to extract something after
reindexing.

regards, tom lane

Re: cache lookup failed for index

От

Tom Lane

Дата:

30 июня 2010 г., 17:05:09

Nathan Robertson <nathan.robertson@gmail.com> writes:
> OK, I ran:
>  postgres -O -P -D /cluster/location

This probably connected to the postgres database, not webapp which is
where your problem is.

> backend> reindex table pg_class;

> And then nothing returns. Nothing stating whether it was successful or a
> failure.

The standalone mode isn't very verbose.  If you got a prompt back
without any error then it's OK.  Just type ^D at the prompt to exit.

            regards, tom lane

Re: cache lookup failed for index

От

Nathan Robertson

Дата:

30 июня 2010 г., 19:18:08

This probably connected to the postgres database, not webapp which is
where your problem is.

OK, is there anyway I can force a connection to this database so I can get a look at the tables (webapp)? To be honest, if there is corrupt data in there, that is fine. I'll manually pull it out. I just need to be able to figure out what uncorrupted data is in there so I can determine if it's even worth saving.

On Wed, Jun 30, 2010 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Nathan Robertson <nathan.robertson@gmail.com> writes:
> OK, I ran:
> postgres -O -P -D /cluster/location

This probably connected to the postgres database, not webapp which is
where your problem is.

> backend> reindex table pg_class;

> And then nothing returns. Nothing stating whether it was successful or a
> failure.

The standalone mode isn't very verbose. If you got a prompt back
without any error then it's OK. Just type ^D at the prompt to exit.

regards, tom lane

Re: cache lookup failed for index

От

Tom Lane

Дата:

30 июня 2010 г., 19:20:20

Nathan Robertson <nathan.robertson@gmail.com> writes:
> *This probably connected to the postgres database, not webapp which is
> where your problem is.
> *
> OK, is there anyway I can force a connection to this database

postgres -O -P -D /cluster/location webapp

(You don't really need the -O, but it probably doesn't hurt either.)

            regards, tom lane

Re: cache lookup failed for index

От

Nathan Robertson

Дата:

30 июня 2010 г., 20:05:35

OK, as postgres user:
-bash-3.2$ postgres -O -P -D /tmp/database webapp
FATAL: cache lookup failed for index 2662

On Wed, Jun 30, 2010 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Nathan Robertson <nathan.robertson@gmail.com> writes:
> *This probably connected to the postgres database, not webapp which is
> where your problem is.
> *
> OK, is there anyway I can force a connection to this database

postgres -O -P -D /cluster/location webapp

(You don't really need the -O, but it probably doesn't hurt either.)

regards, tom lane

Re: cache lookup failed for index

От

Tom Lane

Дата:

30 июня 2010 г., 20:30:24

Nathan Robertson <nathan.robertson@gmail.com> writes:
> OK, as postgres user:
> -bash-3.2$ postgres -O  -P -D /tmp/database webapp
> FATAL:  cache lookup failed for index 2662

Even with -P?  Wow, that's bad.  This DB may be just toast I'm afraid.
But try it like this:

postgres -O  -P -D /tmp/database -c log_error_verbosity=verbose webapp

This should give you the exact location of the error, which will give us
a bit better clue what's failing.

            regards, tom lane

Re: cache lookup failed for index

От

Tom Lane

Дата:

30 июня 2010 г., 20:59:34

Nathan Robertson <nathan.robertson@gmail.com> writes:
> -bash-3.2$ postgres -O  -P -D /tmp/database -c log_error_verbosity=verbose
> webapp
> FATAL:  XX000: cache lookup failed for index 2662
> LOCATION:  RelationInitIndexAccessInfo, relcache.c:841

Um ... and you said this was current 8.1.x ... so it's dying here:

    tuple = SearchSysCache(INDEXRELID,
                           ObjectIdGetDatum(RelationGetRelid(relation)),
                           0, 0, 0);
    if (!HeapTupleIsValid(tuple))
        elog(ERROR, "cache lookup failed for index %u",
             RelationGetRelid(relation));

I interpret this to mean that it can't find the pg_index row for
pg_class_oid_index --- and since you're using -P, that doesn't just mean
corruption in pg_index's indexes, but that the tuple can't be found even
by seqscanning the whole catalog.  Probably the whole page it's in got
wiped out by your filesystem-level failure.

I hate to be the bearer of bad news, but I think this DB may be beyond
recovery.  It's very unlikely that there's just the one tuple gone.
If you're willing to throw money at the problem, there are various
people who offer consulting services that include trying to reconstruct
broken Postgres databases; but you might be best advised to just take
your lumps and go back to your last good backup.  At this point you're
looking at a significant investment of time with no guarantee of being
able to extract anything very useful.

What I'm taking from this is another horror story about the risks of
mounting databases across networks instead of locally :-(.  Postgres
is only as reliable as the storage it's using.

            regards, tom lane