Обсуждение: cache lookup failed for index
Hi Everyone,
So I have a small web cluster that I'm running PSQL on. today, Apache failed which caused PSQL to fail which at some point caused some data in the database to become corrupt. So, now when I try to connect to the data in the database in question I get: FATAL: cache lookup failed for index 2662. I get this when I try to connect on the standalone console or when I try to connect to the running server. So, this is going to sound terrible but there is no current backup of this database. As the admin this is really embarrassing and I know I messed up.
That said, could someone point me in the right direction for restoring this? I've tried to reindex which doesn't work because I can't even connect to the database in question. If I try to connect through the standalone console to the system and reindex the entire system the same error message pops up in the list.
What information can I provide that would help?
CentOS 5.5
PostgreSQL 8.1.21
This is the information from postgresql-Tue.log:
LOG: database system was interrupted at 2010-06-29 17:08:49 EDT
LOG: checkpoint record is at 0/3788A208
LOG: redo record is at 0/3788A208; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 610809; next OID: 28328
LOG: next MultiXactId: 8; next MultiXactOffset: 15
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/3788A258
LOG: redo is not required
LOG: database system is ready
LOG: transaction ID wrap limit is 2147484146, limited by database "postgres"
LOG: database system was interrupted at 2010-06-29 19:52:08 EDT
LOG: checkpoint record is at 0/3788A258
LOG: redo record is at 0/3788A258; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 610809; next OID: 28328
LOG: next MultiXactId: 8; next MultiXactOffset: 15
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/3788A2A8
LOG: redo is not required
LOG: database system is ready
LOG: transaction ID wrap limit is 2147484146, limited by database "postgres"
FATAL: cache lookup failed for index 2662
Any help in this matter would be GREATLY appreciated.
Nate Robertson
So I have a small web cluster that I'm running PSQL on. today, Apache failed which caused PSQL to fail which at some point caused some data in the database to become corrupt. So, now when I try to connect to the data in the database in question I get: FATAL: cache lookup failed for index 2662. I get this when I try to connect on the standalone console or when I try to connect to the running server. So, this is going to sound terrible but there is no current backup of this database. As the admin this is really embarrassing and I know I messed up.
That said, could someone point me in the right direction for restoring this? I've tried to reindex which doesn't work because I can't even connect to the database in question. If I try to connect through the standalone console to the system and reindex the entire system the same error message pops up in the list.
What information can I provide that would help?
CentOS 5.5
PostgreSQL 8.1.21
This is the information from postgresql-Tue.log:
LOG: database system was interrupted at 2010-06-29 17:08:49 EDT
LOG: checkpoint record is at 0/3788A208
LOG: redo record is at 0/3788A208; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 610809; next OID: 28328
LOG: next MultiXactId: 8; next MultiXactOffset: 15
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/3788A258
LOG: redo is not required
LOG: database system is ready
LOG: transaction ID wrap limit is 2147484146, limited by database "postgres"
LOG: database system was interrupted at 2010-06-29 19:52:08 EDT
LOG: checkpoint record is at 0/3788A258
LOG: redo record is at 0/3788A258; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 610809; next OID: 28328
LOG: next MultiXactId: 8; next MultiXactOffset: 15
LOG: database system was not properly shut down; automatic recovery in progress
LOG: record with zero length at 0/3788A2A8
LOG: redo is not required
LOG: database system is ready
LOG: transaction ID wrap limit is 2147484146, limited by database "postgres"
FATAL: cache lookup failed for index 2662
Any help in this matter would be GREATLY appreciated.
Nate Robertson
Nathan Robertson <nathan.robertson@gmail.com> wrote: > Apache failed which caused PSQL to fail which at some point caused > some data in the database to become corrupt. What? What does Apache have to do with psql (a command-line client for PostgreSQL), and how would either of those failing cause a database corruption? Could you give more details? Did the OS lock up at some point? Was there any power loss while PostgreSQL was running? Without an event like that, database corruption is often the result of hardware problems. Testing your RAM and making sure you have SMART monitoring working on the drives, with due attention to any reported problems, might be a good idea; otherwise, whatever you recover may be further damaged. Also, it would help a lot to know what your postgresql.conf file contains (excluding all comments). But first and foremost, you should make a file-copy backup of your entire PostgreSQL data directory tree with the PostgreSQL server stopped, if you haven't done that already. Any attempt at recovery may misfire, and you might want to get back to what you have now. -Kevin
There was a cascade effect. Apache failed which caused the server overall to fail. The data is stored on an iSCSI drive and the mount of the iSCSI drive became corrupt when everything failed. I was able to remount the drive and get access to data now I have this index error.
So, this is where I'm at. If anyone could help resolve the index cache error I would be eternally great full.
So, this is where I'm at. If anyone could help resolve the index cache error I would be eternally great full.
On Wed, Jun 30, 2010 at 9:36 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
Nathan Robertson <nathan.robertson@gmail.com> wrote:What? What does Apache have to do with psql (a command-line client
> Apache failed which caused PSQL to fail which at some point caused
> some data in the database to become corrupt.
for PostgreSQL), and how would either of those failing cause a
database corruption? Could you give more details? Did the OS lock
up at some point? Was there any power loss while PostgreSQL was
running?
Without an event like that, database corruption is often the result
of hardware problems. Testing your RAM and making sure you have
SMART monitoring working on the drives, with due attention to any
reported problems, might be a good idea; otherwise, whatever you
recover may be further damaged.
Also, it would help a lot to know what your postgresql.conf file
contains (excluding all comments).
But first and foremost, you should make a file-copy backup of your
entire PostgreSQL data directory tree with the PostgreSQL server
stopped, if you haven't done that already. Any attempt at recovery
may misfire, and you might want to get back to what you have now.
-Kevin
Nathan Robertson <nathan.robertson@gmail.com> wrote: > There was a cascade effect. Apache failed which caused the server > overall to fail. The data is stored on an iSCSI drive and the > mount of the iSCSI drive became corrupt when everything failed. I > was able to remount the drive and get access to data now I have > this index error. Now we're getting somewhere. The disk drive "became corrupt" while PostgreSQL was running? Was the drive unmounted or remounted while PostgreSQL was running, or did you stop PostgreSQL first? Do you have any errors in the PostgreSQL log from the time this was all going on? Also, how confident are you that the Apache failure caused the drive to be corrupted? That sounds *much* less likely than the other way around. Without understanding that better, fixing one particular problem in the database on this machine might be like rearranging deck chairs on a sinking ship. > So, this is where I'm at. If anyone could help resolve the index > cache error I would be eternally great full. We'd like to help, and perhaps someone else can suggest something on the basis of information you've provided so far, but I'm not comfortable suggesting something without a little more of a sense of what happened and what your configuration is. >> Also, it would help a lot to know what your postgresql.conf file >> contains (excluding all comments). This would still be useful. >> But first and foremost, you should make a file-copy backup of >> your entire PostgreSQL data directory tree with the PostgreSQL >> server stopped, if you haven't done that already. Any attempt at >> recovery may misfire, and you might want to get back to what you >> have now. I can't, in good conscience, recommend any recovery attempts until you confirm that you have a copy to restore if the cleanup effort misfires. One more question occurs to me -- it seems unusual for someone to be running on a single disk with no RAID and no backup, but to be running with a version of PostgreSQL with is only about a month old. Was 8.1.21 the version you were running at the time of the failure, or have you upgraded during the recovery attempt? If you've upgraded, the version in use when the corruption occurred could be relevant. -Kevin
Hi Kevin,
Thanks for the response.
Now we're getting somewhere. The disk drive "became corrupt" while
PostgreSQL was running? Was the drive unmounted or remounted while
PostgreSQL was running, or did you stop PostgreSQL first? Do you
have any errors in the PostgreSQL log from the time this was all
going on?
The failure basically happened because the Django webapp we're running isn't effectively closing database connections. So, memory is completely filling up and causing the server to hang. Yesterday, when this happened it caused the entire network interface to become inoperable which meant that the iscsi connection to the shared drive stopped working and data became corrupt.
I stopped the postgresql service before unmounting and remounting the target.
Thanks for the response.
Now we're getting somewhere. The disk drive "became corrupt" while
PostgreSQL was running? Was the drive unmounted or remounted while
PostgreSQL was running, or did you stop PostgreSQL first? Do you
have any errors in the PostgreSQL log from the time this was all
going on?
The failure basically happened because the Django webapp we're running isn't effectively closing database connections. So, memory is completely filling up and causing the server to hang. Yesterday, when this happened it caused the entire network interface to become inoperable which meant that the iscsi connection to the shared drive stopped working and data became corrupt.
I stopped the postgresql service before unmounting and remounting the target.
My first concern is restoring the database. I'll fix the problems with django and apache later. I can deal with those problems. I'm also going to create a series of database backups that can be used to quickly restore data if this happens again. My concern is simply just getting this back to baseline.
One more question occurs to me -- it seems unusual for someone to be
running on a single disk with no RAID and no backup, but to be
running with a version of PostgreSQL with is only about a month old.
Was 8.1.21 the version you were running at the time of the failure,
or have you upgraded during the recovery attempt? If you've
upgraded, the version in use when the corruption occur
This storage server has RAID and there are backups, it just so happens that the most recent usable backup is from June 20th. I completely forgot to configure the backups on this server. I normally wouldn't make this mistake, but I did this time.
On the version, this is the version that comes standard with CentOS 5.5. This was a clean CentOS 5.5 install and it's been live for about a month.
>> Also, it would help a lot to know what your postgresql.conf file
>> contains (excluding all comments).
The only uncommented lines are:
max_connections = 500
shared_buffers = 4000
redirect_stderr = on
log_directory = 'pg_log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1440
log_rotation_size = 0
redirect_stderr = on
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
I can't, in good conscience, recommend any recovery attempts until
you confirm that you have a copy to restore if the cleanup effort
misfires.
I have a full backup of the entire directory structure I took shortly after the database became unusable.
On Wed, Jun 30, 2010 at 10:14 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
One more question occurs to me -- it seems unusual for someone to be
running on a single disk with no RAID and no backup, but to be
running with a version of PostgreSQL with is only about a month old.
Was 8.1.21 the version you were running at the time of the failure,
or have you upgraded during the recovery attempt? If you've
upgraded, the version in use when the corruption occur
This storage server has RAID and there are backups, it just so happens that the most recent usable backup is from June 20th. I completely forgot to configure the backups on this server. I normally wouldn't make this mistake, but I did this time.
On the version, this is the version that comes standard with CentOS 5.5. This was a clean CentOS 5.5 install and it's been live for about a month.
>> Also, it would help a lot to know what your postgresql.conf file
>> contains (excluding all comments).
The only uncommented lines are:
max_connections = 500
shared_buffers = 4000
redirect_stderr = on
log_directory = 'pg_log'
log_filename = 'postgresql-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1440
log_rotation_size = 0
redirect_stderr = on
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
I can't, in good conscience, recommend any recovery attempts until
you confirm that you have a copy to restore if the cleanup effort
misfires.
I have a full backup of the entire directory structure I took shortly after the database became unusable.
On Wed, Jun 30, 2010 at 10:14 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> There was a cascade effect. Apache failed which caused the serverNow we're getting somewhere. The disk drive "became corrupt" while
> overall to fail. The data is stored on an iSCSI drive and the
> mount of the iSCSI drive became corrupt when everything failed. I
> was able to remount the drive and get access to data now I have
> this index error.
PostgreSQL was running? Was the drive unmounted or remounted while
PostgreSQL was running, or did you stop PostgreSQL first? Do you
have any errors in the PostgreSQL log from the time this was all
going on?
Also, how confident are you that the Apache failure caused the drive
to be corrupted? That sounds *much* less likely than the other way
around. Without understanding that better, fixing one particular
problem in the database on this machine might be like rearranging
deck chairs on a sinking ship.We'd like to help, and perhaps someone else can suggest something on
> So, this is where I'm at. If anyone could help resolve the index
> cache error I would be eternally great full.
the basis of information you've provided so far, but I'm not
comfortable suggesting something without a little more of a sense of
what happened and what your configuration is.This would still be useful.
>> Also, it would help a lot to know what your postgresql.conf file
>> contains (excluding all comments).I can't, in good conscience, recommend any recovery attempts until
>> But first and foremost, you should make a file-copy backup of
>> your entire PostgreSQL data directory tree with the PostgreSQL
>> server stopped, if you haven't done that already. Any attempt at
>> recovery may misfire, and you might want to get back to what you
>> have now.
you confirm that you have a copy to restore if the cleanup effort
misfires.
red could be
relevant.
One more question occurs to me -- it seems unusual for someone to be
running on a single disk with no RAID and no backup, but to be
running with a version of PostgreSQL with is only about a month old.
Was 8.1.21 the version you were running at the time of the failure,
or have you upgraded during the recovery attempt? If you've
upgraded, the version in use when the corruption occur
-Kevin
Nathan Robertson <nathan.robertson@gmail.com> wrote: > The failure basically happened because the Django webapp we're > running isn't effectively closing database connections. So, memory > is completely filling up and causing the server to hang. > Yesterday, when this happened it caused the entire network > interface to become inoperable which meant that the iscsi > connection to the shared drive stopped working and data became > corrupt. > > I stopped the postgresql service before unmounting and remounting > the target. OK, I think the appropriate next step would be to try to run the PostgreSQL cluster in single-user mode: http://www.postgresql.org/docs/8.1/interactive/app-postgres.html Try to REINDEX pg_class_oid_index in that mode. If that fails, it might possibly help to run these statements and try the REINDEX command again: set enable_indexscan = off; set enable_bitmapscan = off; I hope this helps. -Kevin
Hi Kevin,
Thanks for this. I've found a lot of information on this online but I'm a little unclear about how exactly I should connect and run the reindex.
My thinking based on the documentation is I run (as postgres user):
postgres -O -P -D /dbcluster/location
Then I run:
REINDEX TABLE pg_class_oid_in;
Is this correct?
Nate
Thanks for this. I've found a lot of information on this online but I'm a little unclear about how exactly I should connect and run the reindex.
My thinking based on the documentation is I run (as postgres user):
postgres -O -P -D /dbcluster/location
Then I run:
REINDEX TABLE pg_class_oid_in;
Is this correct?
Nate
On Wed, Jun 30, 2010 at 11:51 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> The failure basically happened because the Django webapp we'reOK, I think the appropriate next step would be to try to run the
> running isn't effectively closing database connections. So, memory
> is completely filling up and causing the server to hang.
> Yesterday, when this happened it caused the entire network
> interface to become inoperable which meant that the iscsi
> connection to the shared drive stopped working and data became
> corrupt.
>
> I stopped the postgresql service before unmounting and remounting
> the target.
PostgreSQL cluster in single-user mode:
http://www.postgresql.org/docs/8.1/interactive/app-postgres.html
Try to REINDEX pg_class_oid_index in that mode. If that fails, it
might possibly help to run these statements and try the REINDEX
command again:
set enable_indexscan = off;
set enable_bitmapscan = off;
I hope this helps.
-Kevin
Nathan Robertson <nathan.robertson@gmail.com> wrote: > My thinking based on the documentation is I run (as postgres > user): > postgres -O -P -D /dbcluster/location Looks good to me. In fact, I hadn't remembered the -P option; definitely a good choice here, and it should obviate the need to try to disable the index usage using the commands I showed. > Then I run: > REINDEX TABLE pg_class_oid_in; You either need to specify the INDEX keyword or a table name. Perhaps this would be best: REINDEX TABLE pg_class; -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > OK, I think the appropriate next step would be to try to run the > PostgreSQL cluster in single-user mode: > http://www.postgresql.org/docs/8.1/interactive/app-postgres.html > Try to REINDEX pg_class_oid_index in that mode. If that fails, it > might possibly help to run these statements and try the REINDEX > command again: > set enable_indexscan = off; > set enable_bitmapscan = off; Those won't help. What you *will* need, in order to even start the single-user backend, is to tell it to disregard system indexes (-P command line option). I wouldn't be too surprised if the corruption extends a lot further than the one index :-( but maybe you will be able to extract something after reindexing. regards, tom lane
Thanks Tom.
OK, I ran:
postgres -O -P -D /cluster/location
reindex table pg_class;
backend> reindex table pg_class;
And then nothing returns. Nothing stating whether it was successful or a failure.
And then if I do:
bash-3.2$ postgres -O -P -D /shared/webapp/database webapp
I still get:
FATAL: cache lookup failed for index 2662
OK, I ran:
postgres -O -P -D /cluster/location
reindex table pg_class;
backend> reindex table pg_class;
And then nothing returns. Nothing stating whether it was successful or a failure.
And then if I do:
bash-3.2$ postgres -O -P -D /shared/webapp/database webapp
I still get:
FATAL: cache lookup failed for index 2662
On Wed, Jun 30, 2010 at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:Those won't help. What you *will* need, in order to even start the
> OK, I think the appropriate next step would be to try to run the
> PostgreSQL cluster in single-user mode:
> http://www.postgresql.org/docs/8.1/interactive/app-postgres.html
> Try to REINDEX pg_class_oid_index in that mode. If that fails, it
> might possibly help to run these statements and try the REINDEX
> command again:
> set enable_indexscan = off;
> set enable_bitmapscan = off;
single-user backend, is to tell it to disregard system indexes
(-P command line option).
I wouldn't be too surprised if the corruption extends a lot further than
the one index :-( but maybe you will be able to extract something after
reindexing.
regards, tom lane
Nathan Robertson <nathan.robertson@gmail.com> writes: > OK, I ran: > postgres -O -P -D /cluster/location This probably connected to the postgres database, not webapp which is where your problem is. > backend> reindex table pg_class; > And then nothing returns. Nothing stating whether it was successful or a > failure. The standalone mode isn't very verbose. If you got a prompt back without any error then it's OK. Just type ^D at the prompt to exit. regards, tom lane
This probably connected to the postgres database, not webapp which is
where your problem is.
OK, is there anyway I can force a connection to this database so I can get a look at the tables (webapp)? To be honest, if there is corrupt data in there, that is fine. I'll manually pull it out. I just need to be able to figure out what uncorrupted data is in there so I can determine if it's even worth saving.
where your problem is.
OK, is there anyway I can force a connection to this database so I can get a look at the tables (webapp)? To be honest, if there is corrupt data in there, that is fine. I'll manually pull it out. I just need to be able to figure out what uncorrupted data is in there so I can determine if it's even worth saving.
On Wed, Jun 30, 2010 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Nathan Robertson <nathan.robertson@gmail.com> writes:This probably connected to the postgres database, not webapp which is
> OK, I ran:
> postgres -O -P -D /cluster/location
where your problem is.The standalone mode isn't very verbose. If you got a prompt back
> backend> reindex table pg_class;
> And then nothing returns. Nothing stating whether it was successful or a
> failure.
without any error then it's OK. Just type ^D at the prompt to exit.
regards, tom lane
Nathan Robertson <nathan.robertson@gmail.com> writes: > *This probably connected to the postgres database, not webapp which is > where your problem is. > * > OK, is there anyway I can force a connection to this database postgres -O -P -D /cluster/location webapp (You don't really need the -O, but it probably doesn't hurt either.) regards, tom lane
OK, as postgres user:
-bash-3.2$ postgres -O -P -D /tmp/database webapp
FATAL: cache lookup failed for index 2662
-bash-3.2$ postgres -O -P -D /tmp/database webapp
FATAL: cache lookup failed for index 2662
On Wed, Jun 30, 2010 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Nathan Robertson <nathan.robertson@gmail.com> writes:> *This probably connected to the postgres database, not webapp which is> where your problem is.postgres -O -P -D /cluster/location webapp
> *
> OK, is there anyway I can force a connection to this database
(You don't really need the -O, but it probably doesn't hurt either.)
regards, tom lane
Nathan Robertson <nathan.robertson@gmail.com> writes: > OK, as postgres user: > -bash-3.2$ postgres -O -P -D /tmp/database webapp > FATAL: cache lookup failed for index 2662 Even with -P? Wow, that's bad. This DB may be just toast I'm afraid. But try it like this: postgres -O -P -D /tmp/database -c log_error_verbosity=verbose webapp This should give you the exact location of the error, which will give us a bit better clue what's failing. regards, tom lane
Nathan Robertson <nathan.robertson@gmail.com> writes: > -bash-3.2$ postgres -O -P -D /tmp/database -c log_error_verbosity=verbose > webapp > FATAL: XX000: cache lookup failed for index 2662 > LOCATION: RelationInitIndexAccessInfo, relcache.c:841 Um ... and you said this was current 8.1.x ... so it's dying here: tuple = SearchSysCache(INDEXRELID, ObjectIdGetDatum(RelationGetRelid(relation)), 0, 0, 0); if (!HeapTupleIsValid(tuple)) elog(ERROR, "cache lookup failed for index %u", RelationGetRelid(relation)); I interpret this to mean that it can't find the pg_index row for pg_class_oid_index --- and since you're using -P, that doesn't just mean corruption in pg_index's indexes, but that the tuple can't be found even by seqscanning the whole catalog. Probably the whole page it's in got wiped out by your filesystem-level failure. I hate to be the bearer of bad news, but I think this DB may be beyond recovery. It's very unlikely that there's just the one tuple gone. If you're willing to throw money at the problem, there are various people who offer consulting services that include trying to reconstruct broken Postgres databases; but you might be best advised to just take your lumps and go back to your last good backup. At this point you're looking at a significant investment of time with no guarantee of being able to extract anything very useful. What I'm taking from this is another horror story about the risks of mounting databases across networks instead of locally :-(. Postgres is only as reliable as the storage it's using. regards, tom lane