Обсуждение: FATAL: could not open relation xxx: No such file or directory
Hello all
my struggle with the database continues (see earlier thread titled "too many trigger records found for relation xyz").
Today, I created yet another to table to the same database. Everything went ok, no errors or anything, but when I checked pg_tables -view I saw two tables with the same name. Instantly I queried pg_class and yes there was again two tables with same oid. I dropped the table before anything more serious could happen, but then postgres started to complain of "cache lookup failed for relation ...". I disconnected my psql session and tried to reconnect but failed to do so:
2008-04-09 16:39:25 EEST [18984]: [1-1] FATAL: could not open relation 1663/16386/544592: No such file or directory
Indeed, there is no such file in that directory. I'm guessing that file is connected to the table I just dropped. Now, is there anything to do to get the database back online? I can still connect to other databases in the same instance.
Regards
Mikko
my struggle with the database continues (see earlier thread titled "too many trigger records found for relation xyz").
Today, I created yet another to table to the same database. Everything went ok, no errors or anything, but when I checked pg_tables -view I saw two tables with the same name. Instantly I queried pg_class and yes there was again two tables with same oid. I dropped the table before anything more serious could happen, but then postgres started to complain of "cache lookup failed for relation ...". I disconnected my psql session and tried to reconnect but failed to do so:
2008-04-09 16:39:25 EEST [18984]: [1-1] FATAL: could not open relation 1663/16386/544592: No such file or directory
Indeed, there is no such file in that directory. I'm guessing that file is connected to the table I just dropped. Now, is there anything to do to get the database back online? I can still connect to other databases in the same instance.
Regards
Mikko
On Wed, Apr 9, 2008 at 4:47 PM, Mikko Partio <mpartio@gmail.com> wrote:
The cure was to create file 1663/16386/54459 8K in size with dd. The file in question was in fact the oid index on pg_class -- I had issued a REINDEX on pg_class just a moment before and apparantly something went wrong and the system lost track of the index. There was also two entries in pg_index for index pg_class_oid_index. After I removed the other entry and reindexed pg_class and pg_index, everything seems to be working ok. All the symptoms indicate that perhaps a xid wraparound had happened, but there is no such warning in logs and age(datfrozenxid) went never higher than say 250,000,000. Does anybody have a clue what might have happened?
Regards
Mikko
Hello all
my struggle with the database continues (see earlier thread titled "too many trigger records found for relation xyz").
Today, I created yet another to table to the same database. Everything went ok, no errors or anything, but when I checked pg_tables -view I saw two tables with the same name. Instantly I queried pg_class and yes there was again two tables with same oid. I dropped the table before anything more serious could happen, but then postgres started to complain of "cache lookup failed for relation ...". I disconnected my psql session and tried to reconnect but failed to do so:
2008-04-09 16:39:25 EEST [18984]: [1-1] FATAL: could not open relation 1663/16386/544592: No such file or directory
Indeed, there is no such file in that directory. I'm guessing that file is connected to the table I just dropped. Now, is there anything to do to get the database back online? I can still connect to other databases in the same instance
The cure was to create file 1663/16386/54459 8K in size with dd. The file in question was in fact the oid index on pg_class -- I had issued a REINDEX on pg_class just a moment before and apparantly something went wrong and the system lost track of the index. There was also two entries in pg_index for index pg_class_oid_index. After I removed the other entry and reindexed pg_class and pg_index, everything seems to be working ok. All the symptoms indicate that perhaps a xid wraparound had happened, but there is no such warning in logs and age(datfrozenxid) went never higher than say 250,000,000. Does anybody have a clue what might have happened?
Regards
Mikko
On Tue, Apr 15, 2008 at 9:36 AM, Mikko Partio <mpartio@gmail.com> wrote:
On Wed, Apr 9, 2008 at 4:47 PM, Mikko Partio <mpartio@gmail.com> wrote:Hello all
my struggle with the database continues (see earlier thread titled "too many trigger records found for relation xyz").
Today, I created yet another to table to the same database. Everything went ok, no errors or anything, but when I checked pg_tables -view I saw two tables with the same name. Instantly I queried pg_class and yes there was again two tables with same oid. I dropped the table before anything more serious could happen, but then postgres started to complain of "cache lookup failed for relation ...". I disconnected my psql session and tried to reconnect but failed to do so:
2008-04-09 16:39:25 EEST [18984]: [1-1] FATAL: could not open relation 1663/16386/544592: No such file or directory
Indeed, there is no such file in that directory. I'm guessing that file is connected to the table I just dropped. Now, is there anything to do to get the database back online? I can still connect to other databases in the same instance
The cure was to create file 1663/16386/54459 8K in size with dd. The file in question was in fact the oid index on pg_class -- I had issued a REINDEX on pg_class just a moment before and apparantly something went wrong and the system lost track of the index. There was also two entries in pg_index for index pg_class_oid_index. After I removed the other entry and reindexed pg_class and pg_index, everything seems to be working ok. All the symptoms indicate that perhaps a xid wraparound had happened, but there is no such warning in logs and age(datfrozenxid) went never higher than say 250,000,000. Does anybody have a clue what might have happene
And now it has happened again. A CLUSTER operation was done on a table succesfully, afterwards when trying to access the table I get the error
2008-04-17 13:05:30 EEST [8435]: [32-1] ERROR: could not open relation 1663/16386/359232: No such file or directory
Seems to me like VACUUM FULL, REINDEX and CLUSTER change the filename of a table and/or index and then fail to record the new name to system catalogues. Is this a known deficiency what can I do to stop this behaviour?
Regards
Mikko
On Thu, Apr 17, 2008 at 3:38 PM, Mikko Partio <mpartio@gmail.com> wrote: > > 2008-04-17 13:05:30 EEST [8435]: [32-1] ERROR: could not open relation > 1663/16386/359232: No such file or directory > Looks like a corrupt index to me. DId you try REINDEX on the table ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Thu, Apr 17, 2008 at 1:36 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
On Thu, Apr 17, 2008 at 3:38 PM, Mikko Partio <mpartio@gmail.com> wrote:Looks like a corrupt index to me. DId you try REINDEX on the table ?
>
> 2008-04-17 13:05:30 EEST [8435]: [32-1] ERROR: could not open relation
> 1663/16386/359232: No such file or directory
>
Hi Pavan and thanks for your reply.
I tried to reindex the individual indexes in the table:
# reindex index xxx_idx;
ERROR: could not open relation 1663/16386/359232: No such file or directory
Since I thought the trouble may lie in the system catalogue indexes I issued a REINDEX SYSTEM db, which went through with no errors. After that I tried to remove indexes from the table in question:
# drop index xxx_idx;
ERROR: could not read block 0 of relation 1663/16386/2673: read only 0 of 8192 bytes
Hmm.. this is a different oid
# select 2673::regclass;
regclass
--------------------------
pg_depend_depender_index
(1 row)
But I just reindexed it!
# reindex table pg_depend;
WARNING: could not remove relation 1663/16386/2673: No such file or directory
REINDEX
When I fire pg_dump to take a last minute backup I see this error:
pg_dump: Error message from server: ERROR: could not open relation 1663/16386/544529: No such file or directory
pg_dump: The command was: SELECT tgname, tgfoid::pg_catalog.regproc as tgfname, tgtype, tgnargs, tgargs, tgenabled, tgisconstraint, tgconstrname, tgdeferrable, tgconstrrelid, tginitdeferred, tableoid, oid, tgconstrrelid::pg_catalog.regclass as tgconstrrelname from pg_catalog.pg_trigger t where tgrelid = '294134'::pg_catalog.oid and tgconstraint = 0
# reindex table pg_catalog.pg_trigger;
WARNING: could not remove relation 1663/16386/544529: No such file or directory
REINDEX
Seems like the whole db is falling apart.
Regards
Mikko
"Mikko Partio" <mpartio@gmail.com> writes: > Seems like the whole db is falling apart. I think you've got really serious filesystem-level problems. Have you tried running any hardware diagnostics? Are you sure you're using a stable kernel version? regards, tom lane
On Thu, Apr 17, 2008 at 6:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I run fsck on the filesystem (gfs) -- no problems found. The disks are from a san and the diagnostic programs say there's nothing wrong. I also have other db clusters running on different filesystems (also gfs) and I have never had any problems with them.
Regards
Mikko
I think you've got really serious filesystem-level problems. Have you
tried running any hardware diagnostics? Are you sure you're using a
stable kernel version?
I run fsck on the filesystem (gfs) -- no problems found. The disks are from a san and the diagnostic programs say there's nothing wrong. I also have other db clusters running on different filesystems (also gfs) and I have never had any problems with them.
Regards
Mikko
"Mikko Partio" <mpartio@gmail.com> writes: > On Thu, Apr 17, 2008 at 6:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think you've got really serious filesystem-level problems. Have you >> tried running any hardware diagnostics? Are you sure you're using a >> stable kernel version? > I run fsck on the filesystem (gfs) -- no problems found. The disks are from > a san and the diagnostic programs say there's nothing wrong. I also have > other db clusters running on different filesystems (also gfs) and I have > never had any problems with them. Some RAM checks wouldn't be out of place either. regards, tom lane
On Thu, Apr 17, 2008 at 6:59 PM, Mikko Partio <mpartio@gmail.com> wrote:
Oh yeah and the kernel version is 2.6.18-53.1.14.el5.
Regards
Mikko
On Thu, Apr 17, 2008 at 6:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:I think you've got really serious filesystem-level problems. Have you
tried running any hardware diagnostics? Are you sure you're using a
stable kernel version?
I run fsck on the filesystem (gfs) -- no problems found. The disks are from a san and the diagnostic programs say there's nothing wrong. I also have other db clusters running on different filesystems (also gfs) and I have never had any problems with them.
Oh yeah and the kernel version is 2.6.18-53.1.14.el5.
Regards
Mikko
On Thu, Apr 17, 2008 at 7:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Hmm didn't think of that, will do that asap (tomorrow). Thanks for your help.
Regards
Mikko
"Mikko Partio" <mpartio@gmail.com> writes:
> On Thu, Apr 17, 2008 at 6:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:>> I think you've got really serious filesystem-level problems. Have youSome RAM checks wouldn't be out of place either.
>> tried running any hardware diagnostics? Are you sure you're using a
>> stable kernel version?
> I run fsck on the filesystem (gfs) -- no problems found. The disks are from
> a san and the diagnostic programs say there's nothing wrong. I also have
> other db clusters running on different filesystems (also gfs) and I have
> never had any problems with them.
Hmm didn't think of that, will do that asap (tomorrow). Thanks for your help.
Regards
Mikko
On Thu, Apr 17, 2008 at 7:08 PM, Mikko Partio <mpartio@gmail.com> wrote:
Memtest86+ has now been running for 20+ hours and no errors has been found. I was also unable to reproduce this problem, but it only happened after a few days of constant activity anyway so I guess it's not so easy to replicate. Any other pointers where to look at? Your help is well appreciated.
Regards
Mikko
On Thu, Apr 17, 2008 at 7:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:"Mikko Partio" <mpartio@gmail.com> writes:
> On Thu, Apr 17, 2008 at 6:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:>> I think you've got really serious filesystem-level problems. Have youSome RAM checks wouldn't be out of place either.
>> tried running any hardware diagnostics? Are you sure you're using a
>> stable kernel version?
> I run fsck on the filesystem (gfs) -- no problems found. The disks are from
> a san and the diagnostic programs say there's nothing wrong. I also have
> other db clusters running on different filesystems (also gfs) and I have
> never had any problems with them.
Memtest86+ has now been running for 20+ hours and no errors has been found. I was also unable to reproduce this problem, but it only happened after a few days of constant activity anyway so I guess it's not so easy to replicate. Any other pointers where to look at? Your help is well appreciated.
Regards
Mikko
On Donnerstag, 17. April 2008 Mikko Partio wrote: > I run fsck on the filesystem (gfs) -- no problems found. The disks > are from a san and the diagnostic programs say there's nothing wrong. > I also have other db clusters running on different filesystems (also > gfs) and I have never had any problems with them. A bit OT, but maybe related: I have similar strangeness with a Linux box with Areca controller. On this box, the reiserfs filesystem starts getting seriously damaged after some time. Memtest showed no problems, and everything looks fine. Today we will replace the mainboard, it could have an internal problem (transport from memory to controller broken?). What I had twice (on different customers, once SCSI once SATA) is that a broken hard disk reports no errors, but delivers different data than what was written before. Very nasty, as the RAID controller doesn't see any problem, and destroys even the good harddisks data after the next write, because the read data is already broken. HTH, good luck. mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0676/846 914 666 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: www.keyserver.net Key-ID: 1C1209B4
Вложения
On Tue, Apr 22, 2008 at 12:02 PM, Michael Monnerie <michael.monnerie@it-management.at> wrote:
What I had twice (on different customers, once SCSI once SATA) is that a
broken hard disk reports no errors, but delivers different data than
what was written before. Very nasty, as the RAID controller doesn't see
any problem, and destroys even the good harddisks data after the next
write, because the read data is already broken.
How have you recognized such a hard disk?
Regards
Mikko
On Dienstag, 22. April 2008 Mikko Partio wrote: > How have you recognized such a hard disk? With "badblocks", which writes some patterns and re-reads it. But it's of course annoying slow. At these servers I was lucky. Both were "only" 73GB disks used in a RAID-1, so only 2 small drives to check. With a RAID of 8x750GB disks, it will take a *long* time to check, if you cannot simply replace all disks at once. At this customer from today, I would have to take one drive, check it, replace it, let RAID rebuild CRC, take the next... a new mainboard is less work, so I try this first. mfg zmi -- // Michael Monnerie, Ing.BSc ----- http://it-management.at // Tel: 0676/846 914 666 .network.your.ideas. // PGP Key: "curl -s http://zmi.at/zmi.asc | gpg --import" // Fingerprint: AC19 F9D5 36ED CD8A EF38 500E CE14 91F7 1C12 09B4 // Keyserver: www.keyserver.net Key-ID: 1C1209B4