Обсуждение: Could not open file "pg_subtrans/01EB"
Hi,
I'm trying to investigate a database of one of our clients. The database version is 9.2.5. The client tried to dump one of its databases and then got the next error :
pg_dump: query returned 2 rows instead of one: SELECT tableoid, oid, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = datdba) AS dba, pg_encoding_to_char(encoding) AS encoding, datcollate, datctype, datfrozenxid, (SELECT spcname FROM pg_tablespace t WHERE t.oid = dattablespace) AS tablespace, shobj_description(oid, 'pg_database') AS description FROM pg_database WHERE datname = 'db1'
So I tried to query the pg_database and I saw that there are duplicated rows in that table :
postgres=# select xmin,xmax,datname,datfrozenxid from pg_database order by datname;
xmin | xmax | datname | datfrozenxid
-------+----------+----------------+--------------
2351 | 0 | db1 | 1798
1809 | 21093518 | db1 | 1798
1806 | 0 | postgres | 1798
12594 | 0 | db2 | 1798
1803 | 0 | template0 | 1798
1802 | 0 | template1 | 1798
3590 | 0 |db4 | 1798
3592 | 0 | db3 | 1798
1811 | 21077312 | db3 | 1798
(9 rows)
-fsync and full_page_write are set to on.
I changed the databases names but as you can see db1/db3 have duplicated records. I tried to dump the postgresql database and it worked. I run vacuum on the problematic databases : connected to db1/2/3 and run the vacuum command. On many of the object I got the next detail message :
DETAIL: x dead row versions cannot be removed yet. I'm the only one working on the database and there are no additional session in pg_stat_activity. So when some of the row versions cannot be removed ?
I tried to reindex the problematic databases but got the next error :
reindexdb: reindexing of database "db1" failed: ERROR: could not access status of transaction 32212695
DETAIL: Could not open file "pg_subtrans/01EB": No such file or directory.
I checked and indeed that file doesn't exist.
I restarted the cluster and I got the same error for every database (in all cases analyze of pg_catalog.pg_shdepend" failed and caused the error) in the log file.
2018-05-06 23:46:54 +08 30185 DETAIL: Could not open file "pg_subtrans/01EB": No such file or directory.
2018-05-06 23:46:54 +08 30185 CONTEXT: automatic analyze of table "afa.pg_catalog.pg_shdepend"
2018-05-06 23:47:06 +08 30213 ERROR: could not access status of transaction 32635595
I generated a new empty subtrans file that will be called 01EB and I restarted my cluster :
dd if=/dev/zero of=/var/lib/pgsql/data/pg_subtrans/01EB bs=256k count=1
I didnt get any errors in the log of the database.
Afterwards, I still had duplicated values in pg_databases. I tried again to reindex the problematic databases :
[root@my_host pg_subtrans]# reindexdb db1 -U postgres
Password:
NOTICE: table "pg_catalog.pg_class" was reindexed
and it is just stuck from that point and didnt advanced to other tables.. In pg_stat_activity I dont see that the state_change is changing.
Any idea how can I further continue ?
Thanks.
It seems that I have duplicated values in many system tables and not just in pg_database. I had duplicated values in pg_stat_activity(same pid), and after vacuuming the database they disappeared. However, I still got duplicated values in pg_database.
בתאריך יום א׳, 26 באוג׳ 2018 ב-12:29 מאת Mariel Cherkassky <mariel.cherkassky@gmail.com>:
Hi,I'm trying to investigate a database of one of our clients. The database version is 9.2.5. The client tried to dump one of its databases and then got the next error :pg_dump: query returned 2 rows instead of one: SELECT tableoid, oid, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = datdba) AS dba, pg_encoding_to_char(encoding) AS encoding, datcollate, datctype, datfrozenxid, (SELECT spcname FROM pg_tablespace t WHERE t.oid = dattablespace) AS tablespace, shobj_description(oid, 'pg_database') AS description FROM pg_database WHERE datname = 'db1'So I tried to query the pg_database and I saw that there are duplicated rows in that table :postgres=# select xmin,xmax,datname,datfrozenxid from pg_database order by datname;xmin | xmax | datname | datfrozenxid-------+----------+----------------+--------------2351 | 0 | db1 | 17981809 | 21093518 | db1 | 17981806 | 0 | postgres | 179812594 | 0 | db2 | 17981803 | 0 | template0 | 17981802 | 0 | template1 | 17983590 | 0 |db4 | 17983592 | 0 | db3 | 17981811 | 21077312 | db3 | 1798(9 rows)-fsync and full_page_write are set to on.I changed the databases names but as you can see db1/db3 have duplicated records. I tried to dump the postgresql database and it worked. I run vacuum on the problematic databases : connected to db1/2/3 and run the vacuum command. On many of the object I got the next detail message :DETAIL: x dead row versions cannot be removed yet. I'm the only one working on the database and there are no additional session in pg_stat_activity. So when some of the row versions cannot be removed ?I tried to reindex the problematic databases but got the next error :reindexdb: reindexing of database "db1" failed: ERROR: could not access status of transaction 32212695DETAIL: Could not open file "pg_subtrans/01EB": No such file or directory.I checked and indeed that file doesn't exist.I restarted the cluster and I got the same error for every database (in all cases analyze of pg_catalog.pg_shdepend" failed and caused the error) in the log file.2018-05-06 23:46:54 +08 30185 DETAIL: Could not open file "pg_subtrans/01EB": No such file or directory.2018-05-06 23:46:54 +08 30185 CONTEXT: automatic analyze of table "afa.pg_catalog.pg_shdepend"2018-05-06 23:47:06 +08 30213 ERROR: could not access status of transaction 32635595I generated a new empty subtrans file that will be called 01EB and I restarted my cluster :dd if=/dev/zero of=/var/lib/pgsql/data/pg_subtrans/01EB bs=256k count=1I didnt get any errors in the log of the database.Afterwards, I still had duplicated values in pg_databases. I tried again to reindex the problematic databases :[root@my_host pg_subtrans]# reindexdb db1 -U postgresPassword:NOTICE: table "pg_catalog.pg_class" was reindexedand it is just stuck from that point and didnt advanced to other tables.. In pg_stat_activity I dont see that the state_change is changing.Any idea how can I further continue ?Thanks.
Mariel Cherkassky <mariel.cherkassky@gmail.com> writes: > I'm trying to investigate a database of one of our clients. The database > version is 9.2.5. 1. Fire their DBA for dereliction of duty. 2. Update to the last available release of 9.2.x (9.2.24, looks like). 3. Vacuum everything and see if it gets better. Vacuuming may or may not fix the observed data corruption problems, but it's silly to ignore the fact that they're missing four years worth of bug fixes in that branch. In particular I'm noticing the first entry in the change notes for 9.2.6, which recommends "vacuuming all tables in all databases while having vacuum_freeze_table_age set to zero": https://www.postgresql.org/docs/9.2/static/release-9-2-6.html That problem as-described seems to involve rows disappearing, rather than getting duplicated, but I wouldn't rule out that it could also have symptoms like this. If that doesn't fix things, you could then try: 4. Manually eliminate duplicate rows by identifying which one of each pair seems older and deleting it with DELETE ... WHERE ctid = '(x,y)'. Then reindex to confirm no duplicates remain. But you still need step 2, to reduce the odds the problem will recur. regards, tom lane
1.I'm not really sure if they have one but its not my business.. I'm just trying to help those guys with an application that is based on postgres...
2.Yeah I realized that it is an option, but do you really think that it should be the first solution ? I tried to search for bugs that seems identical to my case but I didnt find any.
3.I set the vacuum_freeze_table_age to 0 and vacuumed all the duplicated databases but it seems that it didn't solve their problem.
4.I tried to delete the records according to the value in xmin/xmax and the result was that after the delete postgresql didnt recognize that I have those databases. I queried the ctid and the xmin/xman :
postgres=# select xmin,xmax,datname,ctid from pg_Database;
xmin | xmax | datname | ctid
-------+----------+----------------+--------
1802 | 0 | template1 | (0,1)
1803 | 0 | template0 | (0,2)
1806 | 0 | postgres | (0,3)
1809 | 21093518 | db1 | (0,4)
1811 | 21077312 | db2 | (0,5)
2351 | 0 | db1 | (0,21)
3590 | 0 | db3 | (0,22)
3592 | 0 | db2 | (0,23)
12594 | 0 | db4 | (0,24)
I tried to delete db1 with xmax 0 and db2 with xmax 2 (the opposite of what you suggested with the ctid). Now, I thought that if the xmax is set to 0 it means that I "didnt have" any update / delete operations and therefore I should delete those databases. Does it matters which one to delete ? I read about the xmin/xmax/ctid columns but I thought that what I did was legit.
Moreover I realized that I have duplicated rows in more system tables so i don`t really like this solution.
My next attempt will be to upgrade to 9.2.24.
Thanks Tom ! ,
Regards Mariel.
בתאריך יום א׳, 26 באוג׳ 2018 ב-18:51 מאת Tom Lane <tgl@sss.pgh.pa.us>:
Mariel Cherkassky <mariel.cherkassky@gmail.com> writes:
> I'm trying to investigate a database of one of our clients. The database
> version is 9.2.5.
1. Fire their DBA for dereliction of duty.
2. Update to the last available release of 9.2.x (9.2.24, looks like).
3. Vacuum everything and see if it gets better.
Vacuuming may or may not fix the observed data corruption problems, but
it's silly to ignore the fact that they're missing four years worth
of bug fixes in that branch. In particular I'm noticing the first
entry in the change notes for 9.2.6, which recommends "vacuuming all
tables in all databases while having vacuum_freeze_table_age set to zero":
https://www.postgresql.org/docs/9.2/static/release-9-2-6.html
That problem as-described seems to involve rows disappearing, rather than
getting duplicated, but I wouldn't rule out that it could also have
symptoms like this.
If that doesn't fix things, you could then try:
4. Manually eliminate duplicate rows by identifying which one of each pair
seems older and deleting it with DELETE ... WHERE ctid = '(x,y)'. Then
reindex to confirm no duplicates remain.
But you still need step 2, to reduce the odds the problem will recur.
regards, tom lane
So As Tom Lane suggested I upgraded the database to the 9.2.24 version. I vacuumed (vacuum verbose) all the databases and for one of the databases (that has a duplicated record in pg_database) I got the next meesages for two of the obejcts :
2018-08-27 16:57:59 +08 db4 22026 WARNING: relation "a" page 1560 is uninitialized --- fixing
2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1561 is uninitialized --- fixing
2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1562 is uninitialized --- fixing
2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1563 is uninitialized --- fixing
2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1564 is uninitialized --- fixing
2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1565 is uninitialized --- fixing
2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1566 is uninitialized --- fixing
2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1567 is uninitialized --- fixing
2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1568 is uninitialized --- fixing
2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1569 is uninitialized --- fixing
2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1031 is uninitialized --- fixing
2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1032 is uninitialized --- fixing
2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1033 is uninitialized --- fixing
2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1034 is uninitialized --- fixing
2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1035 is uninitialized --- fixing
2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1036 is uninitialized --- fixing
2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1037 is uninitialized --- fixing
I tried re indexing the problematic databases :
db1>reindex system db1
2018-08-27 17:09:44 +08 db1 23218 ERROR: could not access status of transaction 32212695
2018-08-27 17:09:44 +08 db1 23218 DETAIL: Could not open file "pg_subtrans/01EB": No such file or directory.
I tried to delete the duplicated rows in pg_database but the delete doesnt delete anything :
select ctid,xmin,xmax,datname from pg_database order by datname;
ctid | xmin | xmax | datname
--------+-------+----------+----------------
(0,21) | 2351 | 0 | db1
(0,4) | 1809 | 21093518 | db1
(0,3) | 1806 | 0 | postgres
(0,24) | 12594 | 0 | db2
(0,2) | 1803 | 0 | template0
(0,1) | 1802 | 0 | template1
(0,22) | 3590 | 0 | db3
(0,23) | 3592 | 0 | db4
(0,5) | 1811 | 21077312 | db4
(9 rows)
As you can see db1 and db4 have duplicated records. I tried to delete them :
delete from pg_database where ctid='(0,4)';
DELETE 0
but the record does exist :
select ctid,datname from pg_database where ctid='(0,4)';
ctid | datname
-------+---------
(0,4) | db1
(1 row)
I set the zero_damaged_pages to on but it didnt help either.
How can I continue ?
בתאריך יום א׳, 26 באוג׳ 2018 ב-19:42 מאת Mariel Cherkassky <mariel.cherkassky@gmail.com>:
1.I'm not really sure if they have one but its not my business.. I'm just trying to help those guys with an application that is based on postgres...2.Yeah I realized that it is an option, but do you really think that it should be the first solution ? I tried to search for bugs that seems identical to my case but I didnt find any.3.I set the vacuum_freeze_table_age to 0 and vacuumed all the duplicated databases but it seems that it didn't solve their problem.4.I tried to delete the records according to the value in xmin/xmax and the result was that after the delete postgresql didnt recognize that I have those databases. I queried the ctid and the xmin/xman :postgres=# select xmin,xmax,datname,ctid from pg_Database;xmin | xmax | datname | ctid-------+----------+----------------+--------1802 | 0 | template1 | (0,1)1803 | 0 | template0 | (0,2)1806 | 0 | postgres | (0,3)1809 | 21093518 | db1 | (0,4)1811 | 21077312 | db2 | (0,5)2351 | 0 | db1 | (0,21)3590 | 0 | db3 | (0,22)3592 | 0 | db2 | (0,23)12594 | 0 | db4 | (0,24)I tried to delete db1 with xmax 0 and db2 with xmax 2 (the opposite of what you suggested with the ctid). Now, I thought that if the xmax is set to 0 it means that I "didnt have" any update / delete operations and therefore I should delete those databases. Does it matters which one to delete ? I read about the xmin/xmax/ctid columns but I thought that what I did was legit.Moreover I realized that I have duplicated rows in more system tables so i don`t really like this solution.My next attempt will be to upgrade to 9.2.24.Thanks Tom ! ,Regards Mariel.בתאריך יום א׳, 26 באוג׳ 2018 ב-18:51 מאת Tom Lane <tgl@sss.pgh.pa.us>:Mariel Cherkassky <mariel.cherkassky@gmail.com> writes:
> I'm trying to investigate a database of one of our clients. The database
> version is 9.2.5.
1. Fire their DBA for dereliction of duty.
2. Update to the last available release of 9.2.x (9.2.24, looks like).
3. Vacuum everything and see if it gets better.
Vacuuming may or may not fix the observed data corruption problems, but
it's silly to ignore the fact that they're missing four years worth
of bug fixes in that branch. In particular I'm noticing the first
entry in the change notes for 9.2.6, which recommends "vacuuming all
tables in all databases while having vacuum_freeze_table_age set to zero":
https://www.postgresql.org/docs/9.2/static/release-9-2-6.html
That problem as-described seems to involve rows disappearing, rather than
getting duplicated, but I wouldn't rule out that it could also have
symptoms like this.
If that doesn't fix things, you could then try:
4. Manually eliminate duplicate rows by identifying which one of each pair
seems older and deleting it with DELETE ... WHERE ctid = '(x,y)'. Then
reindex to confirm no duplicates remain.
But you still need step 2, to reduce the odds the problem will recur.
regards, tom lane
Any idea what should I do ?
בתאריך יום ב׳, 27 באוג׳ 2018 ב-15:05 מאת Mariel Cherkassky <mariel.cherkassky@gmail.com>:
So As Tom Lane suggested I upgraded the database to the 9.2.24 version. I vacuumed (vacuum verbose) all the databases and for one of the databases (that has a duplicated record in pg_database) I got the next meesages for two of the obejcts :2018-08-27 16:57:59 +08 db4 22026 WARNING: relation "a" page 1560 is uninitialized --- fixing2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1561 is uninitialized --- fixing2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1562 is uninitialized --- fixing2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1563 is uninitialized --- fixing2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1564 is uninitialized --- fixing2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1565 is uninitialized --- fixing2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1566 is uninitialized --- fixing2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1567 is uninitialized --- fixing2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1568 is uninitialized --- fixing2018-08-27 16:58:00 +08 db4 22026 WARNING: relation "a" page 1569 is uninitialized --- fixing2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1031 is uninitialized --- fixing2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1032 is uninitialized --- fixing2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1033 is uninitialized --- fixing2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1034 is uninitialized --- fixing2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1035 is uninitialized --- fixing2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1036 is uninitialized --- fixing2018-08-27 16:58:23 +08 db4 22026 WARNING: relation "b" page 1037 is uninitialized --- fixingI tried re indexing the problematic databases :db1>reindex system db12018-08-27 17:09:44 +08 db1 23218 ERROR: could not access status of transaction 322126952018-08-27 17:09:44 +08 db1 23218 DETAIL: Could not open file "pg_subtrans/01EB": No such file or directory.I tried to delete the duplicated rows in pg_database but the delete doesnt delete anything :select ctid,xmin,xmax,datname from pg_database order by datname;ctid | xmin | xmax | datname--------+-------+----------+----------------(0,21) | 2351 | 0 | db1(0,4) | 1809 | 21093518 | db1(0,3) | 1806 | 0 | postgres(0,24) | 12594 | 0 | db2(0,2) | 1803 | 0 | template0(0,1) | 1802 | 0 | template1(0,22) | 3590 | 0 | db3(0,23) | 3592 | 0 | db4(0,5) | 1811 | 21077312 | db4(9 rows)As you can see db1 and db4 have duplicated records. I tried to delete them :delete from pg_database where ctid='(0,4)';DELETE 0but the record does exist :select ctid,datname from pg_database where ctid='(0,4)';ctid | datname-------+---------(0,4) | db1(1 row)I set the zero_damaged_pages to on but it didnt help either.How can I continue ?בתאריך יום א׳, 26 באוג׳ 2018 ב-19:42 מאת Mariel Cherkassky <mariel.cherkassky@gmail.com>:1.I'm not really sure if they have one but its not my business.. I'm just trying to help those guys with an application that is based on postgres...2.Yeah I realized that it is an option, but do you really think that it should be the first solution ? I tried to search for bugs that seems identical to my case but I didnt find any.3.I set the vacuum_freeze_table_age to 0 and vacuumed all the duplicated databases but it seems that it didn't solve their problem.4.I tried to delete the records according to the value in xmin/xmax and the result was that after the delete postgresql didnt recognize that I have those databases. I queried the ctid and the xmin/xman :postgres=# select xmin,xmax,datname,ctid from pg_Database;xmin | xmax | datname | ctid-------+----------+----------------+--------1802 | 0 | template1 | (0,1)1803 | 0 | template0 | (0,2)1806 | 0 | postgres | (0,3)1809 | 21093518 | db1 | (0,4)1811 | 21077312 | db2 | (0,5)2351 | 0 | db1 | (0,21)3590 | 0 | db3 | (0,22)3592 | 0 | db2 | (0,23)12594 | 0 | db4 | (0,24)I tried to delete db1 with xmax 0 and db2 with xmax 2 (the opposite of what you suggested with the ctid). Now, I thought that if the xmax is set to 0 it means that I "didnt have" any update / delete operations and therefore I should delete those databases. Does it matters which one to delete ? I read about the xmin/xmax/ctid columns but I thought that what I did was legit.Moreover I realized that I have duplicated rows in more system tables so i don`t really like this solution.My next attempt will be to upgrade to 9.2.24.Thanks Tom ! ,Regards Mariel.בתאריך יום א׳, 26 באוג׳ 2018 ב-18:51 מאת Tom Lane <tgl@sss.pgh.pa.us>:Mariel Cherkassky <mariel.cherkassky@gmail.com> writes:
> I'm trying to investigate a database of one of our clients. The database
> version is 9.2.5.
1. Fire their DBA for dereliction of duty.
2. Update to the last available release of 9.2.x (9.2.24, looks like).
3. Vacuum everything and see if it gets better.
Vacuuming may or may not fix the observed data corruption problems, but
it's silly to ignore the fact that they're missing four years worth
of bug fixes in that branch. In particular I'm noticing the first
entry in the change notes for 9.2.6, which recommends "vacuuming all
tables in all databases while having vacuum_freeze_table_age set to zero":
https://www.postgresql.org/docs/9.2/static/release-9-2-6.html
That problem as-described seems to involve rows disappearing, rather than
getting duplicated, but I wouldn't rule out that it could also have
symptoms like this.
If that doesn't fix things, you could then try:
4. Manually eliminate duplicate rows by identifying which one of each pair
seems older and deleting it with DELETE ... WHERE ctid = '(x,y)'. Then
reindex to confirm no duplicates remain.
But you still need step 2, to reduce the odds the problem will recur.
regards, tom lane