Обсуждение: Bug #843: pg_clog files problem
Nikolay Hristov (charlie@mail.vega.bg) reports a bug with a severity of 1 The lower the number the more severe it is. Short Description pg_clog files problem Long Description I am running PostgreSQL 7.2.1 distributed with cygwin on Windows NT4 SP6. The data directory is on NTFS drive. I performed the following test: - started a transaction with many(about 50) inserts, updates and deletes); - in the middle of this I unplugged the power cable of the machine. After rebooting, I started the postmaster again and the entries in the log file ware as follows: FindExec: found "/usr/bin/postgres" using argv[0] invoking IpcMemoryCreate(size=1441792) FindExec: found "/usr/bin/postmaster" using argv[0] DEBUG: database system was interrupted at 2002-12-09 15:20:37 EEST DEBUG: checkpoint record is at 0/43B2D20 DEBUG: redo record is at 0/43B2D20; undo record is at 0/0; shutdown TRUE DEBUG: next transaction id: 4565; next oid: 70678 DEBUG: database system was not properly shut down; automatic recovery in progress DEBUG: ReadRecord: record with zero length at 0/43B2D60 DEBUG: redo is not required DEBUG: database system is ready I stopped the server, changed some debug display options and started it again: FindExec: found "/usr/bin/postgres" using argv[0] invoking IpcMemoryCreate(size=1441792) FindExec: found "/usr/bin/postmaster" using argv[0] 2002-12-09 15:42:03 [243] DEBUG: database system was shut down at 2002-12-09 15:41:49 EEST 2002-12-09 15:42:03 [243] DEBUG: checkpoint record is at 0/43B2DA0 2002-12-09 15:42:03 [243] DEBUG: redo record is at 0/43B2DA0; undo record is at 0/0; shutdown TRUE 2002-12-09 15:42:03 [243] DEBUG: next transaction id: 4565; next oid: 70678 2002-12-09 15:42:03 [243] DEBUG: database system is ready After that I established a connection and executed a query (select count(*) from pg_class): 2002-12-09 15:42:59 [180] DEBUG: connection: host=127.0.0.1 user=admin database=lager /usr/bin/postmaster child[180]: starting with (postgres -d2 -v131072 -p lager ) 2002-12-09 15:42:59 [180] DEBUG: InitPostgres 2002-12-09 15:42:59 [180] DEBUG: StartTransactionCommand 2002-12-09 15:42:59 [180] DEBUG: query: select getdatabaseencoding() 2002-12-09 15:42:59 [180] DEBUG: ProcessQuery 2002-12-09 15:42:59 [180] DEBUG: CommitTransactionCommand 2002-12-09 15:42:59 [180] DEBUG: StartTransactionCommand 2002-12-09 15:42:59 [180] DEBUG: query: SELECT usesuper FROM pg_user WHERE usename = 'admin' 2002-12-09 15:42:59 [180] DEBUG: ProcessQuery 2002-12-09 15:42:59 [180] DEBUG: CommitTransactionCommand 2002-12-09 15:43:16 [180] DEBUG: StartTransactionCommand 2002-12-09 15:43:16 [180] DEBUG: query: select count(*) from pg_class; 2002-12-09 15:43:16 [180] DEBUG: ProcessQuery 2002-12-09 15:43:16 [180] FATAL 2: open of /cygdrive/d/TEMP/data/pg_clog/0677 failed: No such file or directory 2002-12-09 15:43:16 [180] DEBUG: proc_exit(2) 2002-12-09 15:43:16 [180] DEBUG: shmem_exit(2) 2002-12-09 15:43:17 [180] DEBUG: exit(2) There is only one file in pg_clog: -rw-r--r-- 1 nike None 8192 Dec 9 15:05 0000 I performed this test a few times during the last two days. This situation was reproduced twice. Sample Code No file was uploaded with this report
pgsql-bugs@postgresql.org writes: > I am running PostgreSQL 7.2.1 distributed with cygwin on > Windows NT4 SP6. 7.2.1 has known bugs; why are you not testing 7.2.3? > The data directory is on NTFS drive. Not sure that we can do much about an inherently unreliable platform :-(. I do not know whether cygwin (or even bare NT) provides any write-ordering guarantees --- anyone able to say anything authoritative about the behavior of fsync on this platform? In any case, there is *no one* who will claim that cygwin/NT offers the sort of production-grade stability that you are evidently looking for. Forget that OS and get yourself a recent Linux release. But having said all that, I suspect that your real issue is not a software problem at all, but hardware: > 2002-12-09 15:43:16 [180] FATAL 2: open of /cygdrive/d/TEMP/data/pg_clog/0677 failed: No such file or directory We've seen several similar reports of attempted clog access far past the actual end of clog (ie, an attempt to determine the commit status of a garbage transaction number) and in every case where it was possible to trace the cause, the cause was corrupted data pages on-disk. A totally trashed disk page will often show this failure before any other evidence appears of the data corruption, just because checking the transaction numbers in tuple headers is one of the first steps in trying to use data. So I rather suspect that the real issue is your disk drive doesn't behave well when it loses power while writing. I can't prove it at a distance, but that's what I'd be looking into if I were you. If you want to try to learn more from the evidence you have, I'd suggest trying to identify the trashed data pages to see if there's any pattern to them. A useful low-level tool for this is pg_filedump from http://sources.redhat.com/rhdb/tools.html. regards, tom lane
Hi I am the bug reporter. I have uploaded some misleading information. Actually the log entries produced on startup after power down are the following: DEBUG: database system was interrupted at 2002-12-09 14:59:34 EEST DEBUG: checkpoint record is at 0/4239AF0 DEBUG: redo record is at 0/4239AF0; undo record is at 0/0; shutdown TRUE DEBUG: next transaction id: 4523; next oid: 62486 DEBUG: database system was not properly shut down; automatic recovery in progress DEBUG: redo starts at 0/4239B30 DEBUG: ReadRecord: record with zero length at 0/43B2CA0 DEBUG: redo done at 0/43B2C78 DEBUG: database system is ready And here follows the reaction of a query that I don't remember. FATAL 2: open of /usr/share/postgresql/data/pg_clog/0640 failed: No such file or directory DEBUG: server process (pid 115) exited with exit code 2 DEBUG: terminating any other active server processes DEBUG: all server processes terminated; reinitializing shared memory and semaphores Then I copied the data directory to my machine and continued there. The log entries from my original message actually should start here. Please excuse me. Best regards, Nikolay Hristov pgsql-bugs@postgresql.org wrote: > Nikolay Hristov (charlie@mail.vega.bg) reports a bug with a severity of 1 > The lower the number the more severe it is. > > Short Description > pg_clog files problem > > Long Description > I am running PostgreSQL 7.2.1 distributed with cygwin on > Windows NT4 SP6. > The data directory is on NTFS drive. > > I performed the following test: > - started a transaction with many(about 50) inserts, updates and deletes); > - in the middle of this I unplugged the power cable of the machine. > > After rebooting, I started the postmaster again and the entries in the log file ware as follows: > > FindExec: found "/usr/bin/postgres" using argv[0] > invoking IpcMemoryCreate(size=1441792) > FindExec: found "/usr/bin/postmaster" using argv[0] > DEBUG: database system was interrupted at 2002-12-09 15:20:37 EEST > DEBUG: checkpoint record is at 0/43B2D20 > DEBUG: redo record is at 0/43B2D20; undo record is at 0/0; shutdown TRUE > DEBUG: next transaction id: 4565; next oid: 70678 > DEBUG: database system was not properly shut down; automatic recovery in progress > DEBUG: ReadRecord: record with zero length at 0/43B2D60 > DEBUG: redo is not required > DEBUG: database system is ready > > I stopped the server, changed some debug display options and started it again: > > FindExec: found "/usr/bin/postgres" using argv[0] > invoking IpcMemoryCreate(size=1441792) > FindExec: found "/usr/bin/postmaster" using argv[0] > 2002-12-09 15:42:03 [243] DEBUG: database system was shut down at 2002-12-09 15:41:49 EEST > 2002-12-09 15:42:03 [243] DEBUG: checkpoint record is at 0/43B2DA0 > 2002-12-09 15:42:03 [243] DEBUG: redo record is at 0/43B2DA0; undo record is at 0/0; shutdown TRUE > 2002-12-09 15:42:03 [243] DEBUG: next transaction id: 4565; next oid: 70678 > 2002-12-09 15:42:03 [243] DEBUG: database system is ready > > > After that I established a connection and executed a query > (select count(*) from pg_class): > > 2002-12-09 15:42:59 [180] DEBUG: connection: host=127.0.0.1 user=admin database=lager > /usr/bin/postmaster child[180]: starting with (postgres -d2 -v131072 -p lager ) > 2002-12-09 15:42:59 [180] DEBUG: InitPostgres > 2002-12-09 15:42:59 [180] DEBUG: StartTransactionCommand > 2002-12-09 15:42:59 [180] DEBUG: query: select getdatabaseencoding() > 2002-12-09 15:42:59 [180] DEBUG: ProcessQuery > 2002-12-09 15:42:59 [180] DEBUG: CommitTransactionCommand > 2002-12-09 15:42:59 [180] DEBUG: StartTransactionCommand > 2002-12-09 15:42:59 [180] DEBUG: query: SELECT usesuper FROM pg_user WHERE usename = 'admin' > 2002-12-09 15:42:59 [180] DEBUG: ProcessQuery > 2002-12-09 15:42:59 [180] DEBUG: CommitTransactionCommand > 2002-12-09 15:43:16 [180] DEBUG: StartTransactionCommand > 2002-12-09 15:43:16 [180] DEBUG: query: select count(*) from pg_class; > 2002-12-09 15:43:16 [180] DEBUG: ProcessQuery > 2002-12-09 15:43:16 [180] FATAL 2: open of /cygdrive/d/TEMP/data/pg_clog/0677 failed: No such file or directory > 2002-12-09 15:43:16 [180] DEBUG: proc_exit(2) > 2002-12-09 15:43:16 [180] DEBUG: shmem_exit(2) > 2002-12-09 15:43:17 [180] DEBUG: exit(2) > > There is only one file in pg_clog: > -rw-r--r-- 1 nike None 8192 Dec 9 15:05 0000 > > I performed this test a few times during the last two days. > This situation was reproduced twice. > > Sample Code > > > No file was uploaded with this report > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)