Обсуждение: Bug #843: pg_clog files problem

Поиск
Список
Период
Сортировка

Bug #843: pg_clog files problem

От
pgsql-bugs@postgresql.org
Дата:
Nikolay Hristov (charlie@mail.vega.bg) reports a bug with a severity of 1
The lower the number the more severe it is.

Short Description
pg_clog files problem

Long Description
I am running PostgreSQL 7.2.1 distributed with cygwin on
Windows NT4 SP6.
The data directory is on NTFS drive.

I performed the following test:
- started a transaction with many(about 50) inserts, updates and deletes);
- in the middle of this I unplugged the power cable of the machine.

After rebooting, I started the postmaster again and the entries in the log file ware as follows:

FindExec: found "/usr/bin/postgres" using argv[0]
invoking IpcMemoryCreate(size=1441792)
FindExec: found "/usr/bin/postmaster" using argv[0]
DEBUG:  database system was interrupted at 2002-12-09 15:20:37 EEST
DEBUG:  checkpoint record is at 0/43B2D20
DEBUG:  redo record is at 0/43B2D20; undo record is at 0/0; shutdown TRUE
DEBUG:  next transaction id: 4565; next oid: 70678
DEBUG:  database system was not properly shut down; automatic recovery in progress
DEBUG:  ReadRecord: record with zero length at 0/43B2D60
DEBUG:  redo is not required
DEBUG:  database system is ready

I stopped the server, changed some debug display options and started it again:

FindExec: found "/usr/bin/postgres" using argv[0]
invoking IpcMemoryCreate(size=1441792)
FindExec: found "/usr/bin/postmaster" using argv[0]
2002-12-09 15:42:03 [243]    DEBUG:  database system was shut down at 2002-12-09 15:41:49 EEST
2002-12-09 15:42:03 [243]    DEBUG:  checkpoint record is at 0/43B2DA0
2002-12-09 15:42:03 [243]    DEBUG:  redo record is at 0/43B2DA0; undo record is at 0/0; shutdown TRUE
2002-12-09 15:42:03 [243]    DEBUG:  next transaction id: 4565; next oid: 70678
2002-12-09 15:42:03 [243]    DEBUG:  database system is ready


After that I established a connection and executed a query
(select count(*) from pg_class):

2002-12-09 15:42:59 [180]    DEBUG:  connection: host=127.0.0.1 user=admin database=lager
/usr/bin/postmaster child[180]: starting with (postgres -d2 -v131072 -p lager )
2002-12-09 15:42:59 [180]    DEBUG:  InitPostgres
2002-12-09 15:42:59 [180]    DEBUG:  StartTransactionCommand
2002-12-09 15:42:59 [180]    DEBUG:  query: select getdatabaseencoding()
2002-12-09 15:42:59 [180]    DEBUG:  ProcessQuery
2002-12-09 15:42:59 [180]    DEBUG:  CommitTransactionCommand
2002-12-09 15:42:59 [180]    DEBUG:  StartTransactionCommand
2002-12-09 15:42:59 [180]    DEBUG:  query: SELECT usesuper FROM pg_user WHERE usename = 'admin'
2002-12-09 15:42:59 [180]    DEBUG:  ProcessQuery
2002-12-09 15:42:59 [180]    DEBUG:  CommitTransactionCommand
2002-12-09 15:43:16 [180]    DEBUG:  StartTransactionCommand
2002-12-09 15:43:16 [180]    DEBUG:  query: select count(*) from pg_class;
2002-12-09 15:43:16 [180]    DEBUG:  ProcessQuery
2002-12-09 15:43:16 [180]    FATAL 2:  open of /cygdrive/d/TEMP/data/pg_clog/0677 failed: No such file or directory
2002-12-09 15:43:16 [180]    DEBUG:  proc_exit(2)
2002-12-09 15:43:16 [180]    DEBUG:  shmem_exit(2)
2002-12-09 15:43:17 [180]    DEBUG:  exit(2)

There is only one file in pg_clog:
-rw-r--r--    1 nike     None         8192 Dec  9 15:05 0000

I performed this test a few times during the last two days.
This situation was reproduced twice.

Sample Code


No file was uploaded with this report

Re: Bug #843: pg_clog files problem

От
Tom Lane
Дата:
pgsql-bugs@postgresql.org writes:
> I am running PostgreSQL 7.2.1 distributed with cygwin on
> Windows NT4 SP6.

7.2.1 has known bugs; why are you not testing 7.2.3?

> The data directory is on NTFS drive.

Not sure that we can do much about an inherently unreliable platform
:-(.  I do not know whether cygwin (or even bare NT) provides any
write-ordering guarantees --- anyone able to say anything authoritative
about the behavior of fsync on this platform?  In any case, there is *no
one* who will claim that cygwin/NT offers the sort of production-grade
stability that you are evidently looking for.  Forget that OS and get
yourself a recent Linux release.

But having said all that, I suspect that your real issue is not a
software problem at all, but hardware:

> 2002-12-09 15:43:16 [180]    FATAL 2:  open of /cygdrive/d/TEMP/data/pg_clog/0677 failed: No such file or directory

We've seen several similar reports of attempted clog access far past the
actual end of clog (ie, an attempt to determine the commit status of a
garbage transaction number) and in every case where it was possible to
trace the cause, the cause was corrupted data pages on-disk.  A totally
trashed disk page will often show this failure before any other evidence
appears of the data corruption, just because checking the transaction
numbers in tuple headers is one of the first steps in trying to use
data.

So I rather suspect that the real issue is your disk drive doesn't
behave well when it loses power while writing.  I can't prove it at a
distance, but that's what I'd be looking into if I were you.

If you want to try to learn more from the evidence you have, I'd suggest
trying to identify the trashed data pages to see if there's any pattern
to them.  A useful low-level tool for this is pg_filedump from
http://sources.redhat.com/rhdb/tools.html.

            regards, tom lane

Bug #843: pg_clog files problem - clarification

От
Nikolay Hristov
Дата:
Hi

I am the bug reporter.
I have uploaded some misleading information.

Actually the log entries produced on startup after power down are the
following:

DEBUG:  database system was interrupted at 2002-12-09 14:59:34 EEST
DEBUG:  checkpoint record is at 0/4239AF0
DEBUG:  redo record is at 0/4239AF0; undo record is at 0/0; shutdown TRUE
DEBUG:  next transaction id: 4523; next oid: 62486
DEBUG:  database system was not properly shut down; automatic recovery
in progress
DEBUG:  redo starts at 0/4239B30
DEBUG:  ReadRecord: record with zero length at 0/43B2CA0
DEBUG:  redo done at 0/43B2C78
DEBUG:  database system is ready

And here follows the reaction of a query that I don't remember.

FATAL 2:  open of /usr/share/postgresql/data/pg_clog/0640 failed: No
such file or directory
DEBUG:  server process (pid 115) exited with exit code 2
DEBUG:  terminating any other active server processes
DEBUG:  all server processes terminated; reinitializing shared memory
and semaphores


Then I copied the data directory to my machine and continued there.
The log entries from my original message actually should start here.


Please excuse me.

Best regards,
Nikolay Hristov

pgsql-bugs@postgresql.org wrote:
> Nikolay Hristov (charlie@mail.vega.bg) reports a bug with a severity of 1
> The lower the number the more severe it is.
>
> Short Description
> pg_clog files problem
>
> Long Description
> I am running PostgreSQL 7.2.1 distributed with cygwin on
> Windows NT4 SP6.
> The data directory is on NTFS drive.
>
> I performed the following test:
> - started a transaction with many(about 50) inserts, updates and deletes);
> - in the middle of this I unplugged the power cable of the machine.
>
> After rebooting, I started the postmaster again and the entries in the log file ware as follows:
>
> FindExec: found "/usr/bin/postgres" using argv[0]
> invoking IpcMemoryCreate(size=1441792)
> FindExec: found "/usr/bin/postmaster" using argv[0]
> DEBUG:  database system was interrupted at 2002-12-09 15:20:37 EEST
> DEBUG:  checkpoint record is at 0/43B2D20
> DEBUG:  redo record is at 0/43B2D20; undo record is at 0/0; shutdown TRUE
> DEBUG:  next transaction id: 4565; next oid: 70678
> DEBUG:  database system was not properly shut down; automatic recovery in progress
> DEBUG:  ReadRecord: record with zero length at 0/43B2D60
> DEBUG:  redo is not required
> DEBUG:  database system is ready
>
> I stopped the server, changed some debug display options and started it again:
>
> FindExec: found "/usr/bin/postgres" using argv[0]
> invoking IpcMemoryCreate(size=1441792)
> FindExec: found "/usr/bin/postmaster" using argv[0]
> 2002-12-09 15:42:03 [243]    DEBUG:  database system was shut down at 2002-12-09 15:41:49 EEST
> 2002-12-09 15:42:03 [243]    DEBUG:  checkpoint record is at 0/43B2DA0
> 2002-12-09 15:42:03 [243]    DEBUG:  redo record is at 0/43B2DA0; undo record is at 0/0; shutdown TRUE
> 2002-12-09 15:42:03 [243]    DEBUG:  next transaction id: 4565; next oid: 70678
> 2002-12-09 15:42:03 [243]    DEBUG:  database system is ready
>
>
> After that I established a connection and executed a query
> (select count(*) from pg_class):
>
> 2002-12-09 15:42:59 [180]    DEBUG:  connection: host=127.0.0.1 user=admin database=lager
> /usr/bin/postmaster child[180]: starting with (postgres -d2 -v131072 -p lager )
> 2002-12-09 15:42:59 [180]    DEBUG:  InitPostgres
> 2002-12-09 15:42:59 [180]    DEBUG:  StartTransactionCommand
> 2002-12-09 15:42:59 [180]    DEBUG:  query: select getdatabaseencoding()
> 2002-12-09 15:42:59 [180]    DEBUG:  ProcessQuery
> 2002-12-09 15:42:59 [180]    DEBUG:  CommitTransactionCommand
> 2002-12-09 15:42:59 [180]    DEBUG:  StartTransactionCommand
> 2002-12-09 15:42:59 [180]    DEBUG:  query: SELECT usesuper FROM pg_user WHERE usename = 'admin'
> 2002-12-09 15:42:59 [180]    DEBUG:  ProcessQuery
> 2002-12-09 15:42:59 [180]    DEBUG:  CommitTransactionCommand
> 2002-12-09 15:43:16 [180]    DEBUG:  StartTransactionCommand
> 2002-12-09 15:43:16 [180]    DEBUG:  query: select count(*) from pg_class;
> 2002-12-09 15:43:16 [180]    DEBUG:  ProcessQuery
> 2002-12-09 15:43:16 [180]    FATAL 2:  open of /cygdrive/d/TEMP/data/pg_clog/0677 failed: No such file or directory
> 2002-12-09 15:43:16 [180]    DEBUG:  proc_exit(2)
> 2002-12-09 15:43:16 [180]    DEBUG:  shmem_exit(2)
> 2002-12-09 15:43:17 [180]    DEBUG:  exit(2)
>
> There is only one file in pg_clog:
> -rw-r--r--    1 nike     None         8192 Dec  9 15:05 0000
>
> I performed this test a few times during the last two days.
> This situation was reproduced twice.
>
> Sample Code
>
>
> No file was uploaded with this report
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)