Обсуждение: Perl script failure => Postgres 7.1.2 database corruption

Поиск

Список

Период

Сортировка

Perl script failure => Postgres 7.1.2 database corruption

От

Frank McKenney

Дата:

09 ноября 2001 г., 12:07:25

Severity
--------

Major annoyance (having to rebuild a database)...  unless there is a
simple workaround I was too ignorant to apply (;-).

Problem
-------

A few days ago, for no obvious reason, a Perl script used to update
the database portion of a 'web site suddenly stopped working.
Several hours later it became apparent that "idle" Postgres
processes were queuing up on the Postgres server machine.  Still
later it became obvious that at least one table in the database was
damaged to the point that even pg_dump would not dump it properly.

The 'web site uses Perl CGI (Apache mod_perl) and DBD::Pg to access
Postgres database on a separate server.


Symptoms
--------

 - Backed-up hung ("idle") Postgres processes.
 - Undeletable table in database (apparently damaged).
 - pg_dump failure.
 - psql errors accessing table.
 - Perl DBI errors reported in Apache/mod_perl error log.

Background
----------

The Perl script involved here updates the Postgres database from
Win32 .DBF files using DBD:Pg and DBD::Xbase.  The script has run
successfully for several weeks, and in fact ran successfully using
the same set of DBF files last Friday.

In order to minimize effect of the update on users who may be using
the database, the script performs the entire database update as a
single transaction. In Perl::DBI terms, it issues a

   $obj = DBI->connect($db_datasource,
                       $db_username,
                       $db_password,
                      {AutoCommit => 0});

and when the updates have been added without error (which usually
happens) it issues a

   $obj->commit();
   $obj->disconnect();
   $obj = undef;

This time the update died partway through:

    + Marking old records for deletion.
        (At this point the script was in the process of issuing
          DELETE FROM TABLE xxxxx  statements)
    DBD::Pg::db do failed: pqReadData() -- backend closed the
    channel unexpectedly. at ./update-foundation-db.pl line 244.

    Could not purge old records from summary, stopping at
    ./update-foundation-db.pl line 244.

What triggered the first error is still not completely clear, but it
may have been the result of the account exceeding its disk space
quota.  However, even after some 40Mb of files were deleted and the
quota was increased by 50Mb, the script continued to fail at the
same point.

The only mildly bright spot in all this is that none of this
appeared to affect 'web site users.  SQL queries from site users
continued to yield results throughout (well, up to the point where I
deleted the database (;-)).

Other messages reported by the script included:

    DBD::Pg::db do failed: ERROR:  heap_delete: (am)invalid tid at
        ./update-foundation-db.pl line 244.


    DBD::Pg::db do failed: ERROR:  simple_heap_delete: tuple
        concurrently updated at  ./update-foundation-db.pl line 244.


    DBD::Pg::db do failed: ERROR:  Relation 1632370768 does not
        exist at ./update-foundation-db.pl line 244.


    DBD::Pg::db do failed: ERROR:  Relation 419450449 does not exist
        at ./update-foundation-db.pl line 244.


At the same time, "idle" Postgres processes were backing up on the
Postgres server.  These only cleared when the Postgres server
process was restarted.
-------
  10626 ?        S      0:00 postgres: websiteadmin websiteadmin
      206.163.128.1 idle in transaction
  (and 8 more like it at one point)
-------

Postgres access from the 'web site logged errors to the Apache error
log for the site's virtual host:
-----------
    NOTICE:  Message from PostgreSQL backend:
    The Postmaster has informed me that some other backend    died
        abnormally and possibly corrupted shared memory.
    I have rolled back the current transaction and am    going to
        terminate your database system connection and exit.
    Please reconnect to the database system and repeat your query.
    [Tue Nov  6 20:52:21 2001] null: Database handle destroyed
        without explicit disconnect at
        /usr/lib/perl5/site_perl/Apache/DBI.pm line 119.
-----------

A number of the 'null:'  errors were reported.  A similar NOTICE:
preceeded the following errors as well:

-----------
    [Tue Nov  6 20:52:53 2001] [error] FATAL 2:  XLogFlush: request
        is not satisfied at
        /home1/g/websiteadmin/site/modules//StandardUtil.pm line
        48.

    [Tue Nov  6 20:52:54 2001] null: DBD::Pg::db rollback failed:
        rollback failed at /usr/lib/perl5/site_perl/Apache/DBI.pm
        line 153.
-----------


I attempted to dump the database with pg_dump, and failed miserably.
The results convinced me that the database was _seriously_ mangled.
-----------
  pg_dump -c -Fc -h postgres.sitehost.com -u websiteadmin >db.pgd
    Username: websiteadmin
    Password:
    ERROR:  Relation 419450449 does not exist
    PQendcopy: resetting connection
    SQL query to dump the contents of Table 'summary' did not
    execute correctly.  After we read all the table contents from
    the backend, PQendcopy() failed.  Explanation from backend:
    'ERROR: Relation 419450449 does not exist '.
    The query was: 'COPY "summary" TO stdout;'.

  psql -h postgres.kattare.com -U websiteadmin -W
    Password:
    Welcome to psql, the PostgreSQL interactive terminal.
    --snip--
    websiteadmin=> vacuum verbose summary;
    NOTICE:  --Relation summary--
    NOTICE:  Pages 728: Changed 34, reaped 604, Empty 0, New 0; Tup
    1032: Vac 5094, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 96,
    MaxLen 2030; Re-using:  Free/Avail.  Space 4913484/4913484;
    EndEmpty/Avail.  Pages 0/604.  CPU 0.03s/0.00u sec.
    NOTICE:  Index summary_pkey: Pages 31; Tuples 1032: Deleted
    5094. CPU 0.00s/0.06u sec.
    NOTICE:  Rel summary: Pages: 728 --> 118; Tuple(s) moved: 1032.
    CPU 0.11s/0.16u sec.
    NOTICE:  Index summary_pkey: Pages 31; Tuples 1032: Deleted
    1032. CPU 0.00s/0.01u sec.
    NOTICE:  --Relation pg_toast_20037--
    NOTICE:  Rel pg_toast_20037: TID 1444/14: OID IS INVALID.
    TUPGONE 1.
    NOTICE:  Rel pg_toast_20037: TID 1444/19: OID IS INVALID.
    TUPGONE 1.
    -- Many, MANY similar lines snipped --
    NOTICE:  Rel pg_toast_20037: TID 1444/2008: OID IS INVALID.
    TUPGONE 1.
    NOTICE:  Rel pg_toast_20037: TID 1444/2016: OID IS INVALID.
    TUPGONE 1.
    pqReadData() -- backend closed the channel unexpectedly.  This
    probably means the backend terminated abnormally before or while
    processing the request.
    The connection to the server was lost.
    Attempting reset:  Failed.  !> -> \q

Workaround
----------

Due to time constraints, and because I had the data available, I
deleted the database with 'dropdb -h', re-created it using
'createdb', and re-populated it using the now-functioning Perl
script.  Another choice might have been to restore the database from
a backup, assuming that the database corruption actually occurred in
conjunction with the initial script failure and not some time
earlier.

Environment
-----------

 DB Server: Postgres 7.1.2, Slackware Linux 8.?
            SELECT version() reports PostgreSQL 7.1.2 on
            i686-pc-linux-gnu, compiled by GCC egcs-2.91.66

 DB Client: Apache 1.3.20, Perl 5.6.1, mod_perl, Slackware 8.1

Questions
---------

 1) Are there circumstances under which a "space exceeded" error on
    a client machine _can_ damage a database on the Postgres server
    machine?

 2) The Postgres error messages all indicate that the transaction
    was going to be "rolled back", but to all indications (including
    the hung processes (;-)) it was not. Is this a bug? Or am I
    misreading the message?

 3) What other things could we have tried to recover this situation?

    Could I have ignored Postgres' claim that it was rolling back
    the transaction, re-connected to the database (with psql, say),
    and issued a ROLLBACK myself?

 4) Where would I find information that would help me accurately
    estimate the client's disk space requirements for SQL database
    processing on a Postgres server?



Frank McKenney


Frank McKenney, McKenney Associates
Richmond, Virginia / (804) 320-4887
E-mail: frank_mckenney@mindspring.com

Re: Perl script failure => Postgres 7.1.2 database corruption

От

Tom Lane

Дата:

09 ноября 2001 г., 16:49:16

Frank McKenney <frank_mckenney@mindspring.com> writes:
>  1) Are there circumstances under which a "space exceeded" error on
>     a client machine _can_ damage a database on the Postgres server
>     machine?

I don't see how.  The error messages you cite all point to the idea
that there was some internal corruption in the database.  I'd venture
more than one corrupted block, in fact.  It appears that block 1444
of your summary table's toast relation was clobbered (probably zeroed
out in whole or in part), and the "relation nnnnn does not exist"
complaints look like some bad things had happened to one or more system
tables as well.

Unfortunately, since you deleted the database, all the evidence is
gone and there's no longer much hope of learning any more.  If something
like this happens again, it might be worth tar'ing up the $PGDATA
tree (while the postmaster is stopped) for possible forensic analysis.

>  3) What other things could we have tried to recover this situation?

I think you were pretty much out of luck on that database, though
perhaps partial data recovery could have been made if you were willing
to spend time on it.  A more interesting thing to worry about is how to
ensure it doesn't happen again, and here my advice would be to look at
the reliability of your disk drives and I/O hardware.  I've seen more
than one report of mysterious data clobbers that eventually traced to
bogus disk controllers, flaky RAM, etc.  In particular, I recall
several data-block-suddenly-became-zero failures with hardware origins,
and none that traced to software problems...

            regards, tom lane

Re: Perl script failure => Postgres 7.1.2 database corruption

От

Tom Lane

Дата:

09 ноября 2001 г., 17:18:36

Frank McKenney <frank_mckenney@mindspring.com> writes:
> Hm.  So it's likely that the only reason that site users didn't see
> the corruption while all this was going on was that they didn't
> happen to invoke that particular portion of that particular table?

Yes.  In fact, since the clobbered data table was a TOAST table, they
could actually have read the affected main-table rows without noticing
a problem, so long as they didn't try to fetch the wide fields.

The "relation does not exist" messages point to some problem in the
system tables as well, but we can no longer guess much about its
extent.  It could well have been something that only affected a few
tables.

> If it makes any difference (and I wish I had remembered this
> earlier), the 'summary' table contains large formatted HTML-text
> fields, on the order of 8-14K. We needed Postgres 7.1 to do this.

Right, it would have been those fields that were clobbered.

> Looking back, I suppose we could have renamed all the tables to
> bad_summary, bad_xxxx, etc.  However, since I'm fairly new to SQL
> (let alone Postgres), I think I'd have been concerned that the
> corruption might "leak out" at some point and corrupt other tables.

The apparent system-table corruption might indeed have some such
behavior; it's hard to tell without more info.  My inclination would
have been to rename the trashed database out of the way and leave it
there for future analysis, if you had the disk space to spare.

> Is there a nicely documented procedure for renaming/moving an entire
> database to a different account that I overlooked in my haste?

It's not well documented.  The most secure procedure is to stop the
postmaster and then cp -rp or tar the whole $PGDATA tree.  If you're
fairly sure that the problem is confined to one database in an
installation, you could just rename that database out of the way.
(There's no RENAME DATABASE command, but in 7.1 I think it'd work to
update the pg_database row with a new name.)

            regards, tom lane

Re: Perl script failure => Postgres 7.1.2 database corruption

От

Frank McKenney

Дата:

13 ноября 2001 г., 16:42:43

** Reply to message from Tom Lane
   <tgl@sss.pgh.pa.us>
   on Fri, 09 Nov 2001 12:36:38 -0500

Tom,

Thanks for taking the time to reply. I have to say that I'm very
impressed with the "expressive power" and performance of
PostgreSQL's inplementation of SQL; I used it in this situation
because "it was there" when I took over the project, but what
I'm doing would be difficult or impossible with other SQL
products I've seen.

(Or maybe it just lets me get away with really arcane syntax I
 shouldn't  be using (;-))

> Frank McKenney <frank_mckenney@mindspring.com> writes:
> >  1) Are there circumstances under which a "space exceeded" error on
> >     a client machine _can_ damage a database on the Postgres server
> >     machine?
>
> I don't see how.  The error messages you cite all point to the idea
> that there was some internal corruption in the database.  I'd venture
> more than one corrupted block, in fact.  It appears that block 1444
> of your summary table's toast relation was clobbered (probably zeroed
> out in whole or in part), and the "relation nnnnn does not exist"
> complaints look like some bad things had happened to one or more system
> tables as well.

Hm.  So it's likely that the only reason that site users didn't see
the corruption while all this was going on was that they didn't
happen to invoke that particular portion of that particular table?
That would make sense -- the site is lightly loaded right now; it's
status is "beta moving into release" and only 8 users have access
for database queries.

If it makes any difference (and I wish I had remembered this
earlier), the 'summary' table contains large formatted HTML-text
fields, on the order of 8-14K. We needed Postgres 7.1 to do this.

--

So it's likely (based on very limited evidence) that whatever
corrupted the database was something limited entirely to the
server's environment (hardware/OS/software)?  Or, to put it another
way since I'm wearing my "application programmer" in this situation
(as opposed to "systems programmer"), there's not much I can do to
prevent this from recurring?

Other than sacrificing the occasional goat, of course (;-).

> Unfortunately, since you deleted the database, all the evidence is
> gone and there's no longer much hope of learning any more.  If
something
> like this happens again, it might be worth tar'ing up the $PGDATA
> tree (while the postmaster is stopped) for possible forensic analysis.

It was a "tight" situation, and I confess to leaning rather heavily
on the 'web host's support person to delete it.  My main interest,
at 23:30 or so, was to get _somthing_ working cleanly and properly
before morning (I finished, what with post-rebuild testing and all,
around 03:30).

Looking back, I suppose we could have renamed all the tables to
bad_summary, bad_xxxx, etc.  However, since I'm fairly new to SQL
(let alone Postgres), I think I'd have been concerned that the
corruption might "leak out" at some point and corrupt other tables.
I've seen this happen in other (non-Postgres) situations.

Is there a nicely documented procedure for renaming/moving an entire
database to a different account that I overlooked in my haste?  I've
skimmed the 7.1 Reference manual to see if it listed anything that
looked right, but if it's in there I missed it.

> >  3) What other things could we have tried to recover this situation?
>
> I think you were pretty much out of luck on that database, though
> perhaps partial data recovery could have been made if you were willing
> to spend time on it.  A more interesting thing to worry about is how to
> ensure it doesn't happen again, and here my advice would be to look at
> the reliability of your disk drives and I/O hardware.  I've seen more
> than one report of mysterious data clobbers that eventually traced to
> bogus disk controllers, flaky RAM, etc.  In particular, I recall
> several data-block-suddenly-became-zero failures with hardware origins,
> and none that traced to software problems...

Thank you for cc:ing my web hosting service support group. I'll
follow up with them on this.

Is there a simple, light-load way of testing the integrity of a
given table/database?  Experience seems to indicate that

    pg_dump -what -ever database > /dev/null

would do a fairly good job on the entire database, but it would be
nice if there was a 20-second 'verifydb' program.  Is there anything
else available along those lines designed for that specific purpose?

Is there a FAQ anywhere that discusses Postgres database corruption
and repair?

Anyone know who's writing Chapter 10 for the next release of the
"PostgreSQL Administrator's Guide"?  It pops up all _over_ the place
when you use Google to try to find out how to fix a Postgres
database (;-).

In any case, thanks for the feedback.  We learn, we try to plan for
"next time", and Murphy ensures that we'll have many more such
"educational opportunities" (;-).

Frank McKenney, McKenney Associates
Richmond, Virginia / (804) 320-4887
E-mail: frank_mckenney@mindspring.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Perl script failure => Postgres 7.1.2 database corruption

Perl script failure => Postgres 7.1.2 database corruption

Re: Perl script failure => Postgres 7.1.2 database corruption

Re: Perl script failure => Postgres 7.1.2 database corruption

Re: Perl script failure => Postgres 7.1.2 database corruption