Обсуждение: Perl script failure => Postgres 7.1.2 database corruption
Severity -------- Major annoyance (having to rebuild a database)... unless there is a simple workaround I was too ignorant to apply (;-). Problem ------- A few days ago, for no obvious reason, a Perl script used to update the database portion of a 'web site suddenly stopped working. Several hours later it became apparent that "idle" Postgres processes were queuing up on the Postgres server machine. Still later it became obvious that at least one table in the database was damaged to the point that even pg_dump would not dump it properly. The 'web site uses Perl CGI (Apache mod_perl) and DBD::Pg to access Postgres database on a separate server. Symptoms -------- - Backed-up hung ("idle") Postgres processes. - Undeletable table in database (apparently damaged). - pg_dump failure. - psql errors accessing table. - Perl DBI errors reported in Apache/mod_perl error log. Background ---------- The Perl script involved here updates the Postgres database from Win32 .DBF files using DBD:Pg and DBD::Xbase. The script has run successfully for several weeks, and in fact ran successfully using the same set of DBF files last Friday. In order to minimize effect of the update on users who may be using the database, the script performs the entire database update as a single transaction. In Perl::DBI terms, it issues a $obj = DBI->connect($db_datasource, $db_username, $db_password, {AutoCommit => 0}); and when the updates have been added without error (which usually happens) it issues a $obj->commit(); $obj->disconnect(); $obj = undef; This time the update died partway through: + Marking old records for deletion. (At this point the script was in the process of issuing DELETE FROM TABLE xxxxx statements) DBD::Pg::db do failed: pqReadData() -- backend closed the channel unexpectedly. at ./update-foundation-db.pl line 244. Could not purge old records from summary, stopping at ./update-foundation-db.pl line 244. What triggered the first error is still not completely clear, but it may have been the result of the account exceeding its disk space quota. However, even after some 40Mb of files were deleted and the quota was increased by 50Mb, the script continued to fail at the same point. The only mildly bright spot in all this is that none of this appeared to affect 'web site users. SQL queries from site users continued to yield results throughout (well, up to the point where I deleted the database (;-)). Other messages reported by the script included: DBD::Pg::db do failed: ERROR: heap_delete: (am)invalid tid at ./update-foundation-db.pl line 244. DBD::Pg::db do failed: ERROR: simple_heap_delete: tuple concurrently updated at ./update-foundation-db.pl line 244. DBD::Pg::db do failed: ERROR: Relation 1632370768 does not exist at ./update-foundation-db.pl line 244. DBD::Pg::db do failed: ERROR: Relation 419450449 does not exist at ./update-foundation-db.pl line 244. At the same time, "idle" Postgres processes were backing up on the Postgres server. These only cleared when the Postgres server process was restarted. ------- 10626 ? S 0:00 postgres: websiteadmin websiteadmin 206.163.128.1 idle in transaction (and 8 more like it at one point) ------- Postgres access from the 'web site logged errors to the Apache error log for the site's virtual host: ----------- NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. [Tue Nov 6 20:52:21 2001] null: Database handle destroyed without explicit disconnect at /usr/lib/perl5/site_perl/Apache/DBI.pm line 119. ----------- A number of the 'null:' errors were reported. A similar NOTICE: preceeded the following errors as well: ----------- [Tue Nov 6 20:52:53 2001] [error] FATAL 2: XLogFlush: request is not satisfied at /home1/g/websiteadmin/site/modules//StandardUtil.pm line 48. [Tue Nov 6 20:52:54 2001] null: DBD::Pg::db rollback failed: rollback failed at /usr/lib/perl5/site_perl/Apache/DBI.pm line 153. ----------- I attempted to dump the database with pg_dump, and failed miserably. The results convinced me that the database was _seriously_ mangled. ----------- pg_dump -c -Fc -h postgres.sitehost.com -u websiteadmin >db.pgd Username: websiteadmin Password: ERROR: Relation 419450449 does not exist PQendcopy: resetting connection SQL query to dump the contents of Table 'summary' did not execute correctly. After we read all the table contents from the backend, PQendcopy() failed. Explanation from backend: 'ERROR: Relation 419450449 does not exist '. The query was: 'COPY "summary" TO stdout;'. psql -h postgres.kattare.com -U websiteadmin -W Password: Welcome to psql, the PostgreSQL interactive terminal. --snip-- websiteadmin=> vacuum verbose summary; NOTICE: --Relation summary-- NOTICE: Pages 728: Changed 34, reaped 604, Empty 0, New 0; Tup 1032: Vac 5094, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 96, MaxLen 2030; Re-using: Free/Avail. Space 4913484/4913484; EndEmpty/Avail. Pages 0/604. CPU 0.03s/0.00u sec. NOTICE: Index summary_pkey: Pages 31; Tuples 1032: Deleted 5094. CPU 0.00s/0.06u sec. NOTICE: Rel summary: Pages: 728 --> 118; Tuple(s) moved: 1032. CPU 0.11s/0.16u sec. NOTICE: Index summary_pkey: Pages 31; Tuples 1032: Deleted 1032. CPU 0.00s/0.01u sec. NOTICE: --Relation pg_toast_20037-- NOTICE: Rel pg_toast_20037: TID 1444/14: OID IS INVALID. TUPGONE 1. NOTICE: Rel pg_toast_20037: TID 1444/19: OID IS INVALID. TUPGONE 1. -- Many, MANY similar lines snipped -- NOTICE: Rel pg_toast_20037: TID 1444/2008: OID IS INVALID. TUPGONE 1. NOTICE: Rel pg_toast_20037: TID 1444/2016: OID IS INVALID. TUPGONE 1. pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> -> \q Workaround ---------- Due to time constraints, and because I had the data available, I deleted the database with 'dropdb -h', re-created it using 'createdb', and re-populated it using the now-functioning Perl script. Another choice might have been to restore the database from a backup, assuming that the database corruption actually occurred in conjunction with the initial script failure and not some time earlier. Environment ----------- DB Server: Postgres 7.1.2, Slackware Linux 8.? SELECT version() reports PostgreSQL 7.1.2 on i686-pc-linux-gnu, compiled by GCC egcs-2.91.66 DB Client: Apache 1.3.20, Perl 5.6.1, mod_perl, Slackware 8.1 Questions --------- 1) Are there circumstances under which a "space exceeded" error on a client machine _can_ damage a database on the Postgres server machine? 2) The Postgres error messages all indicate that the transaction was going to be "rolled back", but to all indications (including the hung processes (;-)) it was not. Is this a bug? Or am I misreading the message? 3) What other things could we have tried to recover this situation? Could I have ignored Postgres' claim that it was rolling back the transaction, re-connected to the database (with psql, say), and issued a ROLLBACK myself? 4) Where would I find information that would help me accurately estimate the client's disk space requirements for SQL database processing on a Postgres server? Frank McKenney Frank McKenney, McKenney Associates Richmond, Virginia / (804) 320-4887 E-mail: frank_mckenney@mindspring.com
Frank McKenney <frank_mckenney@mindspring.com> writes: > 1) Are there circumstances under which a "space exceeded" error on > a client machine _can_ damage a database on the Postgres server > machine? I don't see how. The error messages you cite all point to the idea that there was some internal corruption in the database. I'd venture more than one corrupted block, in fact. It appears that block 1444 of your summary table's toast relation was clobbered (probably zeroed out in whole or in part), and the "relation nnnnn does not exist" complaints look like some bad things had happened to one or more system tables as well. Unfortunately, since you deleted the database, all the evidence is gone and there's no longer much hope of learning any more. If something like this happens again, it might be worth tar'ing up the $PGDATA tree (while the postmaster is stopped) for possible forensic analysis. > 3) What other things could we have tried to recover this situation? I think you were pretty much out of luck on that database, though perhaps partial data recovery could have been made if you were willing to spend time on it. A more interesting thing to worry about is how to ensure it doesn't happen again, and here my advice would be to look at the reliability of your disk drives and I/O hardware. I've seen more than one report of mysterious data clobbers that eventually traced to bogus disk controllers, flaky RAM, etc. In particular, I recall several data-block-suddenly-became-zero failures with hardware origins, and none that traced to software problems... regards, tom lane
Frank McKenney <frank_mckenney@mindspring.com> writes: > Hm. So it's likely that the only reason that site users didn't see > the corruption while all this was going on was that they didn't > happen to invoke that particular portion of that particular table? Yes. In fact, since the clobbered data table was a TOAST table, they could actually have read the affected main-table rows without noticing a problem, so long as they didn't try to fetch the wide fields. The "relation does not exist" messages point to some problem in the system tables as well, but we can no longer guess much about its extent. It could well have been something that only affected a few tables. > If it makes any difference (and I wish I had remembered this > earlier), the 'summary' table contains large formatted HTML-text > fields, on the order of 8-14K. We needed Postgres 7.1 to do this. Right, it would have been those fields that were clobbered. > Looking back, I suppose we could have renamed all the tables to > bad_summary, bad_xxxx, etc. However, since I'm fairly new to SQL > (let alone Postgres), I think I'd have been concerned that the > corruption might "leak out" at some point and corrupt other tables. The apparent system-table corruption might indeed have some such behavior; it's hard to tell without more info. My inclination would have been to rename the trashed database out of the way and leave it there for future analysis, if you had the disk space to spare. > Is there a nicely documented procedure for renaming/moving an entire > database to a different account that I overlooked in my haste? It's not well documented. The most secure procedure is to stop the postmaster and then cp -rp or tar the whole $PGDATA tree. If you're fairly sure that the problem is confined to one database in an installation, you could just rename that database out of the way. (There's no RENAME DATABASE command, but in 7.1 I think it'd work to update the pg_database row with a new name.) regards, tom lane
** Reply to message from Tom Lane <tgl@sss.pgh.pa.us> on Fri, 09 Nov 2001 12:36:38 -0500 Tom, Thanks for taking the time to reply. I have to say that I'm very impressed with the "expressive power" and performance of PostgreSQL's inplementation of SQL; I used it in this situation because "it was there" when I took over the project, but what I'm doing would be difficult or impossible with other SQL products I've seen. (Or maybe it just lets me get away with really arcane syntax I shouldn't be using (;-)) > Frank McKenney <frank_mckenney@mindspring.com> writes: > > 1) Are there circumstances under which a "space exceeded" error on > > a client machine _can_ damage a database on the Postgres server > > machine? > > I don't see how. The error messages you cite all point to the idea > that there was some internal corruption in the database. I'd venture > more than one corrupted block, in fact. It appears that block 1444 > of your summary table's toast relation was clobbered (probably zeroed > out in whole or in part), and the "relation nnnnn does not exist" > complaints look like some bad things had happened to one or more system > tables as well. Hm. So it's likely that the only reason that site users didn't see the corruption while all this was going on was that they didn't happen to invoke that particular portion of that particular table? That would make sense -- the site is lightly loaded right now; it's status is "beta moving into release" and only 8 users have access for database queries. If it makes any difference (and I wish I had remembered this earlier), the 'summary' table contains large formatted HTML-text fields, on the order of 8-14K. We needed Postgres 7.1 to do this. -- So it's likely (based on very limited evidence) that whatever corrupted the database was something limited entirely to the server's environment (hardware/OS/software)? Or, to put it another way since I'm wearing my "application programmer" in this situation (as opposed to "systems programmer"), there's not much I can do to prevent this from recurring? Other than sacrificing the occasional goat, of course (;-). > Unfortunately, since you deleted the database, all the evidence is > gone and there's no longer much hope of learning any more. If something > like this happens again, it might be worth tar'ing up the $PGDATA > tree (while the postmaster is stopped) for possible forensic analysis. It was a "tight" situation, and I confess to leaning rather heavily on the 'web host's support person to delete it. My main interest, at 23:30 or so, was to get _somthing_ working cleanly and properly before morning (I finished, what with post-rebuild testing and all, around 03:30). Looking back, I suppose we could have renamed all the tables to bad_summary, bad_xxxx, etc. However, since I'm fairly new to SQL (let alone Postgres), I think I'd have been concerned that the corruption might "leak out" at some point and corrupt other tables. I've seen this happen in other (non-Postgres) situations. Is there a nicely documented procedure for renaming/moving an entire database to a different account that I overlooked in my haste? I've skimmed the 7.1 Reference manual to see if it listed anything that looked right, but if it's in there I missed it. > > 3) What other things could we have tried to recover this situation? > > I think you were pretty much out of luck on that database, though > perhaps partial data recovery could have been made if you were willing > to spend time on it. A more interesting thing to worry about is how to > ensure it doesn't happen again, and here my advice would be to look at > the reliability of your disk drives and I/O hardware. I've seen more > than one report of mysterious data clobbers that eventually traced to > bogus disk controllers, flaky RAM, etc. In particular, I recall > several data-block-suddenly-became-zero failures with hardware origins, > and none that traced to software problems... Thank you for cc:ing my web hosting service support group. I'll follow up with them on this. Is there a simple, light-load way of testing the integrity of a given table/database? Experience seems to indicate that pg_dump -what -ever database > /dev/null would do a fairly good job on the entire database, but it would be nice if there was a 20-second 'verifydb' program. Is there anything else available along those lines designed for that specific purpose? Is there a FAQ anywhere that discusses Postgres database corruption and repair? Anyone know who's writing Chapter 10 for the next release of the "PostgreSQL Administrator's Guide"? It pops up all _over_ the place when you use Google to try to find out how to fix a Postgres database (;-). In any case, thanks for the feedback. We learn, we try to plan for "next time", and Murphy ensures that we'll have many more such "educational opportunities" (;-). Frank McKenney, McKenney Associates Richmond, Virginia / (804) 320-4887 E-mail: frank_mckenney@mindspring.com