Hello,
Please CC me any reply since I'm not on this list. Should I send this
to -hackers as well?
Yesterday, it became necessary to upgrade our Postgres 6.3.2 database
servers to 6.4.2. We had planned to do this anyway, after hours, but
one particular query in a mission-critical app connecting to 6.3.2 was
causing it to coredump every time, and there were some notes in the
HISTORY file about fixing memory overruns, so it was decided to
upgrade the database server during the day, previous upgrades having
taken less than an hour.
Our machines here are BSD/OS 3.1.
So anyway, first thing I do is compile 6.4.2. A hitch there already:
it is looking for netdb.h in one of the files, but can't find it. A
bit of detective work later, I find that file in /sco/usr/include and
copy it into postgresql-6.4/src/include and things work.
Then, I kill the server and all backends, set pg_hba.conf so only I
can connect, and pg_dumpall -z the database. That worked fine.
I also do a dump appropriate for pg_upgrade, figuring that maybe I'd
try that out of necessary. I also tar all this stuff up, including my
data and bin directories, just in case anything should go haywire.
This will turn out to have been a wise precaution.
I then move data to data.old, gmake install, initdb, and psql -e
template1. It looks to be reloading the database OK for awhile. But
then all of a sudden the screen is dumping pages and pages of psql's
help text (the backslash commands, etc.) Noting that this is not good
at all, I ctrl-C the restore, rm -rf data, initdb again, and try
pg_upgrade. This complains that psql was unsuccessful and doesn't run
either (it did appear to issue queries.) So panic time arrives.
After restoring my old binaries and trying a dump with -d to dump
everything as an INSERT query, and seeing psql spit out literally
thousands of errors after trying to load this back into the new
database, I end up doing this (on a new freshly-initdb'd data
directory in 6.4.2):
psql -e template1 < DumpFile &> results
and:
tail -f results
in another window.
Eventually, I discovered all these problems:
* My User groups were not preserved from the previous version,
so things like grant all to group foo; were failing.
* The dumpfile would do things like create a table as one user,
set the permissions, and then when it tried to use COPY, it would
be connected as a different user that had no permissions to
update. This is one of the most insidious problems; psql does
not stop when it does not have permissions to copy; nor does
it continue reading until the end marker. It simply interprets
all the commands as psql commands and gets thoroughly confused.
By the time I had finally fixed this, I had loaded my 36-megabyte
dumpfile into vi about a dozen times, basically in a
trial-and-error fashion waiting for it to find the next problem.
* Tables containing CONSTRAINT or DEFAULT clauses were not created
by CREATE TABLE (citing parse errors), and causing the same problem
with COPY later on as described above. By that time, I was
several hours behind schedule and finally just deleted those
clauses from the dumpfile so I could get the server back online.
* The old pg_dumpall dumped things as char4, char8, char2, etc.
The new Postgres of course uses char(4) syntax. For some reason,
it doesn't even accept the old syntax, meaning that every single
table that used one of those things was not created either,
and caused problems with COPY. So I wrote a sed to fix these
problems, but that caused problems with CREATE INDEX later. Those
were few enough that they could be manually fixed.
* What was before a char8 is now treated as always having trailing
spaces as a char(8). This is causing extreme havoc with programs
that deal with it. Either that, or the dump corrupted the data
by adding trailing spaces. Either way, changing the data is
NOT acceptable and ought to be fixed. We are storing Unix
usernames. Unix does not put trailing spaces, and so string
comparisons are failing all over the place.
By the time I had tracked down all these problems, the database had
been down for about 4 or 5 hours instead of the 45 minutes that it has
taken me to upgrade in the past. This really is not acceptable for
something used in a mission-critical situation; we have been getting
angry calls and e-mails from customers. I have still not had time to
restore all the groups that were created; I don't understand why they
were not carried over.
In short, I'm not a happy camper. I've always been a PostgreSQL fan,
and I look forward to the features coming in 6.5, but the problems
with this procedure show that people have not tested it. I still find
it hard to believe that nobody else ran into the char8 problems with
upgrading. In my brief narrative above, I've skipped things that
eventually led to no help. For instance, I tried the pg_dumpall from
6.4.2 on 6.3.2, but it didn't help.
I seriously hope that these problems will be fixed before 6.4.3 or
6.5, and that there will be sufficient testing to ensure that the
upgrade process at least works somewhat smoothly. Dumping and
restoring the database is bad enough; it causes a good deal of
downtime, but the process has traditionally been trouble-free. All of
a sudden, it is not, and even someone that has been using this
database since the early Postgre95 days spent many hours trying to fix
it.
--
John Goerzen SouthWind Internet Access, Inc.
E-mail: Business, jgoerzen@southwind.net; Personal, jgoerzen@complete.org
Developer, Debian GNU/Linux <http://www.debian.org>