Обсуждение: encoding confusion
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
{BACKGROUND]
I am testing dbmail for our corporate email solution.
We originally tested it on mysql and now we are migrating it to postgresql.
The messages are stored in a longblob field on mysql and a bytea field
in postgresql.
I set the database up as UTF-8, even though we get emails that are not
UTF encoded, mostly because I didn't know what else to try that would
incorporate all the possible encodings. Examples of 3 encodings we
regularly receive are: UTF-8, Windows-1255, ISO-8859-8-I.
I transferred the entire database except for one table using the
dbi-link contrib module, connecting directly to MySQL and pulling the
data. The table that did not transfer was the Messageblks table which
has a field mentioned above.
It would not transfer through the dbi-link, so I wrote a python script
(see below) to read a row from mysql and write a row to postgresql
(using pygresql and mysqldb).
When I used pygresql's escape_bytea function to copy the data, it went
smoothly, but the data was corrupt.
When I tried the escape_string function it died because the data it was
moving was not UTF-8.
I finally got it to work by defining a database as SQL-ASCII and then
using escape_string worked. After the data was all in place, I pg_dumped
and pg_restored into a UTF-8 database and it surprisingly works now.
[CONFUSION]
What I don't understand, is that if that database can't handle the non
UTF characters, how does it allow them in when I receive an email
(tested, it works) and when I restored the backup.
I also don't understand why the data transfer didn't work to a UTF
database, but it worked to an ASCII database, if the data can go into a
UTF database from an ascii database.
Lastly, I wanted to know if anybody has experience moving data from
mysql to postgresql and if what I did is considered normal, or if there
is a better way of doing this.
Thank you
Sim
[Python script]:
import MySQLdb
from MySQLdb.cursors import *
import pg
import sys
pgdb=pg.connect(host="1.2.3.4",user="username",dbname="dbmail",
passwd="password")
mydb=MySQLdb.connect(host="localhost",user="mysqluser",passwd="mysqlpassword",
db="dbmail", cursorclass=MySQLdb.cursors.SSDictCursor,charset="utf8")
mycrs=mydb.cursor()
mycrs.execute("select * from dbmail_messageblks")
while 1:
nextrow=mycrs.fetchone()
nextrow["messageblk"]=pg.escape_string(nextrow["messageblk"])
pgdb.query("insert into
dbmail_messageblks(messageblk,physmessage_id,is_header,messageblk_idnr,blocksize)
values($field$%(messageblk)s$field$,%(physmessage_id)s,%(is_header)s,%(messageblk_idnr)s,%(blocksize)s)"
% nextrow)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAkhOfUQACgkQjDX6szCBa+oyJwCghkG7XpcrHKY7ybeJgvjPA/XM
qLIAoMS4gOWpP4o2lKrRU2v0IdtaiSQj
=OhYp
-----END PGP SIGNATURE-----
Sim Zacks wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> {BACKGROUND]
> I am testing dbmail for our corporate email solution.
>
> We originally tested it on mysql and now we are migrating it to postgresql.
>
> The messages are stored in a longblob field on mysql and a bytea field
> in postgresql.
>
> I set the database up as UTF-8
Not relevant if you're using bytea. Encoding is for character-based
types (varchar, text) not byte-based types.
[snip]
> When I used pygresql's escape_bytea function to copy the data, it went
> smoothly, but the data was corrupt.
> When I tried the escape_string function it died because the data it was
> moving was not UTF-8.
Your Python script seems to think it's dealing it's dealing with text
rather than a stream of bytes too. I'm not a Python programmer, but I'd
guess it's treating one of the database fields (either MySQL or
PostgreSQL) as text not bytes. You'll need to check the docs for
binary-data handling in your Python libraries.
I'm puzzled as to how the data was corrupted with escape_bytea() - I
can't imagine it would be that difficult for the library to get right.
I'd be suspicious that the source data wasn't what I thought it was.
> [CONFUSION]
> What I don't understand, is that if that database can't handle the non
> UTF characters, how does it allow them in when I receive an email
> (tested, it works) and when I restored the backup.
> I also don't understand why the data transfer didn't work to a UTF
> database, but it worked to an ASCII database, if the data can go into a
> UTF database from an ascii database.
Whatever is going on, it's nothing to do with the bytea type.
> Lastly, I wanted to know if anybody has experience moving data from
> mysql to postgresql and if what I did is considered normal, or if there
> is a better way of doing this.
I think that something in the process is trying to be clever and
treating the blobs as text.
--
Richard Huxton
Archonet Ltd
Sim Zacks wrote: > We originally tested it on mysql and now we are migrating it > to postgresql. > > The messages are stored in a longblob field on mysql and a bytea field > in postgresql. > > I set the database up as UTF-8, even though we get emails that are not > UTF encoded, mostly because I didn't know what else to try that would > incorporate all the possible encodings. Examples of 3 encodings we > regularly receive are: UTF-8, Windows-1255, ISO-8859-8-I. [...] > It would not transfer through the dbi-link, so I wrote a python script > (see below) to read a row from mysql and write a row to postgresql > (using pygresql and mysqldb). > When I used pygresql's escape_bytea function to copy the data, it went > smoothly, but the data was corrupt. > When I tried the escape_string function it died because the data it was > moving was not UTF-8. > > I finally got it to work by defining a database as SQL-ASCII and then > using escape_string worked. After the data was all in place, I pg_dumped > and pg_restored into a UTF-8 database and it surprisingly works now. It's very dificult to know what exactly happened unless you have some examples of a byte sequence that illustrates what you describe: How it looked in MySQL, how it looked in your Python script, what you fed to escape_bytea. What client encoding did you use in your Python script? Yours, Laurenz Albe
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The data in the longblob field might be text, which could be causing the confusion. For example, when I look at the data in the longblob field, I see /n for a newline and when I look at the bytea it is 012. I can only tell you what happened in the client end, in terms of corruption. I am using the Thunderbord client. When I clicked on a message, it didn't show the data and when I looked at the headers, it was just a big mess. I'm guessing that somehow the newlines didn't work and the headers and message were overlaid on top of each other. Richard Huxton wrote: > Your Python script seems to think it's dealing it's dealing with text > rather than a stream of bytes too. I'm not a Python programmer, but I'd > guess it's treating one of the database fields (either MySQL or > PostgreSQL) as text not bytes. You'll need to check the docs for > binary-data handling in your Python libraries. > > I'm puzzled as to how the data was corrupted with escape_bytea() - I > can't imagine it would be that difficult for the library to get right. > I'd be suspicious that the source data wasn't what I thought it was. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkhPZD8ACgkQjDX6szCBa+pKkwCfXwzX9iZJl6OS7gji9WXczfy7 T7UAn0sCr+Ut5lbZzXQizC8V3P/Irwrh =/9bo -----END PGP SIGNATURE-----
Sim Zacks wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > The data in the longblob field might be text, which could be causing the > confusion. For example, when I look at the data in the longblob field, I > see /n for a newline and when I look at the bytea it is 012. That's right - newline is ASCII 10 (or 12 in octal). > I can only tell you what happened in the client end, in terms of > corruption. I am using the Thunderbord client. When I clicked on a > message, it didn't show the data and when I looked at the headers, it > was just a big mess. I'm guessing that somehow the newlines didn't work > and the headers and message were overlaid on top of each other. Well that might be a problem with dmail's setup rather than the database. I think headers are restricted to ASCII only (the body is a different matter). The best bet is to be certain whether the database is to blame. Find a problem entry, dump that one row to a file from MySQL, do the same from PostgreSQL and also from the midpoint in your Python code doing the transfer. Then use a hex editor / dumper (e.g. "hexdump -C" on linux) to see what bytes differ in the files. -- Richard Huxton Archonet Ltd
On Jun 11, 2008, at 9:03 AM, Richard Huxton wrote: > Sim Zacks wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> The data in the longblob field might be text, which could be >> causing the >> confusion. For example, when I look at the data in the longblob >> field, I >> see /n for a newline and when I look at the bytea it is 012. > > That's right - newline is ASCII 10 (or 12 in octal). Just from the top of my head, but aren't mail headers supposed to be terminated by \r\n? In that case you're missing a byte 13 (or 015 octal) before the 012 byte. That could explain your missing line breaks. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4850140f927661409586227!