Обсуждение: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
My db is in utf-8, I have a row in my table say tmp_article and I wanted to generate ts_vector from the article content: select to_tsvector(content) from tmp_article; But I got this error: ERROR: invalid byte sequence for encoding "UTF8": 0xf481 I am wondering how this could happen. I think if there was invalid UTF8 bytes in the content, it shouldn't have been able to inserted into the tmp_article table as I sometimes see similar errors when inserting records to tmp_article. Am I right?
2012/4/14 Rural Hunter <ruralhunter@gmail.com>
My db is in utf-8, I have a row in my table say tmp_article and I wanted to generate ts_vector from the article content:
select to_tsvector(content) from tmp_article;
But I got this error:
ERROR: invalid byte sequence for encoding "UTF8": 0xf481
I am wondering how this could happen. I think if there was invalid UTF8 bytes in the content, it shouldn't have been able to inserted into the tmp_article table as I sometimes see similar errors when inserting records to tmp_article. Am I right?
This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
Try to set client_encoding='LATIN1'
and then execute
Thanks & Regards,
Raghu Ram
EnterpriseDB: http://www.enterprisedb.com
Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
От
Rural Hunter
Дата:
doesn't work either.<br /><br /> db=# show client_encoding;<br /> client_encoding <br /> -----------------<br /> UTF8<br/> (1 row)<br /><br /> db=# set client_encoding='LATIN1';<br /> SET<br /> db=# show client_encoding;<br /> client_encoding<br /> -----------------<br /> LATIN1<br /> (1 row)<br /><br /> db=# select to_tsvector(content) from tmp_article;<br/> ERROR: invalid byte sequence for encoding "UTF8": 0xf481<br /><br /> 于 2012/4/14 10:15, raghu ram 写道:<blockquote cite="mid:CALnrrJTPzsNvaFP_qHisMS=wwp2bkvK=gk_wPDzC-fdaXEkypA@mail.gmail.com" type="cite"><br /><br /><divclass="gmail_quote">2012/4/14 Rural Hunter <span dir="ltr"><<a href="mailto:ruralhunter@gmail.com" moz-do-not-send="true">ruralhunter@gmail.com</a>></span><br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> My db is in utf-8, I have a row in my table say tmp_article and I wantedto generate ts_vector from the article content:<br /> select to_tsvector(content) from tmp_article;<br /> But I gotthis error:<br /> ERROR: invalid byte sequence for encoding "UTF8": 0xf481<br /><br /> I am wondering how this couldhappen. I think if there was invalid UTF8 bytes in the content, it shouldn't have been able to inserted into the tmp_articletable as I sometimes see similar errors when inserting records to tmp_article. Am I right?<span class="HOEnZb"><fontcolor="#888888"><br /><br /></font></span></blockquote><div><br /></div><div><p class="p1">This errorcan also happen if the <span class="s1">byte</span> <span class="s1">sequence</span> does not match the <span class="s1">encoding</span>expected by the server, which is controlled by "client_encoding".</div><div>Try to set client_encoding='LATIN1' </div><div><br/></div><div>and then execute </div></div><div><br /></div> select to_tsvector(content)from tmp_article;<br clear="all" /><div><br /></div> -- <br /><p>Thanks & Regards,<p>Raghu Ram<p>EnterpriseDB:<a href="http://www.enterprisedb.com/" moz-do-not-send="true" target="_blank"><span>http://www.enterprisedb.com</span></a><br/></blockquote><br />
On Sat, Apr 14, 2012 at 9:31 AM, Rural Hunter <ruralhunter@gmail.com> wrote:
-- doesn't work either.
db=# show client_encoding;
client_encoding
-----------------
UTF8
(1 row)
db=# set client_encoding='LATIN1';
SET
db=# show client_encoding;
client_encoding
-----------------
LATIN1
(1 row)
db=# select to_tsvector(content) from tmp_article;
ERROR: invalid byte sequence for encoding "UTF8": 0xf481
Try to set client_encoding='SQL_ASCII'and then executeselect to_tsvector(content) from tmp_article;
Thanks & Regards,
Raghu Ram
EnterpriseDB: http://www.enterprisedb.com
Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
От
Rural Hunter
Дата:
still doesn't work.<br /> db=# set client_encoding='SQL_ASCII' ;<br /> SET<br /> db=# select to_tsvector(content) from tmp_article;<br/> ERROR: invalid byte sequence for encoding "UTF8": 0xf481<br /><br /><br /> 于 2012/4/15 1:38, raghu ram写道: <blockquote cite="mid:CALnrrJSCuGpJTRx_Vg8efcs+o-t0=PABpcOz5AJ+jkZ=jTNq+g@mail.gmail.com" type="cite"><br /><br /><divclass="gmail_quote">On Sat, Apr 14, 2012 at 9:31 AM, Rural Hunter <span dir="ltr"><<a href="mailto:ruralhunter@gmail.com"moz-do-not-send="true">ruralhunter@gmail.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000">doesn't work either.<br /><br /> db=# show client_encoding;<br /> client_encoding <br /> -----------------<br/> UTF8<br /> (1 row)<br /><br /> db=# set client_encoding='LATIN1';<br /> SET<br /> db=# show client_encoding;<br/> client_encoding <br /> -----------------<br /> LATIN1<br /> (1 row)<br /><br /> db=# select to_tsvector(content)from tmp_article; <div class="im"><br /> ERROR: invalid byte sequence for encoding "UTF8": 0xf481<br/></div></div></blockquote><div><br /></div><div><blockquote type="cite"><div class="gmail_quote"><div>Try to setclient_encoding='SQL_ASCII' </div><div><br /></div><div>and then execute </div></div><div><br /></div> select to_tsvector(content)from tmp_article;<br clear="all" /></blockquote></div><div><br /></div></div> -- <br /><p>Thanks &Regards,<p>Raghu Ram<p>EnterpriseDB: <a href="http://www.enterprisedb.com/" moz-do-not-send="true" target="_blank"><span>http://www.enterprisedb.com</span></a><br/></blockquote><br />
Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
От
"Albe Laurenz"
Дата:
Rural Hunter wrote: > My db is in utf-8, I have a row in my table say tmp_article and I wanted > to generate ts_vector from the article content: > select to_tsvector(content) from tmp_article; > But I got this error: > ERROR: invalid byte sequence for encoding "UTF8": 0xf481 > > I am wondering how this could happen. I think if there was invalid UTF8 > bytes in the content, it shouldn't have been able to inserted into the > tmp_article table as I sometimes see similar errors when inserting > records to tmp_article. Am I right? You are right in theory. A lot depends on your PostgreSQL version, because the efforts to prevent invalid strings from entering the database have led to changes over the versions. Older versions are more permissive. To test the theory that the contents of the table are bad, you can test if the same happens if you SELECT convert_to(content, 'UTF8') FROM tmp_article; Yours, Laurenz Albe
Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
От
Rural Hunter
Дата:
=>SELECT convert_to(content, 'UTF8') FROM tmp_article; This works. My pg is at latest 9.1.3 on ubuntu 10.04 server. We have millions of data in the db but this is the only one we met the problem. The bad data is inserted in recent days and we upgraded to 9.1.3 right after it was released. 于 2012/4/16 16:31, Albe Laurenz 写道: > Rural Hunter wrote: >> My db is in utf-8, I have a row in my table say tmp_article and I > wanted >> to generate ts_vector from the article content: >> select to_tsvector(content) from tmp_article; >> But I got this error: >> ERROR: invalid byte sequence for encoding "UTF8": 0xf481 >> >> I am wondering how this could happen. I think if there was invalid > UTF8 >> bytes in the content, it shouldn't have been able to inserted into the >> tmp_article table as I sometimes see similar errors when inserting >> records to tmp_article. Am I right? > You are right in theory. A lot depends on your PostgreSQL version, > because > the efforts to prevent invalid strings from entering the database have > led to changes over the versions. Older versions are more permissive. > > To test the theory that the contents of the table are bad, you can > test if the same happens if you > > SELECT convert_to(content, 'UTF8') FROM tmp_article; > > Yours, > Laurenz Albe >
Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
От
"Albe Laurenz"
Дата:
Please don't top post. Rural Hunter wrote: >>> My db is in utf-8, I have a row in my table say tmp_article and I wanted >>> to generate ts_vector from the article content: >>> select to_tsvector(content) from tmp_article; >>> But I got this error: >>> ERROR: invalid byte sequence for encoding "UTF8": 0xf481 >>> >>> I am wondering how this could happen. I think if there was invalid UTF8 >>> bytes in the content, it shouldn't have been able to inserted into the >>> tmp_article table as I sometimes see similar errors when inserting >>> records to tmp_article. Am I right? >> You are right in theory. A lot depends on your PostgreSQL version, >> because >> the efforts to prevent invalid strings from entering the database have >> led to changes over the versions. Older versions are more permissive. >> >> To test the theory that the contents of the table are bad, you can >> test if the same happens if you >> >> SELECT convert_to(content, 'UTF8') FROM tmp_article; > =>SELECT convert_to(content, 'UTF8') FROM tmp_article; > This works. My pg is at latest 9.1.3 on ubuntu 10.04 server. We have > millions of data in the db but this is the only one we met the problem. > The bad data is inserted in recent days and we upgraded to 9.1.3 right > after it was released. Hmm, that is strange. Could you find the row that causes the problem and post the result of "CAST (content AS bytea)" for this row? Do you use any nonstandard text search parsers or dictionaries? What is the text search configuration you use (parameter default_text_search_config)? Yours, Laurenz Albe
Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
От
Rural Hunter
Дата:
Hi Albe, Yes, I'm using a Chinese tsearch2 plugin called nlpbamboo and the default default_text_search_config is 'chinesecfg'. I tested "select to_tsvector('english',content) from tmp_article" and this works. So it's the problem of nlpbamboo? The result of 'CAST (content AS bytea)' is shown below. btw, what's the meaning of "Please don't top post."? I don't understand but sorry if I caused any problem. \xe6a0b8e5bf83e68f90e7a4baefbc9ae69caae69da5e4b889e5b9b4e5b086e698afe694bfe5ba9ce99c80e6b182e9878ae694beefbc8ce8a18ce4b89ae9ab98e9809fe68890e995bfe79a84e697b6e69c9fefbc8ce585ac e58fb8e79baee5898de582a8e5a487e9a1b9e79baee4bc97e5a49aefbc8ce4b89ae7bba9e5ad98e59ca8e59091e4b88ae8b685e9a284e69c9fe79a84e58fafe883bdefbc8ce5bd93e5898de79a84e4bcb0e580bce6b0b4e5b 9b3e4b88de9ab98efbc8ce68c89e785a73230313220e5b9b43235efbd9e333020e5808de79a84502f45efbc8ce585ace58fb8e4bcb0e580bce58cbae997b4e59ca832322e3439efbd9e32362e393920e58583e58cbae997b4 efbc8ce8be83e79baee5898de882a1e4bbb7e7baa6e69c893425efbd9e323525e79a84e4b88ae58d87e7a9bae997b4efbc8ce7bbb4e68c81e2809ce68ea8e88d90e2809de79a84e68a95e8b584e8af84e7baa7e380820ae8b 083e7a094e58685e5aeb90ae8bf91e697a5efbc8ce58f82e58aa0e4ba86e585ace58fb8e7bb84e7bb87e79a84e68a95e8b584e88085e68ea5e5be85e6b4bbe58aa8efbc8ce4b88ee585ace58fb8e9ab98e7aea1e8bf9be8a1 8ce4ba86e4baa4e6b581e380820ae4b8bbe8a681e8a782e782b90ae2809ce58d81e4ba8cc2b7e4ba94e2809de69c9fe997b4e694bfe5ba9ce99c80e6b182e6988ee698bee5a29ee58aa0efbc8ce585ace58fb8e58aa0e9809 fe8b791e9a9ace59c88e59cb0e380820ae7bb8fe8bf87e58c97e4baace38081e4b88ae6b5b7e38081e69dade5b79ee38081e5ae81e6b3a2e7ad89e58588e8bf9be59f8ee5b882e794b5e5ad90e694bfe58aa1e5bbbae8aebe e79a84e7a4bae88c83e4bd9ce794a8efbc8ce59084e59cb0e694bfe5ba9ce68488e58f91e58585e58886e8aea4e8af86e588b0e695b0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe5afb9e694bfe5ba9ce59 f8ee5b882e7aea1e79086e6b0b4e5b9b3e68f90e58d87e68980e8b5b7e588b0e79a84e9878de5a4a7e4bd9ce794a8efbc8ce8808ce99a8fe79d80e68890e58a9fe6a188e4be8be79a84e4b88de696ade7a7afe7b4afefbc8c e694bfe5ba9ce887aae8baabe5928ce585ace58fb8e5afb9e99c80e6b182e79a84e79086e8a7a3e4b99fe983bde69bb4e58aa0e6b7b1e585a5e38082e7bb8fe8bf873230313020e5b9b4e5928c3230313120e5b9b4e79a84e 694bfe5ba9ce68da2e5b18ae4b88ee8a784e58892e588b6e5ae9aefbc8c32303132efbd9e3230313420e5b9b4e698afe59cb0e696b9e694bfe5ba9ce2809ce58d81e4ba8cc2b7e4ba94e2809de5bbbae8aebee585b7e4bd93 e5ae9ee696bde5b9b4efbc8ce79baee5898de59bbde58685e7baa63236303020e4b8aae58ebfe7baa7e68896e4bba5e4b88ae79a84e5b882efbc88e58ebfefbc89e4b8adefbc8ce5b7b2e7bb8fe5bbbae8aebee4ba86e695b 0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe79a84e4bb8533303020e69da5e5aeb6efbc8ce794b1e4ba8ee59cb0e696b9e694bfe5ba9ce59ca8e5bbbae8aebee696b9e99da2e79a84e68a95e585a5e585b7 e69c89e78886e58f91e5bc8fe79a84e789b9e782b9efbc8ce88083e89991e588b0e585ace58fb83230313220e5b9b4e4b880e5ada3e5baa6e4b89ae7bba9e79a84e5a4a7e5b985e5a29ee995bfefbc8ce79bb8e4bfa1e4bb8 ee4bb8ae5b9b4e5bc80e5a78be79a843320e5b9b4efbc8ce5b086e698afe695b4e4b8aae8a18ce4b89ae5bfabe9809fe58f91e5b195e79a84e697b6e69c9fe38082e585ace58fb8e698afe8a18ce4b89ae4b8ade79a84e9be 99e5a4b4e4bc81e4b89aefbc8ce5b882e59cbae58da0e69c89e78e87e59ca8363025e4bba5e4b88ae38082e8808ce4b894e585ace58fb8e59ca8e59084e4bba3e8a1a8e680a7e59f8ee5b882e79a84e9a1b9e79baee585b7e 69c89e6a0b7e69dbfe69588e5ba94efbc8ce58fafe4bba5e4b8bae585ace58fb8e5b8a6e69da5e69bb4e5a49ae79a84e694bfe5ba9ce5aea2e688b7efbc8ce5b9b6e4b894e5ae9ee78eb0e4ba86e99c80e6b182e79086e8a7 a32de99c80e6b182e5ae9ee78eb02de99c80e6b182e5bc95e5afbce79a84e889afe680a7e5beaae78eafe38082e59ba0e6ada4efbc8ce8aea4e4b8bae59ca8e69c80e8bf91e79a84e4b880e4b8a4e5b9b4efbc8ce585ace58 fb8e5b086e4bba5e9ab98e4ba8ee8a18ce4b89ae6b0b4e5b9b3e79a84e5a29ee9809fe8bf85e78c9be58f91e5b195efbc8ce4b99fe58db3e8bf9be585a5e8b791e9a9ace59c88e59cb0efbc8ce5b7a9e59bbae5b882e59cba e58da0e69c89e78e87e79a84e697b6e69c9fe38082e6ada4e5908eefbc8ce99a8fe79d80e694bfe5ba9ce5908ee7bbade68a95e585a5e58fafe883bde587bae78eb0e79a84e591a8e69c9fe680a7e6b3a2e58aa8efbc8ce58 5ace58fb8e9809ae8bf87e59084e9a1b9e4b89ae58aa1e7bb93e69e84e79a84e8b083e88a82e5ae9ee78eb0e5b9b3e7a8b3e8be83e5bfabe9809fe79a84e5a29ee995bfefbc8ce59ca8e8bf99e4b880e998b6e6aeb5efbc8c e9a284e8aea1e5a29ee9809fe887b3e5b091e4b99fe59ca8323025efbd9e333025e58cbae997b4e38082e5a29ee58aa0e7b3bbe7bb9fe99b86e68890e4b89ae58aa1e9878fefbc8ce689a9e5a4a7e694b6e585a5e8a784e6a 8a1e5b9b6e58aa0e5bcbae5aea2e688b7e7b298e680a7e7b297e795a5e4bcb0e7ae97efbc8ce79bb4e8be96e5b882e38081e79c81e4bc9ae7baa7e59f8ee5b882e38081e59cb0e7baa7e5b882efbc88e4b88de590abe79c81 e4bc9ae59f8ee5b882efbc89e38081e5b882e8be96e58cbae38081e58ebfe7baa7e5b882efbc88e58ebfefbc89e695b0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe79a84e5bbbae8aebee8b4b9e794a8e7b aa6e4b8ba3530303020e4b887e58583e380813330303020e4b887e58583e380813235303020e4b887e58583e380813130303020e4b887e58583e5928c38303020e4b887e58583e38082e68c89e785a7e79baee5898de68891 e59bbde59f8ee5b882e695b0e9878fe8aea1e7ae97efbc8ce4b88de7ae97e5908ee69c9fe58d87e7baa7e7bbb4e68aa4e79a84e8b4b9e794a8efbc8ce695b4e4b8aae5b882e59cbae8a784e6a8a1e59ca832353020e4babfe 4bba5e4b88ae38082e9a284e8aea1e887b33230313520e5b9b4e5ba95efbc8ce695b0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe5b086e887b3e5b091e8a686e79b96e585a8e983a8e59cb0e7baa7e4bba5 e4b88ae59f8ee5b882e38081333025e79a84e5b882e8be96e58cbae5928c313025e79a84e58ebfe7baa7e5b88228e58ebf29e38082e784b6e8808ce79bb8e5afb9e4ba8ee5a682e6ada4e5a4a7e79a84e5b882e59cbaefbc8 ce585ace58fb8e8bdafe4bbb6e4baa7e59381e79a84e4bbbde9a29de4b990e8a782e4bcb0e8aea1e4bb85e59ca8333025e79a84e6af94e4be8be38082e794b1e4ba8ee694bfe5ba9ce8aea2e58d95e69bb4e5a49ae580bee5 9091e4ba8ee4bba5e7a1ace4bbb6e98787e8b4ade4b8bae789b9e782b9e79a84e7b3bbe7bb9fe99b86e68890efbc8ce585ace58fb8e58f82e4b88ee7b3bbe7bb9fe99b86e68890e4b89ae58aa1e4b880e696b9e99da2e58fa fe4bba5e9809ae8bf87e68f90e4be9be695b4e4bd93e8a7a3e586b3e696b9e6a188e68f90e58d87e7ab9ee4ba89e4bc98e58abfefbc8ce5a29ee58aa0e4b8ade6a087e6a682e78e87efbc8ce4bb8ee8808ce689a9e5a4a7e6 94b6e585a5e8a784e6a8a1efbc8ce58fa6e4b880e696b9e99da2e4b99fe58fafe4bba5e5a29ee58aa0e5aea2e688b7e7b298e680a7efbc8ce7a8b3e5ae9ae69caae69da5e79a84e5a29ee995bfe38082e59ba0e6ada4efbc8 ce9a284e8aea1e585ace58fb8e69caae69da5e4bc9ae58aa0e5bcbae7b3bbe7bb9fe99b86e68890e4b89ae58aa1efbc8ce69bb4e58aa0e58585e58886e79a84e4baabe58f97e5b882e59cbae79a84e5a29ee995bfe380820a e695b0e68daee699aee69fa5e4b88ee4bfa1e681afe98787e99b86e58db3e5b086e78886e58f91efbc8ce69caae69da5e5b086e68890e9878de8a681e588a9e6b6a6e69da5e6ba90e4b98be4b880e38082e887aa323031302 0e5b9b4e5bc80e5a78befbc8ce585ace58fb8e5bc80e5a78be4b8bae694bfe5ba9ce79bb8e585b3e983a8e997a8e68f90e4be9be695b0e5ad97e58c96e59f8ee5b882e7aea1e79086e7b3bbe7bb9fe5bbbae8aebee5928ce8 bf90e890a5e4b8ade79a84e79bb8e585b3e695b0e68daee699aee69fa5e5928ce4bfa1e681afe98787e99b86e69c8de58aa1efbc8ce5b9b6e4b8bae6ada4e68a95e585a53430303020e4bd99e4b887e58583e7949fe4baa73 13020e58fb0e59fbae4ba8e4c6164796275673320e79a84e8bda6e8bdbde6bf80e58589e689abe68f8fe7b3bbe7bb9fe5928c3320e58fb0e4bebfe690bae5bc8fe6bf80e58589e689abe68f8fe7b3bbe7bb9fe38082323031 3120e5b9b4e5b7b2e5ae8ce688903220e58fb0e8bda6efbc8ce9a284e8aea13230313220e5b9b4e4b88ae58d8ae5b9b4e69c8934efbd9e3520e58fb0e8bda6e58fafe68a95e585a53320e7bbb4e5ae9ee699afe6b58be9878 fe79a84e5ba94e794a8e38082e5afb9e4ba8ee8bf99e4b880e696b0e79a84e5b882e59cbaefbc8ce6ada6e6b189e7ab8be5be97e7a9bae997b4e4bd9ce4b8bae58588e8a18ce88085efbc8ce5b882e59cbae58da0e69c89e7 8e87e8bebe363025e5b7a6e58fb3efbc8ce585ace58fb8e58899e4b88de588b0333025efbc8ce4bd86e585ace58fb8e794b1e4ba8ee58fafe68f90e4be9be59fbae4ba8ee695b0e5ad97e59f8ee5b882e7aea1e79086e7b3b be7bb9fe79a84e4b880e4bd93e58c96e8a7a3e586b3e696b9e6a188efbc8ce4b894e59ca8e6b58be9878fe7b2bee5baa6e4b88ae79bb8e6af94e5afb9e6898be585b7e69c89e7ab9ee4ba89e4bc98e58abfefbc8ce9a284e8 aea1e8b685e8b68ae5afb9e6898be68890e4b8bae9be99e5a4b4e4bc81e4b89ae5b086e698afe5be88e5bfabe4bc9ae58f91e7949fe79a84e4ba8be68385e38082e9a284e8aea13230313220e5b9b4e8afa5e4b89ae58aa1e 8bf9be585a5e78886e58f91e69c9fefbc8ce4b894e69caae69da533efbd9e3520e5b9b4e58fafe4bf9de68c81353025e5b7a6e58fb3e79a84e5b9b4e59d87e5a48de59088e5a29ee995bfe380820ae4babae58a9be68890e6 9cace4b88ae58d87e58e8be58a9be69c89e68980e7bc93e8a7a3efbc8ce5a4a7e9a29de9a1b9e79baee6918ae4bd8ee7a094e58f91e68890e69cace38082e794b1e4ba8ee8bf91e4b8a4e5b9b4e9809ae88380e78e87e9ab9 8e4bc81efbc8ce4bba5e58f8ae7a7bbe58aa8e4ba92e88194e7bd91e7ad89e696b0e585b4495420e7bb86e58886e8a18ce4b89ae79a84e585b4e8b5b7efbc8ce5afb9e4bca0e7bb9fe8bdafe4bbb6e4bc81e4b89ae980a0e6 8890e4babae58a9be68890e69cace4b88ae6b6a8e79a84e5b7a8e5a4a7e58e8be58a9befbc8ce585ace58fb8e4babae58a9be68890e69cace4b99fe5a4a7e5b985e4b88ae58d87efbc8ce4bcb0e7ae97e585b6e5a29ee5b98 5e887b3e5b091e59ca8333025e4bba5e4b88ae38082e8808ce99a8fe79d80e5ae8fe8a782e7bb8fe6b58ee5bda2e58abfe79a84e58f98e58c96efbc8ce4bba5e58f8ae585ace58fb8e6ada6e6b189e7a094e58f91e4b8ade5 bf83e79a84e5bbbae8aebee5ae8ce68890efbc8ce9a284e8aea13230313220e5b9b4e4babae58a9be68890e69cace4b88ae58d87e79a84e58e8be58a9be5b086e5be97e588b0e8be83e5a4a7e7bc93e8a7a3e38082e8808ce 794b1e4ba8ee585ace58fb8e8bdafe4bbb6e7b3bbe7bb9fe79a84e4baa7e59381e58c96e7a88be5baa6e8be83e9ab98efbc8ce7a094e58f91e4b88ee5ae9ee696bde79a84e880a6e59088e5baa6e8be83e69dbeefbc8ce99a 8fe79d80e59088e5908ce9a29de79a84e4b88ae58d87efbc8ce8bf99e4b880e4bc98e58abfe5b086e68488e58f91e6988ee698bee79a84e6918ae4bd8ee7a094e58f91e68890e69cace38082e5b0bde7aea1e7b3bbe7bb9fe 99b86e68890e9a1b9e79baee5b086e69c89e68980e5a29ee58aa0efbc8ce4bd86e7bbbce59088e4b88ae8bfb0e4b8a4e696b9e99da2e59ba0e7b4a0efbc8ce4bf9de5ae88e4bcb0e8aea1efbc8ce585ace58fb8e6af9be588 a9e78e87e6b0b4e5b9b3e887b3e5b091e58fafe4bba5e7bb93e69d9fe4b8a4e5b9b4e69da5e79a84e4b88be9998de8b68be58abfe8808ce7a8b3e5ae9ae59ca8363025e4bba5e4b88ae380820af48186bff483bfb2f48393b ff4828dbe203ff484998ff4848ebbf4839583f480aabcf481adaff4838686f4819997203ff4838d90f482b2b4f4848d8ff484b692f48196a1f480abbd0ae69cace69687e794b1e882a1e7a5a8e68a95e8b584e5b7a5e585b7 3ae9b9b0e79cbce4b8aae882a1e699bae883bde8b79fe8b8aae7b3bbe7bb9fe68f90e4be9befbc8ce6aca2e8bf8ee8bdace8bdbdefbc8ce8afb7e6b3a8e6988ee587bae5a484e380820ae79bb8e585b3e696b0e997bbefbc9 a0ae8b4a3e4bbbbe7bc96e8be91efbc9ae5bb96e6b3bde5878c 于 2012/4/16 21:34, Albe Laurenz 写道: > Please don't top post. > > Rural Hunter wrote: >>>> My db is in utf-8, I have a row in my table say tmp_article and I wanted >>>> to generate ts_vector from the article content: >>>> select to_tsvector(content) from tmp_article; >>>> But I got this error: >>>> ERROR: invalid byte sequence for encoding "UTF8": 0xf481 >>>> >>>> I am wondering how this could happen. I think if there was invalid UTF8 >>>> bytes in the content, it shouldn't have been able to inserted into the >>>> tmp_article table as I sometimes see similar errors when inserting >>>> records to tmp_article. Am I right? >>> You are right in theory. A lot depends on your PostgreSQL version, >>> because >>> the efforts to prevent invalid strings from entering the database have >>> led to changes over the versions. Older versions are more permissive. >>> >>> To test the theory that the contents of the table are bad, you can >>> test if the same happens if you >>> >>> SELECT convert_to(content, 'UTF8') FROM tmp_article; >> =>SELECT convert_to(content, 'UTF8') FROM tmp_article; >> This works. My pg is at latest 9.1.3 on ubuntu 10.04 server. We have >> millions of data in the db but this is the only one we met the problem. >> The bad data is inserted in recent days and we upgraded to 9.1.3 right >> after it was released. > Hmm, that is strange. > Could you find the row that causes the problem and post the > result of "CAST (content AS bytea)" for this row? > > Do you use any nonstandard text search parsers or dictionaries? > What is the text search configuration you use (parameter > default_text_search_config)? > > Yours, > Laurenz Albe
Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
От
"Albe Laurenz"
Дата:
Rural Hunter wrote: >>>>> My db is in utf-8, I have a row in my table say tmp_article and I wanted >>>>> to generate ts_vector from the article content: >>>>> select to_tsvector(content) from tmp_article; >>>>> But I got this error: >>>>> ERROR: invalid byte sequence for encoding "UTF8": 0xf481 >> Do you use any nonstandard text search parsers or dictionaries? >> What is the text search configuration you use (parameter >> default_text_search_config)? > Yes, I'm using a Chinese tsearch2 plugin called nlpbamboo and the > default default_text_search_config is 'chinesecfg'. I tested "select > to_tsvector('english',content) from tmp_article" and this works. So it's > the problem of nlpbamboo? That's probably the problem - it seems to emit something that is not proper UTF-8 sometimes. Do you get the error if you try Chinese settings without nlpbamboo? > btw, what's the meaning of "Please don't top post."? I > don't understand but sorry if I caused any problem. http://en.wikipedia.org/wiki/Posting_style Yours, Laurenz Albe
Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
От
Rural Hunter
Дата:
于 2012/4/17 18:06, Albe Laurenz 写道: > Rural Hunter wrote: > That's probably the problem - it seems to emit something that is > not proper UTF-8 sometimes. > > Do you get the error if you try Chinese settings without nlpbamboo? How can I do this? The Chinese processing is provided by nlpbamboo.
Re: invalid byte sequence for encoding "UTF8": 0xf481 - how could this happen?
От
"Albe Laurenz"
Дата:
Rural Hunter wrote: >> Do you get the error if you try Chinese settings without nlpbamboo? > How can I do this? The Chinese processing is provided by nlpbamboo. Er, sorry, forget what I said. I have no experience with Chinese. I guess you should try to ask the nlpbamboo people. Is there anything about encodings in the documentation? Yours, Laurenz Albe