Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores
От | Dan O'Hara |
---|---|
Тема | Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores |
Дата | |
Msg-id | 557802370910221010k5669e9f0v559213d998e286d3@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-bugs |
Thanks for having a look at this bug. According to section 12.8.2 of the postgres manual, ts_parse is supposed to recognize different types of data, one of which (#4) is an email address. The list of recognized data formats for parse can be selected via this quer= y: SELECT * FROM ts_token_type('default'); The example in the bug I reported is valid email address, according to the RFC, but isn't recognized as such by the full text search in postgres. This bug will have a real impact on anybody using ts functions to locate email addresses, as only some of them are found in the query. Regards Dan On Thu, Oct 22, 2009 at 12:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Aug 28, 2009 at 9:59 AM, Dan O'Hara <danarasoftware@gmail.com> wr= ote: >> >> The following bug has been logged online: >> >> Bug reference: =A0 =A0 =A05021 >> Logged by: =A0 =A0 =A0 =A0 =A0Dan O'Hara >> Email address: =A0 =A0 =A0danarasoftware@gmail.com >> PostgreSQL version: 8.3.7 >> Operating system: =A0 win32 >> Description: =A0 =A0 =A0 =A0ts_parse doesn't recognize email addresses w= ith >> underscores >> Details: >> >> In the following example, >> >> select distinct token as email >> from ts_parse('default', ' first_last@yahoo.com ' =A0 ) >> where tokid =3D 4 >> >> ts_parse returns last@yahoo.com rather than first_last@yahoo.com =A0It s= eems >> that any text prior to the underscore is truncated. =A0If the portion >> following the underscore is only numeric, such as this example, >> >> select distinct token as email >> from ts_parse('default', ' bill_2000@yahoo.com ' =A0 ) >> where tokid =3D 4 >> >> then ts_parse returns nothing at all. >> >> section 3.2.3 of RFC 5322 indicates that underscores are valid character= s in >> an email address. >> >> http://tools.ietf.org/html/rfc5322 > > I don't think this has much to do with email addresses. =A0If you do: > > select token from ts_parse('a_b'); > > ...you get three tokens. =A0In your case you're pulling out the fourth > token, but some of your examples don't have four tokens, so then you > get nothing at all. > > I'm not real familiar with ts_parse(), but I'm thinking that it > doesn't have any special casing for email addresses and is just > intended to parse text for full-text-search - in which case splitting > on _ is a pretty good algorithm. > > ...Robert > --=20 ------------------------------------------------------------------- Dan O'Hara Danara Software Systems, Inc. danarasoftware@gmail.com 613 288-8733
В списке pgsql-bugs по дате отправления: