Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores
От | Dan O'Hara |
---|---|
Тема | Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores |
Дата | |
Msg-id | 557802370910221254k624306eg81ae6176eb3bd9d4@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #5021: ts_parse doesn't recognize email addresses with underscores (Euler Taveira de Oliveira <euler@timbira.com>) |
Список | pgsql-bugs |
I agree that it isn't easy to determine if given text is a valid email address. As I couldn't use ts_parse, I ended up using a regex, which worked substantially better at pulling out the emails from the text stream. I haven't looked at the code, but perhaps it is possible to do the same thing here? Even a regex that is 99% correct would be better than the current tokenizer which is only right about 80-85% of the time. My workaround looked something like this: select regexp_matches(resumetext,E'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4= }','gi') as email from "Resume" cheers Dan On Thu, Oct 22, 2009 at 3:39 PM, Euler Taveira de Oliveira <euler@timbira.com> wrote: > Robert Haas escreveu: >> I'm not real familiar with ts_parse(), but I'm thinking that it >> doesn't have any special casing for email addresses and is just >> intended to parse text for full-text-search - in which case splitting >> on _ is a pretty good algorithm. >> > It is a bug. The tsearch claims to identify types of tokens but it doesn't > correctly identify any valid e-mail addresses. As Dan stated ts_parse() f= ails > to recognize an e-mail address. For example, foo+bar@baz.com is a valid e= -mail > but the function fails to report that. > > It is not that simple to identify an e-mail address that agrees with RFC.= As > that code is a state machine, IMHO it decides too early (when it finds _)= that > that string is not an e-mail address. AFAIR, that's not an one-line fix. > > euler=3D# select distinct token as email from ts_parse('default', > 'foo.bar@baz.com'); > =C2=A0 =C2=A0 =C2=A0email > =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2= =94=80 > =C2=A0foo.bar@baz.com > (1 row) > > euler=3D# select distinct token as email from ts_parse('default', > 'foo+bar@baz.com'); > =C2=A0 =C2=A0email > =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80 > =C2=A0foo > =C2=A0+ > =C2=A0bar@baz.com > (3 rows) > > euler=3D# select distinct token as email from ts_parse('default', > 'foo_bar@baz.com'); > =C2=A0 =C2=A0email > =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80= =E2=94=80=E2=94=80=E2=94=80=E2=94=80=E2=94=80 > =C2=A0foo > =C2=A0bar@baz.com > =C2=A0_ > (3 rows) > > > -- > =C2=A0Euler Taveira de Oliveira > =C2=A0http://www.timbira.com/ > --=20 ------------------------------------------------------------------- Dan O'Hara Danara Software Systems, Inc. danarasoftware@gmail.com 613 288-8733
В списке pgsql-bugs по дате отправления: