Re: dot to be considered as a word delimiter?
От | Sushant Sinha |
---|---|
Тема | Re: dot to be considered as a word delimiter? |
Дата | |
Msg-id | 9fb559330906021340l1f9f520s57310aa034af3511@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: dot to be considered as a word delimiter? (Kenneth Marshall <ktm@rice.edu>) |
Ответы |
Re: dot to be considered as a word delimiter?
Re: dot to be considered as a word delimiter? |
Список | pgsql-hackers |
Fair enough. I agree that there is a valid need for returning such tokens as a host. But I think there is definitely a need to break it down into individual words. This will help in cases when a document is missing a space in between the words.
So what we can do is: return the entire compound word as Host and also break it down into individual words. I can put up a patch for this if you guys agree.
Returning multiple tokens for the same word is a feature of the text search parser as explained in the documentation here:
http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html
Thanks,
Sushant.
So what we can do is: return the entire compound word as Host and also break it down into individual words. I can put up a patch for this if you guys agree.
Returning multiple tokens for the same word is a feature of the text search parser as explained in the documentation here:
http://www.postgresql.org/docs/8.3/static/textsearch-parsers.html
Thanks,
Sushant.
On Tue, Jun 2, 2009 at 8:47 AM, Kenneth Marshall <ktm@rice.edu> wrote:
In our uses for full text indexing, it is much more important toOn Mon, Jun 01, 2009 at 08:22:23PM -0500, Kevin Grittner wrote:
> Sushant Sinha <sushant354@gmail.com> wrote:
>
> > I think that dot should be considered by as a word delimiter because
> > when dot is not followed by a space, most of the time it is an error
> > in typing. Beside they are not many valid english words that have
> > dot in between.
>
> It's not treating it as an English word, but as a host name.
>
> select ts_debug('english', 'Mr.J.Sai Deepak');
> ts_debug
> ---------------------------------------------------------------------------
> (host,Host,Mr.J.Sai,{simple},simple,{mr.j.sai})
> (blank,"Space symbols"," ",{},,)
> (asciiword,"Word, all
> ASCII",Deepak,{english_stem},english_stem,{deepak})
> (3 rows)
>
> You could run it through a dictionary which would deal with host
> tokens differently. Just be aware of what you'll be doing to
> www.google.com if you run into it.
>
> I hope this helps.
>
> -Kevin
>
be able to find host name and URLs than to find mistyped names.
My two cents.
Cheers,
Ken
В списке pgsql-hackers по дате отправления: