Re: Google Summer of Code 2008
От | Oleg Bartunov |
---|---|
Тема | Re: Google Summer of Code 2008 |
Дата | |
Msg-id | Pine.LNX.4.64.0803082219280.10010@sn.sai.msu.ru обсуждение исходный текст |
Ответ на | Re: Google Summer of Code 2008 (Jan Urbański <j.urbanski@students.mimuw.edu.pl>) |
Ответы |
Re: Google Summer of Code 2008
Re: Google Summer of Code 2008 |
Список | pgsql-hackers |
On Sat, 8 Mar 2008, Jan Urbaski wrote: > Oleg Bartunov wrote: >> Jan, >> >> the problem is known and well requested. From your promotion it's not >> clear what's an idea ? >>> Tom Lane wrote: >>>> Jan Urbański <j.urbanski@students.mimuw.edu.pl> >>>> writes: >>>>> 2. Implement better selectivity estimates for FTS. > > OK, after reading through the some of the code the idea is to write a custom > typanalyze function for tsvector columns. It could look inside the tsvectors, > compute the most commonly appearing lexemes and store that information in > pg_statistics. Then there should be a custom selectivity function for @@ and > friends, that would look at the lexemes in pg_statistics, see if the tsquery > it got matches some/any of them and return a result based on that. such function already exists, it's ts_stat(). The problem with ts_stat() is its performance, since it sequentually scans ALL tsvectors. It's possible to write special function for tsvector data type, which will be used by analyze, but I'm not sure sampling is a good approach here. The way we could improve performance of gathering stats using ts_stat() is to process only new documents. It may be not as fast as it looks because of lot of updates, so one need to think more about. > > I have a feeling that in many cases identifying the top 50 to 300 lexemes > would be enough to talk about text search selectivity with a degree of > confidence. At least we wouldn't give overly low estimates for queries > looking for very popular words, which I believe is worse than givng an overly > high estimate for a obscure query (am I wrong here?). Unfortunately, selectivity estimation for query is much difficult than just estimate frequency of individual word. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
В списке pgsql-hackers по дате отправления: