Re: Proposal: q-gram GIN and GiST indexes
От | Robert Haas |
---|---|
Тема | Re: Proposal: q-gram GIN and GiST indexes |
Дата | |
Msg-id | BANLkTi=dmsdDzBEZhVtNnaY11cuvVQ-h-Q@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Proposal: q-gram GIN and GiST indexes (Alexander Korotkov <aekorotkov@gmail.com>) |
Ответы |
Re: Proposal: q-gram GIN and GiST indexes
|
Список | pgsql-hackers |
On Tue, Apr 5, 2011 at 4:52 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov >> <aekorotkov@gmail.com> wrote: >> > relatively small when q <= 5. Accordingly, I think we should expect >> > indexes >> > to be usable with at least with q = 5. >> >> I defer to your opinion on this, since you know more about it than I >> do. But I think it would still be worthwhile to write a quick Perl >> script and calculate the number q-grams in various sample texts for >> various values of q. The worst case is surely exponential in q, so >> it'd be nice to have some evidence of what the real-world behavior is. > > Here is distribution of numbers of different q-grams count in various > datasets. Q-grams didn't pass any preprocessing, preprocessed q-grams (for > example, lowercased) should have lower counts. > q ds1 ds2 ds3 ds4 > 2 2313 3461 1625 1288 > 3 15146 25094 14090 10728 > 4 58510 105908 69127 47499 > 5 161801 298466 182680 110929 > 6 351175 633750 331090 176336 > 7 613299 1049088 496426 234730 > 8 921962 1450715 657965 283698 > 9 1248339 1793158 802188 321261 > 10 1556838 2066775 926043 348058 > ds1 - J. R. R. Tolkien, The Lord of the Rings, 2805204 bytes > ds2 - Leo Tolstoy, War and Peace volume 1, 3197190 bytes > ds3 - set of person first and last names, 2142298 bytes > ds4 - english dictionary, 931708 bytes > Sure, q-grams count grows with q increasing. At low q we can see > approximately exponential grow. At high q grow is slowing and it is > approximately linear. > In the worst case count of q-grams is exponential in q if we think data > volume to be much higher then number of possible q-grams. But with high q > real limitation is total number of q-grams extracted from dataset. In worst > case each extracted q-gram is unique. This means that entries pages number > would be comparable with data pages number. In this case index size with > high q would be few times greater that index size with low q. So with q=5, the index will be approximately 10x larger than with q=3.Maybe that's OK, I'm not sure. But it is a big difference. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления: