Re: gsoc, text search selectivity and dllist enhancments
От | Jan Urbański |
---|---|
Тема | Re: gsoc, text search selectivity and dllist enhancments |
Дата | |
Msg-id | 48766DC8.1000006@students.mimuw.edu.pl обсуждение исходный текст |
Ответ на | Re: gsoc, text search selectivity and dllist enhancments (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: gsoc, text search selectivity and dllist enhancments
|
Список | pgsql-hackers |
Tom Lane wrote: > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j.urbanski@students.mimuw.edu.pl> writes: >> Do you think it's worthwhile to implement the LC algorithm in C and send >> it out, so others could try it out? Heck, maybe it's worthwhile to >> replace the current compute_minimal_stats() algorithm with LC and see >> how that compares? > > Very possibly. I repeat that the current implementation of > compute_minimal_stats is very ad-hoc code and wasn't written with an eye > to high performance. Replacing it with an algorithm that someone > actually thought about might well be worth doing. Here's a patch that combines both patches included here: http://archives.postgresql.org/message-id/48649261.5040703@students.mimuw.edu.pl and adds a C implementation of the Lossy Counting algorithm. It defines two typanalyze functions: ts_typanalyze_std and ts_typanalyze_lc, so you can see what statistics are gathered by each of them. It's meant for easy applying to HEAD, updating pg_type and running ANALYZE on a few tables with tsvectors (i.e. testing, not commiting). My observations are: the LC algorithm beats the previous one by a fairly large margin (20-30%) timewise. The results are almost identical, I got discrepancies of about 0.05 for some lexemes' frequencies. I intend to stick with LC for tsvectors and that'll allow to throw away the Dllist changes. If I want to keep my GSoC schedule I won't be able to implement LC for general statistics gathering, but it's trivial. If no one gets about it I can do it after the Summer of Code (if only to see how it'll work). Oh, one important thing. You need to choose a bucket width for the LC algorithm, that is decide after how many elements will you prune your data structure. I chose to prune after every twenty tsvectors. You might consider: - picking some other arbitrary value - making it depend on the largest tsvector size - making it depend on the statistics_target - pruning after each X lexemes instead of after each Y tsvectors, because now the buckets will vary in width and you can argue that the order of input makes a difference again. OTOH the situation here is a bit different: you get streams of mutually different elements (lexemes inside a tsvector are all different) and pruning in the middle of such stream might be unfair for lexemes that are still to be processed. Hmm, dunno. Cheers, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Вложения
В списке pgsql-hackers по дате отправления: