Обсуждение: TS: Limited cover density ranking

Поиск
Список
Период
Сортировка

TS: Limited cover density ranking

От
karavelov@mail.bg
Дата:
Hello, <br /><br />I have developed a variation of cover density ranking functions that counts only covers that are
lesserthan a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an
exampleof usage: <br /><br />-- normal cover density ranking : not changed <br />luben=> select
ts_rank_cd(to_tsvector('ab c d e g h i j k'), to_tsquery('a&d')); <br /> ts_rank_cd <br />------------ <br />
0.0333333<br />(1 row) <br /><br />-- limited to 2 <br />luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j
k'),to_tsquery('a&d')); <br /> ts_rank_cd <br />------------ <br /> 0 <br />(1 row) <br /><br />luben=> select
ts_rank_cd(2,to_tsvector('a b c d e g h i j k a d'), to_tsquery('a&d')); <br /> ts_rank_cd <br />------------ <br
/>0.1 <br />(1 row) <br /><br />-- limited to 3 <br />luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j
k'),to_tsquery('a&d')); <br /> ts_rank_cd <br />------------ <br /> 0.0333333 <br />(1 row) <br /><br /> luben=>
selectts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), to_tsquery('a&d')); <br /> ts_rank_cd <br
/>------------<br /> 0.133333 <br />(1 row) <br /><br />Find attached a path agains 9.1.2 sources. I preferred to make
apatch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an
extensiona lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension. <br /><br />I have
somequestions: <br /><br />1. Is it interesting to develop it further (documentation, cleanup, etc) for inclusion in
oneof the next versions? If this is the case, there are some further questions: <br /><br />- should I overload
ts_rank_cd(as in examples above and the patch) or should I define new set of functions, for example ts_rank_lcd ? <br
/>-should I define define this new sql level functions in core or should I go only with this 2 lines change in
calc_rank_cd()and define the new functions as an extension? If we prefer the later, could I overload core functions
withfunctions defined in extensions? <br />- and finally there is always the possibility to duplicate the code and make
anindependent extension. <br /><br />2. If I run the patched version on cluster that was initialized with unpatched
server,is there a way to register the new functions in the system catalog without reinitializing the cluster? <br /><br
/>Bestregards <br />luben <br /><br />-- <br />Luben Karavelov 

Re: TS: Limited cover density ranking

От
karavelov@mail.bg
Дата:
And here is the patch, that I forgot to attach
> Hello,
>
> I have developed a variation of cover density ranking functions that counts only covers that are lesser than a specified limit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of usage:

...

>
> Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1 statement change in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between backend/utils/adt/tsrank.c and the extension.
>
--
Luben Karavelov
Вложения

Re: TS: Limited cover density ranking

От
Sushant Sinha
Дата:
The rank counts 1/coversize. So bigger covers will not have much impact
anyway. What is the need of the patch?

-Sushant.

On Fri, 2012-01-27 at 18:06 +0200, karavelov@mail.bg wrote:
> Hello, 
> 
> I have developed a variation of cover density ranking functions that
> counts only covers that are lesser than a specified limit. It is
> useful for finding combinations of terms that appear nearby one
> another. Here is an example of usage: 
> 
> -- normal cover density ranking : not changed 
> luben=> select ts_rank_cd(to_tsvector('a b c d e g h i j k'),
> to_tsquery('a&d')); 
> ts_rank_cd 
> ------------ 
> 0.0333333 
> (1 row) 
> 
> -- limited to 2 
> luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'),
> to_tsquery('a&d')); 
> ts_rank_cd 
> ------------ 
> 0 
> (1 row) 
> 
> luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'),
> to_tsquery('a&d')); 
> ts_rank_cd 
> ------------ 
> 0.1 
> (1 row) 
> 
> -- limited to 3 
> luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'),
> to_tsquery('a&d')); 
> ts_rank_cd 
> ------------ 
> 0.0333333 
> (1 row) 
> 
> luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'),
> to_tsquery('a&d')); 
> ts_rank_cd 
> ------------ 
> 0.133333 
> (1 row) 
> 
> Find attached a path agains 9.1.2 sources. I preferred to make a
> patch, not a separate extension because it is only 1 statement change
> in calc_rank_cd function. If I have to make an extension a lot of code
> would be duplicated between backend/utils/adt/tsrank.c and the
> extension. 
> 
> I have some questions: 
> 
> 1. Is it interesting to develop it further (documentation, cleanup,
> etc) for inclusion in one of the next versions? If this is the case,
> there are some further questions: 
> 
> - should I overload ts_rank_cd (as in examples above and the patch) or
> should I define new set of functions, for example ts_rank_lcd ? 
> - should I define define this new sql level functions in core or
> should I go only with this 2 lines change in calc_rank_cd() and define
> the new functions as an extension? If we prefer the later, could I
> overload core functions with functions defined in extensions? 
> - and finally there is always the possibility to duplicate the code
> and make an independent extension. 
> 
> 2. If I run the patched version on cluster that was initialized with
> unpatched server, is there a way to register the new functions in the
> system catalog without reinitializing the cluster? 
> 
> Best regards 
> luben 
> 
> -- 
> Luben Karavelov




Re: TS: Limited cover density ranking

От
karavelov@mail.bg
Дата:
----- Цитат от Sushant Sinha (sushant354@gmail.com), на 27.01.2012 в 18:32 ----- <br /><br />> The rank counts
1/coversize.So bigger covers will not have much impact <br />> anyway. What is the need of the patch? <br />> <br
/>>-Sushant. <br />> <br /><br />If you want to find only combinations of words that are close one to another,
withthe patch you could use something as: <br /><br />WITH a AS (SELECT to_tsvector('a b c d e g h i j k') AS vec,
to_tsquery('a&d')AS query) <br />SELECT * FROM a WHERE vec @@ query AND ts_rank_cd(3,vec,query)>0; <br /><br />I
couldnot find another way to make this type of queries. If there is an alternative, I am open to suggestions <br /><br
/>Bestregards <br />-- <br />Luben Karavelov 

Re: TS: Limited cover density ranking

От
karavelov@mail.bg
Дата:
----- Цитат от karavelov@mail.bg, на 27.01.2012 в 18:48 ----- <br /><br />> ----- Цитат от Sushant Sinha
(sushant354@gmail.com),на 27.01.2012 в 18:32 ----- <br />> <br />>> The rank counts 1/coversize. So bigger
coverswill not have much impact <br />>> anyway. What is the need of the patch? <br />>> <br />>>
-Sushant.<br />>> <br />> <br />> If you want to find only combinations of words that are close one to
another,with the patch you could use something as: <br />> <br />> WITH a AS (SELECT to_tsvector('a b c d e g h i
jk') AS vec, to_tsquery('a&d') AS query) <br />> SELECT * FROM a WHERE vec @@ query AND
ts_rank_cd(3,vec,query)>0;<br />> <br /><br />Another example, if you want to match 'b c d' only, you could use:
<br/><br />WITH A AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('b&c&d') AS query) <br
/>SELECT* FROM A WHERE vec @@ query AND ts_rank_cd(2,vec,query)>0; <br /><br />The catch is that it will match also
'bd c', 'd c b', 'd b c', 'c d b' and 'd b d', so it is not a <br />replacement for exact phrase match but something
thatI find useful <br /><br />-- <br />Luben Karavelov 

Re: TS: Limited cover density ranking

От
Oleg Bartunov
Дата:
I suggest you work on more general approach, see 
http://www.sai.msu.su/~megera/wiki/2009-08-12 for example.

btw, I don't like you changed ts_rank_cd arguments.

Oleg
On Fri, 27 Jan 2012, karavelov@mail.bg wrote:

> Hello,
>
> I have developed a variation of cover density ranking functions that counts only covers that are lesser than a
specifiedlimit. It is useful for finding combinations of terms that appear nearby one another. Here is an example of
usage:
>
> -- normal cover density ranking : not changed
> luben=> select ts_rank_cd(to_tsvector('a b c d e g h i j k'), to_tsquery('a&d'));
> ts_rank_cd
> ------------
>  0.0333333
> (1 row)
>
> -- limited to 2
> luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'), to_tsquery('a&d'));
> ts_rank_cd
> ------------
>          0
> (1 row)
>
> luben=> select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'), to_tsquery('a&d'));
> ts_rank_cd
> ------------
>        0.1
> (1 row)
>
> -- limited to 3
> luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'), to_tsquery('a&d'));
> ts_rank_cd
> ------------
>  0.0333333
> (1 row)
>
> luben=> select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), to_tsquery('a&d'));
> ts_rank_cd
> ------------
>   0.133333
> (1 row)
>
> Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a separate extension because it is only 1
statementchange in calc_rank_cd function. If I have to make an extension a lot of code would be duplicated between
backend/utils/adt/tsrank.cand the extension.
 
>
> I have some questions:
>
> 1. Is it interesting to develop it further (documentation, cleanup, etc) for inclusion in one of the next versions?
Ifthis is the case, there are some further questions:
 
>
> - should I overload ts_rank_cd (as in examples above and the patch) or should I define new set of functions, for
examplets_rank_lcd ?
 
> - should I define define this new sql level functions in core or should I go only with this 2 lines change in
calc_rank_cd()and define the new functions as an extension? If we prefer the later, could I overload core functions
withfunctions defined in extensions?
 
> - and finally there is always the possibility to duplicate the code and make an independent extension.
>
> 2. If I run the patched version on cluster that was initialized with unpatched server, is there a way to register the
newfunctions in the system catalog without reinitializing the cluster?
 
>
> Best regards
> luben
>
> --
> Luben Karavelov
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


Re: TS: Limited cover density ranking

От
karavelov@mail.bg
Дата:
----- Цитат от Oleg Bartunov (oleg@sai.msu.su), на 28.01.2012 в 21:04 ----- <br /><br />> I suggest you work on more
generalapproach, see <br />> http://www.sai.msu.su/~megera/wiki/2009-08-12 for example. <br />> <br />> btw, I
don'tlike you changed ts_rank_cd arguments. <br /><br />Hello Oleg, <br /><br />Thanks for the feedback. <br /><br />Is
itOK to begin with adding an exta argument and check in calc_rank_cd? <br /><br />I could change the function names in
ordernot to overload ts_rank_cd <br />arguments. My proposition is : <br /><br />at sql level: <br
/>ts_rank_lcd([weights],tsvector, tsquery, limit, [method]) <br /><br />at C level: <br />ts_ranklcd_wttlf <br
/>ts_ranklcd_wttl<br />ts_ranklcd_ttlf <br />ts_ranklcd_ttl <br /><br />Adding the functions could be done as an
extensionbut they are just <br />trampolines into calc_rank_cd(). <br /><br />I agree that what you describe in the
wikipage is more general approach. So this : <br /><br />SELECT ts_rank_lcd(to_tsvector('a b c'),
to_tsquery('a&c'),2)>0; <br /><br />could be replaced with <br /><br />SELECT to_tsvector('a b c') @@
to_tsquery('(a?2 c)|(c ?2 a) '); <br /><br />but if we need to look for 3 or more nearby terms without order the
tsquery<br />with '?' operator will became quite complicated. For example <br /><br />SELECT tsvec @@ <br />'(a ? b ?
c)| (a ? c ? b) | (b ? a ? c) | (b ? c ? a) | (c ? a ? b) | (c ? b ? a)'::tsquery; <br /><br />is the same as <br /><br
/>SELECTts_rank_lcd(tsvec, 'a&b&c'::tsquery,2)>0; <br /><br />So this is the reason to think that the
generalapproach does not exclude the the <br />usefulness of the approach that I am proposing. <br /><br />Best regards
<br/><br />-- <br />Luben Karavelov <br />