Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words
От | Kyotaro Horiguchi |
---|---|
Тема | Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words |
Дата | |
Msg-id | 20220725.113608.1175924917662229386.horikyota.ntt@gmail.com обсуждение исходный текст |
Ответ на | BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words (PG Bug reporting form <noreply@postgresql.org>) |
Ответы |
Re: BUG #17556: ts_headline does not correctly find matches when separated by 4,999 words
|
Список | pgsql-bugs |
At Fri, 22 Jul 2022 14:06:43 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in > The following bug has been logged on the website: > > Bug reference: 17556 > Logged by: Alex Malek > Email address: magicagent@gmail.com > PostgreSQL version: 14.4 > Operating system: Red Hat > Description: > > Correct results when 4,998 words separate search terms: > > # select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4998) || ' > labor', > $$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<, > MaxFragments=100, MaxWords=7, MinWords=3') ; > ts_headline > --------------------- > >ipsum< ... >labor< > (1 row) > > Add one more word between terms being searched for, to total 4,999, and > terms are not found: > > # select ts_headline('baz baz baz ipsum ' || repeat(' foo ',4999) || ' > labor', > $$'ipsum' & 'labor'$$::tsquery, 'StartSel=>, StopSel=<, > MaxFragments=100, MaxWords=7, MinWords=3') ; > ts_headline > ------------- > baz baz baz > (1 row) When ts_headline searches the document, it splits the document into segments in the length called internally as max_cover, which is not configurable for now [1]. In the latter case above, it is MaxFragments * (max(MaxWords * 10, 100)) = 10000 "words" where whitespaces are counted as words. The docuement has 10007 "words", where 'ipsum' is the 7th word and 'labor' is the 10007th word. The two words aren't within a 10000-word segment so it is missed. ts_headeline returns instead the first MinWords words as you see. This is not a bug, but a designed behavior. However, we might want to document that beahvior. This could be "improved" as [1], but in this specific case, I doubt the usefulness of ts_headline picking up it up when the two words are that far distant each other, in exchange of possible degradation. [1] For developers, wparser_def.c:2582 > * We might eventually make max_cover a user-settable parameter, but for > * now, just compute a reasonable value based on max_words and > * max_fragments. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
В списке pgsql-bugs по дате отправления: