Re: Counting the number of repeated phrases in a column
От | benj.dev@laposte.net |
---|---|
Тема | Re: Counting the number of repeated phrases in a column |
Дата | |
Msg-id | 9ae7115c-89c8-1960-6ca5-d899f710aa0b@laposte.net обсуждение исходный текст |
Ответ на | Re: Counting the number of repeated phrases in a column (Merlin Moncure <mmoncure@gmail.com>) |
Ответы |
Re: Counting the number of repeated phrases in a column
|
Список | pgsql-general |
Le 27/01/2022 à 18:35, Merlin Moncure a écrit : > On Thu, Jan 27, 2022 at 11:09 AM Rob Sargent <robjsargent@gmail.com> wrote: >> >> On 1/27/22 10:03, Merlin Moncure wrote: >> >> On Wed, Jan 26, 2022 at 5:23 PM Merlin Moncure <mmoncure@gmail.com> wrote: >> >> with s as (select 'Hello World Hello World' as sentence) >> select >> phrase, >> array_upper(string_to_array((select sentence from s), phrase), 1) - >> 1 as occurrances >> from >> ( >> select array_to_string(x, ' ') as phrase >> from >> ( >> select distinct v[a:b] x >> from regexp_split_to_array((select sentence from s), ' ') v >> cross join lateral generate_series(1, array_upper(v, 1)) a >> cross join lateral generate_series(a + 1, array_upper(v, 1)) b >> ) q >> ) q; >> >> Simplified to: >> select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences >> from regexp_split_to_array('Hello World Hello World', ' ') v >> cross join lateral generate_series(1, array_upper(v, 1)) a >> cross join lateral generate_series(a + 1, array_upper(v, 1)) b >> group by 1; >> >> phrase │ occurances >> ─────────────────────────┼──────────── >> World Hello │ 1 >> Hello World Hello │ 1 >> Hello World │ 2 >> Hello World Hello World │ 1 >> World Hello World │ 1 >> >> merlin >> >> >> And since we're looking for repeated phrases maybe add >> >> having count(*) > 1 > > thanks. also, testing on actual data, I noticed that a couple other > things are mandatory, mainly doing a bit of massaging before > tokenizing: > > select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences > from > ( > select array_agg(t) v > from > ( > select trim(replace(unnest(v), E'\n', '')) t > from regexp_split_to_array(<sentence>, ' ') v > ) q > where length(t) > 1 > ) q > cross join lateral generate_series(1, array_upper(v, 1)) a > cross join lateral generate_series(a + 1, array_upper(v, 1)) b > group by 1 > having count(*) > 1; > > We are definitely in N^2 space here, so look for things to start > breaking down for sentences > 1000 words. > > merlin > (for better complexity) you may search about "Ukkonen suffix tree" Similar problem as yours : https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/?ref=lbp
Вложения
В списке pgsql-general по дате отправления: