Re: Picking 25 samples of every domain
От | Benoit Izac |
---|---|
Тема | Re: Picking 25 samples of every domain |
Дата | |
Msg-id | 87iojrrb5h.fsf@izac.org обсуждение исходный текст |
Ответ на | Picking 25 samples of every domain (Gary Warner <gar@askgar.com>) |
Список | pgsql-novice |
Le 09/10/2014 à 23:25, Gary Warner écrivait : > I have a set of Postgres tables that are related to URLs found in > email. The tables are BIG. 40-50 million records per day. We are > using them for some research into tricks spammers use to confound > blacklists. When we parse the URLs, we pull out the "domain" portion of > each URL and store it in a field called "top_domain". The full URL is > available as "link". > > Through various forms of randomization, customization, and wild-carding, > a domain may have as many as 1 million URLs per day. I am needing a > query that would grab a sample number of URLs per domain (let's say 25 > for conversation) . . . something that in pseudo-code might look like > > for each top_domain in urltable do > select top_domain, link limit 25; CREATE TYPE urltable_type AS (top_domain text, link text); CREATE OR REPLACE FUNCTION urltable_sample(integer) RETURNS SETOF urltable_type AS $$ DECLARE td text; BEGIN FOR td IN SELECT DISTINCT top_domain FROM urltable LOOP RETURN QUERY EXECUTE 'SELECT top_domain::text, link::text FROM urltable WHERE top_domain = ''' || td || ''' LIMIT ' || $1; END LOOP; RETURN; END $$ LANGUAGE 'plpgsql' ; SELECT top_domain, link FROM urltable_sample(25); > Thoughts on the fastest way to do a query like that? No but I'm not an expert. -- Benoit Izac
В списке pgsql-novice по дате отправления: