Re: Picking 25 samples of every domain

Поиск

Список

Период

Сортировка

От	Benoit Izac
Тема	Re: Picking 25 samples of every domain
Дата	10 октября 2014 г. 19:13:20
Msg-id	87iojrrb5h.fsf@izac.org обсуждение исходный текст
Ответ на	Picking 25 samples of every domain (Gary Warner <gar@askgar.com>)
Список	pgsql-novice

Дерево обсуждения

Le 09/10/2014 à 23:25, Gary Warner écrivait :

> I have a set of Postgres tables that are related to URLs found in
> email.  The tables are BIG.  40-50 million records per day.  We are
> using them for some research into tricks spammers use to confound
> blacklists.  When we parse the URLs, we pull out the "domain" portion of
> each URL and store it in a field called "top_domain".  The full URL is
> available as "link".
>
> Through various forms of randomization, customization, and wild-carding,
> a domain may have as many as 1 million URLs per day.  I am needing a
> query that would grab a sample number of URLs per domain (let's say 25
> for conversation) . . . something that in pseudo-code might look like
>
> for each top_domain in urltable do
>     select top_domain, link limit 25;

CREATE TYPE urltable_type AS (top_domain text, link text);

CREATE OR REPLACE FUNCTION urltable_sample(integer)
RETURNS SETOF urltable_type AS $$
DECLARE
    td text;
BEGIN
    FOR td IN SELECT DISTINCT top_domain FROM urltable
    LOOP
        RETURN QUERY EXECUTE
           'SELECT top_domain::text, link::text
            FROM urltable
            WHERE top_domain = ''' || td || '''
            LIMIT ' || $1;
    END LOOP;
    RETURN;
END
$$ LANGUAGE 'plpgsql' ;

SELECT top_domain, link FROM urltable_sample(25);

> Thoughts on the fastest way to do a query like that?

No but I'm not an expert.

--
Benoit Izac

В списке pgsql-novice по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Picking 25 samples of every domain