Re: Using PostgreSQL to store URLs for a web crawler

Поиск

Список

Период

Сортировка

От	Fabio Pardi
Тема	Re: Using PostgreSQL to store URLs for a web crawler
Дата	28 декабря 2018 г. 19:48:48
Msg-id	3f53e28c-c550-ada4-feb8-5c70daa10d8d@portavita.eu обсуждение исходный текст
Ответ на	Using PostgreSQL to store URLs for a web crawler (Simon Connah <scopensource@gmail.com>)
Список	pgsql-novice

Дерево обсуждения

Hi Simon,

no question is a stupid question.

Postgres can handle great deal of data if properly sized and configured.

Additionally, in your case, I would parse the reverse of the URL, to
make the indexed search faster.

regards,

fabio pardi

On 12/28/18 6:15 PM, Simon Connah wrote:
> First of all apologies if this is a stupid question but I've never used
> a database for something like this and I'm not sure if what I want is
> possible with PostgreSQL.
> 
> I'm writing a simple web crawler in Python and the general task it needs
> to do is first get a list of domain names and URL paths from the
> database and download the HTML associated with that URL and save it in
> an object store.
> 
> Then another process goes through the downloaded HTML and extracts all
> the links on the page and saves it to the database (if the URL does not
> already exist in the database) so the next time the web crawler runs it
> picks up even more URLs to crawl. This process is exponential so the
> number of URLs that are saved in the database will grow very quickly.
> 
> I'm just a bit concerned that saving so much data to PostgreSQL quickly
> would cause performance issues.
> 
> Is this something PostgreSQL could handle well? I'm not going to be
> running PostgreSQL server myself. I'll be using Amazon RDS to host it.
>

В списке pgsql-novice по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Using PostgreSQL to store URLs for a web crawler