Re: Using PostgreSQL to store URLs for a web crawler
| От | Fabio Pardi |
|---|---|
| Тема | Re: Using PostgreSQL to store URLs for a web crawler |
| Дата | |
| Msg-id | 3f53e28c-c550-ada4-feb8-5c70daa10d8d@portavita.eu обсуждение исходный текст |
| Ответ на | Using PostgreSQL to store URLs for a web crawler (Simon Connah <scopensource@gmail.com>) |
| Список | pgsql-novice |
Hi Simon, no question is a stupid question. Postgres can handle great deal of data if properly sized and configured. Additionally, in your case, I would parse the reverse of the URL, to make the indexed search faster. regards, fabio pardi On 12/28/18 6:15 PM, Simon Connah wrote: > First of all apologies if this is a stupid question but I've never used > a database for something like this and I'm not sure if what I want is > possible with PostgreSQL. > > I'm writing a simple web crawler in Python and the general task it needs > to do is first get a list of domain names and URL paths from the > database and download the HTML associated with that URL and save it in > an object store. > > Then another process goes through the downloaded HTML and extracts all > the links on the page and saves it to the database (if the URL does not > already exist in the database) so the next time the web crawler runs it > picks up even more URLs to crawl. This process is exponential so the > number of URLs that are saved in the database will grow very quickly. > > I'm just a bit concerned that saving so much data to PostgreSQL quickly > would cause performance issues. > > Is this something PostgreSQL could handle well? I'm not going to be > running PostgreSQL server myself. I'll be using Amazon RDS to host it. >
В списке pgsql-novice по дате отправления: