Re: questions on (parallel) COPY and when to REINDEX
От | Janet Jacobsen |
---|---|
Тема | Re: questions on (parallel) COPY and when to REINDEX |
Дата | |
Msg-id | 4A76508D.6010509@lbl.gov обсуждение исходный текст |
Ответ на | Re: questions on (parallel) COPY and when to REINDEX (Andy Colson <andy@squeakycode.net>) |
Список | pgsql-general |
Hi, Andy. Thanks for the responses and information. Just to let you know... what we are storing in the db are the image attributes - about 40 of them - not the images. So the COPY is reading an ascii file of the image attributes. It turns out to be useful to have the image attributes handy - much better than reading the image headers. The images are available on spinning disk, and the image locations are in the db. Thanks, Janet On 02/08/2009 05:59 p.m., Andy Colson wrote: >>> On 1 Aug 2009, at 23:24, Janet Jacobsen wrote: >>>> My questions are: >>>> (2) Should I REINDEX these two tables daily after the pipeline >>>> completes? Is this what other people do in practice? >>> >>> No need, but as people are querying as soon as data is arriving, an >>> analyse of the table you just copied to should help performance - >>> the query plans will be derived from more accurate table statistics >>> that way. >> >> The files that are being copied into the table(s) are between 200 to >> 1,000 >> each, but there are hundreds of these small files every day. Would you >> recommend running analyze after every COPY? > > Just once, at the end. Assuming you are only deleting 100k records > and re-adding/updating another 100k in a batch. That's not so many > records it'll through the stats out of whack. (If you were > dropping/updating 15M in a batch you might wanna analyze in the middle) > >>> >>> You wrote that your pipeline runs for a period of 4 hours and the >>> table is about 15M rows now. What is taking up all that time? I >>> understand why you'd want to parallelise that process, but where do >>> you expect the gains? >> >> >> We're processing images, and the data processing and analysis takes >> up most of the time, but the images can be processed/analyzed in >> parallel. >> We've been doing all of the data loading at the end - one COPY at a >> time. Originally that made sense because the researchers wanted to >> check the images before loading the data/analysis results into the db. > > Ah! Images! When you are doing the COPY are you escaping the data? > You cant "just" copy a binary file. > > -Andy
В списке pgsql-general по дате отправления: