Re: Populating large tables with occasional bad values
От | John T. Dow |
---|---|
Тема | Re: Populating large tables with occasional bad values |
Дата | |
Msg-id | 200806132215.m5DMFRrO015727@web2.nidhog.com обсуждение исходный текст |
Ответ на | Populating large tables with occasional bad values ("John T. Dow" <john@johntdow.com>) |
Список | pgsql-jdbc |
I have a solution that I am happy with, yielding nearly a 10-fold improvement in speed. I tried a quick and dirty experiment with different batch sizes with a small table of 750 rows. Without using batches, the insert took 64 seconds. (Auto commit true for this and the other tests.) Batch size 10, 13 seconds. Batch size 100, 5 seconds. This code was very simple -- if any error in the batch, discard the entire batch. This gives an idea of what performancewould be like. Of course, with a batch size of 10 and two errors, the number of rows loaded is short by 17 because two entire batches werediscarded. With a batch size of 100, it's short by 150. (I also neglected to send the last, partial batch, this beingjust a quick and dirty experiment.) Anyway, a batch size of 10 yields a performance improvement of 5. Batch size 100 it's 10. Craig wrote: >IMO the best solution with batches is to use batches to store your >planned operations in a staging table using INSERTs that cannot fail >under normal circumstances. Then issue a single PL/PgSQL call or a bulk >UPDATE ... SELECT with an appropriate WHERE clause to apply the updates >and catch problems. > >The approach you describe will work, but it'll be a bit ugly and not >very fast. If failures are very rare you might find it to be OK, but >your suggestion of using only 10 statements per batch suggests that >failures aren't *that* rare. I need reasonable performance, not the best that money can buy. My clients have tables with tens of thousands of rows, nothundreds or millions. Also, I want a general purpose solution that will work with no special knowledge of the table tobe loaded -- just the info obtainable from the meta data (such as column names and types). A staging table such as yousuggest could be created automatically (copy the columns but change all to char varying so they load). But then locatingrows with bad values would be a messy bit of sql, although it could be generated by my code. Still, I'd have to bringover the rows (casting char varying to integer or whatever) that are good and then remove them, leaving the bad rowsbehind. I'll clean up my dirty code, eg make sure it sends the last partial batch. I'll also have it iterate to insert the rows inany batch that's rejected. The simplest thing to do is simply resend the rows in the batch, one by one. A more elaboratething to do would be to resend a partial batch, up to the point of the problem, skip the known problem row, thensend another partial batch with the remaining rows. Keep doing that until all in the original batch have been successfullyinserted or else written to the error log. John
В списке pgsql-jdbc по дате отправления: