BUG #15115: temporary AND unlogged tables for parallel db import
От | PG Bug reporting form |
---|---|
Тема | BUG #15115: temporary AND unlogged tables for parallel db import |
Дата | |
Msg-id | 152120772359.1215.16174625790404871093@wrigleys.postgresql.org обсуждение исходный текст |
Ответы |
Re: BUG #15115: temporary AND unlogged tables for parallel db import
|
Список | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 15115 Logged by: ew Email address: ennowein@googlemail.com PostgreSQL version: 10.3 Operating system: Centos 7 Description: For fresh (large-) database import, we would like to be able to use temporary tables which are also unlogged, here is why: - Source data is organized in files per month (several million entries each) of which we have multiple years - Each source (-month) is independent of all others - Each source month is imported as its own session with multiple preparation operations (4 tables + 2 indexes) - Each session is completely independent of all others including the main db status until the final insertion/commit step - All session data is deleted immediately after insertion into the main db A nice way to implement this is to use gnu parallel to start worker sessions per month-input-file on a parallel machine (or several, actually). For this, we implemented an sql script that uses the fact that temp tables are visible only locally (otherwise, we'd have to name them all specifically). We can tune the system quite well by the number of workers/cpu/RAM/disk performance. Utilizing this parallel capability, we can reduce the import time by an order of magnitude - from several days to a few hours. The "safety" aspect here is irrelevant in this stage as in case of a crash, we'd simply restart the process from scratch. Specifically WAL logging and checkpointing of the temporary data is very cumbersome, slows the process, makes it actually less "reliable" (as it can lead to a quasi-deadlock by causing traffic jams and swapping which in turn practically stops the process). Ironically, we can gain a little bit of optimization by unlogging the main db - which is the only data we are actually interested in (we turn it back on right after import of course). There are some forum notes that we found which claim that "temporary" and "unlogged" share the functionality of disabling WAL but according to what we see that doesn't seem to be the case, the parallel import sessions generate a huge (and process limiting) amount of logging data despite being "temporary". We would gladly share all data with the developers.
В списке pgsql-bugs по дате отправления: