Re: parallel pg_restore blocks on heavy random read I/O on all children processes
От | Dimitrios Apostolou |
---|---|
Тема | Re: parallel pg_restore blocks on heavy random read I/O on all children processes |
Дата | |
Msg-id | 3764bde7-ca09-a8a3-489c-0166f772245d@gmx.net обсуждение исходный текст |
Ответ на | Re: parallel pg_restore blocks on heavy random read I/O on all children processes (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: parallel pg_restore blocks on heavy random read I/O on all children processes
|
Список | pgsql-performance |
On Sun, 23 Mar 2025, Tom Lane wrote: > Dimitrios Apostolou <jimis@gmx.net> writes: >> On Thu, 20 Mar 2025, Tom Lane wrote: >>> I am betting that the problem is that the dump's TOC (table of >>> contents) lacks offsets to the actual data of the database objects, >>> and thus the readers have to reconstruct that information by scanning >>> the dump file. Normally, pg_dump will back-fill offset data in the >>> TOC at completion of the dump, but if it's told to write to an >>> un-seekable output file then it cannot do that. > >> Further questions: > >> * The seeking pattern in pg_restore seems non-sensical to me: reading 4K, >> jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an >> hour. /Maybe/ something to improve there... Where can I read more about >> the format? > > It's reading data blocks (or at least the headers thereof), which have > a limited size. I don't think that size has changed since circa 1999, > so maybe we could consider increasing it; but I doubt we could move > the needle very far that way. If it could be substantially increased then it would cut down the number of seeks by a big factor and definitely make a difference. But it would have a negative impact on the size of small backups? However, I would have expected the kernel to handle this reading-seeking pattern better: The 4K read() should cause a big pre-fetch (read-ahead) to cover the next seek and read. Not sure why I don't see it happening, it's a rather typical setup (NVMe SSD with ext4 fs on Linux 6.11). As a workaround, pg_restore could sequentially read() the whole archive to construct the TOC, instead of jumping around. I'm a bit split on this one, as it would be much faster on the compressed archive (~800GB) but not that much on the uncompressed one (>5TB). I'm also under the impression that all the pg_restore child processes are doing the same thing: seeking+reading the whole file, each for their own sake. Can you verify or disprove? Thank you for the detailed response. Dimitris
В списке pgsql-performance по дате отправления: