Re: CREATE DATABASE with filesystem cloning
От | Thomas Munro |
---|---|
Тема | Re: CREATE DATABASE with filesystem cloning |
Дата | |
Msg-id | CA+hUKGJycV7PBu_+6RVo=_r-aU81ag-o8JyhcKYm40dPNp5B+g@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: CREATE DATABASE with filesystem cloning (Peter Eisentraut <peter@eisentraut.org>) |
Ответы |
Re: CREATE DATABASE with filesystem cloning
Re: CREATE DATABASE with filesystem cloning |
Список | pgsql-hackers |
On Wed, Oct 11, 2023 at 7:40 PM Peter Eisentraut <peter@eisentraut.org> wrote: > On 07.10.23 07:51, Thomas Munro wrote: > > Here is an experimental POC of fast/cheap database cloning. > > Here are some previous discussions of this: > > https://www.postgresql.org/message-id/flat/20131001223108.GG23410%40saarenmaa.fi > > https://www.postgresql.org/message-id/flat/511B5D11.4040507%40socialserve.com > > https://www.postgresql.org/message-id/flat/bc9ca382-b98d-0446-f699-8c5de2307ca7%402ndquadrant.com > > (I don't see any clear conclusions in any of these threads, but it might > be good to check them in any case.) Thanks. Wow, quite a lot of people have written an experimental patch like this. I would say the things that changed since those ones are: * copy_file_range() became the preferred way to do this on Linux AFAIK (instead of various raw ioctls) * FreeBSD adopted Linux's copy_file_range() * Open ZFS 2.2 implemented range-based cloning * XFS enabled reflink support by default * Apple invented ApFS with cloning * Several OSes adopted XFS, BTRFS, ZFS, ApFS by default * copy_file_range() went in the direction of not revealing how the copying is done (no flags to force behaviour) Here's a rebase. The main thing that is missing is support for redo. It's mostly trivial I think, probably just a record type for "try cloning first" and then teaching that clone function to fall back to the regular copy path if it fails in recovery, do you agree with that idea? Another approach would be to let it fail if it doesn't work on the replica, so you don't finish up using dramatically different amounts of disk space, but that seems terrible because now your replica is broken. So probably fallback with logged warning (?), though I'm not sure exactly which errnos to give that treatment to. One thing to highlight about COW file system semantics: PostgreSQL behaves differently when space runs out. When writing relation data, eg ZFS sometimes fails like bullet point 2 in this ENOSPC article[1], while XFS usually fails like bullet point 1. A database on XFS that has been cloned in this way might presumably start to fail like bullet point 2, eg when checkpointing dirty pages, instead of its usual extension-time-only ENOSPC-rolls-back-your-transaction behaviour. [1] https://wiki.postgresql.org/wiki/ENOSPC
Вложения
В списке pgsql-hackers по дате отправления: