Re: pg_upgrade reflink support on OpenZFS
| От | Marcel Menzel |
|---|---|
| Тема | Re: pg_upgrade reflink support on OpenZFS |
| Дата | |
| Msg-id | 5fd60425-db26-4700-b716-5be3762acd33@menzel.de обсуждение исходный текст |
| Ответ на | Re: pg_upgrade reflink support on OpenZFS (Thomas Munro <thomas.munro@gmail.com>) |
| Список | pgsql-general |
On 15/11/2025 05:17, Thomas Munro wrote: > On Sat, Nov 15, 2025 at 7:16 AM Marcel Menzel <marcel@menzel.de> wrote: >> For the PostgreSQL upgrade to version 18, I took the opportunity to test >> the reflink support in pg_upgrade (with --clone) on OpenZFS 2.3.4 / >> Linux 6.15.11 and it worked flawlessly, being a huge time saver here. > > Nice! > >> I've looked into the documentation for pg_upgrade and it's only >> mentioning btrfs and XFS on Linux and not FreeBSD at all, so I thought >> It'd be an interesting heads-up to report that Linux gained a 3rd FS and >> also I think FreeBSD in general the ability for doing reflink copies. > > It does mention both Linux and FreeBSD under --copy-file-range. I > didn't try to list all the relevant file systems there though, partly > because I didn't feel like documenting all the quirks (only works if > you created your XFS file system with the feature enabled, might need > to frobnicate ZFS sysctl, which NFS clients and servers can push it > down, likewise for non-COW file systems and device drivers, etc etc). > It might be nice to find a decent reference for all that stuff > somewhere else and point to it, but I don't think we can maintain that > accurately ourselves. > > I was actually surprised to hear that ioctl(dest_fd, FICLONE, src_fd) > worked for you. I knew that it was really BTRFS's ioctl and XFS > accepted it too, but I didn't know that ZFS also understood it[1] in > 2.3. They apparently didn't really expect anyone to call it, and > since ZFS 2.4 is apparently about to ship without it[2], it seems like > a bad time to add it to the documentation for --clone. Oh, I haven't had any looks at upcoming versions yet, but yeah this doesn't make any sense then to mention this. >> OpenZFS has been supporting this since 2.2 but has had it disabled due >> to data corruption bugs, now since 2.3 the sysctl (zfs_bclone_enabled on >> Linux, vfs.zfs.bclone_enabled on FreeBSD) has been enabled by default so >> only the zpool feature "block_cloning" has to be enabled, which might be >> the case when running "zpool upgrade". > > Yeah, those data corruption reports (which turned out to be > misattributed IIRC?) provided one reason to keep the old BTRFS ioctl() > under --clone but add the new behaviour under --copy-file-range. > --copy-file-range should work for all COW filesystems on Linux via > proper VFS entrypoints, and is the official way to do this from user > space. Perhaps we should eventually harmonise this under a single > option and drop the ioctl() stuff. One semantic change would be that > copy_file_range() means "copy with your best trick" (could be cloning, > network/driver pushdown or user space buffer copy, silently selecting > the behaviour), while the BTRFS ioctl() means "clone or fail" IIRC, so > that was another reason to want a separate option for now. I haven't looked close at the copy_file_range() syscall and how tools interact with it in detail yet, but I've found this[3] interesting GitHub comment which gives me a clearer picture now. Totally understandable why the OpenZFS remove the compat for those BTRFS syscalls since they now have a proper replacement. Peeking at the OpenZFS docs[4][5], they also mention the copy_file_range() syscall invoking the BRT, so I guess I'll use pg_upgrade with --copy-file-range the next time. > For reference, the macOS copyfile() call used for --clone has flags > that should cause it to fail if it can't clone IIUC, while the Windows > CopyFile() call used for --copy might even clone blocks on ReFS even > if you don't specify --clone... huh. > >> I haven't had the possibility to check this on FreeBSD yet, but I don't >> see why this should not work as I also can't spot anything in the >> OpenZFS docs regarding reflink / block cloning limitations on FreeBSD. >> Also I saw one of the OpenZFS devs writing on Reddit about block cloning >> being supported on FreeBSD v14. > > It always succeeds on FreeBSD, but it only actually clones if you set > vfs.zfs.bclone_enabled=1. I've tested all our "clone" features with > that and they work nicely. The sysctl wasn't on by default in FreeBSD > 14.x, but 15 is about to ship and the "experimental" label was removed > in man 4 zfs. > > If you haven't seen them yet, you might also like these COW tricks: > > Shared storage of basic catalog tables when you have a lot of databases: > SET file_copy_method = CLONE; > CREATE DATABASE ... STRATEGY=FILE_COPY; > > Fast database clone/snapshot of very large databases (caveats: users > can't be connected to source, checkpoint forced): > SET file_copy_method = CLONE; > CREATE DATABASE ... STRATEGY=FILE_COPY TEMPLATE=source_db; > > Combine a chain of incremental backups and a full backup to produce a > new full backup, sharing disk blocks with the ancestor backups: > pg_combinebackup --copy-file-range > > That last one is a really powerful use of copy_file_range()'s subfile > cloning powers. Another subfile cloning trick I've proposed before is > making relation segment size user-controllable, and then allowing > pg_upgrade to migrate between segment sizes by splicing them together. Oh, those are really handy commands, especially the last one, yes. Many thanks for pointing these out! > [1] https://github.com/openzfs/zfs/commit/9927f219f1e9f4ee886d426190500abf5b1d602e > [2] https://github.com/openzfs/zfs/commit/4800181b3b950d67a62aca7c9e28d34c8b303242 [3] https://github.com/openzfs/zfs/pull/13392#issuecomment-1742172842 [4] https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#block_cloning [5] https://openzfs.github.io/openzfs-docs/man/master/7/zfsconcepts.7.html#Block_cloning
В списке pgsql-general по дате отправления: