Обсуждение: pg_upgrade reflink support on OpenZFS
Hello List, For the PostgreSQL upgrade to version 18, I took the opportunity to test the reflink support in pg_upgrade (with --clone) on OpenZFS 2.3.4 / Linux 6.15.11 and it worked flawlessly, being a huge time saver here. I've looked into the documentation for pg_upgrade and it's only mentioning btrfs and XFS on Linux and not FreeBSD at all, so I thought It'd be an interesting heads-up to report that Linux gained a 3rd FS and also I think FreeBSD in general the ability for doing reflink copies. OpenZFS has been supporting this since 2.2 but has had it disabled due to data corruption bugs, now since 2.3 the sysctl (zfs_bclone_enabled on Linux, vfs.zfs.bclone_enabled on FreeBSD) has been enabled by default so only the zpool feature "block_cloning" has to be enabled, which might be the case when running "zpool upgrade". I haven't had the possibility to check this on FreeBSD yet, but I don't see why this should not work as I also can't spot anything in the OpenZFS docs regarding reflink / block cloning limitations on FreeBSD. Also I saw one of the OpenZFS devs writing on Reddit about block cloning being supported on FreeBSD v14. Regards, Marcel
On Sat, Nov 15, 2025 at 7:16 AM Marcel Menzel <marcel@menzel.de> wrote: > For the PostgreSQL upgrade to version 18, I took the opportunity to test > the reflink support in pg_upgrade (with --clone) on OpenZFS 2.3.4 / > Linux 6.15.11 and it worked flawlessly, being a huge time saver here. Nice! > I've looked into the documentation for pg_upgrade and it's only > mentioning btrfs and XFS on Linux and not FreeBSD at all, so I thought > It'd be an interesting heads-up to report that Linux gained a 3rd FS and > also I think FreeBSD in general the ability for doing reflink copies. It does mention both Linux and FreeBSD under --copy-file-range. I didn't try to list all the relevant file systems there though, partly because I didn't feel like documenting all the quirks (only works if you created your XFS file system with the feature enabled, might need to frobnicate ZFS sysctl, which NFS clients and servers can push it down, likewise for non-COW file systems and device drivers, etc etc). It might be nice to find a decent reference for all that stuff somewhere else and point to it, but I don't think we can maintain that accurately ourselves. I was actually surprised to hear that ioctl(dest_fd, FICLONE, src_fd) worked for you. I knew that it was really BTRFS's ioctl and XFS accepted it too, but I didn't know that ZFS also understood it[1] in 2.3. They apparently didn't really expect anyone to call it, and since ZFS 2.4 is apparently about to ship without it[2], it seems like a bad time to add it to the documentation for --clone. > OpenZFS has been supporting this since 2.2 but has had it disabled due > to data corruption bugs, now since 2.3 the sysctl (zfs_bclone_enabled on > Linux, vfs.zfs.bclone_enabled on FreeBSD) has been enabled by default so > only the zpool feature "block_cloning" has to be enabled, which might be > the case when running "zpool upgrade". Yeah, those data corruption reports (which turned out to be misattributed IIRC?) provided one reason to keep the old BTRFS ioctl() under --clone but add the new behaviour under --copy-file-range. --copy-file-range should work for all COW filesystems on Linux via proper VFS entrypoints, and is the official way to do this from user space. Perhaps we should eventually harmonise this under a single option and drop the ioctl() stuff. One semantic change would be that copy_file_range() means "copy with your best trick" (could be cloning, network/driver pushdown or user space buffer copy, silently selecting the behaviour), while the BTRFS ioctl() means "clone or fail" IIRC, so that was another reason to want a separate option for now. For reference, the macOS copyfile() call used for --clone has flags that should cause it to fail if it can't clone IIUC, while the Windows CopyFile() call used for --copy might even clone blocks on ReFS even if you don't specify --clone... huh. > I haven't had the possibility to check this on FreeBSD yet, but I don't > see why this should not work as I also can't spot anything in the > OpenZFS docs regarding reflink / block cloning limitations on FreeBSD. > Also I saw one of the OpenZFS devs writing on Reddit about block cloning > being supported on FreeBSD v14. It always succeeds on FreeBSD, but it only actually clones if you set vfs.zfs.bclone_enabled=1. I've tested all our "clone" features with that and they work nicely. The sysctl wasn't on by default in FreeBSD 14.x, but 15 is about to ship and the "experimental" label was removed in man 4 zfs. If you haven't seen them yet, you might also like these COW tricks: Shared storage of basic catalog tables when you have a lot of databases: SET file_copy_method = CLONE; CREATE DATABASE ... STRATEGY=FILE_COPY; Fast database clone/snapshot of very large databases (caveats: users can't be connected to source, checkpoint forced): SET file_copy_method = CLONE; CREATE DATABASE ... STRATEGY=FILE_COPY TEMPLATE=source_db; Combine a chain of incremental backups and a full backup to produce a new full backup, sharing disk blocks with the ancestor backups: pg_combinebackup --copy-file-range That last one is a really powerful use of copy_file_range()'s subfile cloning powers. Another subfile cloning trick I've proposed before is making relation segment size user-controllable, and then allowing pg_upgrade to migrate between segment sizes by splicing them together. [1] https://github.com/openzfs/zfs/commit/9927f219f1e9f4ee886d426190500abf5b1d602e [2] https://github.com/openzfs/zfs/commit/4800181b3b950d67a62aca7c9e28d34c8b303242
On 15/11/2025 05:17, Thomas Munro wrote: > On Sat, Nov 15, 2025 at 7:16 AM Marcel Menzel <marcel@menzel.de> wrote: >> For the PostgreSQL upgrade to version 18, I took the opportunity to test >> the reflink support in pg_upgrade (with --clone) on OpenZFS 2.3.4 / >> Linux 6.15.11 and it worked flawlessly, being a huge time saver here. > > Nice! > >> I've looked into the documentation for pg_upgrade and it's only >> mentioning btrfs and XFS on Linux and not FreeBSD at all, so I thought >> It'd be an interesting heads-up to report that Linux gained a 3rd FS and >> also I think FreeBSD in general the ability for doing reflink copies. > > It does mention both Linux and FreeBSD under --copy-file-range. I > didn't try to list all the relevant file systems there though, partly > because I didn't feel like documenting all the quirks (only works if > you created your XFS file system with the feature enabled, might need > to frobnicate ZFS sysctl, which NFS clients and servers can push it > down, likewise for non-COW file systems and device drivers, etc etc). > It might be nice to find a decent reference for all that stuff > somewhere else and point to it, but I don't think we can maintain that > accurately ourselves. > > I was actually surprised to hear that ioctl(dest_fd, FICLONE, src_fd) > worked for you. I knew that it was really BTRFS's ioctl and XFS > accepted it too, but I didn't know that ZFS also understood it[1] in > 2.3. They apparently didn't really expect anyone to call it, and > since ZFS 2.4 is apparently about to ship without it[2], it seems like > a bad time to add it to the documentation for --clone. Oh, I haven't had any looks at upcoming versions yet, but yeah this doesn't make any sense then to mention this. >> OpenZFS has been supporting this since 2.2 but has had it disabled due >> to data corruption bugs, now since 2.3 the sysctl (zfs_bclone_enabled on >> Linux, vfs.zfs.bclone_enabled on FreeBSD) has been enabled by default so >> only the zpool feature "block_cloning" has to be enabled, which might be >> the case when running "zpool upgrade". > > Yeah, those data corruption reports (which turned out to be > misattributed IIRC?) provided one reason to keep the old BTRFS ioctl() > under --clone but add the new behaviour under --copy-file-range. > --copy-file-range should work for all COW filesystems on Linux via > proper VFS entrypoints, and is the official way to do this from user > space. Perhaps we should eventually harmonise this under a single > option and drop the ioctl() stuff. One semantic change would be that > copy_file_range() means "copy with your best trick" (could be cloning, > network/driver pushdown or user space buffer copy, silently selecting > the behaviour), while the BTRFS ioctl() means "clone or fail" IIRC, so > that was another reason to want a separate option for now. I haven't looked close at the copy_file_range() syscall and how tools interact with it in detail yet, but I've found this[3] interesting GitHub comment which gives me a clearer picture now. Totally understandable why the OpenZFS remove the compat for those BTRFS syscalls since they now have a proper replacement. Peeking at the OpenZFS docs[4][5], they also mention the copy_file_range() syscall invoking the BRT, so I guess I'll use pg_upgrade with --copy-file-range the next time. > For reference, the macOS copyfile() call used for --clone has flags > that should cause it to fail if it can't clone IIUC, while the Windows > CopyFile() call used for --copy might even clone blocks on ReFS even > if you don't specify --clone... huh. > >> I haven't had the possibility to check this on FreeBSD yet, but I don't >> see why this should not work as I also can't spot anything in the >> OpenZFS docs regarding reflink / block cloning limitations on FreeBSD. >> Also I saw one of the OpenZFS devs writing on Reddit about block cloning >> being supported on FreeBSD v14. > > It always succeeds on FreeBSD, but it only actually clones if you set > vfs.zfs.bclone_enabled=1. I've tested all our "clone" features with > that and they work nicely. The sysctl wasn't on by default in FreeBSD > 14.x, but 15 is about to ship and the "experimental" label was removed > in man 4 zfs. > > If you haven't seen them yet, you might also like these COW tricks: > > Shared storage of basic catalog tables when you have a lot of databases: > SET file_copy_method = CLONE; > CREATE DATABASE ... STRATEGY=FILE_COPY; > > Fast database clone/snapshot of very large databases (caveats: users > can't be connected to source, checkpoint forced): > SET file_copy_method = CLONE; > CREATE DATABASE ... STRATEGY=FILE_COPY TEMPLATE=source_db; > > Combine a chain of incremental backups and a full backup to produce a > new full backup, sharing disk blocks with the ancestor backups: > pg_combinebackup --copy-file-range > > That last one is a really powerful use of copy_file_range()'s subfile > cloning powers. Another subfile cloning trick I've proposed before is > making relation segment size user-controllable, and then allowing > pg_upgrade to migrate between segment sizes by splicing them together. Oh, those are really handy commands, especially the last one, yes. Many thanks for pointing these out! > [1] https://github.com/openzfs/zfs/commit/9927f219f1e9f4ee886d426190500abf5b1d602e > [2] https://github.com/openzfs/zfs/commit/4800181b3b950d67a62aca7c9e28d34c8b303242 [3] https://github.com/openzfs/zfs/pull/13392#issuecomment-1742172842 [4] https://openzfs.github.io/openzfs-docs/man/master/7/zpool-features.7.html#block_cloning [5] https://openzfs.github.io/openzfs-docs/man/master/7/zfsconcepts.7.html#Block_cloning