Large files for relations
От | Thomas Munro |
---|---|
Тема | Large files for relations |
Дата | |
Msg-id | CA+hUKG+BGXwMbrvzXAjL8VMGf25y_ga_XnO741g10y0=m6dDiA@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: Large files for relations
Re: Large files for relations |
Список | pgsql-hackers |
Big PostgreSQL databases use and regularly open/close huge numbers of file descriptors and directory entries for various anachronistic reasons, one of which is the 1GB RELSEG_SIZE thing. The segment management code is trickier that you might think and also still harbours known bugs. A nearby analysis of yet another obscure segment life cycle bug reminded me of this patch set to switch to simple large files and eventually drop all that. I originally meant to develop the attached sketch-quality code further and try proposing it in the 16 cycle, while I was down the modernisation rabbit hole[1], but then I got side tracked: at some point I believed that the 56 bit relfilenode thing might be necessary for correctness, but then I found a set of rules that seem to hold up without that. I figured I might as well post what I have early in the 17 cycle as a "concept" patch to see which way the flames blow. There are various boring details due to Windows, and then a load of fairly obvious changes, and then a whole can of worms about how we'd handle the transition for the world's fleet of existing databases. I'll cut straight to that part. Different choices on aggressiveness could be made, but here are the straw-man answers I came up with so far: 1. All new relations would be in large format only. No 16384.N files, just 16384 that can grow to MaxBlockNumber * BLCKSZ. 2. The existence of a file 16384.1 means that this smgr relation is in legacy segmented format that came from pg_upgrade (note that we don't unlink that file once it exists, even when truncating the fork, until we eventually drop the relation). 3. Forks that were pg_upgrade'd from earlier releases using hard links or reflinks would implicitly be in large format if they only had one segment, and otherwise they could stay in the traditional format for a grace period of N major releases, after which we'd plan to drop segment support. pg_upgrade's [ref]link mode would therefore be the only way to get a segmented relation, other than a developer-only trick for testing/debugging. 4. Every opportunity to convert a multi-segment fork to large format would be taken: pg_upgrade in copy mode, basebackup, COPY DATABASE, VACUUM FULL, TRUNCATE, etc. You can see approximately working sketch versions of all the cases I thought of so far in the attached. 5. The main places that do file-level copying of relations would use copy_file_range() to do the splicing, so that on file systems that are smart enough (XFS, ZFS, BTRFS, ...) with qualifying source and destination, the operation can be very fast, and other degrees of optimisation are available to the kernel too even for file systems without block sharing magic (pushing down block range copies to hardware/network storage, etc). The copy_file_range() stuff could also be proposed independently (I vaguely recall it was discussed a few times before), it's just that it really comes into its own when you start splicing files together, as needed here, and it's also been adopted by FreeBSD with the same interface as Linux and has an efficient implementation in bleeding edge ZFS there. Stepping back, the main ideas are: (1) for some users of large databases, it would be painlessly done at upgrade time without even really noticing, using modern file system facilities where possible for speed; (2) for anyone who wants to defer that because of lack of fast copy_file_range() and a desire to avoid prolonged downtime by using links or reflinks, concatenation can be put off for the next N releases, giving a total of 5 + N years of option to defer the work, and in that case there are also many ways to proactively change to large format before the time comes with varying degrees of granularity and disruption. For example, set up a new replica and fail over, or VACUUM FULL tables one at a time, etc. There are plenty of things left to do in this patch set: pg_rewind doesn't understand optional segmentation yet, there are probably more things like that, and I expect there are some ssize_t vs pgoff_t confusions I missed that could bite a 32 bit system. But you can see the basics working on a typical system. I am not aware of any modern/non-historic filesystem[2] that can't do large files with ease. Anyone know of anything to worry about on that front? I think the main collateral damage would be weird old external tools like some weird old version of Windows tar I occasionally see mentioned, that sort of thing, but that'd just be another case of "well don't use that then", I guess? What else might we need to think about, outside PostgreSQL? What other problems might occur inside PostgreSQL? Clearly we'd need to figure out a decent strategy to automate testing of all of the relevant transitions. We could test the splicing code paths with an optional test suite that you might enable along with a small segment size (as we're already testing on CI and probably BF after the last round of segmentation bugs). To test the messy Windows off_t API stuff convincingly, we'd need actual > 4GB files, I think? Maybe doable cheaply with file system hole punching tricks. Speaking of file system holes, this patch set doesn't touch buffile.c That code wants to use segments for two extra purposes: (1) parallel create index merges workers' output using segmentation tricks as if there were holes in the file; this could perhaps be replaced with large files that make use of actual OS-level holes but I didn't feel like additionally claiming that all computers have spare files -- perhaps another approach is needed anyway; (2) buffile.c deliberately spreads large buffiles around across multiple temporary tablespaces using segments supposedly for space management reasons. So although it initially looks like a nice safe little place to start using large files, we'd need an answer to those design choices first. /me dons flameproof suit and goes back to working on LLVM problems for a while [1] https://wiki.postgresql.org/wiki/AllComputers [2] https://en.wikipedia.org/wiki/Comparison_of_file_systems
Вложения
- 0001-Assert-that-pgoff_t-is-wide-enough.patch
- 0002-Use-pgoff_t-in-system-call-replacements-on-Windows.patch
- 0003-Support-large-files-on-Windows-in-our-VFD-API.patch
- 0004-Use-pgoff_t-instead-of-off_t-in-more-places.patch
- 0005-Use-large-files-for-relation-storage.patch
- 0006-Detect-copy_file_range-function.patch
- 0007-Use-copy_file_range-to-implement-copy_file.patch
- 0008-Teach-copy_file-to-concatenate-segmented-files.patch
- 0009-Use-copy_file_range-in-pg_upgrade.patch
- 0010-Teach-pg_upgrade-to-concatenate-segmented-files.patch
- 0011-Teach-basebackup-to-concatenate-segmented-files.patch
В списке pgsql-hackers по дате отправления: