Обсуждение: Write lifetime hints for NVMe
Hi, From what I see some time ago the write lifetime hints support for NVMe multi streaming was merged into Linux kernel [1]. Theoretically it allows data written together on media so they can be erased together, which minimizes garbage collection, resulting in reduced write amplification as well as efficient flash utilization [2]. I couldn't find any discussion about that on hackers, so I decided to experiment with this feature a bit. My idea was to test quite naive approach when all file descriptors, that are related to temporary files, have assigned `RWH_WRITE_LIFE_SHORT`, and rest of them `RWH_WRITE_LIFE_EXTREME`. Attached patch is a dead simple POC without any infrastructure around to enable/disable hints. It turns out that it's possible to perform benchmarks on some EC2 instance types (e.g. c5) with the corresponding version of the kernel, since they expose a volume as nvme device: ``` # nvme list Node SN Model Namespace Usage Format FW Rev ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- -------- /dev/nvme0n1 vol01cdbc7ec86f17346 Amazon Elastic Block Store 1 0.00 B / 8.59 GB 512 B + 0 B 1.0 ``` To get some baseline results I've run several rounds of pgbench on these quite modest instances (dedicated, with optimized EBS) with slightly adjusted `max_wal_size` and with default configuration: $ pgbench -s 200 -i $ pgbench -T 600 -c 2 -j 2 Analyzing `strace` output I can see that during this test there were some significant number of operations with pg_stat_tmp and xlogtemp, so I assume write lifetime hints should have some effect. As a result I've got reduction of latency about 5-8% (but so far these numbers are unstable, probably because of virtualization). ``` # without patch number of transactions actually processed: 491945 latency average = 2.439 ms tps = 819.906323 (including connections establishing) tps = 819.908755 (excluding connections establishing) ``` ``` with patch number of transactions actually processed: 521805 latency average = 2.300 ms tps = 869.665330 (including connections establishing) tps = 869.668026 (excluding connections establishing) ``` So I have a few questions: * Does it sound interesting and worthwhile to create a proper patch? * Maybe someone else has similar results? * Any suggestions about what can be the best/worst case scenarios of using such kind of hints? [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c75b1d9421f80f4143e389d2d50ddfc8a28c8c35 [2]: https://regmedia.co.uk/2016/09/23/0_storage-intelligence-prodoverview-2015-0.pdf
Вложения
On 01/27/2018 02:20 PM, Dmitry Dolgov wrote: > Hi, > > From what I see some time ago the write lifetime hints support for NVMe multi > streaming was merged into Linux kernel [1]. Theoretically it allows data > written together on media so they can be erased together, which minimizes > garbage collection, resulting in reduced write amplification as well as > efficient flash utilization [2]. I couldn't find any discussion about that on > hackers, so I decided to experiment with this feature a bit. My idea was to > test quite naive approach when all file descriptors, that are related to > temporary files, have assigned `RWH_WRITE_LIFE_SHORT`, and rest of them > `RWH_WRITE_LIFE_EXTREME`. Attached patch is a dead simple POC without any > infrastructure around to enable/disable hints. > > It turns out that it's possible to perform benchmarks on some EC2 instance > types (e.g. c5) with the corresponding version of the kernel, since they expose > a volume as nvme device: > > ``` > # nvme list > Node SN Model > Namespace Usage Format FW Rev > ---------------- -------------------- > ---------------------------------------- --------- > -------------------------- ---------------- -------- > /dev/nvme0n1 vol01cdbc7ec86f17346 Amazon Elastic Block Store > 1 0.00 B / 8.59 GB 512 B + 0 B 1.0 > ``` > > To get some baseline results I've run several rounds of pgbench on these quite > modest instances (dedicated, with optimized EBS) with slightly adjusted > `max_wal_size` and with default configuration: > > $ pgbench -s 200 -i > $ pgbench -T 600 -c 2 -j 2 > > Analyzing `strace` output I can see that during this test there were some > significant number of operations with pg_stat_tmp and xlogtemp, so I assume > write lifetime hints should have some effect. > > As a result I've got reduction of latency about 5-8% (but so far these numbers > are unstable, probably because of virtualization). > > ``` > # without patch > number of transactions actually processed: 491945 > latency average = 2.439 ms > tps = 819.906323 (including connections establishing) > tps = 819.908755 (excluding connections establishing) > ``` > > ``` > with patch > number of transactions actually processed: 521805 > latency average = 2.300 ms > tps = 869.665330 (including connections establishing) > tps = 869.668026 (excluding connections establishing) > ``` > Aren't those numbers far lower that you'd expect from NVMe storage? I do have a NVMe drive (Intel 750) in my machine, and I can do thousands of transactions on it with two clients. Seems a bit suspicious. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 27 January 2018 at 16:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Aren't those numbers far lower that you'd expect from NVMe storage? I do > have a NVMe drive (Intel 750) in my machine, and I can do thousands of > transactions on it with two clients. Seems a bit suspicious. Maybe an NVMe storage can provide much higher numbers in general, but there are resource limitations from AWS itself. I was using c5.large, which is the smallest possible instance of type c5, so maybe that can explain absolute numbers - but anyway I can recheck, just in case if I missed something.
On 01/27/2018 08:06 PM, Dmitry Dolgov wrote: >> On 27 January 2018 at 16:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> Aren't those numbers far lower that you'd expect from NVMe storage? I do >> have a NVMe drive (Intel 750) in my machine, and I can do thousands of >> transactions on it with two clients. Seems a bit suspicious. > > Maybe an NVMe storage can provide much higher numbers in general, but there are > resource limitations from AWS itself. I was using c5.large, which is the > smallest possible instance of type c5, so maybe that can explain absolute > numbers - but anyway I can recheck, just in case if I missed something. > According to [1] the C5 instances don't have actual NVMe devices (say, storage in PCIe slot or connected using M.2) but EBS volumes exposed as NVMe devices. That would certainly make explain the low IOPS numbers, as EBS has built-in throttling. I don't know how much of the NVMe features does this EBS variant support. Amazon actually does provide instance types (f1 and i3) with real NVMe devices. That's what I'd be testing. I can do some testing on my system with NVMe storage, to see if there really is any change thanks to the patch. [1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 27 January 2018 at 23:53, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Amazon actually does provide instance types (f1 and i3) with real NVMe > devices. That's what I'd be testing. Yes, indeed, that's a better target for testing, thanks. I'll write back when will get some results.