Re: Large block sizes support in Linux

Поиск
Список
Период
Сортировка
От Pankaj Raghav
Тема Re: Large block sizes support in Linux
Дата
Msg-id 08f433f0-28b8-49c7-8b77-dcf4ae7d5047@pankajraghav.com
обсуждение исходный текст
Ответ на Re: Large block sizes support in Linux  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Ответы Re: Large block sizes support in Linux  (Bruce Momjian <bruce@momjian.us>)
Список pgsql-hackers
Hi Tomas and Bruce,

>>> My knowledge of Postgres internals is limited, so I'm wondering if there
>>> are any optimizations or potential optimizations that Postgres could
>>> leverage once we have LBS support on Linux?
>>
>> We have discussed this in the past, and in fact in the early years we
>> thought we didn't need fsync since the BSD file system was 8k at the
>> time.
>>
>> What we later realized is that we have no guarantee that the file system
>> will write to the device in the specified block size, and even it it
>> does, the I/O layers between the OS and the device might not, since many
>> devices use 512 byte blocks or other sizes.
>>
> 
> Right, but things change over time - current storage devices support
> much larger sectors (LBA format), usually 4K. And if you do I/O with
> this size, it's usually atomic.
> 
> AFAIK if you built Postgres with 4K pages, on a device with 4K LBA
> format, that would not need full-page writes - we always do I/O in 4k
> pages, and block layer does I/O (during writeback from page cache) with
> minimum guaranteed size = logical block size. 4K are great for OLTP
> systems in general, it'd be even better if we didn't need to worry about
> torn pages (but the tricky part is to be confident it's safe to disable
> them on a particular system).
> 
> I did watch the talk linked by Pankaj, and IIUC the promise of the LBS
> patches is that this benefit would extend would apply even to larger
> page sizes (= fs page size). Which right now you can't even mount, but
> the patches allow that. So for example it would be possible to create an
> XFS filesystem with 8kB pages, and then we'd read/write 8kB pages as
> usual, and we'd know that the page cache always writes out either the
> whole page or none of it. Which right now is not guaranteed to happen,
> it's possible to e.g. write the page as two 4K requests, even if all
> other things are set properly (drive has 4K logical/physical sectors).
> 
> At least that's my understanding ...
>> Pankaj, could you clarify what the guarantees provided by LBS are going
> to be? the talk uses wording like "should be" and "hint" in a couple
> places, and there's also stuff I'm not 100% familiar with.
> 
> If we create a filesystem with 8K blocks, and we only ever do writes
> (and reads) in 8K chunks (our default page size), what guarantees that
> gives us? What if the underlying device has LBA format with only 4K (or
> perhaps even just 512B), how would that affect the guarantees?
> 

Yes, the whole FS block is managed as one unit (also on a physically contiguous
page), so we send the whole fs block while performing writeback. This is not guaranteed
when FS block size = 4k and the DB page size is 8k as it might be sent as two
different requests as you have indicated.

The LBA format will not affect the guarantee of sending the whole FS block without
splitting as long as the FS block size is less than the maximum IO transfer size*.

But another issue now is even though the host has done its job, the device might
have a smaller atomic guarantee, thereby making it not powerfail safe.

> The other thing is - is there a reliable way to say when the guarantees
> actually apply? I mean, how would the administrator *know* it's safe to
> set full_page_writes=off, or even better how could we verify this when
> the database starts (and complain if it's not safe to disable FPW)?
> 

This is an excellent question that needs a bit of community discussion to
expose a device agnostic value that userspace can trust.

There might be a talk this year at LSFMM about untorn writes[1] in buffered IO
path. I will make sure to bring this question up.

At the moment, Linux exposes the physical blocksize by taking also atomic guarantees
into the picture, especially for NVMe it uses the NAWUPF and AWUPF while setting
physical blocksize (/sys/block/<dev>/queue/physical_block_size).

A system admin could use value exposed by phy_bs as a hint to disable full_page_write=off.
Of course this requires also the device to give atomic guarantees.

The most optimal would be DB page size == FS block size == Device atomic size.

> It's easy to e.g. take a backup on one filesystem and restore it on
> another one, and forget those may have different block sizes etc. I'm
> not sure it's possible in a 100% reliable way (tablespaces?).
> 
> 
> regards
> 

[1] https://lore.kernel.org/linux-fsdevel/20240228061257.GA106651@mit.edu/

* A small caveat, I am most familiar with NVMe, so my answers might be based on
my experience in NVMe.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Daniel Gustafsson
Дата:
Сообщение: Re: Potential issue in ecpg-informix decimal converting functions
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: pgsql: Track last_inactive_time in pg_replication_slots.