Обсуждение: WAL Re-Writes
As discussed previously about WAL Re-Writes [1], I have
done some investigation in that area which I would like to
share.
Currently we always write WAL in 8KB blocks, which could
lead to a lot of re-write of data for small-transactions. Consider
the case where the amount to be written is usually < 4KB,
we always write it in 8KB chunks which is the major source
of re-writes. I have tried various options to reduce this re-write
of data.
First, I have tried to write the WAL in exact size which the
transaction or otherwise has been requested to XLogWrite(),
patch for this experiment is attached. I have written a small test
patch (calculate_wal_written_by_backend_v1.patch) to calculate the
amount of WAL written by XLogWrite() API and found that the actual
WAL-writes by PostgreSQL have reduced by half with patch on
(pgbench tpc-b workload with 4-clients), but unfortunately this lead
to significant decrease (more than 50%) in TPS. Jan Wieck has then
found out by observing OS stats </sys/block/<devname>/stat> that
this patch has reduced the writes, but introduced reads and probable
theory behind the same is that as the patch is not writing in block
boundaries of OS, OS has to read the block to complete this write
operation. Now why OS couldn't find the corresponding block in
memory is that, while closing the WAL file, we use
POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
lead to this problem. So with this experiment, the conclusion is that
though we can avoid re-write of WAL data by doing exact writes, but
it could lead to significant reduction in TPS.
Then, I have tried by writing the WAL in chunks and introduced
a guc wal_write_chunk_size to experiment with different chunk
sizes (patch - write_wal_chunks_v1.patch). I have noticed that
at 4096 bytes chunk size (OS block size), there is approximately 35%
reduction in WAL writes (at different client counts for pgbench
read-write workload) both by using my test patch
calculate_wal_written_by_backend_v1.patch and by observing
OS stats </sys/block/<devname>/stat>. Now where I see a good
amount of reduction in WAL writes, but the TPS increase
is between 1~5% for read-write workloads. In some cases at lower
client-count (4), I have seen increase upto 10~15% across multiple
runs, but didn't find a clear trend which can suggest that at lower-
client counts it will always be such a good improvement, OTOH
I have not observed any regression with 4096 bytes WAL chunk size
in my tests till now. One likely theory that we might not see much
improvement at high client count is due to the logic in XLogFlush()
where we combine the WAL writes from multiple clients and the
combined size is greater than 4096 bytes in which case it will write
8K blocks. For all other chunk sizes 512 bytes, 1024 bytes,
2048 bytes, I observed that the smaller the chunk size, better is
reduction in WAL writes, but trend for TPS is just opposite (lesser
the chunk size, worse is TPS) and the probable reason is same as
explained in previous paragraph.
Thoughts?
Note -
1. OS level reduction for WAL writes is done by having WAL
and data on separate disks.
2. I can share the detailed performance data if required, but
I thought it is better to first share the Approach of patch.
3. Patches are more of a Proof-of-concept stage, rather than real
implementation, but I think it won't need too much effort to
improve it, if we find any particular approach as an acceptable
approach.
Вложения
On 27/01/2016 13:30, Amit Kapila wrote: > > Thoughts? > Are the decreases observed with SSD as well as spinning rust? I might imagine that decreasing the wear would be advantageous, especially if the performance decrease is less with low read latency.
On Thu, Jan 28, 2016 at 1:34 AM, james <james@mansionfamily.plus.com> wrote:
On 27/01/2016 13:30, Amit Kapila wrote:Are the decreases observed with SSD as well as spinning rust?
Thoughts?
The test is done with WAL on SSD and data on spinning rust, but I
think the results should be similar if we would have done it
otherwise as well. Having said that, I think still it is worth-while to
test it that way and I will do it.
I might imagine that decreasing the wear would be advantageous,
Yes.
especially if the performance decrease is less with low read latency.
Let me clarify again here that with 4096 bytes chunk size, there is
no performance decrease observed and rather there is a performance
increase though relatively-small (1~5%) and there is a reduction of ~35%
disk writes. Only if we do exact writes or write with smaller chunk size
like (512 or 1024 bytes, basically lesser than OS block size), then we can
see performance decrease mainly for wal_level < ARCHIVE, but then
writes are much more smaller. I would also like to mention that what we
call reduction in disk writes, this is the 7th column in stat file [1]
(write sectors - number of sectors written, for details you can refer
documentation of stat file [1]).
EnterpriseDB: http://www.enterprisedb.com
On 01/27/2016 08:30 AM, Amit Kapila wrote: > operation. Now why OS couldn't find the corresponding block in > memory is that, while closing the WAL file, we use > POSIX_FADV_DONTNEED if wal_level is less than 'archive' which > lead to this problem. So with this experiment, the conclusion is that > though we can avoid re-write of WAL data by doing exact writes, but > it could lead to significant reduction in TPS. POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish from OS buffers. If I am not mistaken we recycle WAL segments in a round robin fashion. In a properly configured system, where the reason for a checkpoint is usually "time" rather than "xlog", a recycled WAL file written to had been closed and not touched for about a complete checkpoint_timeout or longer. You must have a really big amount of spare RAM in the machine to still find those blocks in memory. Basically we are talking about the active portion of your database, shared buffers, the sum of all process local memory and the complete pg_xlog directory content fitting into RAM. Regards, Jan -- Jan Wieck Senior Software Engineer http://slony.info
On 1/31/16 3:26 PM, Jan Wieck wrote: > On 01/27/2016 08:30 AM, Amit Kapila wrote: >> operation. Now why OS couldn't find the corresponding block in >> memory is that, while closing the WAL file, we use >> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which >> lead to this problem. So with this experiment, the conclusion is that >> though we can avoid re-write of WAL data by doing exact writes, but >> it could lead to significant reduction in TPS. > > POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish > from OS buffers. If I am not mistaken we recycle WAL segments in a round > robin fashion. In a properly configured system, where the reason for a > checkpoint is usually "time" rather than "xlog", a recycled WAL file > written to had been closed and not touched for about a complete > checkpoint_timeout or longer. You must have a really big amount of spare > RAM in the machine to still find those blocks in memory. Basically we > are talking about the active portion of your database, shared buffers, > the sum of all process local memory and the complete pg_xlog directory > content fitting into RAM. But that's only going to matter when the segment is newly recycled. My impression from Amit's email is that the OS was repeatedly reading even in the same segment? Either way, I would think it wouldn't be hard to work around this by spewing out a bunch of zeros to the OS in advance of where we actually need to write, preventing the need for reading back from disk. Amit, did you do performance testing with archiving enabled an a no-op archive_command? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
On 1/31/16 3:26 PM, Jan Wieck wrote:On 01/27/2016 08:30 AM, Amit Kapila wrote:operation. Now why OS couldn't find the corresponding block in
memory is that, while closing the WAL file, we use
POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
lead to this problem. So with this experiment, the conclusion is that
though we can avoid re-write of WAL data by doing exact writes, but
it could lead to significant reduction in TPS.
POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
from OS buffers. If I am not mistaken we recycle WAL segments in a round
robin fashion. In a properly configured system, where the reason for a
checkpoint is usually "time" rather than "xlog", a recycled WAL file
written to had been closed and not touched for about a complete
checkpoint_timeout or longer. You must have a really big amount of spare
RAM in the machine to still find those blocks in memory. Basically we
are talking about the active portion of your database, shared buffers,
the sum of all process local memory and the complete pg_xlog directory
content fitting into RAM.
I think that could only be problem if reads were happening at write or
fsync call, but that is not the case here. Further investigation on this
point reveals that the reads are not for fsync operation, rather they
happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
Although this behaviour (writing in non-OS-page-cache-size chunks could
lead to reads if followed by a call to posix_fadvise
(,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
reason for the same is that fadvise() call maps the specified data range
(which in our case is whole file) into the list of pages and then invalidate
them which will further lead to removing them from OS cache, now any
misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file
could cause additional reads as everything written by us will not be on
OS-page-boundary. This theory is based on code of fadvise [1] and some
googling [2] which suggests that misaligned reads followed with
POSIX_FADV_DONTNEED could cause similar problem. Colleague of
mine, Dilip Kumar has verified it even by writing a simple program
for open/write/fsync/fdvise/close as well.
But that's only going to matter when the segment is newly recycled. My impression from Amit's email is that the OS was repeatedly reading even in the same segment?
As explained above the reads are only happening during file close.
Either way, I would think it wouldn't be hard to work around this by spewing out a bunch of zeros to the OS in advance of where we actually need to write, preventing the need for reading back from disk.
I think we can simply prohibit to set wal_chunk_size to a value other
than OS-page-cache or XLOG_BLCKSZ (whichever is lesser) if the
wal_level is lesser than archive. This can avoid the problem of extra
reads for misaligned writes as we won't call fadvise().
We can even choose to always write in OS-page-cache boundary
or XLOG_BLCKSZ (whichever is lesser) as in many cases
OS-page-cache boundary is 4K which can also save significant
re-writes.
Amit, did you do performance testing with archiving enabled an a no-op archive_command?
No, but what kind of advantage are you expecting from such
tests?
On Wed, Feb 3, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>
>> On 1/31/16 3:26 PM, Jan Wieck wrote:
>>>
>>> On 01/27/2016 08:30 AM, Amit Kapila wrote:
>>>>
>>>> operation. Now why OS couldn't find the corresponding block in
>>>> memory is that, while closing the WAL file, we use
>>>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
>>>> lead to this problem. So with this experiment, the conclusion is that
>>>> though we can avoid re-write of WAL data by doing exact writes, but
>>>> it could lead to significant reduction in TPS.
>>>
>>>
>>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
>>> from OS buffers. If I am not mistaken we recycle WAL segments in a round
>>> robin fashion. In a properly configured system, where the reason for a
>>> checkpoint is usually "time" rather than "xlog", a recycled WAL file
>>> written to had been closed and not touched for about a complete
>>> checkpoint_timeout or longer. You must have a really big amount of spare
>>> RAM in the machine to still find those blocks in memory. Basically we
>>> are talking about the active portion of your database, shared buffers,
>>> the sum of all process local memory and the complete pg_xlog directory
>>> content fitting into RAM.
>
>
>
> I think that could only be problem if reads were happening at write or
> fsync call, but that is not the case here. Further investigation on this
> point reveals that the reads are not for fsync operation, rather they
> happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
> Although this behaviour (writing in non-OS-page-cache-size chunks could
> lead to reads if followed by a call to posix_fadvise
> (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
> reason for the same is that fadvise() call maps the specified data range
> (which in our case is whole file) into the list of pages and then invalidate
> them which will further lead to removing them from OS cache, now any
> misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file
> could cause additional reads as everything written by us will not be on
> OS-page-boundary.
>
>
> On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>
>> On 1/31/16 3:26 PM, Jan Wieck wrote:
>>>
>>> On 01/27/2016 08:30 AM, Amit Kapila wrote:
>>>>
>>>> operation. Now why OS couldn't find the corresponding block in
>>>> memory is that, while closing the WAL file, we use
>>>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
>>>> lead to this problem. So with this experiment, the conclusion is that
>>>> though we can avoid re-write of WAL data by doing exact writes, but
>>>> it could lead to significant reduction in TPS.
>>>
>>>
>>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
>>> from OS buffers. If I am not mistaken we recycle WAL segments in a round
>>> robin fashion. In a properly configured system, where the reason for a
>>> checkpoint is usually "time" rather than "xlog", a recycled WAL file
>>> written to had been closed and not touched for about a complete
>>> checkpoint_timeout or longer. You must have a really big amount of spare
>>> RAM in the machine to still find those blocks in memory. Basically we
>>> are talking about the active portion of your database, shared buffers,
>>> the sum of all process local memory and the complete pg_xlog directory
>>> content fitting into RAM.
>
>
>
> I think that could only be problem if reads were happening at write or
> fsync call, but that is not the case here. Further investigation on this
> point reveals that the reads are not for fsync operation, rather they
> happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
> Although this behaviour (writing in non-OS-page-cache-size chunks could
> lead to reads if followed by a call to posix_fadvise
> (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
> reason for the same is that fadvise() call maps the specified data range
> (which in our case is whole file) into the list of pages and then invalidate
> them which will further lead to removing them from OS cache, now any
> misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file
> could cause additional reads as everything written by us will not be on
> OS-page-boundary.
>
On further testing, it has been observed that misaligned writes could
cause reads even when blocks related to file are not in-memory, so
I think what Jan is describing is right. The case where there is
absolutely zero chance of reads is when we write in OS-page boundary
which is generally 4K. However I still think it is okay to provide an
option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
for the cases when these are beneficial like when wal_level is
greater than equal to Archive and keep default as OS-page size if
the same is smaller than 8K.
On Wed, Feb 3, 2016 at 7:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On further testing, it has been observed that misaligned writes could > cause reads even when blocks related to file are not in-memory, so > I think what Jan is describing is right. The case where there is > absolutely zero chance of reads is when we write in OS-page boundary > which is generally 4K. However I still think it is okay to provide an > option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc) > for the cases when these are beneficial like when wal_level is > greater than equal to Archive and keep default as OS-page size if > the same is smaller than 8K. Hmm, a little research seems to suggest that 4kB pages are standard on almost every system we might care about: x86_64, x86, Power, Itanium, ARMv7. Sparc uses 8kB, though, and a search through the Linux kernel sources (grep for PAGE_SHIFT) suggests that there are other obscure architectures that can at least optionally use larger pages, plus a few that can use smaller ones. I'd like this to be something that users don't have to configure, and it seems like that should be possible. We can detect the page size on non-Windows systems using sysctl(_SC_PAGESIZE), and on Windows by using GetSystemInfo. And I think it's safe to make this decision at configure time, because the page size is a function of the hardware architecture (it seems there are obscure systems that support multiple page sizes, but I don't care about them particularly). So what I think we should do is set an XLOG_WRITESZ along with XLOG_BLCKSZ and set it to the smaller of XLOG_BLCKSZ and the system page size. If we can't determine the system page size, assume 4kB. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 3, 2016 at 7:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Feb 3, 2016 at 7:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On further testing, it has been observed that misaligned writes could
> > cause reads even when blocks related to file are not in-memory, so
> > I think what Jan is describing is right. The case where there is
> > absolutely zero chance of reads is when we write in OS-page boundary
> > which is generally 4K. However I still think it is okay to provide an
> > option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
> > for the cases when these are beneficial like when wal_level is
> > greater than equal to Archive and keep default as OS-page size if
> > the same is smaller than 8K.
>
> Hmm, a little research seems to suggest that 4kB pages are standard on
> almost every system we might care about: x86_64, x86, Power, Itanium,
> ARMv7. Sparc uses 8kB, though, and a search through the Linux kernel
> sources (grep for PAGE_SHIFT) suggests that there are other obscure
> architectures that can at least optionally use larger pages, plus a
> few that can use smaller ones.
>
> I'd like this to be something that users don't have to configure, and
> it seems like that should be possible. We can detect the page size on
> non-Windows systems using sysctl(_SC_PAGESIZE), and on Windows by
> using GetSystemInfo. And I think it's safe to make this decision at
> configure time, because the page size is a function of the hardware
> architecture (it seems there are obscure systems that support multiple
> page sizes, but I don't care about them particularly). So what I
> think we should do is set an XLOG_WRITESZ along with XLOG_BLCKSZ and
> set it to the smaller of XLOG_BLCKSZ and the system page size. If we
> can't determine the system page size, assume 4kB.
>
>
> On Wed, Feb 3, 2016 at 7:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On further testing, it has been observed that misaligned writes could
> > cause reads even when blocks related to file are not in-memory, so
> > I think what Jan is describing is right. The case where there is
> > absolutely zero chance of reads is when we write in OS-page boundary
> > which is generally 4K. However I still think it is okay to provide an
> > option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
> > for the cases when these are beneficial like when wal_level is
> > greater than equal to Archive and keep default as OS-page size if
> > the same is smaller than 8K.
>
> Hmm, a little research seems to suggest that 4kB pages are standard on
> almost every system we might care about: x86_64, x86, Power, Itanium,
> ARMv7. Sparc uses 8kB, though, and a search through the Linux kernel
> sources (grep for PAGE_SHIFT) suggests that there are other obscure
> architectures that can at least optionally use larger pages, plus a
> few that can use smaller ones.
>
> I'd like this to be something that users don't have to configure, and
> it seems like that should be possible. We can detect the page size on
> non-Windows systems using sysctl(_SC_PAGESIZE), and on Windows by
> using GetSystemInfo. And I think it's safe to make this decision at
> configure time, because the page size is a function of the hardware
> architecture (it seems there are obscure systems that support multiple
> page sizes, but I don't care about them particularly). So what I
> think we should do is set an XLOG_WRITESZ along with XLOG_BLCKSZ and
> set it to the smaller of XLOG_BLCKSZ and the system page size. If we
> can't determine the system page size, assume 4kB.
>
I think deciding it automatically without user require to configure it,
certainly has merits, but what about some cases where user can get
benefits by configuring themselves like the cases where we use
PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
buffers and won't cause misaligned writes even for smaller chunk sizes
like 512 bytes or so). Some googling [1] reveals that other databases
also provides user with option to configure wal block/chunk size (as
BLOCKSIZE), although they seem to decide chunk size based on
disk-sector size.
An additional thought, which is not necessarily related to this patch is,
if user chooses and or we decide to write in 512 bytes sized chunks,
which is usually a disk sector size, then can't we think of avoiding
CRC for each record for such cases, because each WAL write in
it-self will be atomic. While reading, if we process in wal-chunk-sized
units, then I think it should be possible to detect end-of-wal based
on data read.
On Mon, Feb 8, 2016 at 12:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think deciding it automatically without user require to configure it, > certainly has merits, but what about some cases where user can get > benefits by configuring themselves like the cases where we use > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS > buffers and won't cause misaligned writes even for smaller chunk sizes > like 512 bytes or so). Some googling [1] reveals that other databases > also provides user with option to configure wal block/chunk size (as > BLOCKSIZE), although they seem to decide chunk size based on > disk-sector size. Well, if you can prove that we need that flexibility, then we should have a GUC. Where's the benchmarking data to support that conclusion? > An additional thought, which is not necessarily related to this patch is, > if user chooses and or we decide to write in 512 bytes sized chunks, > which is usually a disk sector size, then can't we think of avoiding > CRC for each record for such cases, because each WAL write in > it-self will be atomic. While reading, if we process in wal-chunk-sized > units, then I think it should be possible to detect end-of-wal based > on data read. Gosh, taking CRCs off of WAL records sounds like a terrible idea. I'm not sure why you think that writing in sector-sized chunks would make that any more safe, because to me it seems like it wouldn't. But even if it does, it's hard to believe that we don't derive some reliability from CRCs that we would lose without them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-02-08 10:38:55 +0530, Amit Kapila wrote: > I think deciding it automatically without user require to configure it, > certainly has merits, but what about some cases where user can get > benefits by configuring themselves like the cases where we use > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS > buffers and won't cause misaligned writes even for smaller chunk sizes > like 512 bytes or so). Some googling [1] reveals that other databases > also provides user with option to configure wal block/chunk size (as > BLOCKSIZE), although they seem to decide chunk size based on > disk-sector size. FWIW, you usually can't do that small writes with O_DIRECT. Usually it has to be 4KB (pagesize) sized, aligned (4kb again) writes. And on filesystems that do support doing such writes, they essentially fall back to doing buffered IO. > An additional thought, which is not necessarily related to this patch is, > if user chooses and or we decide to write in 512 bytes sized chunks, > which is usually a disk sector size, then can't we think of avoiding > CRC for each record for such cases, because each WAL write in > it-self will be atomic. While reading, if we process in wal-chunk-sized > units, then I think it should be possible to detect end-of-wal based > on data read. O_DIRECT doesn't give any useful guarantees to do something like the above. It doesn't have any ordering or durability implications. You still need to do fdatasyncs and such. Besides, with the new CRC implications, that doesn't really seem like such a large win anyway. Greetings, Andres Freund
On Mon, Feb 8, 2016 at 8:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Feb 8, 2016 at 12:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think deciding it automatically without user require to configure it,
> > certainly has merits, but what about some cases where user can get
> > benefits by configuring themselves like the cases where we use
> > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
> > buffers and won't cause misaligned writes even for smaller chunk sizes
> > like 512 bytes or so). Some googling [1] reveals that other databases
> > also provides user with option to configure wal block/chunk size (as
> > BLOCKSIZE), although they seem to decide chunk size based on
> > disk-sector size.
>
> Well, if you can prove that we need that flexibility, then we should
> have a GUC. Where's the benchmarking data to support that conclusion?
>
>
> On Mon, Feb 8, 2016 at 12:08 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think deciding it automatically without user require to configure it,
> > certainly has merits, but what about some cases where user can get
> > benefits by configuring themselves like the cases where we use
> > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
> > buffers and won't cause misaligned writes even for smaller chunk sizes
> > like 512 bytes or so). Some googling [1] reveals that other databases
> > also provides user with option to configure wal block/chunk size (as
> > BLOCKSIZE), although they seem to decide chunk size based on
> > disk-sector size.
>
> Well, if you can prove that we need that flexibility, then we should
> have a GUC. Where's the benchmarking data to support that conclusion?
>
It is not posted as some more work is needed to complete the
benchmarks results when PG_O_DIRECT is used (mainly with
open_sync and open_datasync). I will do so. But, I think main thing
which needs to be taken care is that as smaller-chunk sized writes are
useful only in some cases, we need to ensure that users should not
get baffled by the same. So there are multiple ways to provide the same,
a) at the startup, we ensure that if the user has set smaller chunk-size
(other than 4KB which will be default as decided based on the way
described by you upthread at configure time) and it can use PG_O_DIRECT
as we decide in get_sync_bit(), then allow it, otherwise either return an
error or just set it to default which is 4KB.
b) mention in docs that it better not to tinker with wal_chunk_size guc
unless you have other relevant settings (like wal_sync_method =
open_sync or open_datasync) and wal_level as default.
c) there is yet another option which is, let us do with 4KB sized
chunks for now as the benefit for not doing so only in sub-set of
cases we can support.
The reason why I think it is beneficial to provide an option of writing in
smaller chunks is that it could lead to reduce the amount of re-writes
by higher percentage where they can be used. For example at 4KB,
there is ~35% reduction, similarly at smaller chunks it could gives us
saving unto 50% or 70% depending on the chunk_size.
>
> > An additional thought, which is not necessarily related to this patch is,
> > if user chooses and or we decide to write in 512 bytes sized chunks,
> > which is usually a disk sector size, then can't we think of avoiding
> > CRC for each record for such cases, because each WAL write in
> > it-self will be atomic. While reading, if we process in wal-chunk-sized
> > units, then I think it should be possible to detect end-of-wal based
> > on data read.
>
> Gosh, taking CRCs off of WAL records sounds like a terrible idea. I'm
> not sure why you think that writing in sector-sized chunks would make
> that any more safe, because to me it seems like it wouldn't. But even
> if it does, it's hard to believe that we don't derive some reliability
> from CRCs that we would lose without them.
>
I think here the point is not about more-safety, rather it is about whether
writing in disk-sector sizes gives equal reliability as CRC's, because
if it does, then not doing crc calculation for each record both while
writing and during replay can save CPU and should intern lead to better
performance. Now, the reason why I thought it could give equal-reliability
is that as disk-sector writes are atomic, so it should buy us that reliability.
I admit that much more analysis/research is required before doing that
and we can do that later if it proves to be any valuable in terms of
performance and reliability. Here, I mentioned to say that writing in
smaller chunks have other potential benefits.
On Mon, Feb 8, 2016 at 8:16 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-02-08 10:38:55 +0530, Amit Kapila wrote:
> > I think deciding it automatically without user require to configure it,
> > certainly has merits, but what about some cases where user can get
> > benefits by configuring themselves like the cases where we use
> > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
> > buffers and won't cause misaligned writes even for smaller chunk sizes
> > like 512 bytes or so). Some googling [1] reveals that other databases
> > also provides user with option to configure wal block/chunk size (as
> > BLOCKSIZE), although they seem to decide chunk size based on
> > disk-sector size.
>
> FWIW, you usually can't do that small writes with O_DIRECT. Usually it
> has to be 4KB (pagesize) sized, aligned (4kb again) writes. And on
> filesystems that do support doing such writes, they essentially fall
> back to doing buffered IO.
>
> > An additional thought, which is not necessarily related to this patch is,
> > if user chooses and or we decide to write in 512 bytes sized chunks,
> > which is usually a disk sector size, then can't we think of avoiding
> > CRC for each record for such cases, because each WAL write in
> > it-self will be atomic. While reading, if we process in wal-chunk-sized
> > units, then I think it should be possible to detect end-of-wal based
> > on data read.
>
> O_DIRECT doesn't give any useful guarantees to do something like the
> above. It doesn't have any ordering or durability implications. You
> still need to do fdatasyncs and such.
>
> Besides, with the new CRC implications, that doesn't really seem like
> such a large win anyway.
>
>
> On 2016-02-08 10:38:55 +0530, Amit Kapila wrote:
> > I think deciding it automatically without user require to configure it,
> > certainly has merits, but what about some cases where user can get
> > benefits by configuring themselves like the cases where we use
> > PG_O_DIRECT flag for WAL (with o_direct, it will by bypass OS
> > buffers and won't cause misaligned writes even for smaller chunk sizes
> > like 512 bytes or so). Some googling [1] reveals that other databases
> > also provides user with option to configure wal block/chunk size (as
> > BLOCKSIZE), although they seem to decide chunk size based on
> > disk-sector size.
>
> FWIW, you usually can't do that small writes with O_DIRECT. Usually it
> has to be 4KB (pagesize) sized, aligned (4kb again) writes. And on
> filesystems that do support doing such writes, they essentially fall
> back to doing buffered IO.
>
I have not observed this during the tests (observation is based on the
fact that whenever there is a use of OS buffer cache, writing in smaller
chunks (lesser than 4K) leads to reads and in-turn decrease the
performance). I don't see such an implication even in documentation.
> > An additional thought, which is not necessarily related to this patch is,
> > if user chooses and or we decide to write in 512 bytes sized chunks,
> > which is usually a disk sector size, then can't we think of avoiding
> > CRC for each record for such cases, because each WAL write in
> > it-self will be atomic. While reading, if we process in wal-chunk-sized
> > units, then I think it should be possible to detect end-of-wal based
> > on data read.
>
> O_DIRECT doesn't give any useful guarantees to do something like the
> above. It doesn't have any ordering or durability implications. You
> still need to do fdatasyncs and such.
>
It doesn't need to, if we use o_sync flag which we always use whenever
we use O_DIRECT mode during WAL writes.
> Besides, with the new CRC implications, that doesn't really seem like
> such a large win anyway.
>
I haven't check this till now that how much big win we can get if we
can avoid CRC's and still provide same reliability, but I think it can
certainly save CPU instructions both during writes and replay and
performance must be better than current.