Обсуждение: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory
pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory
От
SATYANARAYANA NARLAPURAM
Дата:
Hi Hackers,
I noticed that pg_receivewal fails to stream when the partial file to write is not fully initialized and fails with the error message something like below. This requires an extra step of deleting the partial file that is not fully initialized before starting the pg_receivewal. Attaching a simple patch that creates a temp file, fully initialize it and rename the file to the desired wal segment name.
"error: write-ahead log file "000000010000000000000003.partial" has 8396800 bytes, should be 0 or 16777216"
Thanks,
Satya
Вложения
On Sun, Jan 02, 2022 at 09:27:43PM -0800, SATYANARAYANA NARLAPURAM wrote: > I noticed that pg_receivewal fails to stream when the partial file to write > is not fully initialized and fails with the error message something like > below. This requires an extra step of deleting the partial file that is not > fully initialized before starting the pg_receivewal. Attaching a simple > patch that creates a temp file, fully initialize it and rename the file to > the desired wal segment name. Are you referring to the pre-padding when creating a new partial segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until the file is fully created? What kind of error did you see? I guess that a write() with ENOSPC would be more likely, but you got a different problem? I don't disagree with improving such cases, but we should not do things so as there is a risk of leaving behind an infinite set of segments in case of repeated errors, and partial segments are already a kind of temporary file. - if (dir_data->sync) + if (shouldcreatetempfile) + { + if (durable_rename(tmpsuffixpath, targetpath) != 0) + { + close(fd); + unlink(tmpsuffixpath); + return NULL; + } + } durable_rename() does a set of fsync()'s, but --no-sync should not flush any data. -- Michael
Вложения
Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory
От
SATYANARAYANA NARLAPURAM
Дата:
Thanks Michael!
On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Jan 02, 2022 at 09:27:43PM -0800, SATYANARAYANA NARLAPURAM wrote:
> I noticed that pg_receivewal fails to stream when the partial file to write
> is not fully initialized and fails with the error message something like
> below. This requires an extra step of deleting the partial file that is not
> fully initialized before starting the pg_receivewal. Attaching a simple
> patch that creates a temp file, fully initialize it and rename the file to
> the desired wal segment name.
Are you referring to the pre-padding when creating a new partial
segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until
the file is fully created? What kind of error did you see? I guess
that a write() with ENOSPC would be more likely, but you got a
different problem?
I see two cases, 1/ when no space is left on the device and 2/ when the process is taken down forcibly (a VM/container crash)
I don't disagree with improving such cases, but we
should not do things so as there is a risk of leaving behind an
infinite set of segments in case of repeated errors
Do you see a problem with the proposed patch that leaves the files behind, at least in my testing I don't see any files left behind?
, and partial
segments are already a kind of temporary file.
if the .partial file exists with not zero-padded up to the wal segment size (WalSegSz), then open_walfile fails with the below error. I have two options here, 1/ to continue padding the existing partial file and let it zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought the latter is safe because it can handle corrupt cases as described below. Thoughts?
* When streaming to files, if an existing file exists we verify that it's
* either empty (just created), or a complete WalSegSz segment (in which
* case it has been created and padded). Anything else indicates a corrupt
* file. Compressed files have no need for padding, so just ignore this
* case.
* either empty (just created), or a complete WalSegSz segment (in which
* case it has been created and padded). Anything else indicates a corrupt
* file. Compressed files have no need for padding, so just ignore this
* case.
- if (dir_data->sync)
+ if (shouldcreatetempfile)
+ {
+ if (durable_rename(tmpsuffixpath, targetpath) != 0)
+ {
+ close(fd);
+ unlink(tmpsuffixpath);
+ return NULL;
+ }
+ }
durable_rename() does a set of fsync()'s, but --no-sync should not
flush any data.
I need to look into this further, without this I am seeing random file close and rename failures and disconnecting the stream. Also it appears we are calling durable_rename when we are closing the file (dir_close) even without --no-sync. Should we worry about the padding case?
--
Michael
Hello pg_receivewal creates this .partial WAL file during WAL streaming and it is already treating this file as a temporary file.It will fill this .partial file with zeroes up to 16777216 by default before streaming real WAL data on it. If your .partial file is only 8396800 bytes, then this could mean that pg_receivewal is terminated abruptly while it is appendingzeroes or your system runs out of disk space. Do you have any error message? If this is case, the uninitialized .partial file should still be all zeroes, so it should be ok to delete it and have pg_receivewalto recreate a new .partial file. Also, in your patch, you are using pad_to_size argument in function dir_open_for_write to determine if it needs to createa temp file, but I see that this function is always given a pad_to_size = 16777216 , and never 0. Am I missing something? Cary Huang =========== HighGo Software Canada
On Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: > > On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier <michael@paquier.xyz> wrote: >> >> On Sun, Jan 02, 2022 at 09:27:43PM -0800, SATYANARAYANA NARLAPURAM wrote: >> > I noticed that pg_receivewal fails to stream when the partial file to write >> > is not fully initialized and fails with the error message something like >> > below. This requires an extra step of deleting the partial file that is not >> > fully initialized before starting the pg_receivewal. Attaching a simple >> > patch that creates a temp file, fully initialize it and rename the file to >> > the desired wal segment name. >> >> Are you referring to the pre-padding when creating a new partial >> segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until >> the file is fully created? What kind of error did you see? I guess >> that a write() with ENOSPC would be more likely, but you got a >> different problem? > > I see two cases, 1/ when no space is left on the device and 2/ when the process is taken down forcibly (a VM/containercrash) Yeah, these cases can occur leaving uninitialized .partial files which can be a problem for both pg_receivewal and pg_basebackup that uses dir_open_for_write (CreateWalDirectoryMethod). >> I don't disagree with improving such cases, but we >> should not do things so as there is a risk of leaving behind an >> infinite set of segments in case of repeated errors > > Do you see a problem with the proposed patch that leaves the files behind, at least in my testing I don't see any filesleft behind? With the proposed patch, it doesn't leave the unpadded .partial files. Also, the v2 patch always removes a leftover .partial.temp file before it creates a new one. >> , and partial >> segments are already a kind of temporary file. > > > if the .partial file exists with not zero-padded up to the wal segment size (WalSegSz), then open_walfile fails with thebelow error. I have two options here, 1/ to continue padding the existing partial file and let it zero up to WalSegSz, 2/create a temp file as I did in the patch. I thought the latter is safe because it can handle corrupt cases asdescribed below. Thoughts? The temp file approach looks clean. >> - if (dir_data->sync) >> + if (shouldcreatetempfile) >> + { >> + if (durable_rename(tmpsuffixpath, targetpath) != 0) >> + { >> + close(fd); >> + unlink(tmpsuffixpath); >> + return NULL; >> + } >> + } >> >> durable_rename() does a set of fsync()'s, but --no-sync should not >> flush any data. Fixed this in v2. Another thing I found while working on this is the way the dir_open_for_write does padding - it doesn't retry in case of partial writes of blocks of size XLOG_BLCKSZ, unlike what core postgres does with pg_pwritev_with_retry in XLogFileInitInternal. Maybe dir_open_for_write should use the same approach. Thoughts? I fixed couple of issues with v1 (which was using static local variables in dir_open_for_write, not using durable_rename/rename for dir_data->sync true/false cases, not considering compression method none while setting shouldcreatetempfile true), improved comments and added commit message. Please review the v2 further. Regards, Bharath Rupireddy.
Вложения
At Sat, 9 Apr 2022 18:03:01 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > On Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM > <satyanarlapuram@gmail.com> wrote: > > > > On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier <michael@paquier.xyz> wrote: > >> Are you referring to the pre-padding when creating a new partial > >> segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until > >> the file is fully created? What kind of error did you see? I guess > >> that a write() with ENOSPC would be more likely, but you got a > >> different problem? > > > > I see two cases, 1/ when no space is left on the device and 2/ when the process is taken down forcibly (a VM/containercrash) > > Yeah, these cases can occur leaving uninitialized .partial files which > can be a problem for both pg_receivewal and pg_basebackup that uses > dir_open_for_write (CreateWalDirectoryMethod). > > >> I don't disagree with improving such cases, but we > >> should not do things so as there is a risk of leaving behind an > >> infinite set of segments in case of repeated errors > > > > Do you see a problem with the proposed patch that leaves the files behind, at least in my testing I don't see any filesleft behind? I guess that Michael took this patch as creating a temp file name such like "tmp.n" very time finding an incomplete file. > With the proposed patch, it doesn't leave the unpadded .partial files. > Also, the v2 patch always removes a leftover .partial.temp file before > it creates a new one. > > >> , and partial > >> segments are already a kind of temporary file. I'm not sure this is true for pg_receivewal case. The .partial file is not a temporary file but the current working file for the tool. > > if the .partial file exists with not zero-padded up to the wal segment size (WalSegSz), then open_walfile fails withthe below error. I have two options here, 1/ to continue padding the existing partial file and let it zero up to WalSegSz, 2/create a temp file as I did in the patch. I thought the latter is safe because it can handle corrupt cases asdescribed below. Thoughts? I think this patch shouldn't involve pg_basebackup. I agree to Cary that deleting the erroring file should be fine. We already "skipping" (complete = non-.partial) WAL files with a wrong size in FindStreamingStart so we can error-out with suggesting a hint. $ pg_receivewal -D xlog -p 5432 -h /tmp pg_receivewal: error: segment file "0000000100000022000000F5.partial" has incorrect size 8404992 hint: You can continue after removing the file. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Sorry for the terrible typos.. At Sat, 9 Apr 2022 18:03:01 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > On Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM > <satyanarlapuram@gmail.com> wrote: > > > > On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier <michael@paquier.xyz> wrote: > >> Are you referring to the pre-padding when creating a new partial > >> segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until > >> the file is fully created? What kind of error did you see? I guess > >> that a write() with ENOSPC would be more likely, but you got a > >> different problem? > > > > I see two cases, 1/ when no space is left on the device and 2/ when the process is taken down forcibly (a VM/containercrash) > > Yeah, these cases can occur leaving uninitialized .partial files which > can be a problem for both pg_receivewal and pg_basebackup that uses > dir_open_for_write (CreateWalDirectoryMethod). > > >> I don't disagree with improving such cases, but we > >> should not do things so as there is a risk of leaving behind an > >> infinite set of segments in case of repeated errors > > > > Do you see a problem with the proposed patch that leaves the files behind, at least in my testing I don't see any filesleft behind? I guess that Michael took this patch as creating a temp file with a name such like "tmp.n" every time finding an incomplete file. > With the proposed patch, it doesn't leave the unpadded .partial files. > Also, the v2 patch always removes a leftover .partial.temp file before > it creates a new one. > > >> , and partial > >> segments are already a kind of temporary file. I'm not sure this is true for pg_receivewal case. The .partial file is not a temporary file but the current working file for the tool. > > if the .partial file exists with not zero-padded up to the wal segment size (WalSegSz), then open_walfile fails withthe below error. I have two options here, 1/ to continue padding the existing partial file and let it zero up to WalSegSz, 2/create a temp file as I did in the patch. I thought the latter is safe because it can handle corrupt cases asdescribed below. Thoughts? I think this patch shouldn't involve pg_basebackup. I agree to Cary that deleting the erroring file should be fine. We already "skipping" (complete = non-.partial) WAL files with a wrong size in FindStreamingStart so we can error-out with suggesting a hint. $ pg_receivewal -D xlog -p 5432 -h /tmp pg_receivewal: error: segment file "0000000100000022000000F5.partial" has incorrect size 8404992 hint: You can continue after removing the file. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory
От
SATYANARAYANA NARLAPURAM
Дата:
On Sun, Apr 10, 2022 at 11:16 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Sorry for the terrible typos..
At Sat, 9 Apr 2022 18:03:01 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
> On Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM
> <satyanarlapuram@gmail.com> wrote:
> >
> > On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier <michael@paquier.xyz> wrote:
> >> Are you referring to the pre-padding when creating a new partial
> >> segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until
> >> the file is fully created? What kind of error did you see? I guess
> >> that a write() with ENOSPC would be more likely, but you got a
> >> different problem?
> >
> > I see two cases, 1/ when no space is left on the device and 2/ when the process is taken down forcibly (a VM/container crash)
>
> Yeah, these cases can occur leaving uninitialized .partial files which
> can be a problem for both pg_receivewal and pg_basebackup that uses
> dir_open_for_write (CreateWalDirectoryMethod).
>
> >> I don't disagree with improving such cases, but we
> >> should not do things so as there is a risk of leaving behind an
> >> infinite set of segments in case of repeated errors
> >
> > Do you see a problem with the proposed patch that leaves the files behind, at least in my testing I don't see any files left behind?
I guess that Michael took this patch as creating a temp file with a
name such like "tmp.n" every time finding an incomplete file.
> With the proposed patch, it doesn't leave the unpadded .partial files.
> Also, the v2 patch always removes a leftover .partial.temp file before
> it creates a new one.
>
> >> , and partial
> >> segments are already a kind of temporary file.
I'm not sure this is true for pg_receivewal case. The .partial file
is not a temporary file but the current working file for the tool.
Correct. The idea is to make sure the file is fully allocated before treating it as a current file.
> > if the .partial file exists with not zero-padded up to the wal segment size (WalSegSz), then open_walfile fails with the below error. I have two options here, 1/ to continue padding the existing partial file and let it zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought the latter is safe because it can handle corrupt cases as described below. Thoughts?
I think this patch shouldn't involve pg_basebackup. I agree to Cary
that deleting the erroring file should be fine.
We already "skipping" (complete = non-.partial) WAL files with a wrong
size in FindStreamingStart so we can error-out with suggesting a hint.
$ pg_receivewal -D xlog -p 5432 -h /tmp
pg_receivewal: error: segment file "0000000100000022000000F5.partial" has incorrect size 8404992
hint: You can continue after removing the file.
The idea here is to make pg_receivewal self sufficient and reduce human/third party tool interaction. Ideal case is running pg_Receivewal as a service for wal archiving.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Apr 11, 2022 at 01:21:23PM -0700, SATYANARAYANA NARLAPURAM wrote: > Correct. The idea is to make sure the file is fully allocated before > treating it as a current file. Another problem comes to compression, as the pre-padding cannot be applied in this case because zlib and lz4 don't know the size of the compressed segment until we reach 16MB of data received, but you can get a good estimate as long as you know how much space is left on a device. FWIW, I had to deal with this problem a couple of years ago for the integration of an archiver in a certain thing, and the requirement was that the WAL archiver service had to be a maximum self-aware and automated, which is what you wish to achieve here. It basically came down to measure how much WAL one wishes to keep in the WAL archives for the sizing of the disk partition storing the archives (aka how much back in time you want to go), in combination to how much WAL would get produced on a rather-linear production load. Another thing is that you never really want to stress too much your partition so as it gets filled at 100%, as there could be opened files and the kind that consume more space than the actual amount of data stored, but you'd usually want to keep up to 70~90% of it. At the end, we finished with: - A dependency to statvfs(), which is not portable on WIN32, to find out how much space was left on the partition (f_blocks*f_bsize for the total size and f_bfree*f_bsize for the free size I guess, by looking at its man page). - Control the amount of WAL to keep around using a percentage rate of maximum disk space allowed (or just a percentage of free disk space), with pg_receivewal doing a cleanup of up to WalSegSz worth of data for the oldest segments. The segments of the oldest TLIs are removed first. For any compression algorithm, unlinking this much amount of data is not necessary but that's fine as you usually just remove one compressed or uncompressed segment per cycle, at it does not matter with dozens of gigs worth of WAL archives, or even more. -- Michael
Вложения
On Tue, Apr 12, 2022 at 5:34 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Apr 11, 2022 at 01:21:23PM -0700, SATYANARAYANA NARLAPURAM wrote: > > Correct. The idea is to make sure the file is fully allocated before > > treating it as a current file. > > Another problem comes to compression, as the pre-padding cannot be > applied in this case because zlib and lz4 don't know the size of the > compressed segment until we reach 16MB of data received, but you can > get a good estimate as long as you know how much space is left on a > device. FWIW, I had to deal with this problem a couple of years ago > for the integration of an archiver in a certain thing, and the > requirement was that the WAL archiver service had to be a maximum > self-aware and automated, which is what you wish to achieve here. It > basically came down to measure how much WAL one wishes to keep in the > WAL archives for the sizing of the disk partition storing the > archives (aka how much back in time you want to go), in combination to > how much WAL would get produced on a rather-linear production load. > > Another thing is that you never really want to stress too much your > partition so as it gets filled at 100%, as there could be opened files > and the kind that consume more space than the actual amount of data > stored, but you'd usually want to keep up to 70~90% of it. At the > end, we finished with: > - A dependency to statvfs(), which is not portable on WIN32, to find > out how much space was left on the partition (f_blocks*f_bsize for > the total size and f_bfree*f_bsize for the free size I guess, by > looking at its man page). > - Control the amount of WAL to keep around using a percentage rate of > maximum disk space allowed (or just a percentage of free disk space), > with pg_receivewal doing a cleanup of up to WalSegSz worth of data for > the oldest segments. The segments of the oldest TLIs are removed > first. For any compression algorithm, unlinking this much amount of > data is not necessary but that's fine as you usually just remove one > compressed or uncompressed segment per cycle, at it does not matter > with dozens of gigs worth of WAL archives, or even more. Thanks for sharing this. Will the write operations (in dir_open_for_write) for PG_COMPRESSION_GZIP and PG_COMPRESSION_LZ4 take longer compared to prepadding for non-compressed files? I would like to know if there's any problem with the proposed fix. I think we need the same fix proposed in this thread for tar_open_for_write as well because it also does prepadding for non-compressed files. In general, I agree that making pg_receivewal self-aware and automating things by itself is really a great idea. This will avoid manual effort. For instance, pg_receivewal can try with different streaming start LSNs (restart_lsn of its slot or server insert LSN) not just the latest LSN found in its target directory which will particularly be helpful in case its source server has changed the timeline or for some reason unable to serve the WAL. Regards, Bharath Rupireddy.
On Mon, Apr 18, 2022 at 02:50:17PM +0530, Bharath Rupireddy wrote: > Thanks for sharing this. Will the write operations (in > dir_open_for_write) for PG_COMPRESSION_GZIP and PG_COMPRESSION_LZ4 > take longer compared to prepadding for non-compressed files? The first write operations for gzip and lz4 consists in writing their respective headers in the resulting file, which should be a couple of dozen bytes, at most. So that's surely going to be cheaper than the pre-padding done for a full segment with the flush induced after writing WalSegSz bytes worth of zeros. > I would like to know if there's any problem with the proposed fix. There is nothing done for the case of compressed segments, meaning that you would see the same problem when being in the middle of writing a segment compressed with gzip or lz4 in the middle of writing it, and that's what you want to avoid here. So the important part is not the pre-padding, it is to make sure that there is enough space reserved for the handling of a full segment before beginning the work on it. -- Michael
Вложения
On Tue, Apr 19, 2022 at 10:42 AM Michael Paquier <michael@paquier.xyz> wrote: > > > I would like to know if there's any problem with the proposed fix. > > There is nothing done for the case of compressed segments, meaning > that you would see the same problem when being in the middle of > writing a segment compressed with gzip or lz4 in the middle of writing > it, and that's what you want to avoid here. So the important part is > not the pre-padding, it is to make sure that there is enough space > reserved for the handling of a full segment before beginning the work > on it. Right. We find enough disk space and go to write and suddenly the write operations fail for some reason or the VM crashes because of a reason other than disk space. I think the foolproof solution is to figure out the available disk space before prepadding or compressing and also use the write-first-to-temp-file-and-then-rename-it-to-original-file as proposed in the earlier patches in this thread. Having said that, is there a portable way that we can find out the disk space available? I read your response upthread that statvfs isn't portable to WIN32 platforms. So, we just say that part of the fix we proposed here (checking disk space before prepadding or compressing) isn't supported on WIN32 and we just do the temp file thing for WIN32 alone? Regards, Bharath Rupireddy.
On Fri, Apr 22, 2022 at 07:17:37PM +0530, Bharath Rupireddy wrote: > Right. We find enough disk space and go to write and suddenly the > write operations fail for some reason or the VM crashes because of a > reason other than disk space. I think the foolproof solution is to > figure out the available disk space before prepadding or compressing > and also use the > write-first-to-temp-file-and-then-rename-it-to-original-file as > proposed in the earlier patches in this thread. Yes, what would count here is only the amount of free space in a partition. The total amount of space available becomes handy once you begin introducing things like percentage-based quota policies for the disk when archiving. The free amount of space could be used to define a policy based on the maximum number of bytes you need to leave around, as well, but this is not perfect science as this depends of what FSes decide to do underneath. There are a couple of designs possible here. When I had to deal with my upthread case I have chosen one as I had no need to worry only about Linux, it does not mean that this is the best choice that would fit with the long-term community picture. This comes down to how much pg_receivewal should handle automatically, and how it should handle it. > Having said that, is there a portable way that we can find out the > disk space available? I read your response upthread that statvfs isn't > portable to WIN32 platforms. So, we just say that part of the fix we > proposed here (checking disk space before prepadding or compressing) > isn't supported on WIN32 and we just do the temp file thing for WIN32 > alone? Something like GetDiskFreeSpaceA() would do the trick on WIN32. https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getdiskfreespacea When it comes to compression, creating a temporary file would only lead to its deletion, mostly, and extra I/O is never free. Anyway, why would you need the extra logic of the temporary file at all? That's basically what the .partial file is as pg_receivewal begins streaming at the beginning of a segment, to the partial file, each time it sees fit to restart a streaming cycle. -- Michael
Вложения
On Mon, Apr 25, 2022 at 6:38 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Apr 22, 2022 at 07:17:37PM +0530, Bharath Rupireddy wrote: > > Right. We find enough disk space and go to write and suddenly the > > write operations fail for some reason or the VM crashes because of a > > reason other than disk space. I think the foolproof solution is to > > figure out the available disk space before prepadding or compressing > > and also use the > > write-first-to-temp-file-and-then-rename-it-to-original-file as > > proposed in the earlier patches in this thread. > > Yes, what would count here is only the amount of free space in a > partition. The total amount of space available becomes handy once you > begin introducing things like percentage-based quota policies for the > disk when archiving. The free amount of space could be used to define > a policy based on the maximum number of bytes you need to leave > around, as well, but this is not perfect science as this depends of > what FSes decide to do underneath. There are a couple of designs > possible here. When I had to deal with my upthread case I have chosen > one as I had no need to worry only about Linux, it does not mean that > this is the best choice that would fit with the long-term community > picture. This comes down to how much pg_receivewal should handle > automatically, and how it should handle it. Thanks. I'm not sure why we are just thinking of crashes due to out-of-disk space. Figuring out free disk space before writing a huge file (say a WAL file) is a problem in itself to the core postgres as well, not just pg_receivewal. I think we are off-track a bit here. Let me illustrate what's the whole problem is and the idea: If the node/VM on which pg_receivewal runs, goes down/crashes or fails during write operation while padding the target WAL file (the .partial file) with zeros, the unfilled target WAL file ((let me call this file a partially padded .partial file) will be left over and subsequent reads/writes to that it will fail with "write-ahead log file \"%s\" has %zd bytes, should be 0 or %d" error which requires manual intervention to remove it. In a service, this manual intervention is what we would like to avoid. Let's not much bother right now for compressed file writes (for now at least) as they don't have a prepadding phase. The proposed solution is to make the prepadding atomic - prepad the XXXX.partial file as XXXX.partial.tmp name and after the prepadding rename (durably if sync option is chosen for pg_receivewal) to XXXX.partial. Before prepadding XXXX.partial.tmp, delete the XXXX.partial.tmp if it exists. The above problem isn't unique to pg_receivewal alone, pg_basebackup too uses CreateWalDirectoryMethod and dir_open_for_write via ReceiveXlogStream. IMHO, pg_receivewal checking for available disk space before writing any file should better be discussed separately? Regards, Bharath Rupireddy.
On Mon, Apr 25, 2022 at 5:17 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Apr 25, 2022 at 6:38 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Fri, Apr 22, 2022 at 07:17:37PM +0530, Bharath Rupireddy wrote: > > > Right. We find enough disk space and go to write and suddenly the > > > write operations fail for some reason or the VM crashes because of a > > > reason other than disk space. I think the foolproof solution is to > > > figure out the available disk space before prepadding or compressing > > > and also use the > > > write-first-to-temp-file-and-then-rename-it-to-original-file as > > > proposed in the earlier patches in this thread. > > > > Yes, what would count here is only the amount of free space in a > > partition. The total amount of space available becomes handy once you > > begin introducing things like percentage-based quota policies for the > > disk when archiving. The free amount of space could be used to define > > a policy based on the maximum number of bytes you need to leave > > around, as well, but this is not perfect science as this depends of > > what FSes decide to do underneath. There are a couple of designs > > possible here. When I had to deal with my upthread case I have chosen > > one as I had no need to worry only about Linux, it does not mean that > > this is the best choice that would fit with the long-term community > > picture. This comes down to how much pg_receivewal should handle > > automatically, and how it should handle it. > > Thanks. I'm not sure why we are just thinking of crashes due to > out-of-disk space. Figuring out free disk space before writing a huge > file (say a WAL file) is a problem in itself to the core postgres as > well, not just pg_receivewal. > > I think we are off-track a bit here. Let me illustrate what's the > whole problem is and the idea: > > If the node/VM on which pg_receivewal runs, goes down/crashes or fails > during write operation while padding the target WAL file (the .partial > file) with zeros, the unfilled target WAL file ((let me call this file > a partially padded .partial file) will be left over and subsequent > reads/writes to that it will fail with "write-ahead log file \"%s\" > has %zd bytes, should be 0 or %d" error which requires manual > intervention to remove it. In a service, this manual intervention is > what we would like to avoid. Let's not much bother right now for > compressed file writes (for now at least) as they don't have a > prepadding phase. > > The proposed solution is to make the prepadding atomic - prepad the > XXXX.partial file as XXXX.partial.tmp name and after the prepadding > rename (durably if sync option is chosen for pg_receivewal) to > XXXX.partial. Before prepadding XXXX.partial.tmp, delete the > XXXX.partial.tmp if it exists. > > The above problem isn't unique to pg_receivewal alone, pg_basebackup > too uses CreateWalDirectoryMethod and dir_open_for_write via > ReceiveXlogStream. > > IMHO, pg_receivewal checking for available disk space before writing > any file should better be discussed separately? Here's the v3 patch after rebasing. I just would like to reiterate the issue the patch is trying to solve: At times (when no space is left on the device or when the process is taken down forcibly (VM/container crash)), there can be leftover uninitialized .partial files (note that padding i.e. filling 16MB WAL files with all zeros is done in non-compression cases) due to which pg_receivewal fails to come up after the crash. To address this, the proposed patch makes the padding 16MB WAL files atomic (first write a .temp file, pad it and durably rename it to the original .partial file, ready to be filled with received WAL records). This approach is similar to what core postgres achieves atomicity while creating new WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file first and then how InstallXLogFileSegment() durably renames it to original WAL file). Thoughts? Regards, Bharath Rupireddy.
Вложения
Hi Bharath,
Idea to atomically allocate WAL file by creating tmp file and renaming it is nice.
I have one question though:
How is partially temp file created will be cleaned if the VM crashes or out of disk space cases? Does it endup creating multiple files for every VM crash/disk space during process of pg_receivewal?
Thoughts?
Thanks,
Mahendrakar.
On Mon, 25 Jul 2022 at 16:42, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Mon, Apr 25, 2022 at 5:17 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Mon, Apr 25, 2022 at 6:38 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Apr 22, 2022 at 07:17:37PM +0530, Bharath Rupireddy wrote:
> > > Right. We find enough disk space and go to write and suddenly the
> > > write operations fail for some reason or the VM crashes because of a
> > > reason other than disk space. I think the foolproof solution is to
> > > figure out the available disk space before prepadding or compressing
> > > and also use the
> > > write-first-to-temp-file-and-then-rename-it-to-original-file as
> > > proposed in the earlier patches in this thread.
> >
> > Yes, what would count here is only the amount of free space in a
> > partition. The total amount of space available becomes handy once you
> > begin introducing things like percentage-based quota policies for the
> > disk when archiving. The free amount of space could be used to define
> > a policy based on the maximum number of bytes you need to leave
> > around, as well, but this is not perfect science as this depends of
> > what FSes decide to do underneath. There are a couple of designs
> > possible here. When I had to deal with my upthread case I have chosen
> > one as I had no need to worry only about Linux, it does not mean that
> > this is the best choice that would fit with the long-term community
> > picture. This comes down to how much pg_receivewal should handle
> > automatically, and how it should handle it.
>
> Thanks. I'm not sure why we are just thinking of crashes due to
> out-of-disk space. Figuring out free disk space before writing a huge
> file (say a WAL file) is a problem in itself to the core postgres as
> well, not just pg_receivewal.
>
> I think we are off-track a bit here. Let me illustrate what's the
> whole problem is and the idea:
>
> If the node/VM on which pg_receivewal runs, goes down/crashes or fails
> during write operation while padding the target WAL file (the .partial
> file) with zeros, the unfilled target WAL file ((let me call this file
> a partially padded .partial file) will be left over and subsequent
> reads/writes to that it will fail with "write-ahead log file \"%s\"
> has %zd bytes, should be 0 or %d" error which requires manual
> intervention to remove it. In a service, this manual intervention is
> what we would like to avoid. Let's not much bother right now for
> compressed file writes (for now at least) as they don't have a
> prepadding phase.
>
> The proposed solution is to make the prepadding atomic - prepad the
> XXXX.partial file as XXXX.partial.tmp name and after the prepadding
> rename (durably if sync option is chosen for pg_receivewal) to
> XXXX.partial. Before prepadding XXXX.partial.tmp, delete the
> XXXX.partial.tmp if it exists.
>
> The above problem isn't unique to pg_receivewal alone, pg_basebackup
> too uses CreateWalDirectoryMethod and dir_open_for_write via
> ReceiveXlogStream.
>
> IMHO, pg_receivewal checking for available disk space before writing
> any file should better be discussed separately?
Here's the v3 patch after rebasing.
I just would like to reiterate the issue the patch is trying to solve:
At times (when no space is left on the device or when the process is
taken down forcibly (VM/container crash)), there can be leftover
uninitialized .partial files (note that padding i.e. filling 16MB WAL
files with all zeros is done in non-compression cases) due to which
pg_receivewal fails to come up after the crash. To address this, the
proposed patch makes the padding 16MB WAL files atomic (first write a
.temp file, pad it and durably rename it to the original .partial
file, ready to be filled with received WAL records). This approach is
similar to what core postgres achieves atomicity while creating new
WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file
first and then how InstallXLogFileSegment() durably renames it to
original WAL file).
Thoughts?
Regards,
Bharath Rupireddy.
On Sun, Jul 31, 2022 at 8:36 PM mahendrakar s <mahendrakarforpg@gmail.com> wrote: > >> On Mon, 25 Jul 2022 at 16:42, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: >> Here's the v3 patch after rebasing. >> >> I just would like to reiterate the issue the patch is trying to solve: >> At times (when no space is left on the device or when the process is >> taken down forcibly (VM/container crash)), there can be leftover >> uninitialized .partial files (note that padding i.e. filling 16MB WAL >> files with all zeros is done in non-compression cases) due to which >> pg_receivewal fails to come up after the crash. To address this, the >> proposed patch makes the padding 16MB WAL files atomic (first write a >> .temp file, pad it and durably rename it to the original .partial >> file, ready to be filled with received WAL records). This approach is >> similar to what core postgres achieves atomicity while creating new >> WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file >> first and then how InstallXLogFileSegment() durably renames it to >> original WAL file). >> >> Thoughts? > > Hi Bharath, > > Idea to atomically allocate WAL file by creating tmp file and renaming it is nice. Thanks for reviewing it. > I have one question though: > How is partially temp file created will be cleaned if the VM crashes or out of disk space cases? Does it endup creatingmultiple files for every VM crash/disk space during process of pg_receivewal? > > Thoughts? It is handled in the patch, see [1]. Attaching v4 patch herewith which now uses the temporary file suffix '.tmp' as opposed to v3 patch '.temp'. This is just to be in sync with other atomic file write codes in the core - autoprewarm, pg_stat_statement, slot, basebacup, replorigin, snapbuild, receivelog and so on. Please review the v4 patch. [1] + /* + * Actual file doesn't exist. Now, create a temporary file pad it + * and rename to the target file. The temporary file may exist from + * the last failed attempt (failed during partial padding or + * renaming or some other). If it exists, let's play safe and + * delete it before creating a new one. + */ + snprintf(tmpsuffixpath, sizeof(tmpsuffixpath), "%s.%s", + targetpath, "temp"); + + if (dir_existsfile(tmpsuffixpath)) + { + if (unlink(tmpsuffixpath) != 0) + { + dir_data->lasterrno = errno; + return NULL; + } + } -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Вложения
On Thu, Aug 4, 2022 at 11:59 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Sun, Jul 31, 2022 at 8:36 PM mahendrakar s > <mahendrakarforpg@gmail.com> wrote: > > > >> On Mon, 25 Jul 2022 at 16:42, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > >> Here's the v3 patch after rebasing. > >> > >> I just would like to reiterate the issue the patch is trying to solve: > >> At times (when no space is left on the device or when the process is > >> taken down forcibly (VM/container crash)), there can be leftover > >> uninitialized .partial files (note that padding i.e. filling 16MB WAL > >> files with all zeros is done in non-compression cases) due to which > >> pg_receivewal fails to come up after the crash. To address this, the > >> proposed patch makes the padding 16MB WAL files atomic (first write a > >> .temp file, pad it and durably rename it to the original .partial > >> file, ready to be filled with received WAL records). This approach is > >> similar to what core postgres achieves atomicity while creating new > >> WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file > >> first and then how InstallXLogFileSegment() durably renames it to > >> original WAL file). > >> > >> Thoughts? > > > > Hi Bharath, > > > > Idea to atomically allocate WAL file by creating tmp file and renaming it is nice. > > Thanks for reviewing it. > > > I have one question though: > > How is partially temp file created will be cleaned if the VM crashes or out of disk space cases? Does it endup creatingmultiple files for every VM crash/disk space during process of pg_receivewal? > > > > Thoughts? > > It is handled in the patch, see [1]. > > Attaching v4 patch herewith which now uses the temporary file suffix > '.tmp' as opposed to v3 patch '.temp'. This is just to be in sync with > other atomic file write codes in the core - autoprewarm, > pg_stat_statement, slot, basebacup, replorigin, snapbuild, receivelog > and so on. > > Please review the v4 patch. I've done some more testing today (hacked the code a bit by adding pg_usleep(10000L); in pre-padding loop and crashing the pg_receivewal process to produce the warning [1]) and found that the better place to remove ".partial.tmp" leftover files is in FindStreamingStart() because there we do a traversal of all the files in target directory along the way to remove if ".partial.tmp" file(s) is/are found. Please review the v5 patch further. [1] pg_receivewal: warning: segment file "0000000100000006000000B9.partial" has incorrect size 15884288, skipping -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Вложения
Hi Bharath,
I reviewed your patch. Minor comments.
1. Why are we not using durable_unlink instead of unlink to remove the partial tmp files?
2. Below could be a simple if(shouldcreatetempfile){} else{} as in error case we need to return NULL.
+ if (errno != ENOENT || !shouldcreatetempfile)
+ {
+ dir_data->lasterrno = errno;
+ return NULL;
+ }
+ else if (shouldcreatetempfile)
+ {
+ /*
+ * Actual file doesn't exist. Now, create a temporary file pad it
+ * and rename to the target file. The temporary file may exist from
+ * the last failed attempt (failed during partial padding or
+ * renaming or some other). If it exists, let's play safe and
+ * delete it before creating a new one.
+ */
+ snprintf(tmpsuffixpath, MAXPGPATH, "%s.tmp", targetpath);
+
+ if (dir_existsfile(tmpsuffixpath))
+ {
+ if (unlink(tmpsuffixpath) != 0)
+ {
+ dir_data->lasterrno = errno;
+ return NULL;
+ }
+ }
+
+ fd = open(tmpsuffixpath, flags | O_CREAT, pg_file_create_mode);
+ if (fd < 0)
+ {
+ dir_data->lasterrno = errno;
+ return NULL;
+ }
+ }
On Mon, 8 Aug 2022 at 11:59, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Thu, Aug 4, 2022 at 11:59 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Sun, Jul 31, 2022 at 8:36 PM mahendrakar s
> <mahendrakarforpg@gmail.com> wrote:
> >
> >> On Mon, 25 Jul 2022 at 16:42, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> >> Here's the v3 patch after rebasing.
> >>
> >> I just would like to reiterate the issue the patch is trying to solve:
> >> At times (when no space is left on the device or when the process is
> >> taken down forcibly (VM/container crash)), there can be leftover
> >> uninitialized .partial files (note that padding i.e. filling 16MB WAL
> >> files with all zeros is done in non-compression cases) due to which
> >> pg_receivewal fails to come up after the crash. To address this, the
> >> proposed patch makes the padding 16MB WAL files atomic (first write a
> >> .temp file, pad it and durably rename it to the original .partial
> >> file, ready to be filled with received WAL records). This approach is
> >> similar to what core postgres achieves atomicity while creating new
> >> WAL file (see how XLogFileInitInternal() creates xlogtemp.%d file
> >> first and then how InstallXLogFileSegment() durably renames it to
> >> original WAL file).
> >>
> >> Thoughts?
> >
> > Hi Bharath,
> >
> > Idea to atomically allocate WAL file by creating tmp file and renaming it is nice.
>
> Thanks for reviewing it.
>
> > I have one question though:
> > How is partially temp file created will be cleaned if the VM crashes or out of disk space cases? Does it endup creating multiple files for every VM crash/disk space during process of pg_receivewal?
> >
> > Thoughts?
>
> It is handled in the patch, see [1].
>
> Attaching v4 patch herewith which now uses the temporary file suffix
> '.tmp' as opposed to v3 patch '.temp'. This is just to be in sync with
> other atomic file write codes in the core - autoprewarm,
> pg_stat_statement, slot, basebacup, replorigin, snapbuild, receivelog
> and so on.
>
> Please review the v4 patch.
I've done some more testing today (hacked the code a bit by adding
pg_usleep(10000L); in pre-padding loop and crashing the pg_receivewal
process to produce the warning [1]) and found that the better place to
remove ".partial.tmp" leftover files is in FindStreamingStart()
because there we do a traversal of all the files in target directory
along the way to remove if ".partial.tmp" file(s) is/are found.
Please review the v5 patch further.
[1] pg_receivewal: warning: segment file
"0000000100000006000000B9.partial" has incorrect size 15884288,
skipping
--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Fri, Aug 19, 2022 at 1:37 PM mahendrakar s <mahendrakarforpg@gmail.com> wrote: > > Hi Bharath, > I reviewed your patch. Minor comments. Thanks. > 1. Why are we not using durable_unlink instead of unlink to remove the partial tmp files? durable_unlink() issues fsync on the parent directory, if used, those fsync() calls will be per partial.tmp file. Moreover, durable_unlink() is backend-only, not available for tools i.e. FRONTEND code. If we don't durably remove the pratial.tmp file, it will get deleted in the next cycle anyways, so no problem there. > 2. Below could be a simple if(shouldcreatetempfile){} else{} as in error case we need to return NULL. Yeah, that way it is much simpler. Please review the attached v6 patch. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Вложения
Changes look good to me.
Thanks,
Mahendrakar.
On Fri, 19 Aug 2022 at 17:28, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Fri, Aug 19, 2022 at 1:37 PM mahendrakar s
<mahendrakarforpg@gmail.com> wrote:
>
> Hi Bharath,
> I reviewed your patch. Minor comments.
Thanks.
> 1. Why are we not using durable_unlink instead of unlink to remove the partial tmp files?
durable_unlink() issues fsync on the parent directory, if used, those
fsync() calls will be per partial.tmp file. Moreover, durable_unlink()
is backend-only, not available for tools i.e. FRONTEND code. If we
don't durably remove the pratial.tmp file, it will get deleted in the
next cycle anyways, so no problem there.
> 2. Below could be a simple if(shouldcreatetempfile){} else{} as in error case we need to return NULL.
Yeah, that way it is much simpler.
Please review the attached v6 patch.
--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Fri, Aug 19, 2022 at 5:27 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Please review the attached v6 patch. I'm attaching the v7 patch rebased on to the latest HEAD. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
On Wed, Sep 21, 2022 at 8:03 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Aug 19, 2022 at 5:27 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > Please review the attached v6 patch. > > I'm attaching the v7 patch rebased on to the latest HEAD. v7 patch was failing on Windows [1]. This is because of the full path name being sent to dir_existsfile() instead of just sending the temp file name. I fixed the issue. Thanks Michael Paquier for an offlist chat. PSA v8 patch. [1] t/010_pg_basebackup.pl ... 134/? # Failed test 'pg_basebackup reports checksum mismatch stderr /(?^s:^WARNING.*checksum verification failed)/' # at t/010_pg_basebackup.pl line 769. # 'unrecognized win32 error code: 123WARNING: checksum verification failed in file "./base/5/16399", block 0: calculated 4C09 but expected B3F6 # pg_basebackup: error: checksum error occurred -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
On Thu, Sep 22, 2022 at 07:16:41AM +0530, Bharath Rupireddy wrote: > t/010_pg_basebackup.pl ... 134/? > # Failed test 'pg_basebackup reports checksum mismatch stderr > /(?^s:^WARNING.*checksum verification failed)/' > # at t/010_pg_basebackup.pl line 769. > # 'unrecognized win32 error code: 123WARNING: > checksum verification failed in file "./base/5/16399", block 0: > calculated 4C09 but expected B3F6 > # pg_basebackup: error: checksum error occurred Shouldn't we extend the mapping table in win32error.c so as the information provided is more useful when seeing this error, then? There could be other code path able to trigger this failure, or other hackers working on separate features that could benefit from this extra information. -- Michael
Вложения
On Thu, Sep 22, 2022 at 10:07 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Sep 22, 2022 at 07:16:41AM +0530, Bharath Rupireddy wrote: > > t/010_pg_basebackup.pl ... 134/? > > # Failed test 'pg_basebackup reports checksum mismatch stderr > > /(?^s:^WARNING.*checksum verification failed)/' > > # at t/010_pg_basebackup.pl line 769. > > # 'unrecognized win32 error code: 123WARNING: > > checksum verification failed in file "./base/5/16399", block 0: > > calculated 4C09 but expected B3F6 > > # pg_basebackup: error: checksum error occurred > > Shouldn't we extend the mapping table in win32error.c so as the > information provided is more useful when seeing this error, then? > There could be other code path able to trigger this failure, or other > hackers working on separate features that could benefit from this > extra information. Thanks. I will start a separate thread to discuss that. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Sep 22, 2022 at 7:16 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > PSA v8 patch. I simplified the code a bit by using a fixed temporary file name (thanks Michael Paquier for the suggestion) much like the core does in XLogFileInitInternal(). If there's any left-over temp file from a crash or failure in dir_open_for_write(), that gets deleted in the next call. Please review the v9 patch further. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
2024-01 Commitfest. Hi, this patch was marked in CF as "Needs Review", but there has been no activity on this thread for 15+ months. Since there seems not much interest, I have changed the status to "Returned with Feedback" [1]. Feel free to propose a stronger use case for the patch and add an entry for the same. ====== [1] https://commitfest.postgresql.org/46/3503/ Kind Regards, Peter Smith.