Обсуждение: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

Поиск
Список
Период
Сортировка

odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Andres Freund
Дата:
Hi,

My buildfarm animal grassquit just showed an odd failure [1] in REL_11_STABLE:

ok 10 - standby is in recovery
# Running: pg_ctl -D
/mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/src/bin/pg_ctl/tmp_check/t_003_promote_standby2_data/pgdata
promote
waiting for server to promote....pg_ctl: control file appears to be corrupt
not ok 11 - pg_ctl promote of standby runs

#   Failed test 'pg_ctl promote of standby runs'
#   at /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 474.


I didn't find other references to this kind of failure. Nor has the error
re-occurred on grassquit.


I don't immediately see a way for this message to be hit that's not indicating
a bug somewhere. We should be updating the control file in an atomic way and
read it in an atomic way.


The failure has to be happening in wait_for_postmaster_promote(), because the
standby2 is actually successfully promoted.

Greetings,

Andres Freund

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2022-11-22%2016%3A33%3A57



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Michael Paquier
Дата:
On Tue, Nov 22, 2022 at 05:42:24PM -0800, Andres Freund wrote:
> The failure has to be happening in wait_for_postmaster_promote(), because the
> standby2 is actually successfully promoted.

That's the one under -fsanitize=address.  It really smells to me like
a bug with a race condition all over it.
--
Michael

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Alvaro Herrera
Дата:
On 2022-Nov-22, Andres Freund wrote:

> ok 10 - standby is in recovery
> # Running: pg_ctl -D
/mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/src/bin/pg_ctl/tmp_check/t_003_promote_standby2_data/pgdata
promote
> waiting for server to promote....pg_ctl: control file appears to be corrupt
> not ok 11 - pg_ctl promote of standby runs
> 
> #   Failed test 'pg_ctl promote of standby runs'
> #   at /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/../pgsql/src/test/perl/TestLib.pm line 474.

This triggered me on this proposal I saw yesterday
https://postgr.es/m/02fe0063-bf77-90d0-3cf5-e9fe7c2a487b@postgrespro.ru
I think trying to store more stuff in pg_control is dangerous and we
should resist it.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Wed, Nov 23, 2022 at 2:42 PM Andres Freund <andres@anarazel.de> wrote:
> The failure has to be happening in wait_for_postmaster_promote(), because the
> standby2 is actually successfully promoted.

I assume this is ext4.  Presumably anything that reads the
controlfile, like pg_ctl, pg_checksums, pg_resetwal,
pg_control_system(), ... by reading without interlocking against
writes could see garbage.  I have lost track of the versions and the
thread, but I worked out at some point by experimentation that this
only started relatively recently for concurrent read() and write(),
but always happened with concurrent pread() and pwrite().  The control
file uses the non-p variants which didn't mash old/new data like
grated cheese under concurrency due to some implementation detail, but
now does.



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Nov 23, 2022 at 2:42 PM Andres Freund <andres@anarazel.de> wrote:
> > The failure has to be happening in wait_for_postmaster_promote(), because the
> > standby2 is actually successfully promoted.
>
> I assume this is ext4.  Presumably anything that reads the
> controlfile, like pg_ctl, pg_checksums, pg_resetwal,
> pg_control_system(), ... by reading without interlocking against
> writes could see garbage.  I have lost track of the versions and the
> thread, but I worked out at some point by experimentation that this
> only started relatively recently for concurrent read() and write(),
> but always happened with concurrent pread() and pwrite().  The control
> file uses the non-p variants which didn't mash old/new data like
> grated cheese under concurrency due to some implementation detail, but
> now does.

As for what to do about it, some ideas:

1.  Use advisory range locking.  (This would be an advisory version of
what many other filesystems do automatically, AFAIK.  Does Windows
have a thing like POSIX file locking, or need it here?)
2.  Retry after a short time on checksum failure.  The probability is
already miniscule, and becomes pretty close to 0 if we read thrice
100ms apart.
3.  Some scheme that involves renaming the file into place.  (That
might be a pain on Windows; it only works for the relmap thing because
all readers and writers are in the backend and use an LWLock to avoid
silly handle semantics.)
4.  ???

First thought is that 2 is appropriate level of complexity for this
rare and stupid problem.



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> I assume this is ext4.  Presumably anything that reads the
>> controlfile, like pg_ctl, pg_checksums, pg_resetwal,
>> pg_control_system(), ... by reading without interlocking against
>> writes could see garbage.  I have lost track of the versions and the
>> thread, but I worked out at some point by experimentation that this
>> only started relatively recently for concurrent read() and write(),
>> but always happened with concurrent pread() and pwrite().  The control
>> file uses the non-p variants which didn't mash old/new data like
>> grated cheese under concurrency due to some implementation detail, but
>> now does.

Ugh.

> As for what to do about it, some ideas:
> 2.  Retry after a short time on checksum failure.  The probability is
> already miniscule, and becomes pretty close to 0 if we read thrice
> 100ms apart.

> First thought is that 2 is appropriate level of complexity for this
> rare and stupid problem.

Yeah, I was thinking the same.  A variant could be "repeat until
we see the same calculated checksum twice".

            regards, tom lane





Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Thu, Nov 24, 2022 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > As for what to do about it, some ideas:
> > 2.  Retry after a short time on checksum failure.  The probability is
> > already miniscule, and becomes pretty close to 0 if we read thrice
> > 100ms apart.
>
> > First thought is that 2 is appropriate level of complexity for this
> > rare and stupid problem.
>
> Yeah, I was thinking the same.  A variant could be "repeat until
> we see the same calculated checksum twice".

Hmm.  While writing a comment to explain why that's good enough, I
realised it's not really true for a standby that control file writes
are always expected to be far apart in time.  XLogFlush->
UpdateMinRecoveryPoint() could coincide badly with our N attempts for
any small N and for any nap time, which I think makes your idea better
than mine.

With some cartoon-level understanding of what's going on (to wit: I
think the kernel just pins the page but doesn't use a page-level
content lock or range lock, so what you're seeing is raw racing memcpy
calls and unsynchronised cache line effects), I guess you'd be fairly
likely to make "progress" in seeing more new data even if you didn't
sleep in between, but who knows.  So I have a 10ms sleep to make
progress very likely; given your algorithm it doesn't matter if you
didn't make all the progress, just some.  Since this is reachable from
SQL, I think we also need a CFI call so you can't get uninterruptibly
stuck here?

I wrote a stupid throw-away function to force a write.  If you have an
ext4 system to hand (xfs, zfs, apfs, ufs, others don't suffer from
this) you can do:

 do $$ begin for i in 1..100000000 do loop perform
pg_update_control_file(); end loop; end; $$;

... while you also do:

 select pg_control_system();
 \watch 0.001

... and you'll soon see:

ERROR:  calculated CRC checksum does not match value stored in file

The attached draft patch fixes it.

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Thu, Nov 24, 2022 at 2:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> ... and you'll soon see:
>
> ERROR:  calculated CRC checksum does not match value stored in file

I forgot to mention: this reproducer only seems to work if fsync =
off.  I don't know why, but I recall that was true also for bug
#17064.



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
"Anton A. Melnikov"
Дата:
Hello!

On 24.11.2022 04:02, Thomas Munro wrote:
> On Thu, Nov 24, 2022 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thomas Munro <thomas.munro@gmail.com> writes:
> 
> ERROR:  calculated CRC checksum does not match value stored in file
> 
> The attached draft patch fixes it.

Tried to catch this error on my PC, but failed to do it within a reasonable time.
The 1s interval is probably too long for me.
It seems there are more durable way to reproduce this bug with 0001 patch applied:
At the first backend:

do $$ begin loop perform pg_update_control_file(); end loop; end; $$;

At the second one:

do $$ begin loop perform pg_control_system(); end loop; end; $$;

It will fails almost immediately with:
"ERROR:  calculated CRC checksum does not match value stored in file"
both with fsync = off and fsync = on.
Checked it out for master and REL_11_STABLE.

Also checked for a few hours that the patch 0002 fixes this error,
but there are some questions to its logical structure.
The equality between the previous and newly calculated crc is checked only
if the last crc calculation was wrong, i.e not equal to the value stored in the file.
It is very unlikely that in this case the previous and new crc can match, so, in fact,
the loop will spin until crc is calculated correctly. In the other words,
this algorithm is practically equivalent to an infinite loop of reading from a file
and calculating crc while(EQ_CRC32C(crc, ControlFile->crc) != true).
But then it can be simplified significantly by removing checksums equality checks,
bool fist_try and by limiting the maximum number of iterations
with some constant in the e.g. for loop to avoid theoretically possible freeze.

Or maybe use the variant suggested by Tom Lane, i.e, as far as i understand,
repeat the file_open-read-close-calculate_crc sequence twice without a pause between
them and check the both calculated crcs for the equality. If they match, exit and return
the bool result of comparing between the last calculation with the value from the file,
if not, take a pause and repeat everything from the beginning.
In this case, no need to check *crc_ok_p inside get_controlfile()
as it was in the present version; i think it's more logically correct
since this variable is intended top-level functions and the logic
inside get_controlfile() should not depend on its value.

Also found a warning in 0001 patch for master branch. On my PC gcc gives:

xlog.c:2507:1: warning: no previous prototype for ‘pg_update_control_file’ [-Wmissing-prototypes]
  2507 | pg_update_control_file()

Fixed it with #include "utils/fmgrprotos.h" to xlog.c and
add PG_FUNCTION_ARGS to pg_update_control_file().

With the best wishes,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Tue, Jan 31, 2023 at 2:10 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:
> Also checked for a few hours that the patch 0002 fixes this error,
> but there are some questions to its logical structure.

Hi Anton,

Thanks for looking!

> The equality between the previous and newly calculated crc is checked only
> if the last crc calculation was wrong, i.e not equal to the value stored in the file.
> It is very unlikely that in this case the previous and new crc can match, so, in fact,
> the loop will spin until crc is calculated correctly. In the other words,
> this algorithm is practically equivalent to an infinite loop of reading from a file
> and calculating crc while(EQ_CRC32C(crc, ControlFile->crc) != true).

Maybe it's unlikely that two samples will match while running that
torture test, because it's overwriting the file as fast as it can.
But the idea is that a real system isn't writing the control file most
of the time.

> But then it can be simplified significantly by removing checksums equality checks,
> bool fist_try and by limiting the maximum number of iterations
> with some constant in the e.g. for loop to avoid theoretically possible freeze.

Yeah, I was thinking that we should also put a limit on the loop, just
to be cautious.

Primary servers write the control file very infrequently.  Standybys
more frequently, while writing data out, maybe every few seconds on a
busy system writing out a lot of data.  UpdateMinRecoveryPoint() makes
some effort to avoid updating the file too often.  You definitely see
bursts of repeated flushes that might send this thing in a loop for a
while if the timings were exactly wrong, but usually with periodic
gaps; I haven't really studied the expected behaviour too closely.

> Or maybe use the variant suggested by Tom Lane, i.e, as far as i understand,
> repeat the file_open-read-close-calculate_crc sequence twice without a pause between
> them and check the both calculated crcs for the equality. If they match, exit and return
> the bool result of comparing between the last calculation with the value from the file,
> if not, take a pause and repeat everything from the beginning.

Hmm.  Would it be good enough to do two read() calls with no sleep in
between?  How sure are we that a concurrent write will manage to
change at least one bit that our second read can see?  I guess it's
likely, but maybe hypervisors, preemptible kernels, I/O interrupts or
a glitch in the matrix could decrease our luck?  I really don't know.
So I figured I should add a small sleep between the reads to change
the odds in our favour.  But we don't want to slow down all reads of
the control file with a sleep, do we?  So I thought we should only
bother doing this slow stuff if the actual CRC check fails, a low
probability event.

Clearly there is an element of speculation or superstition here.  I
don't know what else to do if both PostgreSQL and ext4 decided not to
add interlocking.  Maybe we should rethink that.  How bad would it
really be if control file access used POSIX file locking?  I mean, the
writer is going to *fsync* the file, so it's not like one more wafer
thin system call is going to hurt too much.



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Tue, Jan 31, 2023 at 5:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Clearly there is an element of speculation or superstition here.  I
> don't know what else to do if both PostgreSQL and ext4 decided not to
> add interlocking.  Maybe we should rethink that.  How bad would it
> really be if control file access used POSIX file locking?  I mean, the
> writer is going to *fsync* the file, so it's not like one more wafer
> thin system call is going to hurt too much.

Here's an experimental patch for that alternative.  I wonder if
someone would want to be able to turn it off for some reason -- maybe
some NFS problem?  It's less back-patchable, but maybe more
principled?

I don't know if Windows suffers from this type of problem.
Unfortunately its equivalent functionality LockFile() looks non-ideal
for this purpose: if your program crashes, the documentation is very
vague on when exactly it is released by the OS, but it's not
immediately on process exit.  That seems non-ideal for a control file
you might want to access again very soon after a crash, to be able to
recover.

A thought in passing: if UpdateMinRecoveryPoint() performance is an
issue, maybe we should figure out how to use fdatasync() instead of
fsync().

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
"Anton A. Melnikov"
Дата:
Hi, Thomas!

There are two variants of the patch now.

1) As for the first workaround:

On 31.01.2023 07:09, Thomas Munro wrote:
> 
> Maybe it's unlikely that two samples will match while running that
> torture test, because it's overwriting the file as fast as it can.
> But the idea is that a real system isn't writing the control file most
> of the time.
> 
........
> Yeah, I was thinking that we should also put a limit on the loop, just
> to be cautious.

At first i didn’t understand that the equality condition with the previous
calculated crc and the current one at the second+ attempts was intended
for the case when the pg_control file is really corrupted.

Indeed, by making a few debugging variables and running the tortue test,
i found that there were ~4000 crc errors (~0.0003%) in ~125 million reads from the file,
and there was no case when the crc error appeared twice in a row.
So the second and moreover the third successive occurrence of an crc error
can be neglected and for this workaround seems a simpler and maybe more clear
algorithm is possible.
For instance:

for(try = 0 ; try < 3; try++)
{
    open, read from and close pg_control;
    calculate crc;

    *crc_ok_p = EQ_CRC32C(crc, ControlFile->crc);

    if(*crc_ok_p)
       break;
}

2) As for the second variant of the patch with POSIX locks:

On 31.01.2023 14:38, Thomas Munro wrote:
> On Tue, Jan 31, 2023 at 5:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Clearly there is an element of speculation or superstition here.  I
>> don't know what else to do if both PostgreSQL and ext4 decided not to
>> add interlocking.  Maybe we should rethink that.  How bad would it
>> really be if control file access used POSIX file locking?  I mean, the
>> writer is going to *fsync* the file, so it's not like one more wafer
>> thin system call is going to hurt too much.
> 
> Here's an experimental patch for that alternative.  I wonder if
> someone would want to be able to turn it off for some reason -- maybe
> some NFS problem?  It's less back-patchable, but maybe more
> principled?

It looks very strange to me that there may be cases where the cluster data
is stored in NFS. Can't figure out how this can be useful.

i think this variant of the patch is a normal solution
of the problem, not workaround. Found no problems on Linux.
+1 for this variant.

Might add a custom error message for EDEADLK
since it absent in errcode_for_file_access()?

> I don't know if Windows suffers from this type of problem.
> Unfortunately its equivalent functionality LockFile() looks non-ideal
> for this purpose: if your program crashes, the documentation is very
> vague on when exactly it is released by the OS, but it's not
> immediately on process exit.  That seems non-ideal for a control file
> you might want to access again very soon after a crash, to be able to
> recover.

Unfortunately i've not had time to reproduce the problem and apply this patch on
Windows yet but i'm going to do it soon on windows 10. If a crc error
will occur there, then we might use the workaround from the first
variant of the patch.

> A thought in passing: if UpdateMinRecoveryPoint() performance is an
> issue, maybe we should figure out how to use fdatasync() instead of
> fsync().

May be choose it in accordance with GUC wal_sync_method?


Sincerely yours,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Wed, Feb 1, 2023 at 5:04 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:
> On 31.01.2023 14:38, Thomas Munro wrote:
> > Here's an experimental patch for that alternative.  I wonder if
> > someone would want to be able to turn it off for some reason -- maybe
> > some NFS problem?  It's less back-patchable, but maybe more
> > principled?
>
> It looks very strange to me that there may be cases where the cluster data
> is stored in NFS. Can't figure out how this can be useful.

Heh.  There are many interesting failure modes, but people do it.  I
guess my more general question when introducing any new system call
into this code is how some unusual system I can't even access is going
to break.  Maybe some obscure filesystem will fail with EOPNOTSUPP, or
take 5 seconds and then fail because there is no lock server
configured or whatever, so that's why I don't think we can back-patch
it, and we probably need a way to turn it off.

> i think this variant of the patch is a normal solution
> of the problem, not workaround. Found no problems on Linux.
> +1 for this variant.

I prefer it too.

> Might add a custom error message for EDEADLK
> since it absent in errcode_for_file_access()?

Ah, good thought.  I think it shouldn't happen™, so it's OK that
errcode_for_file_access() would classify it as ERRCODE_INTERNAL_ERROR.

Other interesting errors are:

ENOLCK: system limits exceeded; PANIC seems reasonable
EOPNOTSUPP: this file doesn't support locking (seen on FreeBSD man
pages, not on POSIX)

> > I don't know if Windows suffers from this type of problem.
> > Unfortunately its equivalent functionality LockFile() looks non-ideal
> > for this purpose: if your program crashes, the documentation is very
> > vague on when exactly it is released by the OS, but it's not
> > immediately on process exit.  That seems non-ideal for a control file
> > you might want to access again very soon after a crash, to be able to
> > recover.
>
> Unfortunately i've not had time to reproduce the problem and apply this patch on
> Windows yet but i'm going to do it soon on windows 10. If a crc error
> will occur there, then we might use the workaround from the first
> variant of the patch.

Thank you for investigating.  I am afraid to read your results.

One idea would be to proceed with LockFile() for Windows, with a note
suggesting you file a bug with your OS vendor if you ever need it to
get unstuck.  Googling this subject, I read that MongoDB used to
suffer from stuck lock files, until an OS bug report led to recent
versions releasing locks more promptly.  I find that sufficiently
scary that I would want to default the feature to off on Windows, even
if your testing shows that it does really need it.

> > A thought in passing: if UpdateMinRecoveryPoint() performance is an
> > issue, maybe we should figure out how to use fdatasync() instead of
> > fsync().
>
> May be choose it in accordance with GUC wal_sync_method?

Here's a patch like that.  I don't know if it's a good idea for
wal_sync_method to affect other kinds of files or not, but, then, it
already does (fsync_writethough changes this behaviour).   I would
only want to consider this if we also stop choosing "open_datasync" as
a default on at least Windows.

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
"Anton A. Melnikov"
Дата:
Hi, Thomas!

Thanks for your rapid answer and sorry for my delay with reply.

On 01.02.2023 09:45, Thomas Munro wrote:
>> Might add a custom error message for EDEADLK
>> since it absent in errcode_for_file_access()?
> 
> Ah, good thought.  I think it shouldn't happen™, so it's OK that
> errcode_for_file_access() would classify it as ERRCODE_INTERNAL_ERROR.

Yes, i also think that is impossible since the lock is taken on
the entire file, so ERRCODE_INTERNAL_ERROR will be right here.

> Other interesting errors are:
> 
> ENOLCK: system limits exceeded; PANIC seems reasonable
> EOPNOTSUPP: this file doesn't support locking (seen on FreeBSD man
> pages, not on POSIX)

Agreed that ENOLCK is a PANIC or at least FATAL. Maybe it's even better
to do it FATAL to allow other backends to survive?
As for EOPNOTSUPP, maybe make a fallback to the workaround from the
first variant of the patch? (In my previous letter i forgot the pause
after break;, of cause)

>>> I don't know if Windows suffers from this type of problem.
>>> Unfortunately its equivalent functionality LockFile() looks non-ideal
>>> for this purpose: if your program crashes, the documentation is very
>>> vague on when exactly it is released by the OS, but it's not
>>> immediately on process exit.  That seems non-ideal for a control file
>>> you might want to access again very soon after a crash, to be able to
>>> recover.
>>
>> Unfortunately i've not had time to reproduce the problem and apply this patch on
>> Windows yet but i'm going to do it soon on windows 10. If a crc error
>> will occur there, then we might use the workaround from the first
>> variant of the patch.
> 
> Thank you for investigating.  I am afraid to read your results.

First of all it seemed to me that is not a problem at all since msdn
guarantees sector-by-sector atomicity.
"Physical Sector: The unit for which read and write operations to the device
are completed in a single operation. This is the unit of atomic write..."
https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile
(Of course, only if the 512 bytes lays from the beginning of the file with a zero
offset, but this is our case. The current size of ControlFileData is
296 bytes at offset = 0.)

I tried to verify this fact experimentally and was very surprised.
Unfortunately it crashed in an hour during torture test:
2023-02-13 15:10:52.675 MSK [5704] LOG:  starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit
2023-02-13 15:10:52.768 MSK [5704] LOG:  database system is ready to accept connections
@@@@@@ sizeof(ControlFileData) = 296
.........
2023-02-13 16:10:41.997 MSK [9380] ERROR:  calculated CRC checksum does not match value stored in file

But fortunately, this only happens when fsync=off.
Also i did several experiments with fsync=on and found more appropriate behavior:
The stress test with sizeof(ControlFileData) = 512+8 = 520 bytes failed in a 4,5 hours,
but the other one with ordinary sizeof(ControlFileData) = 296 not crashed in more than 12 hours.

Seems in that case the behavior corresponds to msdn.  So if it is possible
to use fsync() under windows when the GUC fsync is off it maybe a solution
for this problem. If so there is no need to lock the pg_control file under windows at all.

>> May be choose it in accordance with GUC wal_sync_method?
> 
> Here's a patch like that.  I don't know if it's a good idea for
> wal_sync_method to affect other kinds of files or not, but, then, it
> already does (fsync_writethough changes this behaviour).   

+1. Looks like it needs a little fix:

+++ b/src/common/controldata_utils.c
@@ -316,7 +316,7 @@ update_controlfile(const char *DataDir,
                         if (pg_fsync(fd) != 0)
                                 ereport(PANIC,
                                                 (errcode_for_file_access(),
-                                                errmsg("could not fdatasync file \"%s\": %m",
+                                                errmsg("could not fsync file \"%s\": %m",
                                                                 ControlFilePath)));

And it may be combined with 0001-Lock-pg_control-while-reading-or-writing.patch

> I would
> only want to consider this if we also stop choosing "open_datasync" as
> a default on at least Windows.

I didn't quite understand this point. Could you clarify it for me, please? If the performance
of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all platforms.
   

Sincerely yours,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
"Anton A. Melnikov"
Дата:
Hi, Thomas!

On 14.02.2023 06:38, Anton A. Melnikov wrote:
> Also i did several experiments with fsync=on and found more appropriate behavior:
> The stress test with sizeof(ControlFileData) = 512+8 = 520 bytes failed in a 4,5 hours,
> but the other one with ordinary sizeof(ControlFileData) = 296 not crashed in more than 12 hours.

Nonetheless it crashed after 18 hours:

2023-02-13 18:07:21.476 MSK [7640] LOG:  starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit
2023-02-13 18:07:21.483 MSK [7640] LOG:  listening on IPv6 address "::1", port 5432
2023-02-13 18:07:21.483 MSK [7640] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-02-13 18:07:21.556 MSK [1940] LOG:  database system was shut down at 2023-02-13 18:07:12 MSK
2023-02-13 18:07:21.590 MSK [7640] LOG:  database system is ready to accept connections
@@@@@@@@@@@@@ sizeof(ControlFileData) = 296
2023-02-13 18:12:21.545 MSK [9532] LOG:  checkpoint starting: time
2023-02-13 18:12:21.583 MSK [9532] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0
recycled;write=0.003 s, sync=0.009 s, total=0.038 s; sync files=2, longest=0.005 s, average=0.005 s; distance=0 kB,
estimate=0kB; lsn=0/17AC388, redo lsn=0/17AC350
 
2023-02-14 12:12:21.738 MSK [8676] ERROR:  calculated CRC checksum does not match value stored in file
2023-02-14 12:12:21.738 MSK [8676] CONTEXT:  SQL statement "SELECT pg_control_system()"
    PL/pgSQL function inline_code_block line 1 at PERFORM
2023-02-14 12:12:21.738 MSK [8676] STATEMENT:  do $$ begin loop perform pg_control_system(); end loop; end; $$;


So all of the following is incorrect:

> Seems in that case the behavior corresponds to msdn.  So if it is possible
> to use fsync() under windows when the GUC fsync is off it maybe a solution
> for this problem. If so there is no need to lock the pg_control file under windows at all.

and cannot be a solution.


Sincerely yours,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Tue, Feb 14, 2023 at 4:38 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:
> First of all it seemed to me that is not a problem at all since msdn
> guarantees sector-by-sector atomicity.
> "Physical Sector: The unit for which read and write operations to the device
> are completed in a single operation. This is the unit of atomic write..."
> https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering
> https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile
> (Of course, only if the 512 bytes lays from the beginning of the file with a zero
> offset, but this is our case. The current size of ControlFileData is
> 296 bytes at offset = 0.)

There are two kinds of atomicity that we rely on for the control file today:

* atomicity on power loss (= device property, in case of overwrite filesystems)
* atomicity of concurrent reads and writes (= VFS or kernel buffer
pool interlocking policy)

I assume that documentation is talking about the first thing (BTW, I
suspect that documentation is also now wrong in one special case: NTFS
filesystems mounted on DAX devices don't have sectors or sector-based
atomicity unless you turn on BTT and slow them down[1]; that might
eventually be something for PostgreSQL to think about, and it applies
to other OSes too).

With this patch we would stop relying on the second thing.  Supposedly
POSIX requires read/write atomicity, and many file systems offer it in
a strong form (internal range locking) or maybe a weak accidental form
(page-level locking).  Since some extremely popular implementations
just don't care, and Windows isn't even a POSIX, we just have to do it
ourselves.

BTW there are at least two other places where PostgreSQL already knows
that concurrent reads and writes are possibly non-atomic (and we also
don't even try to get the alignment right, making the question moot):
pg_basebackup, which enables full_page_writes implicitly if you didn't
have the GUC on already, and pg_rewind, which refuses to run if you
haven't enabled full_page_writes explicitly (as recently discussed on
another thread recently; that's an annoying difference, and also an
annoying behaviour if you know your storage doesn't really need it!)

> I tried to verify this fact experimentally and was very surprised.
> Unfortunately it crashed in an hour during torture test:
> 2023-02-13 15:10:52.675 MSK [5704] LOG:  starting PostgreSQL 16devel, compiled by Visual C++ build 1929, 64-bit
> 2023-02-13 15:10:52.768 MSK [5704] LOG:  database system is ready to accept connections
> @@@@@@ sizeof(ControlFileData) = 296
> .........
> 2023-02-13 16:10:41.997 MSK [9380] ERROR:  calculated CRC checksum does not match value stored in file

Thanks.  I'd seen reports of this in discussions on the 'net, but
those had no authoritative references or supporting test results.  The
fact that fsync made it take longer (in your following email) makes
sense and matches Linux.  It inserts a massive pause in the middle of
the experiment loop, affecting the probabilities.

Therefore, we need a solution for Windows too.  I tried to write the
equivalent code, in the attached.  I learned a few things: Windows
locks are mandatory (that is, if you lock a file, reads/writes can
fail, unlike Unix).  Combined with async release, that means we
probably also need to lock the file in xlog.c, when reading it in
xlog.c:ReadControlFile() (see comments).  Before I added that, on a CI
run, I found that the read in there would sometimes fail, and adding
the locking fixed that.  I am a bit confused, though, because I
expected that to be necessary only if someone managed to crash while
holding the write lock, which the CI tests shouldn't do.  Do you have
any ideas?

While contemplating what else a mandatory file lock might break, I
remembered that basebackup.c also reads the control file.  Hrmph.  Not
addressed yet; I guess it might need to acquire/release around
sendFile(sink, XLOG_CONTROL_FILE, ...)?

> > I would
> > only want to consider this if we also stop choosing "open_datasync" as
> > a default on at least Windows.
>
> I didn't quite understand this point. Could you clarify it for me, please? If the performance
> of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all platforms.

The level of durability would be weakened on Windows.  On all systems
except Linux and FreeBSD, we default to wal_sync_method=open_datasync,
so then we would start using FILE_FLAG_WRITE_THROUGH for the control
file too.  We know from reading and experimentation that
FILE_FLAG_WRITE_THROUGH doesn't flush the drive cache on consumer
Windows hardware, but its fdatasync-like thing does[2].  I have not
thought too hard about the consequences of using different durability
levels for different categories of file, but messing with write
ordering can't be good for crash recovery, so I'd rather increase WAL
durability than decrease control file durability.  If a Windows user
complains that it makes their fancy non-volatile cache slow down, they
can always adjust the settings in PostgreSQL, their OS, or their
drivers etc.  I think we should just make fdatasync the default on all
systems.

Here's a patch like that.

[1] https://learn.microsoft.com/en-us/windows-server/storage/storage-spaces/persistent-memory-direct-access
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Ba-7r4GpADsasCnuDBiqC1c31DAQQco2FayVtB9V3sQw%40mail.gmail.com#460bfa5a6b571cc98c575d23322e0b25

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Fri, Feb 17, 2023 at 4:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> While contemplating what else a mandatory file lock might break, I
> remembered that basebackup.c also reads the control file.  Hrmph.  Not
> addressed yet; I guess it might need to acquire/release around
> sendFile(sink, XLOG_CONTROL_FILE, ...)?

If we go this way, I suppose, in theory at least, someone with
external pg_backup_start()-based tools might also want to hold the
lock while copying pg_control.  Otherwise they might fail to open it
on Windows (where that patch uses a mandatory lock) or copy garbage on
Linux (as they can today, I assume), with non-zero probability -- at
least when copying files from a hot standby.  Or backup tools might
want to get the file contents through some entirely different
mechanism that does the right interlocking (whatever that might be,
maybe inside the server).  Perhaps this is not so much the localised
systems programming curiosity I thought it was, and has implications
that'd need to be part of the documented low-level backup steps.  It
makes me like the idea a bit less.  It'd be good to hear from backup
gurus what they think about that.

One cute hack I thought of to make the file lock effectively advisory
on Windows is to lock a byte range *past the end* of the file (the
documentation says you can do that).  That shouldn't stop programs
that want to read the file without locking and don't know/care about
our scheme (that is, pre-existing backup tools that haven't considered
this problem and remain oblivious or accept the (low) risk of torn
reads), but will block other participants in our scheme.

If we went back to the keep-rereading-until-it-stops-changing model,
then an external backup tool would need to be prepared to do that too,
in theory at least.  Maybe some already do something like that?

Or maybe the problem is/was too theoretical before...



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
"Anton A. Melnikov"
Дата:
Hi, Thomas!

On 17.02.2023 06:21, Thomas Munro wrote:
> There are two kinds of atomicity that we rely on for the control file today:
> 
> * atomicity on power loss (= device property, in case of overwrite filesystems)
> * atomicity of concurrent reads and writes (= VFS or kernel buffer
> pool interlocking policy)
> 
> I assume that documentation is talking about the first thing 

I think this is true, as documentation says about write operations only,
but not about the read ones.

> (BTW, I
> suspect that documentation is also now wrong in one special case: NTFS
> filesystems mounted on DAX devices don't have sectors or sector-based
> atomicity unless you turn on BTT and slow them down[1]; that might
> eventually be something for PostgreSQL to think about, and it applies
> to other OSes too).

Very interesting find! For instance, the volume of Intel® Optane™ Persistent Memory
already reaches 512GB and can be potentially used for cluster data. As the
first step it would be good to understand what Microsoft means by
large memory pages, what size are they talking about, that is, where
is the reasonable boundary for using BTT; i suppose, this will help choose
whether to use BTT or have to write own DAX-aware code.

> With this patch we would stop relying on the second thing.  Supposedly
> POSIX requires read/write atomicity, and many file systems offer it in
> a strong form (internal range locking) or maybe a weak accidental form
> (page-level locking).  Since some extremely popular implementations
> just don't care, and Windows isn't even a POSIX, we just have to do it
> ourselves.

Yes. indeed. But unless it's an atomic or transactional filesystem. In
such a case there is almost nothing to do. Another thing is that
it seems such a systems do not exist in reality although it has been
discussed many times I've googled some information on this topic e.g
[1], [2], [3] but all of project were abandoned or deprecated as
Microsoft's own development.

> BTW there are at least two other places where PostgreSQL already knows
> that concurrent reads and writes are possibly non-atomic (and we also
> don't even try to get the alignment right, making the question moot):
> pg_basebackup, which enables full_page_writes implicitly if you didn't
> have the GUC on already, and pg_rewind, which refuses to run if you
> haven't enabled full_page_writes explicitly (as recently discussed on
> another thread recently; that's an annoying difference, and also an
> annoying behaviour if you know your storage doesn't really need it!)

It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?
  
> Therefore, we need a solution for Windows too.  I tried to write the
> equivalent code, in the attached.  I learned a few things: Windows
> locks are mandatory (that is, if you lock a file, reads/writes can
> fail, unlike Unix).  Combined with async release, that means we
> probably also need to lock the file in xlog.c, when reading it in
> xlog.c:ReadControlFile() (see comments).  Before I added that, on a CI
> run, I found that the read in there would sometimes fail, and adding
> the locking fixed that.  I am a bit confused, though, because I
> expected that to be necessary only if someone managed to crash while
> holding the write lock, which the CI tests shouldn't do.  Do you have
> any ideas?

Unfortunately, no ideas so far. Was it a pg_control CRC or I/O errors?
Maybe logs of such a fail were saved somewhere? I would like to see
them if possible.

> While contemplating what else a mandatory file lock might break, I
> remembered that basebackup.c also reads the control file.  Hrmph.  Not
> addressed yet; I guess it might need to acquire/release around
> sendFile(sink, XLOG_CONTROL_FILE, ...)?

Here, possibly pass a boolean flag into sendFile()? When it is true,
then take a lock after OpenTransientFile() and release it before
CloseTransientFile() if under Windows.

There are also two places where read or write from/to the pg_control
occur. These are functions WriteControlFile() in xlog.c and
read_controlfile() in pg_resetwal.c.
For the second case, locking definitely not necessary as
the server is stopped. For the first case seems too as BootStrapXLOG()
where WriteControlFile() will be called inside must be called
only once on system install.

Since i've smoothly moved on to the code review here there is a
suggestion at your discretion to add error messages to get_controlfile()
and update_controlfile() if unlock_controlfile() fails.

> I think we should just make fdatasync the default on all
> systems.

Agreed. And maybe choose the UPDATE_CONTROLFILE_FDATASYNC as the default
case in UpdateControlFile() since fdatasync now the default on all
systems and its metadata are of no interest?


On 22.02.2023 03:10, Thomas Munro wrote:
> If we go this way, I suppose, in theory at least, someone with
> external pg_backup_start()-based tools might also want to hold the
> lock while copying pg_control.  Otherwise they might fail to open it
> on Windows (where that patch uses a mandatory lock) 

As for external pg_backup_start()-based tools if somebody want to take the
lock while copying pg_control i suppose it's a normal case. He may have to wait
a bit until we release the lock, like in lock_controlfile(). Moreover
this is a very good desire, as it guarantees the integrity of pg_control if
only someone is going to use F_SETLKW rather than non-waiting F_SETLK.

Backup of locked files is the common place in Windows and all standard
backup tools can do it well via VSS (volume shadow copy) including
embedded windows backup tool. Just in case, i tested it experimentally.
During the torture test first try to copy pg_control and predictably caught:
Error: Cannot read C:\pgbins\master\data\global\pg_control!
"The process cannot access the file because another process has locked a portion of the file."
But copy using own windows backup utility successfully copied it with VSS.

> > One cute hack I thought of to make the file lock effectively advisory
> on Windows is to lock a byte range *past the end* of the file (the
> documentation says you can do that).  That shouldn't stop programs
> that want to read the file without locking and don't know/care about
> our scheme (that is, pre-existing backup tools that haven't considered
> this problem and remain oblivious or accept the (low) risk of torn
> reads), but will block other participants in our scheme.

A very interesting idea. It makes sense when someone using external
backup tools that can not work with VSS. But the fact of using such a tools
under Windows is highly doubtful, i guess. It will not allow to backup
many other applications and windows system itself.
Let me to join you suggestion that it'd be good to hear from backup
gurus what they think about that.

> or copy garbage on
> Linux (as they can today, I assume), with non-zero probability -- at
> least when copying files from a hot standby.
> Or backup tools might
> want to get the file contents through some entirely different
> mechanism that does the right interlocking (whatever that might be,
> maybe inside the server).  Perhaps this is not so much the localised
> systems programming curiosity I thought it was, and has implications
> that'd need to be part of the documented low-level backup steps.  It
> makes me like the idea a bit less.  It'd be good to hear from backup
> gurus what they think about that.
> If we went back to the keep-rereading-until-it-stops-changing model,
> then an external backup tool would need to be prepared to do that too,
> in theory at least.  Maybe some already do something like that?
> 
> Or maybe the problem is/was too theoretical before...

As far as i understand, this problem has always been, but the probability of
this is extremely small in practice, which is directly pointed in
the docs [4]:
"So while it is theoretically a weak spot, pg_control does not seem
to be a problem in practice."

> Here's a patch like that.

In this patch, the problem is solved for the live database and
maybe remains for some possible cases of an external backup. In a whole,
i think it can be stated that this is a sensible step forward.

Just like last time, i ran a long stress test under windows with current patch.
There were no errors for more than 3 days even with fsync=off.

[1] https://lwn.net/Articles/789600/
[2] https://github.com/ut-osa/txfs
[3] https://en.wikipedia.org/wiki/Transactional_NTFS[4] https://www.postgresql.org/docs/devel/wal-internals.html


With the best wishes!

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company







Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Fri, Feb 24, 2023 at 11:12 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:
> On 17.02.2023 06:21, Thomas Munro wrote:
> > BTW there are at least two other places where PostgreSQL already knows
> > that concurrent reads and writes are possibly non-atomic (and we also
> > don't even try to get the alignment right, making the question moot):
> > pg_basebackup, which enables full_page_writes implicitly if you didn't
> > have the GUC on already, and pg_rewind, which refuses to run if you
> > haven't enabled full_page_writes explicitly (as recently discussed on
> > another thread recently; that's an annoying difference, and also an
> > annoying behaviour if you know your storage doesn't really need it!)
>
> It seems a good topic for a separate thread patch. Would you provide a
> link to the thread you mentioned please?

https://www.postgresql.org/message-id/flat/367d01a7-90bb-9b70-4cda-248e81cc475c%40cosium.com

> > Therefore, we need a solution for Windows too.  I tried to write the
> > equivalent code, in the attached.  I learned a few things: Windows
> > locks are mandatory (that is, if you lock a file, reads/writes can
> > fail, unlike Unix).  Combined with async release, that means we
> > probably also need to lock the file in xlog.c, when reading it in
> > xlog.c:ReadControlFile() (see comments).  Before I added that, on a CI
> > run, I found that the read in there would sometimes fail, and adding
> > the locking fixed that.  I am a bit confused, though, because I
> > expected that to be necessary only if someone managed to crash while
> > holding the write lock, which the CI tests shouldn't do.  Do you have
> > any ideas?
>
> Unfortunately, no ideas so far. Was it a pg_control CRC or I/O errors?
> Maybe logs of such a fail were saved somewhere? I would like to see
> them if possible.

I think it was this one:

https://cirrus-ci.com/task/5004082033721344

For example, see subscription/011_generated which failed like this:

2023-02-16 06:57:25.724 GMT [5736][not initialized] PANIC:  could not
read file "global/pg_control": Permission denied

That was fixed after I changed it to also do locking in xlog.c
ReadControlFile(), in the version you tested.  There must be something
I don't understand going on, because that cluster wasn't even running
before: it had just been created by initdb.

But, anyway, I have a new idea that makes that whole problem go away
(though I'd still like to understand what happened there):

With the "pseudo-advisory lock for Windows" idea from the last email,
we don't need to bother with file system level locking in many places.
Just in controldata_utils.c, for FE/BE interlocking (that is, we don't
need to use that for interlocking of concurrent reads and writes that
are entirely in the backend, because we already have an LWLock that we
could use more consistently).  Changes:

1. xlog.c mostly uses no locking
2. basebackup.c now acquires ControlFileLock
3. only controldata_utils.c uses the new file locking, for FE-vs-BE interlocking
4. lock past the end (pseudo-advisory locking for Windows)

Note that when we recover from a basebackup or pg_backup_start()
backup, we use the backup label to find a redo start location in the
WAL (see read_backup_label()), BUT we still read the copied pg_control
file (one that might be too "new", so we don't use its redo pointer).
So it had better not be torn, or the recovery will fail.  So, in this
version I protected that sendFile() with ControlFileLock.  But...

Isn't that a bit strange?  To go down this path we would also need to
document the need to copy the control file with the file locked to
avoid a rare failure, in the pg_backup_start() documentation.  That's
annoying (I don't even know how to do it with easy shell commands;
maybe we should provide a program that locks and cats the file...?).
Could we make better use of the safe copy that we have in the log?
Then the pg_backup_start() subproblem would disappear.  Conceptually,
that'd be just like the way we use FPI for data pages copied during a
backup.  I'm not sure about any of that, though, it's just an idea,
not tested.

> > Or maybe the problem is/was too theoretical before...
>
> As far as i understand, this problem has always been, but the probability of
> this is extremely small in practice, which is directly pointed in
> the docs [4]:
> "So while it is theoretically a weak spot, pg_control does not seem
> to be a problem in practice."

I guess that was talking about power loss atomicity again?  Knowledge
of the read/write atomicity problem seems to be less evenly
distributed (and I think it became more likely in Linux > 3.something;
and the Windows situation possibly hadn't been examined by anyone
before).

> > Here's a patch like that.
>
> In this patch, the problem is solved for the live database and
> maybe remains for some possible cases of an external backup. In a whole,
> i think it can be stated that this is a sensible step forward.
>
> Just like last time, i ran a long stress test under windows with current patch.
> There were no errors for more than 3 days even with fsync=off.

Thanks!

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
"Anton A. Melnikov"
Дата:
Hi, Thomas!

On 04.03.2023 00:39, Thomas Munro wrote:
>> It seems a good topic for a separate thread patch. Would you provide a
>> link to the thread you mentioned please?
> 
> https://www.postgresql.org/message-id/flat/367d01a7-90bb-9b70-4cda-248e81cc475c%40cosium.com

Thanks! The important words there:
>> But I fail to see how full_page_writes prevents this since it only act on writes

> It ensures the damage is later repaired during WAL replay. Which can only
> happen if the WAL contains the necessary information to do so - the full page
> writes.

Together with the docs about restoring corrupted pg_control in the
pg_resetwal utility description these words made me think about whether
to save the contents of pg_control at the beginning and end of
checkpoint into WAL and teach pg_resetwal to read them? It would be like
a periodic backup of the pg_control to a safe place.
This thought has nothing to do with this patch and this thread, and, in general,
i'm not sure if it has any practical meaning, and whether, on the contrary, it
may lead to some difficulties. If it seems that there is a sense, then it
will be possible to think further about it.

> For example, see subscription/011_generated which failed like this:
> 
> 2023-02-16 06:57:25.724 GMT [5736][not initialized] PANIC:  could not
> read file "global/pg_control": Permission denied
> 
> That was fixed after I changed it to also do locking in xlog.c
> ReadControlFile(), in the version you tested.  There must be something
> I don't understand going on, because that cluster wasn't even running
> before: it had just been created by initdb.
> 
> But, anyway, I have a new idea that makes that whole problem go away
> (though I'd still like to understand what happened there):

Seems to be it's a race between the first reading of the pg_control in PostmasterMain()
in LocalProcessControlFile(false) and the second one in SubPostmasterMain() here:
    /*
     * (re-)read control file, as it contains config. The postmaster will
     * already have read this, but this process doesn't know about that.
     */
    LocalProcessControlFile(false);

which crashes according to the crash log: crashlog-postgres.exe_19a0_2023-02-16_06-57-26-675.txt

> With the "pseudo-advisory lock for Windows" idea from the last email,
> we don't need to bother with file system level locking in many places.
> Just in controldata_utils.c, for FE/BE interlocking (that is, we don't
> need to use that for interlocking of concurrent reads and writes that
> are entirely in the backend, because we already have an LWLock that we
> could use more consistently).  Changes:
> 
> 1. xlog.c mostly uses no locking
> 2. basebackup.c now acquires ControlFileLock
> 3. only controldata_utils.c uses the new file locking, for FE-vs-BE interlocking
> 4. lock past the end (pseudo-advisory locking for Windows)

Although the changes in 1. contributes to the problem described above again,
but 4. fixes this. And i did not find any other places where ReadControlFile()
can be called in different processes.That's all ok.
Thanks to 4., now it is not necessary to use VSS to copy the pg_control file,
it can be copied in a common way even during the torture test. This is very good.
I really like the idea with LWLock where possible.
In general, i think that these changes make the patch more lightweight and fast.

Also i ran tests for more than a day in stress mode with fsync=off under windows
and Linux and found no problems. Patch-tester also passes without errors.

I would like to move this patch to RFC, since I don’t see any problems
both in the code and in the tests, but the pg_backup_start() subproblem confuses me.
Maybe move it to a separate patch in a distinct thread?
As there are a number of suggestions and questions to discuss such as:


> Note that when we recover from a basebackup or pg_backup_start()
> backup, we use the backup label to find a redo start location in the
> WAL (see read_backup_label()), BUT we still read the copied pg_control
> file (one that might be too "new", so we don't use its redo pointer).
> So it had better not be torn, or the recovery will fail.  So, in this
> version I protected that sendFile() with ControlFileLock.  But...
> 
> Isn't that a bit strange?  To go down this path we would also need to
> document the need to copy the control file with the file locked to
> avoid a rare failure, in the pg_backup_start() documentation.  That's
> annoying (I don't even know how to do it with easy shell commands;
> maybe we should provide a program that locks and cats the file...?).

Variant with separate utility looks good, with the recommendation
in the doc to use it for the pg_control coping.

Also seems it is possible to make a function available in psql, such as
export_pg_control('dst_path') with the destination path as argument
and call it before pg_backup_stop().
Or pass the pg_control destination path to the pg_backup_stop() as extra argument.
Or save pg_control to a predetermined location during pg_backup_stop() and specify
in the docs that one need to copy it from there. I suppose that we have the right
to ask the user to perform some manual manipulations here like with backup_label.

> Could we make better use of the safe copy that we have in the log?
> Then the pg_backup_start() subproblem would disappear.  Conceptually,
> that'd be just like the way we use FPI for data pages copied during a
> backup.  I'm not sure about any of that, though, it's just an idea,
> not tested.

Sorry, i didn't understand the question about log. Would you explain me
please what kind of log did you mention and where can i look this
safe copy creation in the code?
   
With the best wishes,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Wed, Mar 8, 2023 at 4:43 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote:
> On 04.03.2023 00:39, Thomas Munro wrote:
> > Could we make better use of the safe copy that we have in the log?
> > Then the pg_backup_start() subproblem would disappear.  Conceptually,
> > that'd be just like the way we use FPI for data pages copied during a
> > backup.  I'm not sure about any of that, though, it's just an idea,
> > not tested.
>
> Sorry, i didn't understand the question about log. Would you explain me
> please what kind of log did you mention and where can i look this
> safe copy creation in the code?

Sorry, I was confused; please ignore that part.  We don't have a copy
of the control file anywhere else.  (Perhaps we should, but that could
be a separate topic.)



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
"Anton A. Melnikov"
Дата:
On 08.03.2023 07:28, Thomas Munro wrote:
> Sorry, I was confused; please ignore that part.  We don't have a copy
> of the control file anywhere else.  (Perhaps we should, but that could
> be a separate topic.)

That’s all right! Fully agreed that this is a possible separate topic.

Sincerely yours,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Daniel Gustafsson
Дата:
This patch no longer applies and needs a rebase.

Given where we are in the commitfest, do you think this patch has the potential
to go in or should it be moved?

--
Daniel Gustafsson




Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
This is a frustrating thread, because despite the last patch solving
most of the problems we discussed, it doesn't address the
low-level-backup procedure in a nice way.  We'd have to tell users
they have to flock that file, or add a new step "pg_controldata --raw
> pg_control", which seems weird when they already have a connection
to the server.

Maybe it just doesn't matter if eg the pg_controldata program can
spuriously fail if pointed at a running system, and I was being too
dogmatic trying to fix even that.  Maybe we should just focus on
fixing backups.  Even there, I am beginning to suspect we are solving
this problem in the wrong place when a higher level change could
simplify the problem away.

Idea for future research:  Perhaps pg_backup_stop()'s label-file
output should include the control file image (suitably encoded)?  Then
the recovery-from-label code could completely ignore the existing
control file, and overwrite it using that copy.  It's already
partially ignoring it, by using the label file's checkpoint LSN
instead of the control file's.  Perhaps the captured copy could
include the correct LSN already, simplifying that code, and the low
level backup procedure would not need any additional steps or caveats.
No more atomicity problem for low-level-backups... but probably not
something we would back-patch, for such a rare failure mode.

Here's a new minimal patch that solves only the bugs in basebackup +
the simple SQL-facing functions that read the control file, by simply
acquiring ControlFileLock in the obvious places.  This should be
simple enough for back-patching?

Perhaps we could use the control file image from server memory, but
that requires us to be certain that its CRC is always up to date.
That seems to be true, but I didn't want to require it for this, and
it doesn't seem important for non-performance-critical code.

Thoughts?

As for the other topics that came up in this thread, I kicked the
wal_sync_method thing out to its own thread[1].  (There was a logical
chain connecting these topics: "can I add file lock system calls
here?" -> "well if someone is going to complain that it's performance
critical then why are we using unnecessarily slow pg_fsync()?" ->
"well if we change that to pg_fdatasync() we have to address known
weakness/kludge on macOS first".  I don't like the flock stuff
anymore, but I do want to fix the known macOS problem independently.
Hereby disentangled.)

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BF0EL4Up6yVYbbcWse4xKaqW4wc2xpw67Pq9FjmByWVg%40mail.gmail.com

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Stephen Frost
Дата:
Greetings,

(Adding David Steele into the CC on this one...)

* Thomas Munro (thomas.munro@gmail.com) wrote:
> This is a frustrating thread, because despite the last patch solving
> most of the problems we discussed, it doesn't address the
> low-level-backup procedure in a nice way.  We'd have to tell users
> they have to flock that file, or add a new step "pg_controldata --raw
> > pg_control", which seems weird when they already have a connection
> to the server.

flock'ing the file strikes me as dangerous to ask external tools to do
due to the chances of breaking the running system if they don't do it
right.  I'm generally open to the idea of having the backup tool have to
do more complicated work to be correct but that just seems likely to
cause issues.  Also- haven't looked yet, but I'm not sure that's even
possible if your backup tool is running as a user who only has read
access to the data directory?  I don't want us to give up on that
feature.

> Maybe it just doesn't matter if eg the pg_controldata program can
> spuriously fail if pointed at a running system, and I was being too
> dogmatic trying to fix even that.  Maybe we should just focus on
> fixing backups.  Even there, I am beginning to suspect we are solving
> this problem in the wrong place when a higher level change could
> simplify the problem away.

For a running system.. perhaps pg_controldata should be connecting to
the database and calling functions there?  Or just complain that the
system is online and tell the user to do that?

> Idea for future research:  Perhaps pg_backup_stop()'s label-file
> output should include the control file image (suitably encoded)?  Then
> the recovery-from-label code could completely ignore the existing
> control file, and overwrite it using that copy.  It's already
> partially ignoring it, by using the label file's checkpoint LSN
> instead of the control file's.  Perhaps the captured copy could
> include the correct LSN already, simplifying that code, and the low
> level backup procedure would not need any additional steps or caveats.
> No more atomicity problem for low-level-backups... but probably not
> something we would back-patch, for such a rare failure mode.

I like this general direction and wonder if we could possibly even push
a bit harder on it: have the backup_label include the control file's
contents in some form that is understood and then tell tools to *not*
copy the running pg_control file ... and maybe even complain if a
pg_control file exists when we detect that backup_label has the control
file's image.  We've certainly had problems in the past where people
would just nuke the backup_label file, even though they were restoring
from a backup, because they couldn't start the system up since their
restore command wasn't set up properly or their WAL archive was missing.

Being able to get rid of the control file being in the backup at all
would make it harder for someone to get to a corrupted-but-running
system and that seems like it's a good thing.

> Here's a new minimal patch that solves only the bugs in basebackup +
> the simple SQL-facing functions that read the control file, by simply
> acquiring ControlFileLock in the obvious places.  This should be
> simple enough for back-patching?

I don't particularly like fixing this in a way that only works for
pg_basebackup and means that the users of other backup tools don't have
a solution to this problem.  What are we supposed to tell users of
pgBackRest when they see this fix for pg_basebackup in the next set of
minor releases and they ask us if we've addressed this risk?

We might be able to accept the 're-read on CRC failure' approach, if it
were being used for pg_controldata and we documented that external
tools and especially backup tools using the low-level API are required
to check the CRC and to re-read on failure if accessing a running
system.

While it's not ideal, maybe we could get away with changing the contents
of the backup_label as part of a back-patch?  The documentation, at
least on a quick look, says to copy the second field from pg_backup_stop
into a backup_label file but doesn't say anything about what those
contents are or if they can change.  That would at least address the
concern of backups ending up with a corrupt pg_control file and not
being able to be restored, even if tools aren't updated to verify the
CRC or similar.  Of course, that's a fair bit of code churn for a
backpatch, which I certainly understand as being risky.  If we can't
back-patch that, it might still be the way to go moving forward, while
also telling tools to check the CRC.  (I'm not going to try to figure
out some back-patchable pretend solution for this for shell scripts that
pretend to be able to backup running PG databases; this issue is
probably the least of their issues anyway...)

A couple of interesting notes on this though- pgBackRest doesn't only
read the pg_control file at backup time, we also check it at
archive_command time, to make sure that the system ID and version that
are in the control file match up with the information in the WAL file
we're getting ready to archive and that those match with the system ID
and version of the repo/stanza into which we are pushing the WAL file.
We do read the control file on the replica but that isn't the one we
actually push into the repo as part of a backup- that's always the one
we read from the primary (we don't currently support 'backup just from
the replica').

Coming out of our discussion regarding this, we're likely to move
forward on the check-CRC-and-re-read approach for the next pgBackRest
release.  If PG provides a better solution for us to use, great, but
given that this has been shown to happen, we're not intending to wait
around for PG to provide us with a better fix.

Thanks,

Stephen

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Robert Haas
Дата:
On Fri, Jul 21, 2023 at 8:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Idea for future research:  Perhaps pg_backup_stop()'s label-file
> output should include the control file image (suitably encoded)?  Then
> the recovery-from-label code could completely ignore the existing
> control file, and overwrite it using that copy.  It's already
> partially ignoring it, by using the label file's checkpoint LSN
> instead of the control file's.  Perhaps the captured copy could
> include the correct LSN already, simplifying that code, and the low
> level backup procedure would not need any additional steps or caveats.
> No more atomicity problem for low-level-backups... but probably not
> something we would back-patch, for such a rare failure mode.

I don't really know what the solution is, but this is a general
problem with the low-level backup API, and I think it sucks pretty
hard. Here, we're talking about the control file, but the same problem
exists with the data files. We try to work around that but it's all
hacks. Unless your backup tool has special magic powers of some kind,
you can't take a backup using either pg_basebackup or the low-level
API and then check that individual blocks have valid checksums, or
that they have sensible, interpretable contents, because they might
not. (Yeah, I know we have code to verify checksums during a base
backup, but as discussed elsewhere, it doesn't work.) It's also why we
have to force full-page write on during a backup. But the whole thing
is nasty because you can't really verify anything about the backup you
just took. It may be full of gibberish blocks but don't worry because,
if all goes well, recovery will fix it. But you won't really know
whether recovery actually does fix it. You just kind of have to cross
your fingers and hope.

It's unclear to me how we could do better, especially when using the
low-level API. BASE_BACKUP could read via shared_buffers instead of
the FS, and I think that might be a good idea if we can defend
adequately against cache poisoning, but with the low-level API someone
may just be calling a FS-level snapshot primitive. Unless we're
prepared to pause all writes while that happens, I don't know how to
do better.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Tue, Jul 25, 2023 at 6:04 AM Stephen Frost <sfrost@snowman.net> wrote:
> * Thomas Munro (thomas.munro@gmail.com) wrote:
> > Here's a new minimal patch that solves only the bugs in basebackup +
> > the simple SQL-facing functions that read the control file, by simply
> > acquiring ControlFileLock in the obvious places.  This should be
> > simple enough for back-patching?
>
> I don't particularly like fixing this in a way that only works for
> pg_basebackup and means that the users of other backup tools don't have
> a solution to this problem.  What are we supposed to tell users of
> pgBackRest when they see this fix for pg_basebackup in the next set of
> minor releases and they ask us if we've addressed this risk?
>
> We might be able to accept the 're-read on CRC failure' approach, if it
> were being used for pg_controldata and we documented that external
> tools and especially backup tools using the low-level API are required
> to check the CRC and to re-read on failure if accessing a running
> system.

Hi Stephen, David, and thanks for looking.  Alright, let's try that idea out.

0001 + 0002.  Acquire the lock in the obvious places in the backend,
to completely exclude the chance of anything going wrong for the easy
cases, including pg_basebackup.  (This is the v4 from yesterday).
And...

0003.  Document this problem and how to detect it, in the
low-level-backup section.  Better words welcome.  And...

0004.  Retries for front-end programs, using the idea suggested by Tom
(to wit: if it fails, retry until it fails with the same CRC value
twice).  It's theoretically imperfect, but it's highly unlikely to
fail in practice and the best we can do without file locks or a
connection to the server AFAICT.  (This is the first patch I posted,
adjusted to give up after 10 (?) loops, and not to bother at all in
backend code since that takes ControlFileLock with the 0001 patch).

> While it's not ideal, maybe we could get away with changing the contents
> of the backup_label as part of a back-patch?  The documentation, at
> least on a quick look, says to copy the second field from pg_backup_stop
> into a backup_label file but doesn't say anything about what those
> contents are or if they can change.  That would at least address the
> concern of backups ending up with a corrupt pg_control file and not
> being able to be restored, even if tools aren't updated to verify the
> CRC or similar.  Of course, that's a fair bit of code churn for a
> backpatch, which I certainly understand as being risky.  If we can't
> back-patch that, it might still be the way to go moving forward, while
> also telling tools to check the CRC.  (I'm not going to try to figure
> out some back-patchable pretend solution for this for shell scripts that
> pretend to be able to backup running PG databases; this issue is
> probably the least of their issues anyway...)

I think you're probably right that anyone following those instructions
would be OK if we just back-patched such a thing, but it all seems a
little too much to me.  +1 to looking into that for v17 (and I guess
maybe someone could eventually argue for back-patching much later with
experience).  I'm sure other solutions are possible too... other
places to put a safe atomic copy of the control file could include: in
the WAL, or in extra files (pg_control.XXX) in the data directory +
some infallible way for recovery to choose which one to start up from.
Or something.

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
On Tue, Jul 25, 2023 at 8:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> (Yeah, I know we have code to verify checksums during a base
> backup, but as discussed elsewhere, it doesn't work.)

BTW the the code you are referring to there seems to think 4KB
page-halves are atomic; not sure if that's imagining page-level
locking in ancient Linux (?), or imagining default setvbuf() buffer
size observed with some specific implementation of fread(), or
confusing power-failure-sector-based atomicity with concurrent access
atomicity, or something else, but for the record what we actually see
in this scenario on ext4 is the old/new page contents mashed together
on much smaller boundaries (maybe cache lines), caused by duelling
concurrent memcpy() to/from, independent of any buffer/page-level
implementation details we might have been thinking of with that code.
Makes me wonder if it's even technically sound to examine the LSN.

> It's also why we
> have to force full-page write on during a backup. But the whole thing
> is nasty because you can't really verify anything about the backup you
> just took. It may be full of gibberish blocks but don't worry because,
> if all goes well, recovery will fix it. But you won't really know
> whether recovery actually does fix it. You just kind of have to cross
> your fingers and hope.

Well, not without also scanning the WAL for FPIs, anyway...  And
conceptually, that's why I think we probably want an 'FPI' of the
control file somewhere.



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
While chatting to Robert and Andres about all this, a new idea came
up.  Or, rather, one of the first ideas that was initially rejected,
now resurrected to try out a suggestion of Andres’s on how to
de-pessimise it.  Unfortunately, it also suffers from Windows-specific
problems that I originally mentioned at the top of this thread but
had since repressed.  Arrrghgh.

First, the good news:

We could write out a whole new control file, and durable_rename() it
into place.  We don’t want to do that in general, because we don’t
want to slow down UpdateMinRecoveryPoint().  The new concept is to do
that only if a backup is in progress.  That requires a bit of
interlocking with backup start/stop (ie when runningBackups is
changing in shmem, we don’t want to overlap with UpdateControlFile()'s
decision on how to do it).  Here is a patch to try that out.  No more
weasel wording needed for the docs; basebackup and low-level file
system backup should always see an atomic control file (and
occasionally also copy a harmless pg_control.tmp file).  Then we only
need the gross retry-until-stable hack for front-end programs.

And the bad news:

In my catalogue-of-Windows-weirdness project[1], I learned in v3-0003 that:

+ fd = open(path, O_CREAT | O_EXCL | O_RDWR | PG_BINARY, 0777);
+ PG_EXPECT_SYS(fd >= 0, "touch name1.txt");
+ PG_REQUIRE_SYS(fd < 0 || close(fd) == 0);
+
+ fd = open(path2, O_RDWR | PG_BINARY, 0777);
+ PG_EXPECT_SYS(fd >= 0, "open name2.txt");
+ make_path(path2, "name2.txt");
+#ifdef WIN32
+
+ /*
+ * Windows can't rename over an open non-unlinked file, even with
+ * have_posix_unlink_semantics.
+ */
+ pgwin32_dirmod_loops = 2; /* minimize looping to fail fast in testing */
+ PG_EXPECT_SYS(rename(path, path2) == -1,
+   "Windows: can't rename name1.txt -> name2.txt while name2.txt is open");
+ PG_EXPECT_EQ(errno, EACCES);
+ PG_EXPECT_SYS(unlink(path) == 0, "unlink name1.txt");
+#else
+ PG_EXPECT_SYS(rename(path, path2) == 0,
+   "POSIX: can rename name1.txt -> name2.txt while name2.txt is open");
+#endif
+ PG_EXPECT_SYS(close(fd) == 0);

Luckily the code in dirmod.c:pgrename() should retry lots of times if
a concurrent transient opener/reader comes along, so I think that
should be OK in practice (but if backups_r_us.exe holds the file open
for 10 seconds while we're trying to rename it, I assume we'll PANIC);
call that problem #1.  What is slightly more disturbing is the clue in
the "Cygwin cleanup" thread[2] that rename() can fail to be 100%
atomic, so that a concurrent call to open() can fail with ENOENT (cf.
the POSIX requirement "... a link named new shall remain visible to
other processes throughout the renaming operation and refer either to
the file referred to by new or old ...").  Call that problem #2, a
problem that already causes us rare breakage (for example: could not
open file "pg_logical/snapshots/0-14FE6B0.snap").

I know that problem #1 can be fixed by applying v3-0004 from [1] but
that leads to impossible decisions like revoking support for non-NTFS
filesystems as discussed in that thread, and we certainly couldn't
back-patch that anyway.  I assume problem #2 can too.

That makes me want to *also* acquire ControlFileLock, for base
backup's read of pg_control.  Even though it seems redundant with the
rename() trick (the rename() trick should be enough for low-level
*and* basebackup on ext4), it would at least avoid the above
Windowsian pathologies during base backups.

I'm sorry for the patch/idea-churn in this thread.  It's like
Whac-a-Mole.  Blasted non-POSIX-compliant moles.  New patches
attached.  Are they getting better?

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BajSQ_8eu2AogTncOnZ5me2D-Cn66iN_-wZnRjLN%2Bicg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Be13wK0PBX5Z63CCwWm7MfRQuwBRabM_3aKWSko2AUww%40mail.gmail.com

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
David Steele
Дата:
Hi Thomas,

On 7/26/23 06:06, Thomas Munro wrote:
> While chatting to Robert and Andres about all this, a new idea came
> up.  Or, rather, one of the first ideas that was initially rejected,
> now resurrected to try out a suggestion of Andres’s on how to
> de-pessimise it.  Unfortunately, it also suffers from Windows-specific
> problems that I originally mentioned at the top of this thread but
> had since repressed.  Arrrghgh.
> 
> First, the good news:
> 
> We could write out a whole new control file, and durable_rename() it
> into place.  We don’t want to do that in general, because we don’t
> want to slow down UpdateMinRecoveryPoint().  The new concept is to do
> that only if a backup is in progress.  That requires a bit of
> interlocking with backup start/stop (ie when runningBackups is
> changing in shmem, we don’t want to overlap with UpdateControlFile()'s
> decision on how to do it).  Here is a patch to try that out.  No more
> weasel wording needed for the docs; basebackup and low-level file
> system backup should always see an atomic control file (and
> occasionally also copy a harmless pg_control.tmp file).  Then we only
> need the gross retry-until-stable hack for front-end programs.

I like the approach in these patches better than the last patch set. My 
only concern would be possible performance regression on standbys (when 
doing backup from standby) since pg_control can be written very 
frequently to update min recovery point.

I've made a first pass through the patches and they look generally 
reasonable (and back patch-able).

One thing:

+ sendFileWithContent(sink, XLOG_CONTROL_FILE,
+                     (char *) control_file, sizeof(*control_file),
+                     &manifest);

I wonder if we should pad pg_control out to 8k so it remains the same 
size as now? Postgres doesn't care, but might look odd to users, and is 
arguably a change in behavior that should not be back patched.

> And the bad news:

Provided we can reasonably address the Windows issues this seems to be 
the way to go.

Regards,
-David



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
"Anton A. Melnikov"
Дата:
Hello!

On 26.07.2023 07:06, Thomas Munro wrote:
>  New patches
> attached.  Are they getting better?

It seems to me that it is worth focusing efforts on the second part of the patch,
as the most in demand. And try to commit it first.

And seems there is a way to simplify it by adding a parameter to get_controlfile() that will return calculated
crc and moving the repetition logic level up.

There is a proposed algorithm in alg_level_up.pdf attached.

[Excuse me, for at least three days i will be in a place where there is no available Internet. \
So will be able to read this thread no earlier than August 2 evening]

With the best wishes,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
"Anton A. Melnikov"
Дата:
Sorry, attached the wrong version of the file. Here is the right one.

Sincerely yours,

-- 
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.

I'll abandon the others for now, since we're now thinking bigger[1]
for backups, side stepping the problem.

[1] https://www.postgresql.org/message-id/flat/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Michael Paquier
Дата:
On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:
> I'm planning to push 0002 (retries in frontend programs, which is
> where this thread began) and 0004 (add missing locks to SQL
> functions), including back-patches as far as 12, in a day or so.
>
> I'll abandon the others for now, since we're now thinking bigger[1]
> for backups, side stepping the problem.

FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be
OK to use it at least for now on HEAD before seeing where the other
discussions lead.  0004 would be OK if applied to v11, as well, but I
also agree that it is not a big deal to let this branch be as it is
now at this stage if you feel strongly this way.
--
Michael

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
David Steele
Дата:

On 10/11/23 21:10, Michael Paquier wrote:
> On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:
>> I'm planning to push 0002 (retries in frontend programs, which is
>> where this thread began) and 0004 (add missing locks to SQL
>> functions), including back-patches as far as 12, in a day or so.
>>
>> I'll abandon the others for now, since we're now thinking bigger[1]
>> for backups, side stepping the problem.
> 
> FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be
> OK to use it at least for now on HEAD before seeing where the other
> discussions lead.  0004 would be OK if applied to v11, as well, but I
> also agree that it is not a big deal to let this branch be as it is
> now at this stage if you feel strongly this way.

Agreed on 0002 and 0004, though I don't really think a back patch of 
0004 to 11 is necessary. I'd feel differently if there was a single 
field report of this issue.

I would prefer to hold off on applying 0003 to HEAD until we see how [1] 
pans out.

Having said that, I have a hard time seeing [1] as being something we 
could back patch. The manipulation of backup_label is simple enough, but 
starting a cluster without pg_control is definitely going to change some 
things. Also, the requirement that backup software skip copying 
pg_control after a minor release is not OK.

If we do back patch 0001 is 0003 really needed? Surely if 0001 works 
with other backup software it would work fine for pg_basebackup.

Regards,
-David

[1] 
https://www.postgresql.org/message-id/flat/1330cb48-4e47-03ca-f2fb-b144b49514d8%40pgmasters.net



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
David Steele
Дата:
On 10/12/23 09:58, David Steele wrote:
>> On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:
>>> I'm planning to push 0002 (retries in frontend programs, which is
>>> where this thread began) and 0004 (add missing locks to SQL
>>> functions), including back-patches as far as 12, in a day or so.
>>>
>>> I'll abandon the others for now, since we're now thinking bigger[1]
>>> for backups, side stepping the problem.
>>
>> FWIW, 0003 looks like a low-risk improvement seen from here, so I'd be
>> OK to use it at least for now on HEAD before seeing where the other
>> discussions lead.  0004 would be OK if applied to v11, as well, but I
>> also agree that it is not a big deal to let this branch be as it is
>> now at this stage if you feel strongly this way.
> 
> Agreed on 0002 and 0004, though I don't really think a back patch of 
> 0004 to 11 is necessary. I'd feel differently if there was a single 
> field report of this issue.
> 
> I would prefer to hold off on applying 0003 to HEAD until we see how [1] 
> pans out.
> 
> Having said that, I have a hard time seeing [1] as being something we 
> could back patch. The manipulation of backup_label is simple enough, but 
> starting a cluster without pg_control is definitely going to change some 
> things. Also, the requirement that backup software skip copying 
> pg_control after a minor release is not OK.
> 

After some more thought, I think we could massage the "pg_control in 
backup_label" method into something that could be back patched, with 
more advanced features (e.g. error on backup_label and pg_control both 
present on initial cluster start) saved for HEAD.

Regards,
-David



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Michael Paquier
Дата:
On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:
> After some more thought, I think we could massage the "pg_control in
> backup_label" method into something that could be back patched, with more
> advanced features (e.g. error on backup_label and pg_control both present on
> initial cluster start) saved for HEAD.

I doubt that anything changed in this area would be in the
backpatchable zone, particularly as it would involve protocol changes
within the replication commands, so I'd recommend to focus on HEAD.
Backward-compatibility is not much of a conern as long as the backend
is involved.  The real problem here would be on the frontend side and
how much effort we should try to put in maintaining the code of
pg_basebackup compatible with older backends.
--
Michael

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
David Steele
Дата:
On 10/12/23 19:15, Michael Paquier wrote:
> On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:
>> After some more thought, I think we could massage the "pg_control in
>> backup_label" method into something that could be back patched, with more
>> advanced features (e.g. error on backup_label and pg_control both present on
>> initial cluster start) saved for HEAD.
> 
> I doubt that anything changed in this area would be in the
> backpatchable zone, particularly as it would involve protocol changes
> within the replication commands, so I'd recommend to focus on HEAD.

I can't see why there would be any protocol changes, but perhaps I am 
missing something?

One thing that does have to change, however, is the ordering of 
backup_label in the base tar file. Right now it is at the beginning but 
we need it to be at the end like pg_control is now.

I'm working up a POC patch now and hope to have something today or 
tomorrow. I think it makes sense to at least have a look at an 
alternative solution before going forward.

Regards,
-David




Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
David Steele
Дата:
On 10/13/23 10:40, David Steele wrote:
> On 10/12/23 19:15, Michael Paquier wrote:
>> On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:
>>> After some more thought, I think we could massage the "pg_control in
>>> backup_label" method into something that could be back patched, with 
>>> more
>>> advanced features (e.g. error on backup_label and pg_control both 
>>> present on
>>> initial cluster start) saved for HEAD.
>>
>> I doubt that anything changed in this area would be in the
>> backpatchable zone, particularly as it would involve protocol changes
>> within the replication commands, so I'd recommend to focus on HEAD.
> 
> I can't see why there would be any protocol changes, but perhaps I am 
> missing something?
> 
> One thing that does have to change, however, is the ordering of 
> backup_label in the base tar file. Right now it is at the beginning but 
> we need it to be at the end like pg_control is now.

Well, no protocol changes, but overall this does not seem like a 
direction that would be even remotely back patch-able. See [1] for details.

For back branches that puts us back to committing some form of 0001 and 
0003. I'm still worried about the performance implications of 0001 on a 
standby when in backup mode, but I don't have any better ideas.

If we do commit 0001 and 0003 to the back branches I'd still like to 
hold off on HEAD to see if we can do something better there.

Regards,
-David

[1] 
https://www.postgresql.org/message-id/e05f20f9-6f91-9a70-efab-9a2ae472e65d%40pgmasters.net



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Thomas Munro
Дата:
I pushed the retry-loop-in-frontend-executables patch and the
missing-locking-in-SQL-functions patch yesterday.  That leaves the
backup ones, which I've rebased and attached, no change.  It sounds
like we need some more healthy debate about that backup label idea
that would mean we don't need these (two birds with one stone), so
I'll just leave these here to keep the cfbot happy in the meantime.

Вложения

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
Robert Haas
Дата:
On Mon, Oct 16, 2023 at 6:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I pushed the retry-loop-in-frontend-executables patch and the
> missing-locking-in-SQL-functions patch yesterday.  That leaves the
> backup ones, which I've rebased and attached, no change.  It sounds
> like we need some more healthy debate about that backup label idea
> that would mean we don't need these (two birds with one stone), so
> I'll just leave these here to keep the cfbot happy in the meantime.

0002 has no comments at all, and the commit message is not specific
enough for me to understand what problem it fixes. I suggest adding
some comments and fixing the commit message. I'm also not very sure
whether the change to the signature of sendFileWithContent is really
the best way to deal with the control file maybe containing a zero
byte ... but I'm quite sure that if we're going to do it that way, it
needs a comment. But maybe we should do something else that would
require less explanation, like having the caller always pass the
length.

Regarding 0001, the way you've hacked up update_controlfile() doesn't
fill me with joy. It's nice if code that is common to the frontend and
the backend does the same thing in both cases rather than, say, having
an extra argument that only works in one case but not the other. I bet
this could be refactored to make it nicer, e.g. have one function that
takes an exact pathname at which the control file is to be written and
then other functions that use it as a subroutine.

Personally, I think the general idea of 0001 is better than any
competing proposal on the table. In the case of pg_basebackup, we
could fix the server to perform appropriate locking around reading the
control file, so that the version sent to the client doesn't get torn.
But if a backup is made by calling pg_backup_start() and copying
$PGDATA, that isn't going to work. To fix that, we need to either make
the backup procedure more complicated and essentially treat the
control file as a special case, or we need to do something like this.
I think this is better; but as you mention, opinions vary on that.

Life would be a lot easier here if we could get rid of the low-level
backup API and just have pg_basebackup DTWT, but that seems like a
completely non-viable proposal.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
"David G. Johnston"
Дата:
On Tue, Oct 17, 2023 at 10:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
Life would be a lot easier here if we could get rid of the low-level
backup API and just have pg_basebackup DTWT, but that seems like a
completely non-viable proposal.

Yeah, my contribution to this area [1] is focusing on the API because I figured we've provided it and should do our best to have it do as much as possible for the dba or third-parties that build tooling on top of it.

I kinda think that adding a pg_backup_metadata directory that pg_backup_start|stop can use may help here.  I'm wondering whether those functions provide enough control guarantees that pg_control's "in_backup=true|false" flag proposed in that thread is reliable enough when copied to the root directory in the backup.  I kinda feel that so long as the flag is reliable it should be possible for the signal file processing code to implement whatever protocol we need.

David J.


Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
vignesh C
Дата:
On Tue, 17 Oct 2023 at 04:18, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> I pushed the retry-loop-in-frontend-executables patch and the
> missing-locking-in-SQL-functions patch yesterday.  That leaves the
> backup ones, which I've rebased and attached, no change.  It sounds
> like we need some more healthy debate about that backup label idea
> that would mean we don't need these (two birds with one stone), so
> I'll just leave these here to keep the cfbot happy in the meantime.

I have changed the status of this to "Waiting on Author" as Robert's
comments have not yet been handled. Feel free to post an updated
version and change the status accordingly.

Regards,
Vignesh



Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

От
vignesh C
Дата:
On Thu, 11 Jan 2024 at 19:50, vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, 17 Oct 2023 at 04:18, Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > I pushed the retry-loop-in-frontend-executables patch and the
> > missing-locking-in-SQL-functions patch yesterday.  That leaves the
> > backup ones, which I've rebased and attached, no change.  It sounds
> > like we need some more healthy debate about that backup label idea
> > that would mean we don't need these (two birds with one stone), so
> > I'll just leave these here to keep the cfbot happy in the meantime.
>
> I have changed the status of this to "Waiting on Author" as Robert's
> comments have not yet been handled. Feel free to post an updated
> version and change the status accordingly.

The patch which you submitted has been awaiting your attention for
quite some time now.  As such, we have moved it to "Returned with
Feedback" and removed it from the reviewing queue. Depending on
timing, this may be reversible.  Kindly address the feedback you have
received, and resubmit the patch to the next CommitFest.

Regards,
Vignesh