Re: Cygwin cleanup

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: Cygwin cleanup
Дата
Msg-id CA+hUKG+J4jSFk=-hdoZdcx+p7ru6xuipzCZY-kiKoDc2FjsV7g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Cygwin cleanup  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
On Wed, Feb 8, 2023 at 8:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jan 13, 2023 at 5:17 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > My patch used fsync_fname_ext() which would cause an ERROR rather than a
> > PANIC when failing to fsync the logical decoding pathname.
>
> FTR While analysing a lot of CI logs trying to debug something else I
> came across a plain Windows/MSVC (not Cygwin) build that panicked like
> this:
>
> https://cirrus-ci.com/task/6689224833892352
>
https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/testrun/build/testrun/subscription/013_partition/log/013_partition_publisher.log
>
https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/crashlog/crashlog-postgres.exe_0af4_2023-02-05_21-53-20-018.txt

Here are some more flapping CI failures due to this phenomenon
(nothing to do with Cygwin, this is just regular Windows):

 4509011781877760 | Windows - Server 2019, VS 2019 - Meson & ninja
 4525770962370560 | Windows - Server 2019, VS 2019 - Meson & ninja
 5664518341132288 | Windows - Server 2019, VS 2019 - Meson & ninja
 5689846694412288 | Windows - Server 2019, VS 2019 - Meson & ninja
 5853025126842368 | Windows - Server 2019, VS 2019 - Meson & ninja
 6639943179567104 | Windows - Server 2019, VS 2019 - Meson & ninja
 6727728217456640 | Windows - Server 2019, VS 2019 - Meson & ninja
 6740158104469504 | Windows - Server 2019, VS 2019 - Meson & ninja

They all say something like 'PANIC:  could not open file
"pg_logical/snapshots/0-1597938.snap": No such file or directory',
because they all do rename(some_temporary_file, that_name), then try
to re-open and sync it, but rename() on Windows fails to be atomic so
a concurrent process can see an intermediate ENOENT state.  I see a
few 'local' workarounds we could do to fix that, but ... there seems
to be a much better idea staring us in the face in the comments!

I think this would be fixed as a happy by-product of this TODO in
SnapBuildSerialize():

     * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
     * some noticeable overhead since it's performed synchronously during
     * decoding?

I have done no analysis myself of whether that is sound, but assuming
it is, I think the way to achieve that is to tweak FileTag so that it
can describe the file to be fsync'd, and use the sync.c machinery to
fsync the file in the background.  Presumably that would provide a
huge speed up for logical decoding, and people would rejoice.

Some other topics that came up in this thread:
 * Now that PostgreSQL seems to be stable enough on Cygwin to get
through the basic regression tests reliably, lorikeet might as well
run the full TAP test suite?
 * Justin complained about the weird effects of wal_sync_method, and I
finally got around to showing how I think that should be untangled, in
https://commitfest.postgresql.org/44/4453/



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Smith
Дата:
Сообщение: Re: subscription TAP test has unused $result
Следующее
От: Andy Fan
Дата:
Сообщение: Re: Buffer ReadMe Confuse