Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
От | Thomas Munro |
---|---|
Тема | Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" |
Дата | |
Msg-id | CA+hUKGKZoCCn+6Z_dmwQsKu5PVLbDu6G7fD6JU1qok84r1ry0A@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt" ("Anton A. Melnikov" <aamelnikov@inbox.ru>) |
Ответы |
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
|
Список | pgsql-hackers |
On Wed, Feb 1, 2023 at 5:04 PM Anton A. Melnikov <aamelnikov@inbox.ru> wrote: > On 31.01.2023 14:38, Thomas Munro wrote: > > Here's an experimental patch for that alternative. I wonder if > > someone would want to be able to turn it off for some reason -- maybe > > some NFS problem? It's less back-patchable, but maybe more > > principled? > > It looks very strange to me that there may be cases where the cluster data > is stored in NFS. Can't figure out how this can be useful. Heh. There are many interesting failure modes, but people do it. I guess my more general question when introducing any new system call into this code is how some unusual system I can't even access is going to break. Maybe some obscure filesystem will fail with EOPNOTSUPP, or take 5 seconds and then fail because there is no lock server configured or whatever, so that's why I don't think we can back-patch it, and we probably need a way to turn it off. > i think this variant of the patch is a normal solution > of the problem, not workaround. Found no problems on Linux. > +1 for this variant. I prefer it too. > Might add a custom error message for EDEADLK > since it absent in errcode_for_file_access()? Ah, good thought. I think it shouldn't happen™, so it's OK that errcode_for_file_access() would classify it as ERRCODE_INTERNAL_ERROR. Other interesting errors are: ENOLCK: system limits exceeded; PANIC seems reasonable EOPNOTSUPP: this file doesn't support locking (seen on FreeBSD man pages, not on POSIX) > > I don't know if Windows suffers from this type of problem. > > Unfortunately its equivalent functionality LockFile() looks non-ideal > > for this purpose: if your program crashes, the documentation is very > > vague on when exactly it is released by the OS, but it's not > > immediately on process exit. That seems non-ideal for a control file > > you might want to access again very soon after a crash, to be able to > > recover. > > Unfortunately i've not had time to reproduce the problem and apply this patch on > Windows yet but i'm going to do it soon on windows 10. If a crc error > will occur there, then we might use the workaround from the first > variant of the patch. Thank you for investigating. I am afraid to read your results. One idea would be to proceed with LockFile() for Windows, with a note suggesting you file a bug with your OS vendor if you ever need it to get unstuck. Googling this subject, I read that MongoDB used to suffer from stuck lock files, until an OS bug report led to recent versions releasing locks more promptly. I find that sufficiently scary that I would want to default the feature to off on Windows, even if your testing shows that it does really need it. > > A thought in passing: if UpdateMinRecoveryPoint() performance is an > > issue, maybe we should figure out how to use fdatasync() instead of > > fsync(). > > May be choose it in accordance with GUC wal_sync_method? Here's a patch like that. I don't know if it's a good idea for wal_sync_method to affect other kinds of files or not, but, then, it already does (fsync_writethough changes this behaviour). I would only want to consider this if we also stop choosing "open_datasync" as a default on at least Windows.
Вложения
В списке pgsql-hackers по дате отправления: