BUG #17989: ERROR: could not open file "pg_logical/snapshots/6-1E182808.snap": Permission denied

Поиск
Список
Период
Сортировка
От PG Bug reporting form
Тема BUG #17989: ERROR: could not open file "pg_logical/snapshots/6-1E182808.snap": Permission denied
Дата
Msg-id 17989-6b9f058bad9cd31b@postgresql.org
обсуждение исходный текст
Список pgsql-bugs
The following bug has been logged on the website:

Bug reference:      17989
Logged by:          Dmitry Ivanov
Email address:      dmivanov@ptsecurity.com
PostgreSQL version: 12.1
Operating system:   Windows Server 2019 Standard
Description:

Hi,

I have problem with logical replication on Postgres12 (later versions
too).
I set up logical replication for 40 databases (40 subscriptions /
publications) + (a large number of replication slots).
After starting replication at random times, I get the following errors on
the master:

ERROR: could not open file "pg_logical/snapshots/6-1E182808.snap":
Permission denied

This leads to PG restart and downtime of databases consumer services.

There are no obvious problems associated with permissions. For the some
databases replication works well. The error appears at a random times.
I look at file/handle activity using Sysinternas FileMon/ProcessMonitor. No
third party software (antivirus, etc) touches files in
pg_logical/snapshots

It looks like two backend workers are trying to open the same snapshot at
the same time:

Postres log with DEBUG1 level:
2023-06-22 12:39:54 MSK db1 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db2 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db3 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db4 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db5 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db6 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db7 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db8 DEBUG:  updated xmin: 0 restart: 1
2023-06-22 12:39:54 MSK db9 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db10 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db13 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db8 DEBUG:  got new restart lsn 6/1ED17A98 at
6/1ED17A98
2023-06-22 12:39:54 MSK db8 DEBUG:  serializing snapshot to
pg_logical/snapshots/6-1ED17AD0.snap
2023-06-22 12:39:54 MSK db11 DEBUG:  updated xmin: 0 restart: 1
2023-06-22 12:39:54 MSK db10 DEBUG:  received replication command:
CREATE_REPLICATION_SLOT "subs_db10_190385_sync_16547" TEMPORARY LOGICAL
pgoutput USE_SNAPSHOT
2023-06-22 12:39:54 MSK db2 DEBUG:  failed to increase restart lsn: proposed
6/1ED17AD0, after 6/1ED17AD0, current candidate 6/1ED17AD0, current after
6/1ED17AD0, flushed up to 6/1ED17AD0
2023-06-22 12:39:54 MSK db7 DEBUG:  searching for logical decoding starting
point, starting at 6/1ED17AD0
2023-06-22 12:39:54 MSK db7 LOG:  logical decoding found consistent point at
6/1ED17AD0
2023-06-22 12:39:54 MSK db7 DETAIL:  There are no running transactions.
2023-06-22 12:39:54 MSK db7 STATEMENT:  CREATE_REPLICATION_SLOT
"subs_db7_190470_sync_17114" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT
2023-06-22 12:39:54 MSK db11 DEBUG:  failed to increase restart lsn:
proposed 6/1ED17A28, after 6/1ED17A28, current candidate 6/1ED17A28, current
after 6/1ED17A28, flushed up to 6/1ED17A28
2023-06-22 12:39:54 MSK db12 DEBUG:  updated xmin: 0 restart: 1
2023-06-22 12:39:54 MSK db2 DEBUG:  updated xmin: 0 restart: 1
2023-06-22 12:39:54 MSK db13 DEBUG:  failed to increase restart lsn:
proposed 6/1ED17AD0, after 6/1ED17AD0, current candidate 6/1ED17AD0, current
after 6/1ED17AD0, flushed up to 6/1ED17AD0
2023-06-22 12:39:54 MSK db12 DEBUG:  got new restart lsn 6/1ED17A98 at
6/1ED17A98
2023-06-22 12:39:54 MSK db14 DEBUG:  failed to increase restart lsn:
proposed 6/1ED17A28, after 6/1ED17A28, current candidate 6/1ED17A28, current
after 6/1ED17A28, flushed up to 6/1ED17A28
2023-06-22 12:39:54 MSK db10 DEBUG:  failed to increase restart lsn:
proposed 6/1ED17AD0, after 6/1ED17AD0, current candidate 6/1ED179F0, current
after 6/1ED179F0, flushed up to 6/1ED179B8
2023-06-22 12:39:54 MSK db11 ERROR:  could not open file
"pg_logical/snapshots/6-1ED17AD0.snap": Permission denied

I looked at the source code and see a comment about the race-condition and
sync via atomic rename

https://github.com/postgres/postgres/blob/596114164699bbe184969df374fd6a1d8a93a57c/src/backend/replication/logical/snapbuild.c#L1670

The PANIC occurs when calling fsync (opening a file with permissions) on the
target file after renaming .tmp 

https://github.com/postgres/postgres/blob/596114164699bbe184969df374fd6a1d8a93a57c/src/backend/replication/logical/snapbuild.c#L1815

https://github.com/postgres/postgres/blob/596114164699bbe184969df374fd6a1d8a93a57c/src/backend/storage/file/fd.c#LL3775C6-L3775C6

On POSIX systems rename is an atomic operation, but AFAIK in Windows
MoveFileEx is not atomic. At fact, this is Copy-Delete.
Is it reasonable to rely here on the atomicity of rename for
synchronization?

Are there any ways to fix this?
Unfortunately, now I don’t see other ways, except how to turn off panic on
fsync failures at all for the duration of replication (data_sync_retry =
on).
It looks nothing is damaged in this case, but in general this solution is
VERY fragile and dangerous.

BTW: A similar issue was already mentioned here (different bug, same place),
but instead of a fix, they just upgraded FreeBSD on test machines.
https://www.postgresql.org/message-id/flat/17827.1549866683%40sss.pgh.pa.us


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: BUG #17949: Adding an index introduces serialisation anomalies.
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17990: PSQL Process hangs in parallel mode