Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Justin Pryzby
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id 20180329050031.GG28454@telsasoft.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Michael Paquier <michael@paquier.xyz>)
Ответы Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-hackers
On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:
> On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:
> > Craig Ringer <craig@2ndquadrant.com> writes:
> >> TL;DR: Pg should PANIC on fsync() EIO return.
> > 
> > Surely you jest.
> 
> Any callers of pg_fsync in the backend code are careful enough to check
> the returned status, sometimes doing retries like in mdsync, so what is
> proposed here would be a regression.

The retries are the source of the problem ; the first fsync() can return EIO,
and also *clears the error* causing a 2nd fsync (of the same data) to return
success.

(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).

On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:
> Craig, is the phenomenon you described the same as the second issue
> "Reporting writeback errors" discussed in this article?
> https://lwn.net/Articles/724307/

Worse, the article acknowledges the behavior without apparently suggesting to
change it:

 "Storing that value in the file structure has an important benefit: it makes
it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT
CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR
OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION."

I believe I reproduced the problem behavior using dmsetup "error" target, see
attached.

strace looks like this:

kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

     1    open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
     2    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     3    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     4    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     5    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     6    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     7    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
     8    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 2560
     9    write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = -1 ENOSPC (No space
lefton device)
 
    10    dup(2)                                  = 4
    11    fcntl(4, F_GETFL)                       = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
    12    brk(NULL)                               = 0x1299000
    13    brk(0x12ba000)                          = 0x12ba000
    14    fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
    15    write(4, "write(1): No space left on devic"..., 34write(1): No space left on device
    16    ) = 34
    17    close(4)                                = 0
    18    fsync(3)                                = -1 EIO (Input/output error)
    19    dup(2)                                  = 4
    20    fcntl(4, F_GETFL)                       = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
    21    fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
    22    write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output error
    23    ) = 29
    24    close(4)                                = 0
    25    close(3)                                = 0
    26    open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
    27    fsync(3)                                = 0
    28    write(3, "\0", 1)                       = 1
    29    fsync(3)                                = 0
    30    exit_group(0)                           = ?

2: EIO isn't seen initially due to writeback page cache;
9: ENOSPC due to small device
18: original IO error reported by fsync, good
25: the original FD is closed
26: ..and file reopened
27: fsync on file with still-dirty data+EIO returns success BAD

10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror
should write to a separate FD (?)

Also note, close() ALSO returned success..which you might think exonerates the
2nd fsync(), but I think may itself be problematic, no?  In any case, the 2nd
byte certainly never got written to DM error, and the failure status was lost
following fsync().

I get the exact same behavior if I break after one write() loop, such as to
avoid ENOSPC.

Justin

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro HORIGUCHI
Дата:
Сообщение: Re: PQHost() undefined behavior if connecting string contains bothhost and hostaddr types
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS