Re: BUG #16331: segfault in checkpointer with full disk

Поиск
Список
Период
Сортировка
От Julien Rouhaud
Тема Re: BUG #16331: segfault in checkpointer with full disk
Дата
Msg-id CAOBaU_a0-FkNp4YHO_7nN7=NDN2R_xb-Ya-e3w9bB1SHEstYCQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #16331: segfault in checkpointer with full disk  (Jozef Mlich <jmlich83@gmail.com>)
Список pgsql-bugs
On Wed, Apr 1, 2020 at 11:51 AM Jozef Mlich <jmlich83@gmail.com> wrote:
>
> On Wed, 2020-04-01 at 11:04 +0200, Julien Rouhaud wrote:
> > Hi,
> >
> > On Wed, Apr 01, 2020 at 08:51:56AM +0000, PG Bug reporting form
> > wrote:
> > >
> > > I can see segfaults on CentOS 7 with postgresql 12.2-2PGDG.rhel7
> > > (from
> > > yum.postgresql.org). I am using multiple extensions  (cstore,
> > > postgres_fdw,
> > > pgcrypto,dblink, etc.). It seems crash is related to disk run out
> > > of space
> > > (I am using separate partion for / and for /var/lib/pgsql). It
> > > occurs few
> > > times a day. According to backtrace it seems to be related to
> > > checkpointer.
> > > Replication is not configured.
> > >
> > >
> > > [New LWP 26290]
> > > [Thread debugging using libthread_db enabled]
> > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > Core was generated by `postgres:
> > > checkpointer
> > >  '.
> > > Program terminated with signal 6, Aborted.
> > > #0  0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at
> > > ../nptl/sysdeps/unix/sysv/linux/raise.c:55
> > > 55    return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> > >
> > > Thread 1 (Thread 0x7fe462e148c0 (LWP 26290)):
> > > #0  0x00007fe4604c1207 in __GI_raise (sig=sig@entry=6) at
> > > ../nptl/sysdeps/unix/sysv/linux/raise.c:55
> > >         resultvar = 0
> > >         pid = 26290
> > >         selftid = 26290
> > > #1  0x00007fe4604c28f8 in __GI_abort () at abort.c:90
> > >         save_stage = 2
> > >         act = {__sigaction_handler = {sa_handler = 0x0,
> > > sa_sigaction = 0x0},
> > > sa_mask = {__val = {0, 0, 0, 0, 0, 9268713, 70403103920717,
> > > 39808819211026438, 20126216749056, 70394513997832, 9268713,
> > > 70403103920719,
> > > 17316096998686159616, 20134806683648, 140618848608704,
> > > 140618848592800}},
> > > sa_flags = 1615828275, sa_restorer = 0x0}
> > >         sigs = {__val = {32, 0 <repeats 15 times>}}
> > > #2  0x000000000087840a in errfinish (dummy=<optimized out>) at
> > > elog.c:552
> > >         edata = 0xd47040 <errordata>
> > >         elevel = 22
> > >         oldcontext = 0x171a6d0
> > >         econtext = 0x0
> > >         __func__ = "errfinish"
> > > #3  0x0000000000706b24 in CheckPointReplicationOrigin () at
> > > origin.c:562
> > >         tmppath = 0x9e6fa8 "pg_logical/replorigin_checkpoint.tmp"
> > >         path = 0x9e6fd0 "pg_logical/replorigin_checkpoint"
> > >         tmpfd = <optimized out>
> > >         i = <optimized out>
> > >         magic = 307747550
> > >         crc = 4294967295
> > >         __func__ = "CheckPointReplicationOrigin"
> >
> > That's not a bug (nor a segfault) but the expected behavior if the
> > checkpointer is not able to do its work.  As data durability can't be
> > guaranteed in such case, the checkpointer raises a PANIC level
> > message, which raises an abort so that the whole instance do an
> > emergency restart cycle.
> >
> > Do you have monitoring for this filesystem?  Do you see spikes in
> > disk usage or other strange behavior?
>
> Then it is clear. Thanks for explanation and applogize for false bug
> report.
>
> I have probably misunderstood how is segfault distinguished from abort.
> I need to fix my kernel.core_pattern script.
>
> In attachment is screenshot from monitoring grafana with information
> about space on /var/lib/pgsql partition.

The main filesystem is full or almost full most of the time?  That's
unfortunately a good way to trigger that kind of outage.  Is that
because most of the data is on a different tablespace?  Even in that
case you need to ensure that you still have at least a reasonable
amount of free space.



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Jehan-Guillaume de Rorthais
Дата:
Сообщение: Re: [BUG] non archived WAL removed during production crash recovery
Следующее
От: Abdallah Farouk
Дата:
Сообщение: Help to Install Postgre SQL