Re: Walsender waiting on SnapbuildSync

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Walsender waiting on SnapbuildSync
Дата
Msg-id 20180805143641.mjvgag4ck4wgadgk@alap3.anarazel.de
обсуждение исходный текст
Ответ на Walsender waiting on SnapbuildSync  (Brent Kerby <blkerby@gmail.com>)
Список pgsql-general
On 2018-08-04 13:34:04 -0600, Brent Kerby wrote:
> On Postgres 10.3 (on AWS RDS), I am running logical decoding using the
> test_decoding output plugin, and every few minutes I am seeing pauses in
> the stream, unrelated to any large transactions. About once every hour or
> two, the pause is long enough that the database disconnects my client due
> to exceeding wal_sender_timeout (30 seconds -- the RDS default value);
> after reconnecting it is able to make progress again. My client is using
> the streaming replication protocol via pgjdbc (with a status interval of 1
> second). What I'm seeing is that during such a pause, the server is not
> sending any data to the client:
> 
> - pg_stat_replication.sent_lsn stops advancing
> - My client is blocking in a call to PGReplicationStream.read()
> - pg_stat_activity shows that the walsender process has a wait_event of
> 'SnapbuildSync'.
> 
> In this scenario, it makes sense that the client would be timed out: pgjdbc
> only sends feedback to the server at the beginning of a call to
> PGReplicationStream.read(), so if a single call blocks a long time, never
> receiving any data from the server, then the client would stop sending
> feedback to the server, causing timeout.
> 
> My question is why might the server be spending so much time waiting on
> SnapbuildSync? The docs describe this event as follows:
> 
> "IO / SnapbuildSync / Waiting for a serialized historical catalog snapshot
> to reach stable storage."

> I gather that this is related to statistics collection, but I'm not
> understanding why a walsender process would wait on such an event nor why
> it would take such a long time. Any ideas?

It's *not* related to statistics collection. It's fsyncing a few *small*
files to disk. On a first blush this seems to indicate that your storage
system is quite overloaded?

The relevant piece of code is:

    /*
     * fsync the file before renaming so that even if we crash after this we
     * have either a fully valid file or nothing.
     *
     * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has
     * some noticeable overhead since it's performed synchronously during
     * decoding?
     */
    pgstat_report_wait_start(WAIT_EVENT_SNAPBUILD_SYNC);
    if (pg_fsync(fd) != 0)
    {
        int            save_errno = errno;

        CloseTransientFile(fd);
        errno = save_errno;
        ereport(ERROR,
                (errcode_for_file_access(),
                 errmsg("could not fsync file \"%s\": %m", tmppath)));
    }
    pgstat_report_wait_end();
    CloseTransientFile(fd);

Greetings,

Andres Freund


В списке pgsql-general по дате отправления:

Предыдущее
От: Ron
Дата:
Сообщение: ALTER TABLE .. SET STATISTICS
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: ALTER TABLE .. SET STATISTICS