Re: Show WAL write and fsync stats in pg_stat_io

Поиск
Список
Период
Сортировка
От Nazir Bilal Yavuz
Тема Re: Show WAL write and fsync stats in pg_stat_io
Дата
Msg-id CAN55FZ13ZViv+4V34TqQQ0aEzKbnPfDbYnBr=iJLPNNCPEuHfg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Show WAL write and fsync stats in pg_stat_io  (Melanie Plageman <melanieplageman@gmail.com>)
Ответы Re: Show WAL write and fsync stats in pg_stat_io  (Michael Paquier <michael@paquier.xyz>)
Re: Show WAL write and fsync stats in pg_stat_io  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Список pgsql-hackers
Hi,

Thanks for the review!

Current status of the patch is:
- IOOBJECT_WAL / IOCONTEXT_NORMAL read, write and fsync stats are added.
- IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
- IOOBJECT_WAL / IOCONTEXT_INIT stats are added.
- pg_stat_io shows different op_bytes for the IOOBJECT_WAL operations.
- Working on which 'BackendType / IOContext / IOOp' should be banned in pg_stat_io.
- PendingWalStats.wal_sync and PendingWalStats.wal_write_time / PendingWalStats.wal_sync_time are moved to pgstat_count_io_op_n() / pgstat_count_io_op_time() respectively.

TODOs:
- Documentation.
- Try to set op_bytes for BackendType / IOContext.
- Decide which 'BackendType / IOContext / IOOp' should not be tracked.
- Add IOOBJECT_WAL / IOCONTEXT_NORMAL read tests.
- Add IOOBJECT_WAL / IOCONTEXT_INIT tests.

I am adding tracking of BackendType / IOContext / IOOp as tables, empty cell means it is not decided yet:

IOCONTEXT_NORMAL / Backend / IOOp table:

╔═════════════════════╦═══════╦═══════╦═══════╗
║ IOCONTEXT_NORMAL    ║ read  ║ write ║ fsync ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ autovacuum launcher ║ FALSE ║ FALSE ║ FALSE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ autovacuum worker   ║ FALSE ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ client backend      ║       ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ background worker   ║       ║       ║       ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ background writer   ║       ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ checkpointer        ║       ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ standalone backend  ║  TRUE ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ startup             ║  TRUE ║       ║       ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walreceiver         ║       ║       ║       ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walsender           ║       ║       ║       ║
╠═════════════════════╬═══════╬═══════╬═══════╣
║ walwriter           ║       ║  TRUE ║  TRUE ║
╚═════════════════════╩═══════╩═══════╩═══════╝


IOCONTEXT_WAL_INIT / Backend / IOOp table:

╔═════════════════════╦═══════╦═══════╗
║ IOCONTEXT_WAL_INIT  ║ write ║ fsync ║
╠═════════════════════╬═══════╬═══════╣
║ autovacuum launcher ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ autovacuum worker   ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ client backend      ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╣
║ background worker   ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ background writer   ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ checkpointer        ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ standalone backend  ║  TRUE ║  TRUE ║
╠═════════════════════╬═══════╬═══════╣
║ startup             ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ walreceiver         ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ walsender           ║       ║       ║
╠═════════════════════╬═══════╬═══════╣
║ walwriter           ║       ║       ║
╚═════════════════════╩═══════╩═══════╝


On Wed, 9 Aug 2023 at 21:52, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> > On Sat, 22 Jul 2023 at 01:30, Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > > I think it would be good to count WAL reads even though they are not
> > > currently represented in pg_stat_wal. Here is a thread discussing this
> > > [1].
> >
> > I used the same implementation in the thread link [1]. I added 'WAL
> > read' to only xlogrecovery.c for now. I didn't add 'WAL read' to
> > xlogreader.c and walsender.c because they cause some failures on:
> > '!pgStatLocal.shmem->is_shutdown' asserts. I will spend more time on
> > these. Also, I added Bharath to CC. I have a question about 'WAL
> > read':
> > 1. There are two places where 'WAL read' happens.
> > a. In WALRead() in xlogreader.c, it reads 'count' bytes, most of the
> > time count is equal to XLOG_BLCKSZ but there are some cases it is not.
> > For example
> > - in XLogSendPhysical() in walsender.c WALRead() is called by nbytes
> > - in WALDumpReadPage() in pg_waldump.c WALRead() is called by count
> > These nbytes and count variables could be different from XLOG_BLCKSZ.
> >
> > b. in XLogPageRead() in xlogreader.c, it reads exactly XLOG_BLCKSZ bytes:
> > pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
> >
> > So, what should op_bytes be set to for 'WAL read' operations?
>
> If there is any combination of BackendType and IOContext which will
> always read XLOG_BLCKSZ bytes, we could use XLOG_BLCKSZ for that row's
> op_bytes. For other cases, we may have to consider using op_bytes 1 and
> tracking reads and write IOOps in number of bytes (instead of number of
> pages). I don't actually know if there is a clear separation by
> BackendType for these different cases.

I agree. I will edit that later, added to TODOs.

>
> The other alternative I see is to use XLOG_BLCKSZ as the op_bytes and
> treat op_bytes * number of reads as an approximation of the number of
> bytes read. I don't actually know what makes more sense. I don't think I
> would like having a number for bytes that is not accurate.

Yes, the prior one makes more sense to me.

>
> > Should I use IsBootstrapProcessingMode() to enable WAL io timings at
> > the initdb or are they not that much important?
>
> I don't have an opinion about this. I can see an argument for doing it
> either way. We do track other IO during initdb in pg_stat_io.

I didn't add it for now. It is an easy change, it could be added later.

>
> As an additional TODO, I would explore adding some tests to prevent
> accidental removal of the pg_stat_io WAL tracking.
>
> I think we can easily test IOCONTEXT_NORMAL WAL writes in
> src/test/regress/sql/stats.sql (perhaps it is worth checking that
> synchronous_commit is on in the test). IOCONTEXT_NORMAL WAL fsyncs
> should again be easy to test if synchronous_commit is on and fsync is
> on.
>
> I'm not sure how to reliably test WAL reads (given timing). Logically,
> you can sum WAL reads before a crash is initiated in one of the tests in
> the recovery suite, and then sum them after the db has restarted and
> there should definitely be an increase in WAL reads, but I don't know if
> we need to do something to guarantee that there will have been WAL reads
> (to avoid test flakes).
>
> I'm also not sure how to reliably test any IOCONTEXT_INIT operations. We
> need a before and after and I can't think of a cheap operation to ensure
> a new WAL segment is written to or fsyncd in between a before and after
> for the purposes of testing.

IOOBJECT_WAL / IOCONTEXT_NORMAL write and fsync tests are added.
For the IOCONTEXT_NORMAL reads and IOCONTEXT_INIT tests, I couldn't find a way to avoid test flakes. I am open to suggestions. I added these to TODOs.

>
> > ---
> > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> > index 60c0b7ec3af..ee7b85e18ca 100644
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -2245,6 +2229,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
> >                               startoffset += written;
> >                       } while (nleft > 0);
> >
>
> I'm not sure if the right location is here or in
> pgstat_count_io_op_time(), but I would explain why you did not move
> PendingWalStats.wal_writes counter into pg_stat_io code (and why you did
> move the other PendingWalStats counters there.
>
> > +                     pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
> > +                                                                     IOOP_WRITE, io_start, npages);
> > +
> >                       npages = 0;
> >
> >                       /*
> > @@ -2938,6 +2925,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
> >       int                     fd;
> >       int                     save_errno;
> >       int                     open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
> > +     instr_time      io_start;
> >
> >       Assert(logtli != 0);
> >
> > @@ -2981,6 +2969,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
> >                               (errcode_for_file_access(),
> >                                errmsg("could not create file \"%s\": %m", tmppath)));
> >
>
> Since you have two calls to pgstat_prepare_io_time() in this function, I
> think it would be nice to have a comment above each to the effect of
> "start timing writes for stats" and "start timing fsyncs for stats"

Done.

>
> > +     io_start = pgstat_prepare_io_time();
> > +
> >       pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
>
> > diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> > index becc2bda62e..ee850af5514 100644
> > --- a/src/backend/access/transam/xlogrecovery.c
> > +++ b/src/backend/access/transam/xlogrecovery.c
> > @@ -1587,6 +1587,7 @@ PerformWalRecovery(void)
> >       XLogRecord *record;
> >       bool            reachedRecoveryTarget = false;
> >       TimeLineID      replayTLI;
> > +     uint32          pgstat_report_wal_frequency = 0;
> >
> >       /*
> >        * Initialize shared variables for tracking progress of WAL replay, as if
> > @@ -1745,6 +1746,16 @@ PerformWalRecovery(void)
> >                        */
> >                       ApplyWalRecord(xlogreader, record, &replayTLI);
> >
> > +                     /*
> > +                      * Report pending statistics to the cumulative stats system once
> > +                      * every PGSTAT_REPORT_FREQUENCY times to not hinder performance.
> > +                      */
> > +                     if (pgstat_report_wal_frequency++ == PGSTAT_REPORT_FREQUENCY)
> > +                     {
> > +                             pgstat_report_wal(false);
> > +                             pgstat_report_wal_frequency = 0;
> > +                     }
> > +
>
> Is the above needed for your patch to work? What does it do? It should
> probably be in a separate commit and should definitely have an
> explanation.

Done, I omit that part.

>
> > --- a/src/backend/utils/activity/pgstat_io.c
> > +++ b/src/backend/utils/activity/pgstat_io.c
> > @@ -87,17 +87,25 @@ pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint3
> >       Assert((unsigned int) io_op < IOOP_NUM_TYPES);
> >       Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
>
> I would add a comment here explaining that pg_stat_wal doesn't count WAL
> init or WAL reads.

Done.

>
> > +     if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> > +        io_op == IOOP_FSYNC)
> > +             PendingWalStats.wal_sync += cnt;
> > +
> >       PendingIOStats.counts[io_object][io_context][io_op] += cnt;
> >
> >       have_iostats = true;
> >  }
>
> > +/*
> > + * Prepares io_time for pgstat_count_io_op_time() function. It needs to return
> > + * current time if there is a chance that any 'time' can be tracked.
> > + */
> >  instr_time
> >  pgstat_prepare_io_time(void)
> >  {
> >       instr_time      io_start;
> >
> > -     if (track_io_timing)
> > +     if(track_io_timing || track_wal_io_timing)
> >               INSTR_TIME_SET_CURRENT(io_start);
> >       else
> >               INSTR_TIME_SET_ZERO(io_start);
>
> Since you asked me off-list why we had to do INSTR_TIME_SET_ZERO() and I
> couldn't remember, it is probably worth a comment.

Done.

>
> >  pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
> >                                               instr_time start_time, uint32 cnt)
> >  {
> > -     if (track_io_timing)
> > +     if (pgstat_should_track_io_time(io_object, io_context))
> >       {
> >               instr_time      io_time;
> >
> > @@ -124,6 +148,9 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
> >                       pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
>
> Now that we are adding more if statements to this function, I think we
> should start adding more comments.
>
> We should explain what the different counters here are for e.g.
> pgBufferUsage for EXPLAIN, PendingWalStats for pg_stat_wal.
>
> We should also explain what is tracked for each and why it differs --
> e.g. some track time and some don't, some track only reads or writes,
> etc.
>
> Also we should mention why we are consolidating them here. That is, we
> want to eventually deduplicate these counters, so we are consolidating
> them first. This also makes it easy to compare what is tracked for which
> stats or instrumentation purpose.
>
> And for those IO counters that we haven't moved here, we should mention
> it is because they track at a different level of granularity or at a
> different point in the call stack.

Done.

>
> >                       if (io_object == IOOBJECT_RELATION)
> >                               INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> > +                     /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
> > +                     else if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
> > +                             INSTR_TIME_ADD(PendingWalStats.wal_write_time, io_time);
> >               }
>
>
> Also, I would reorder the if statements to be in order of the enum
> values (e.g. FSYNC, READ, WRITE).

Done.

>
> >               else if (io_op == IOOP_READ)
> >               {
> > @@ -131,6 +158,12 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
> >                       if (io_object == IOOBJECT_RELATION)
> >                               INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
> >               }
> > +             else if (io_op == IOOP_FSYNC)
> > +             {
> > +                     /* Track IOOBJECT_WAL/IOCONTEXT_NORMAL times on PendingWalStats */
>
> I wouldn't squeeze this comment here like this. It is hard to read

Done.

>
> > +                     if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL)
> > +                             INSTR_TIME_ADD(PendingWalStats.wal_sync_time, io_time);
>
>
> > + * op_bytes can change according to IOObject and IOContext.
> > + * Return BLCKSZ as default.
> > + */
> > +int
> > +pgstat_get_io_op_btyes(IOObject io_object, IOContext io_context)
> > +{
>
> Small typo in function name:
> pgstat_get_io_op_btyes -> pgstat_get_io_op_bytes
> I'd also mention why BLCKSZ is the default

Done.

>
> > +     if (io_object == IOOBJECT_WAL)
> > +     {
> > +             if (io_context == IOCONTEXT_NORMAL)
> > +                     return XLOG_BLCKSZ;
> > +             else if (io_context == IOCONTEXT_INIT)
> > +                     return wal_segment_size;
> > +     }
> > +
> > +     return BLCKSZ;
> > +}
>
> > @@ -350,6 +405,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
> >       if (!pgstat_tracks_io_bktype(bktype))
> >               return false;
> >
> > +     /*
> > +      * Currently, IO on IOOBJECT_WAL IOObject can only occur in the
> > +      * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContext.
> > +      */
> > +     if (io_object == IOOBJECT_WAL &&
> > +             (io_context != IOCONTEXT_NORMAL &&
>
> Little bit of errant whitespace here.

Done.

>
> >       /*
> >        * Currently, IO on temporary relations can only occur in the
> >        * IOCONTEXT_NORMAL IOContext.
> > @@ -439,6 +503,14 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
> >       if (io_context == IOCONTEXT_BULKREAD && io_op == IOOP_EXTEND)
> >               return false;
>
> I would expand on the comment to explain what NORMAL is for WAL -- what
> we consider normal to be and why. And why it is different than INIT.

Done.

>
> >
> > +     if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
> > +        !(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
> > +        return false;
> > +
> > +     if(io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
> > +        !(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
> > +        return false;
>
> These are the first "bans" that we have for an IOOp for a specific
> combination of io_context and io_object. We should add a new comment for
> this and perhaps consider what ordering makes most sense. I tried to
> organize the bans from most broad to most specific at the bottom.

Done.

>
> >
> > --- a/src/backend/utils/adt/pgstatfuncs.c
> > +++ b/src/backend/utils/adt/pgstatfuncs.c
> > @@ -1409,7 +1410,8 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
> >                                * and constant multipliers, once non-block-oriented IO (e.g.
> >                                * temporary file IO) is tracked.
> >                                */
> > -                             values[IO_COL_CONVERSION] = Int64GetDatum(BLCKSZ);
>
> There's a comment above this in the code that says this is hard-coded to
> BLCKSZ. That comment needs to be updated or removed (in lieu of the
> comment in your pgstat_get_io_op_bytes() function).

Done.

>
>
> > +                             op_bytes = pgstat_get_io_op_btyes(io_obj, io_context);
> > +                             values[IO_COL_CONVERSION] = Int64GetDatum(op_bytes);
> >
>
> > +extern PGDLLIMPORT bool track_wal_io_timing;
> > +extern PGDLLIMPORT int wal_segment_size;
>
> These shouldn't be in two places (i.e. they are already in xlog.h and
> you added them in pgstat.h. pg_stat_io.c includes bufmgr.h for
> track_io_timing, so you can probably justify including xlog.h.

Done.

Any kind of feedback would be appreciated.

Regards,
Nazir Bilal Yavuz
Microsoft
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrey Lepikhov
Дата:
Сообщение: Re: disfavoring unparameterized nested loops
Следующее
От: Fujii Masao
Дата:
Сообщение: Re: pg_rewind with cascade standby doesn't work well