Re: Syncrep and improving latency due to WAL throttling

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Syncrep and improving latency due to WAL throttling
Дата
Msg-id 8cb875d2-42c2-fa04-a041-631ebfe526a7@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Syncrep and improving latency due to WAL throttling  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Ответы Re: Syncrep and improving latency due to WAL throttling  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Список pgsql-hackers

On 2/1/23 11:04, Jakub Wartak wrote:
> On Mon, Jan 30, 2023 at 9:16 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> 
> Hi Bharath, thanks for reviewing.
> 
>> I think measuring the number of WAL flushes with and without this
>> feature that the postgres generates is great to know this feature
>> effects on IOPS. Probably it's even better with variations in
>> synchronous_commit_flush_wal_after or the throttling limits.
> 
> It's the same as point 19, so I covered it there.
> 
>> Although v2 is a WIP patch, I have some comments:
>> 1. Coding style doesn't confirm to the Postgres standard:
> 
> Fixed all thoses that you mentioned and I've removed elog() and code
> for timing. BTW: is there a way to pgindent only my changes (on git
> diff?)
> 
>> 2. It'd be better to add a TAP test hitting the WAL throttling.
> 
> TODO, any hints on how that test should look like are welcome
> (checking pg_stat_wal?) I've could spot only 1 place for it --
> src/test/recovery/007_sync_rep.pl
> 
>> 3. We generally don't need timings to be calculated explicitly when we
>> emit before and after log messages. If needed one can calculate the
>> wait time from timestamps of the log messages. If it still needs an
>> explicit time difference which I don't think is required, because we
>> have a special event and before/after log messages, I think it needs
>> to be under track_wal_io_timing to keep things simple.
> 
> Removed, it was just debugging aid to demonstrate the effect in the
> session waiting.
> 
>> 4. XLogInsertRecord() can be a hot path for high-write workload, so
>> effects of adding anything in it needs to be measured. So, it's better
>> to run benchmarks with this feature enabled and disabled.
> 
> When the GUC is off(0 / default), in my tests the impact is none (it's
> just set of simple IFs), however if the feature is enabled then the
> INSERT is slowed down (I've repeated the initial test from 1st post
> and single-statement INSERT for 50MB WAL result is the same 4-5s and
> ~23s +/- 1s when feature is activated when the RTT is 10.1ms, but
> that's intentional).
> 
>> 5. Missing documentation of this feature and the GUC. I think we can
>> describe extensively why we'd need a feature of this kind in the
>> documentation for better adoption or usability.
> 
> TODO, I'm planning on adding documentation when we'll be a little bit
> closer to adding to CF.
> 
>> 6. Shouldn't the max limit be MAX_KILOBYTES?
>> +        &synchronous_commit_flush_wal_after,
>> +        0, 0, 1024L*1024L,
> 
> Fixed.
> 
>> 7. Can this GUC name be something like
>> synchronous_commit_wal_throttle_threshold to better reflect what it
>> does for a backend?
>> +        {"synchronous_commit_flush_wal_after", PGC_USERSET,
>> REPLICATION_SENDING,
> 
> Done.
> 
>> 8. A typo - s/confiration/confirmation
> [..]
>> 9. This
>> "Sets the maximum logged WAL in kbytes, after which wait for sync
>> commit confiration even without commit "
>> better be something like below?
>> "Sets the maximum amount of WAL in kilobytes a backend generates after
>> which it waits for synchronous commit confiration even without commit
>> "
> 
> Fixed as you have suggested.
> 
>> 10. I think there's nothing wrong in adding some assertions in
>> HandleXLogDelayPending():
>> Assert(synchronous_commit_flush_wal_after > 0);
>> Assert(backendWalInserted > synchronous_commit_flush_wal_after * 1024L);
>> Assert(XactLastThrottledRecEnd is not InvalidXLogRecPtr);
> 
> Sure, added.
> 
>> 11. Specify the reason in the comments as to why we had to do the
>> following things:
>> Here:
>> +    /* flush only up to the last fully filled page */
>> +    XLogRecPtr     LastFullyWrittenXLogPage = XactLastThrottledRecEnd
>> - (XactLastThrottledRecEnd % XLOG_BLCKSZ);
>> Talk about how this avoids multiple-flushes for the same page.
>>
>> Here:
>> + * Called from ProcessMessageInterrupts() to avoid waiting while
>> being in critical section
>> + */
>> +void HandleXLogDelayPending()
>> Talk about how waiting in a critical section, that is in
>> XLogInsertRecord() causes locks to be held longer durations and other
>> effects.
> 
> Added.
> 
>> Here:
>> +        /* WAL throttling */
>> +        backendWalInserted += rechdr->xl_tot_len;
>> Be a bit more verbose about why we try to throttle WAL and why only
>> for sync replication, the advantages, effects etc.
> 
> Added.
> 
>> 12. This better be a DEBUG1 or 2 message instead of WARNINGs to not
>> clutter server logs.
>> +    /* XXX Debug for now */
>> +    elog(WARNING, "throttling WAL down on this session
>> (backendWalInserted=%d, LSN=%X/%X flushingTo=%X/%X)",
>> +        backendWalInserted,
>> +        LSN_FORMAT_ARGS(XactLastThrottledRecEnd),
>> +        LSN_FORMAT_ARGS(LastFullyWrittenXLogPage));
>>
>> +    elog(WARNING, "throttling WAL down on this session - end (%f
>> ms)", timediff);
> 
> OK, that's entirely removed.
> 
>> 13. BTW, we don't need to hold interrupts while waiting for sync
>> replication ack as it may block query cancels or proc die pendings.
>> You can remove these, unless there's a strong reason.
>> +    //HOLD_INTERRUPTS();
>> +    //RESUME_INTERRUPTS();
> 
> Sure, removed. However, one problem I've discovered is that we were
> hitting Assert(InterruptHoldoffCount > 0) in SyncRepWaitForLSN, so
> I've fixed that too.
> 
>> 14. Add this wait event in the documentation.
>> +        case WAIT_EVENT_SYNC_REP_THROTTLED:
>> +            event_name = "SyncRepThrottled";
>> +            break;
>>
>> 15. Why can XLogDelayPending not be a volatile atomic variable? Is it
>> because it's not something being set within a signal handler?
>>  extern PGDLLIMPORT volatile sig_atomic_t IdleStatsUpdateTimeoutPending;
>> +extern PGDLLIMPORT bool XLogDelayPending;
> 
> Added a comment explaining that.
> 
>> 16.
>> +    instr_time    wal_throttle_time_start, wal_throttle_time_end;
>> +    double        timediff;
>> +    XLogDelayPending = false;
>> An extra line needed after variable declaration and assignment.
> 
> Fixed.
> 
>> 17. I think adding how many times a backend throttled WAL to
>> pg_stat_activity is a good metric.
> 
> Good idea, added; catversion and pgstat format id were bumped. I've
> also added it to the per-query EXPLAIN (WAL) so it logs something like
> "WAL: records=500000 bytes=529500000 throttled=2016" , however I would
> appreciate a better name proposal on how to name that.
> 
>> 18. Can you avoid the need of new functions SyncRepWaitForLSNInternal
>> and SyncRepWaitForLSNThrottled by relying on the global throttling
>> state to determine the correct waitevent in SyncRepWaitForLSN?
> 
> Done.
> 
>> 19. I think measuring the number of WAL flushes with and without this
>> feature that the postgres generates is great to know this feature
>> effects on IOPS. Probably it's even better with variations in
>> synchronous_commit_flush_wal_after or the throttling limits.
> 
> default =>
> INSERT 0 500000
> Time: 6996.340 ms (00:06.996)
> -[ RECORD 1 ]----+------------------------------
> wal_records      | 500001
> wal_fpi          | 0
> wal_bytes        | 529500034
> wal_buffers_full | 44036
> wal_write        | 44056
> wal_sync         | 40
> wal_write_time   | 317.991
> wal_sync_time    | 2690.425
> wal_throttled    | 0
> stats_reset      | 2023-02-01 10:31:44.475651+01
> and 1 call to probe_postgres:XLogFlush
> 
> set synchronous_commit_wal_throttle_threshold to '256kB'; =>
> INSERT 0 500000
> Time: 25476.155 ms (00:25.476)
> -[ RECORD 1 ]----+------------------------------
> wal_records      | 500001
> wal_fpi          | 0
> wal_bytes        | 529500034
> wal_buffers_full | 0
> wal_write        | 2062
> wal_sync         | 2062
> wal_write_time   | 180.177
> wal_sync_time    | 1409.522
> wal_throttled    | 2016
> stats_reset      | 2023-02-01 10:32:01.955513+01
> and 2017 calls to probe_postgres:XLogFlush
> 
> set synchronous_commit_wal_throttle_threshold to '1MB'; =>
> INSERT 0 500000
> Time: 10113.278 ms (00:10.113)
> -[ RECORD 1 ]----+------------------------------
> wal_records      | 642864
> wal_fpi          | 0
> wal_bytes        | 537929315
> wal_buffers_full | 0
> wal_write        | 560
> wal_sync         | 559
> wal_write_time   | 182.586
> wal_sync_time    | 988.72
> wal_throttled    | 504
> stats_reset      | 2023-02-01 10:32:36.250678+01
> 

I'm not quite sure how to interpret these numbers and what conclusions
to draw. ~520MB of WAL is about 2000 x 256kB chunks, so it's not
surprising we do ~2017 XLogFlush calls.

But does that mean we did more I/O operations (like fsyncs)? Not
necessarily, because something else might have done a flush too. For
example, if there are concurrent sessions doing commit, that'll trigger
fsync. And if we back off a bit (e.g. to the LSN of the last complete
page), we'll simply piggyback on that and won't do any extra fsync or
other IO.

Of course, if there is no other concurrent activity triggering flushes,
this will do extra (network) I/O and fsync. But that's expected and
flushing data more often naturally reduces throughput. The impact is
mostly proportional to network latency.

The way I see this is "latency guarantee" - not allowing more unflushed
(not confirmed by sync replica) WAL than can be sent/confirmed within
some time limit. Of course, we don't have a way to specify this in time
directly, so we have to specify amount of WAL instead. And stricter
guarantees generally lead to lower throughput.

I don't think there's a way around this, except for disabling the
throttling by default - if someone enables that, he/she intentionally
did that because latency is more important than throughput.

> Maybe we should avoid calling fsyncs for WAL throttling? (by teaching
> HandleXLogDelayPending()->XLogFlush()->XLogWrite() to NOT to sync when
> we are flushing just because of WAL thortting ?) Would that still be
> safe?

It's not clear to me how could this work and still be safe. I mean, we
*must* flush the local WAL first, otherwise the replica could get ahead
(if we send unflushed WAL to replica and then crash). Which would be
really bad, obviously.

And we *have* to send the data to the sync replica, because that's the
whole point of this thread.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Justin Pryzby
Дата:
Сообщение: Re: Support for dumping extended statistics
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Show various offset arrays for heap WAL records