Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Дата
Msg-id 20230123202619.yckkxn56iavoq2rn@awork3.anarazel.de
обсуждение исходный текст
Ответ на Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?  (David Geier <geidav.pg@gmail.com>)
Ответы Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?  (David Geier <geidav.pg@gmail.com>)
Список pgsql-hackers
Hi,

On 2023-01-23 18:49:37 +0100, David Geier wrote:
> On 1/21/23 05:12, Andres Freund wrote:
> > We do currently do the conversion quite frequently.  Admittedly I was
> > partially motivated by trying to get the per-loop overhead in pg_test_timing
> > down ;)
> > 
> > But I think it's a real issue. Places where we do, but shouldn't, convert:
> > 
> > - ExecReScan() - quite painful, we can end up with a lot of those
> > - InstrStopNode() - adds a good bit of overhead to simple
> InstrStopNode() doesn't convert in the general case but only for the first
> tuple or when async. So it goes somewhat hand in hand with ExecReScan().

I think even the first-scan portion is likely noticable for quick queries -
you can quickly end up with 5-10 nodes, even for queries processed in the <
0.1ms range.

Of course it's way worse with rescans / loops.


> > - PendingWalStats.wal_write_time - this is particularly bad because it happens
> >    within very contended code
> > - calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() -
> >    they can be very frequent
> > - pgbench.c, as we already discussed
> > - pg_stat_statements.c
> > - ...
> > 
> > These all will get a bit slower when moving to a "variable" frequency.

> I wonder if we will be able to measure any of them easily. But given that
> it's many more places than I had realized and given that the optimized code
> is not too involved, let's give it a try.

I think at least some should be converted to just accumulate in an
instr_time...



> > What was your approach for avoiding the costly operation?  I ended up with a
> > integer multiplication + shift approximation for the floating point
> > multiplication (which in turn uses the inverse of the division by the
> > frequency). To allow for sufficient precision while also avoiding overflows, I
> > had to make that branch conditional, with a slow path for large numbers of
> > nanoseconds.
> 
> It seems like we ended up with the same. I do:
> 
> sec = ticks / frequency_hz
> ns  = ticks / frequency_hz * 1,000,000,000
> ns  = ticks * (1,000,000,000 / frequency_hz)
> ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
> 
> Now, the constant scaling factor in parentheses is typically a floating
> point number. For example for a frequency of 2.5 GHz it would be 2.5. To
> work around that we can do something like:
> 
> ns  = ticks * (1,000,000 * scaler / frequency_khz) / scaler
> 
> Where scaler is a power-of-2, big enough to maintain enough precision while
> allowing for a shift to implement the division.

Yep, at least quite similar.


> The additional multiplication with scaler makes that the maximum range go
> down, because we must ensure we never overflow. I'm wondering if we cannot
> pick scaler in such a way that remaining range of cycles is large enough for
> our use case and we can therefore live without bothering for the overflow
> case. What would be "enough"? 1 year? 10 years? ...

Depending on how low we want to keep the error, I don't think we can:

If I set the allowed deviation to 10**-9, we end up requiring a shift by 29
for common ghz ranges. Clearly 33bits isn't an interesting range.

But even if you accept a higher error - we don't have *that* much range
available. Assuming an uint64, the range is ~584 years. If we want 10 years
range, we end up

  math.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2)
  ~= 5.87

So 5 bits available that we could "use" for multiply/shift. For something like
2.5ghz, that'd be ~2% error, clearly not acceptable.  And even just a year of
range, ends up allowing a failure of 30796s = 8min over a year, still too
high.


But I don't think it's really an issue - normally that branch will never be
taken (at least within the memory of the branch predictor), which on modern
CPUs means it'll just be predicted as not taken. So as long as we tell the
compiler what's the likely branch, it should be fine. At least as long as the
branch compares with a hardcoded number.


> > I think it'd be great - but I'm not sure we're there yet, reliability and
> > code-complexity wise.

> Thanks to your commits, the diff of the new patch set will be already much
> smaller and easier to review. What's your biggest concern in terms of
> reliability?

- the restriction just to linux, that'll make testing harder for some, and
  ends up encoding too much OS dependency
- I think we need both the barrier and non-barrier variant, otherwise I
  suspect we'll end up with inccuracies we don't want
- needs lots more documentation about why certain cpuid registers are used
- cpu microarch dependencies - isn't there, e.g., the case that the scale on
  nehalem has to be different than on later architectures?
- lack of facility to evaluate how well the different time sources work


> > I think it might be worth makign the rdts aspect somewhat
> > measurable. E.g. allowing pg_test_timing to use both at the same time, and
> > have it compare elapsed time with both sources of counters.
> I haven't yet looked into pg_test_timing. I'll do that while including your
> patches into the new patch set.

Cool.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: James Coleman
Дата:
Сообщение: Re: Fix incorrect comment reference
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?