Обсуждение: WAL CPU overhead/optimization (was Master-slave visibility order)

Поиск
Список
Период
Сортировка

WAL CPU overhead/optimization (was Master-slave visibility order)

От
Andres Freund
Дата:
On 2013-08-30 01:10:40 +0300, Ants Aasma wrote:
> On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
> > per CPU overhead actually minimally increased (at least in my tests), it
> > just scales noticeably better than before.
> 
> Interesting. Do you have any insight what is behind the CPU overhead?
> Maybe the solution is to make WAL insertion cheap enough to not
> matter. That won't be easy, but neither are the alternatives.

Funnily by far the biggest thing I have seen in benchmarks is the CRC32
computation. I plan to brush up my ~3 year old CRC32 reimplementation
patch sometime, but afair you had a much better one?

I have some doubts about weakening the hash function by also using FNV
or similar here, so I'd first like to try how much of a difference a
better CRC32 implementation can make with the current XLogInsert()
implementation.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: WAL CPU overhead/optimization (was Master-slave visibility order)

От
Ants Aasma
Дата:
On Fri, Aug 30, 2013 at 1:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-08-30 01:10:40 +0300, Ants Aasma wrote:
>> On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
>> > per CPU overhead actually minimally increased (at least in my tests), it
>> > just scales noticeably better than before.
>>
>> Interesting. Do you have any insight what is behind the CPU overhead?
>> Maybe the solution is to make WAL insertion cheap enough to not
>> matter. That won't be easy, but neither are the alternatives.
>
> Funnily by far the biggest thing I have seen in benchmarks is the CRC32
> computation. I plan to brush up my ~3 year old CRC32 reimplementation
> patch sometime, but afair you had a much better one?
>
> I have some doubts about weakening the hash function by also using FNV
> or similar here, so I'd first like to try how much of a difference a
> better CRC32 implementation can make with the current XLogInsert()
> implementation.

The CRC32 implementations mostly differ by the amount of lookups that
are done in parallel. Postgresql does 1 lookup, IIRC zlib
implementation does 4, Intel has a paper that recommends going up to
8. The tradeoff is that each level requires a 4KB lookup table - for
small records the additional cache misses will probably kill any
speedup.

A quick overview of the hot cache large buffer performance of a few
interesting options:
crc32 slice-by-1: 0.148 bytes/cycle
crc32 slice-by-4: 0.392 bytes/cycle
crc32 slice-by-8: 0.654 bytes/cycle
crc32c instruction pipelined by 3: 6.8 bytes/cycle (number from Intels paper)
FNV 1 byte at a time version: 0.333 bytes/cycle
md5: 0.159 bytes/cycle
Murmur3A: 1.019 bytes/cycle
CityHash64: 4.246 bytes/cycle

CityHash64 actually looks pretty good, there no known hash quality
issues. Compared to CRC, the only weakening is that single bit errors
are not guaranteed to be 100% detected. There's also the issue that
only a 64bit implementation exists, but I'm sure this can be resolved
(if necessary, we can just use Murmur3 on 32bit).

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: WAL CPU overhead/optimization (was Master-slave visibility order)

От
Andres Freund
Дата:
On 2013-08-30 02:53:54 +0300, Ants Aasma wrote:
> On Fri, Aug 30, 2013 at 1:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-08-30 01:10:40 +0300, Ants Aasma wrote:
> >> On Fri, Aug 30, 2013 at 12:33 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > FWIW, WAL is still the major bottleneck for INSERT heavy workloads. The
> >> > per CPU overhead actually minimally increased (at least in my tests), it
> >> > just scales noticeably better than before.
> >>
> >> Interesting. Do you have any insight what is behind the CPU overhead?
> >> Maybe the solution is to make WAL insertion cheap enough to not
> >> matter. That won't be easy, but neither are the alternatives.
> >
> > Funnily by far the biggest thing I have seen in benchmarks is the CRC32
> > computation. I plan to brush up my ~3 year old CRC32 reimplementation
> > patch sometime, but afair you had a much better one?
> >
> > I have some doubts about weakening the hash function by also using FNV
> > or similar here, so I'd first like to try how much of a difference a
> > better CRC32 implementation can make with the current XLogInsert()
> > implementation.
> 
> The CRC32 implementations mostly differ by the amount of lookups that
> are done in parallel. Postgresql does 1 lookup, IIRC zlib
> implementation does 4, Intel has a paper that recommends going up to
> 8. The tradeoff is that each level requires a 4KB lookup table - for
> small records the additional cache misses will probably kill any
> speedup.
> 
> A quick overview of the hot cache large buffer performance of a few
> interesting options:
> [interesting data]

I am not sure "hot cache large buffer performance" is really the
interesting case. Most of the XLogInsert()s are pretty small in the
common workloads. I vaguely recall trying 8 and getting worse
performance on many workloads, but that might have been a problem of my
implementation.

The reason I'd like to go for a faster CRC32 implementation as a first
step is that it's easy. Easy to verify, easy to analyze, easy to
backout. I personally don't have enough interest/time in the 9.4 cycle
to purse conversion to a different algorithm (I find the idea of using
different ones on 32/64bit pretty bad), but I obviously won't stop
somebody else ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: WAL CPU overhead/optimization (was Master-slave visibility order)

От
Ants Aasma
Дата:
On Fri, Aug 30, 2013 at 3:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I am not sure "hot cache large buffer performance" is really the
> interesting case. Most of the XLogInsert()s are pretty small in the
> common workloads. I vaguely recall trying 8 and getting worse
> performance on many workloads, but that might have been a problem of my
> implementation.

Slice-by-8 doesn't have any overhead for small buffers besides the
lookup tables, so it most likely the cache misses that were the issue.
Murmur3, CityHash and SpookyHash don't have any lookup tables and are
excellent with small keys. Especially CityHash, 64 byte hash is quoted
at 9ns.

> The reason I'd like to go for a faster CRC32 implementation as a first
> step is that it's easy. Easy to verify, easy to analyze, easy to
> backout. I personally don't have enough interest/time in the 9.4 cycle
> to purse conversion to a different algorithm (I find the idea of using
> different ones on 32/64bit pretty bad), but I obviously won't stop
> somebody else ;)

I might give it a shot later this cycle as I have familiarized my self
with the problem domain anyway. I understand the appeal of staying
with what we have, but this would cap the speedup at 4x and has large
caveats with the extra lookup tables. A 28x speedup might be worth the
extra effort.

Regards,
Ants Aasma



Re: WAL CPU overhead/optimization (was Master-slave visibility order)

От
"ktm@rice.edu"
Дата:
On Fri, Aug 30, 2013 at 03:22:37AM +0300, Ants Aasma wrote:
> On Fri, Aug 30, 2013 at 3:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I am not sure "hot cache large buffer performance" is really the
> > interesting case. Most of the XLogInsert()s are pretty small in the
> > common workloads. I vaguely recall trying 8 and getting worse
> > performance on many workloads, but that might have been a problem of my
> > implementation.
> 
> Slice-by-8 doesn't have any overhead for small buffers besides the
> lookup tables, so it most likely the cache misses that were the issue.
> Murmur3, CityHash and SpookyHash don't have any lookup tables and are
> excellent with small keys. Especially CityHash, 64 byte hash is quoted
> at 9ns.
> 
> > The reason I'd like to go for a faster CRC32 implementation as a first
> > step is that it's easy. Easy to verify, easy to analyze, easy to
> > backout. I personally don't have enough interest/time in the 9.4 cycle
> > to purse conversion to a different algorithm (I find the idea of using
> > different ones on 32/64bit pretty bad), but I obviously won't stop
> > somebody else ;)
> 
> I might give it a shot later this cycle as I have familiarized my self
> with the problem domain anyway. I understand the appeal of staying
> with what we have, but this would cap the speedup at 4x and has large
> caveats with the extra lookup tables. A 28x speedup might be worth the
> extra effort.
> 
> Regards,
> Ants Aasma
> 

You may want to also check out xxhash with a BSD License and very fast
32-bit performance as well:

http://fastcompression.blogspot.com/2012/04/selecting-checksum-algorithm.html
http://code.google.com/p/xxhash/

FWIW I agree that a much faster function would be better for CPU overhead.

Regards,
Ken



Re: WAL CPU overhead/optimization (was Master-slave visibility order)

От
Robert Haas
Дата:
On Thu, Aug 29, 2013 at 8:22 PM, Ants Aasma <ants@cybertec.at> wrote:
> I might give it a shot later this cycle as I have familiarized my self
> with the problem domain anyway. I understand the appeal of staying
> with what we have, but this would cap the speedup at 4x and has large
> caveats with the extra lookup tables. A 28x speedup might be worth the
> extra effort.

I agree.  However, on the flip side, a bird in the hand is worth two
in the bush.  A patch that just does the same thing faster is likely
to be less controversial than changing the algorithm, and does not
preclude changing the algorithm later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL CPU overhead/optimization (was Master-slave visibility order)

От
Hannu Krosing
Дата:
On 09/03/2013 05:27 PM, Robert Haas wrote:
> On Thu, Aug 29, 2013 at 8:22 PM, Ants Aasma <ants@cybertec.at> wrote:
>> I might give it a shot later this cycle as I have familiarized my self
>> with the problem domain anyway. I understand the appeal of staying
>> with what we have, but this would cap the speedup at 4x and has large
>> caveats with the extra lookup tables. A 28x speedup might be worth the
>> extra effort.
> I agree.  However, on the flip side, a bird in the hand is worth two
> in the bush.  A patch that just does the same thing faster is likely
> to be less controversial than changing the algorithm, and does not
> preclude changing the algorithm later.
>
Has anybody looked at the recent i5/i7 added hrdware support for CRC32 ?

http://www.strchr.com/crc32_popcnt

From number there it is probably still slower than xxhash, but it might
be worth doing as
an interim optimisation.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




Re: WAL CPU overhead/optimization (was Master-slave visibility order)

От
Andres Freund
Дата:
On 2013-09-03 23:25:01 +0200, Hannu Krosing wrote:
> On 09/03/2013 05:27 PM, Robert Haas wrote:
> > On Thu, Aug 29, 2013 at 8:22 PM, Ants Aasma <ants@cybertec.at> wrote:
> >> I might give it a shot later this cycle as I have familiarized my self
> >> with the problem domain anyway. I understand the appeal of staying
> >> with what we have, but this would cap the speedup at 4x and has large
> >> caveats with the extra lookup tables. A 28x speedup might be worth the
> >> extra effort.
> > I agree.  However, on the flip side, a bird in the hand is worth two
> > in the bush.  A patch that just does the same thing faster is likely
> > to be less controversial than changing the algorithm, and does not
> > preclude changing the algorithm later.
> >
> Has anybody looked at the recent i5/i7 added hrdware support for CRC32 ?
> 
> http://www.strchr.com/crc32_popcnt
> 
> From number there it is probably still slower than xxhash, but it might
> be worth doing as
> an interim optimisation.

Yes. There's two problems: For one they use a different polynom
(castagnoli afair), so the results aren't compatible with ones we
compute so far. For another, you need runtime detection of whether the
instruction is available and you need some compiler specific things to
use it.
Neither is too hard to solve, but I think that amount of work is better
spent using a different hash.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services