Обсуждение: [PATCH] LLVM tuple deforming improvements

Поиск
Список
Период
Сортировка

[PATCH] LLVM tuple deforming improvements

От
Pierre Ducroquet
Дата:
Hi

As reported in the «effect of JIT tuple deform?» thread, there are for some
cases slowdowns when using JIT tuple deforming.
I've played with the generated code and with the LLVM optimizer trying to fix
that issue, here are the results of my experiments, with the corresponding
patches.

All performance measurements are done following the test from
https://www.postgresql.org/message-id/CAFj8pRAOcSXNnykfH=M6mNaHo
+g=FaUs=DLDZsOHdJbKujRFSg@mail.gmail.com

Base measurements :

No JIT : 850ms
JIT without tuple deforming : 820 ms (0.2ms optimizing)
JIT with tuple deforming, no opt : 1650 ms (1.5ms)
JIT with tuple deforming, -O3 : 770 ms (105ms)

1) force a -O1 when deforming

This is by far the best I managed to get. With -O1, the queries are even
faster than with -O3 since the optimizer is faster, while generating an
already efficient code.
I have tried adding the right passes to the passmanager, but it looks like the
interesting ones are not available unless you enable -O1.

JIT with tuple deforming, -O1 : 725 ms (54ms)

2) improve the LLVM IR code

The code generator in llvmjit-deform.c currently rely on the LLVM optimizer to
do the right thing. For instance, it can generate a lot of empty blocks with
only a jump. If we don't want to enable the LLVM optimizer for every code, we
have to get rid of this kind of pattern. The attached patch does that. When
the optimizer is not used, this gets a few cycles boost, nothing impressive.
I have tried to go closer to the optimized bitcode, but it requires building
phi nodes manually instead of using alloca, and this isn't enough to bring us
to the performance level of -O1.

JIT with tuple deforming, no opt : 1560 ms (1.5ms)

3) *experimental* : faster non-NULL handling

Currently, the generated code always look at the tuple header bitfield to
check each field null-ness, using afterwards an and against the hasnulls bit.
Checking only for hasnulls improves performance when there are mostly null-
less tuples, but taxes the performance when nulls are found.
I have not yet suceeded in implementing it, but I think that using the
statistics collected for a given table, we could use that when we know that we
may benefit from it.

JIT with tuple deforming, no opt : 1520 ms (1.5ms)
JIT with tuple deforming, -O1 : 690 ms (54ms)

Вложения

Re: [PATCH] LLVM tuple deforming improvements

От
Andres Freund
Дата:
Hi,

Thanks for looking at this!

On 2018-07-13 10:20:42 +0200, Pierre Ducroquet wrote:
> 2) improve the LLVM IR code
> 
> The code generator in llvmjit-deform.c currently rely on the LLVM optimizer to 
> do the right thing. For instance, it can generate a lot of empty blocks with 
> only a jump. If we don't want to enable the LLVM optimizer for every code, we 
> have to get rid of this kind of pattern. The attached patch does that. When 
> the optimizer is not used, this gets a few cycles boost, nothing impressive.
> I have tried to go closer to the optimized bitcode, but it requires building 
> phi nodes manually instead of using alloca, and this isn't enough to bring us 
> to the performance level of -O1.

Building phi blocks manually is too painful imo. But there's plenty
blocks we could easily skip entirely, even without creating phi nodes.



> From 4da278ee49b91d34120747c6763c248ad52da7b7 Mon Sep 17 00:00:00 2001
> From: Pierre Ducroquet <p.psql@pinaraf.info>
> Date: Mon, 2 Jul 2018 13:44:10 +0200
> Subject: [PATCH] Introduce opt1 in LLVM/JIT, and force it with deforming

I think I'd rather go for more explicit pipelines than defaulting to OX
pipelines.  This is too late for v11, and I suspect quite strongly that
we'll end up not relying on any of the default pipelines going forward -
they're just not targeted at our usecase.  I just didn't go there for
v11, because I was running out of steam / time.

Greetings,

Andres Freund


Re: [PATCH] LLVM tuple deforming improvements

От
Pierre Ducroquet
Дата:
On Friday, July 13, 2018 11:08:45 PM CEST Andres Freund wrote:
> Hi,
> 
> Thanks for looking at this!
> 
> On 2018-07-13 10:20:42 +0200, Pierre Ducroquet wrote:
> > 2) improve the LLVM IR code
> > 
> > The code generator in llvmjit-deform.c currently rely on the LLVM
> > optimizer to do the right thing. For instance, it can generate a lot of
> > empty blocks with only a jump. If we don't want to enable the LLVM
> > optimizer for every code, we have to get rid of this kind of pattern. The
> > attached patch does that. When the optimizer is not used, this gets a few
> > cycles boost, nothing impressive. I have tried to go closer to the
> > optimized bitcode, but it requires building phi nodes manually instead of
> > using alloca, and this isn't enough to bring us to the performance level
> > of -O1.
> 
> Building phi blocks manually is too painful imo. But there's plenty
> blocks we could easily skip entirely, even without creating phi nodes.
> 
> > From 4da278ee49b91d34120747c6763c248ad52da7b7 Mon Sep 17 00:00:00 2001
> > From: Pierre Ducroquet <p.psql@pinaraf.info>
> > Date: Mon, 2 Jul 2018 13:44:10 +0200
> > Subject: [PATCH] Introduce opt1 in LLVM/JIT, and force it with deforming
> 
> I think I'd rather go for more explicit pipelines than defaulting to OX
> pipelines.  This is too late for v11, and I suspect quite strongly that
> we'll end up not relying on any of the default pipelines going forward -
> they're just not targeted at our usecase.  I just didn't go there for
> v11, because I was running out of steam / time.

Hi

After looking at the optimization passes, I noticed that, at least in that 
case, most benefits do not come from any of the bitcode level passes.
Using a -O3 pipeline only on the bitcode gives at best a 20% performance 
boost.
Using a -O0 on the bitcode, but changing the codegen opt level from None (O1) 
to Less (O1) yields even better performance than a complete O1.
The attached patch alone gives a query time of 650 ms, vs 725ms for 'full' O1 
and 770ms for O3.

As far as I know (and from quickly looking at LLVM code, it doesn't look like 
this changed recently), the CodeGen part doesn't expose a pass manager, thus 
making it impossible to have our own optimization pipeline there, and we do 
not control the code generation directly since we rely on the ORC execution 
engine.


Regards

 Pierre

Вложения