Обсуждение: Stampede of the JIT compilers

Поиск
Список
Период
Сортировка

Stampede of the JIT compilers

От
James Coleman
Дата:
Hello,

We recently brought online a new database cluster, and in the course
of ramping up traffic to it encountered a situation where a misplanned
query (analyzing helped with this, but I think the issue is still
relevant) resulted in that query being compiled with JIT, and soon a
large number of backends were running that same shape of query, all of
them JIT compiling it. Since each JIT compilation took ~2s, this
starved the server of resources.

There are a couple of issues here. I'm sure it's been discussed
before, and it's not the point of my thread, but I can't help but note
that the default value of jit_above_cost of 100000 seems absurdly low.
On good hardware like we have even well-planned queries with costs
well above that won't be taking as long as JIT compilation does.

But on the topic of the thread: I'd like to know if anyone has ever
considered implemented a GUC/feature like
"max_concurrent_jit_compilations" to cap the number of backends that
may be compiling a query at any given point so that we avoid an
optimization from running amok and consuming all of a servers
resources?

Regards,
James Coleman



Re: Stampede of the JIT compilers

От
David Rowley
Дата:
On Sat, 24 Jun 2023 at 02:28, James Coleman <jtc331@gmail.com> wrote:
> There are a couple of issues here. I'm sure it's been discussed
> before, and it's not the point of my thread, but I can't help but note
> that the default value of jit_above_cost of 100000 seems absurdly low.
> On good hardware like we have even well-planned queries with costs
> well above that won't be taking as long as JIT compilation does.

It would be good to know your evidence for thinking it's too low.

The main problem I see with it is that the costing does not account
for how many expressions will be compiled. It's quite different to
compile JIT expressions for a query to a single table with a simple
WHERE clause vs some query with many joins which scans a partitioned
table with 1000 partitions, for example.

> But on the topic of the thread: I'd like to know if anyone has ever
> considered implemented a GUC/feature like
> "max_concurrent_jit_compilations" to cap the number of backends that
> may be compiling a query at any given point so that we avoid an
> optimization from running amok and consuming all of a servers
> resources?

Why do the number of backends matter?  JIT compilation consumes the
same CPU resources that the JIT compilation is meant to save.  If the
JIT compilation in your query happened to be a net win rather than a
net loss in terms of CPU usage, then why would
max_concurrent_jit_compilations be useful? It would just restrict us
on what we could save. This idea just covers up the fact that the JIT
costing is disconnected from reality.  It's a bit like trying to tune
your radio with the volume control.

I think the JIT costs would be better taking into account how useful
each expression will be to JIT compile.  There were some ideas thrown
around in [1].

David

[1] https://www.postgresql.org/message-id/CAApHDvpQJqLrNOSi8P1JLM8YE2C%2BksKFpSdZg%3Dq6sTbtQ-v%3Daw%40mail.gmail.com



Re: Stampede of the JIT compilers

От
Tomas Vondra
Дата:

On 6/24/23 02:33, David Rowley wrote:
> On Sat, 24 Jun 2023 at 02:28, James Coleman <jtc331@gmail.com> wrote:
>> There are a couple of issues here. I'm sure it's been discussed
>> before, and it's not the point of my thread, but I can't help but note
>> that the default value of jit_above_cost of 100000 seems absurdly low.
>> On good hardware like we have even well-planned queries with costs
>> well above that won't be taking as long as JIT compilation does.
> 
> It would be good to know your evidence for thinking it's too low.
> 
> The main problem I see with it is that the costing does not account
> for how many expressions will be compiled. It's quite different to
> compile JIT expressions for a query to a single table with a simple
> WHERE clause vs some query with many joins which scans a partitioned
> table with 1000 partitions, for example.
> 

I think it's both - as explained by James, there are queries with much
higher cost, but the JIT compilation takes much more than just running
the query without JIT. So the idea that 100k difference is clearly not
sufficient to make up for the extra JIT compilation cost.

But it's true that's because the JIT costing is very crude, and there's
little effort to account for how expensive the compilation will be (say,
how many expressions, ...).

IMHO there's no "good" default that wouldn't hurt an awful lot of cases.

There's also a lot of bias - people are unlikely to notice/report cases
when the JIT (including costing) works fine. But they sure are annoyed
when it makes the wrong choice.

>> But on the topic of the thread: I'd like to know if anyone has ever
>> considered implemented a GUC/feature like
>> "max_concurrent_jit_compilations" to cap the number of backends that
>> may be compiling a query at any given point so that we avoid an
>> optimization from running amok and consuming all of a servers
>> resources?
> 
> Why do the number of backends matter?  JIT compilation consumes the
> same CPU resources that the JIT compilation is meant to save.  If the
> JIT compilation in your query happened to be a net win rather than a
> net loss in terms of CPU usage, then why would
> max_concurrent_jit_compilations be useful? It would just restrict us
> on what we could save. This idea just covers up the fact that the JIT
> costing is disconnected from reality.  It's a bit like trying to tune
> your radio with the volume control.
> 

Yeah, I don't quite get this point either. If JIT for a given query
helps (i.e. makes execution shorter), it'd be harmful to restrict the
maximum number of concurrent compilations. It we just disable JIT after
some threshold is reached, that'd make queries longer and just made the
pileup worse.

If it doesn't help for a given query, we shouldn't be doing it at all.
But that should be based on better costing, not some threshold.

In practice there'll be a mix of queries where JIT does/doesn't help,
and this threshold would just arbitrarily (and quite unpredictably)
enable/disable costing, making it yet harder to investigate slow queries
(as if we didn't have enough trouble with that already).

> I think the JIT costs would be better taking into account how useful
> each expression will be to JIT compile.  There were some ideas thrown
> around in [1].
> 

+1 to that


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Stampede of the JIT compilers

От
James Coleman
Дата:
On Sat, Jun 24, 2023 at 7:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 6/24/23 02:33, David Rowley wrote:
> > On Sat, 24 Jun 2023 at 02:28, James Coleman <jtc331@gmail.com> wrote:
> >> There are a couple of issues here. I'm sure it's been discussed
> >> before, and it's not the point of my thread, but I can't help but note
> >> that the default value of jit_above_cost of 100000 seems absurdly low.
> >> On good hardware like we have even well-planned queries with costs
> >> well above that won't be taking as long as JIT compilation does.
> >
> > It would be good to know your evidence for thinking it's too low.

It's definitely possible that I stated this much more emphatically
than I should have -- it was coming out of my frustration with this
situation after all.

I think, though, that my later comments here will provide some
philosophical justification for it.

> > The main problem I see with it is that the costing does not account
> > for how many expressions will be compiled. It's quite different to
> > compile JIT expressions for a query to a single table with a simple
> > WHERE clause vs some query with many joins which scans a partitioned
> > table with 1000 partitions, for example.
> >
>
> I think it's both - as explained by James, there are queries with much
> higher cost, but the JIT compilation takes much more than just running
> the query without JIT. So the idea that 100k difference is clearly not
> sufficient to make up for the extra JIT compilation cost.
>
> But it's true that's because the JIT costing is very crude, and there's
> little effort to account for how expensive the compilation will be (say,
> how many expressions, ...).
>
> IMHO there's no "good" default that wouldn't hurt an awful lot of cases.
>
> There's also a lot of bias - people are unlikely to notice/report cases
> when the JIT (including costing) works fine. But they sure are annoyed
> when it makes the wrong choice.
>
> >> But on the topic of the thread: I'd like to know if anyone has ever
> >> considered implemented a GUC/feature like
> >> "max_concurrent_jit_compilations" to cap the number of backends that
> >> may be compiling a query at any given point so that we avoid an
> >> optimization from running amok and consuming all of a servers
> >> resources?
> >
> > Why do the number of backends matter?  JIT compilation consumes the
> > same CPU resources that the JIT compilation is meant to save.  If the
> > JIT compilation in your query happened to be a net win rather than a
> > net loss in terms of CPU usage, then why would
> > max_concurrent_jit_compilations be useful? It would just restrict us
> > on what we could save. This idea just covers up the fact that the JIT
> > costing is disconnected from reality.  It's a bit like trying to tune
> > your radio with the volume control.
> >
>
> Yeah, I don't quite get this point either. If JIT for a given query
> helps (i.e. makes execution shorter), it'd be harmful to restrict the
> maximum number of concurrent compilations. It we just disable JIT after
> some threshold is reached, that'd make queries longer and just made the
> pileup worse.

My thought process here is that given the poor modeling of JIT costing
you've both described that we're likely to estimate the cost of "easy"
JIT compilation acceptably well but also likely to estimate "complex"
JIT compilation far lower than actual cost.

Another way of saying this is that our range of JIT compilation costs
may well be fine on the bottom end but clamped on the high end, and
that means that our failure modes will tend towards the worst
mis-costings being the most painful (e.g., 2s compilation time for a
100ms query). This is even more the case in an OLTP system where the
majority of queries are already known to be quite fast.

In that context capping the number of backends compiling, particularly
where plans (and JIT?) might be cached, could well save us (depending
on workload).

That being said, I could imagine an alternative approach solving a
similar problem -- a way of exiting early from compilation if it takes
longer than we expect.

> If it doesn't help for a given query, we shouldn't be doing it at all.
> But that should be based on better costing, not some threshold.
>
> In practice there'll be a mix of queries where JIT does/doesn't help,
> and this threshold would just arbitrarily (and quite unpredictably)
> enable/disable costing, making it yet harder to investigate slow queries
> (as if we didn't have enough trouble with that already).
>
> > I think the JIT costs would be better taking into account how useful
> > each expression will be to JIT compile.  There were some ideas thrown
> > around in [1].
> >
>
> +1 to that

That does sound like an improvement.

One thing about our JIT that is different from e.g. browser JS engine
JITing is that we don't substitute in the JIT code "on the fly" while
execution is already underway. That'd be another, albeit quite
difficult, way to solve these issues.

Regards,
James Coleman



Re: Stampede of the JIT compilers

От
Tom Lane
Дата:
James Coleman <jtc331@gmail.com> writes:
> On Sat, Jun 24, 2023 at 7:40 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>> On 6/24/23 02:33, David Rowley wrote:
>>> On Sat, 24 Jun 2023 at 02:28, James Coleman <jtc331@gmail.com> wrote:
>>>> There are a couple of issues here. I'm sure it's been discussed
>>>> before, and it's not the point of my thread, but I can't help but note
>>>> that the default value of jit_above_cost of 100000 seems absurdly low.
>>>> On good hardware like we have even well-planned queries with costs
>>>> well above that won't be taking as long as JIT compilation does.

>>> It would be good to know your evidence for thinking it's too low.

> It's definitely possible that I stated this much more emphatically
> than I should have -- it was coming out of my frustration with this
> situation after all.

I think there is *plenty* of evidence that it is too low, or at least
that for some reason we are too willing to invoke JIT when the result
is to make the overall cost of a query far higher than it is without.
Just see all the complaints on the mailing lists that have been
resolved by advice to turn off JIT.  You do not even have to look
further than our own regression tests: on my machine with current
HEAD, "time make installcheck-parallel" reports

real    0m8.544s
user    0m0.906s
sys     0m0.863s

for a build without --with-llvm, and

real    0m13.211s
user    0m0.917s
sys     0m0.811s

for a build with it (and all JIT settings left at defaults).  If you
do non-parallel "installcheck" the ratio is similar.  I don't see how
anyone can claim that 50% slowdown is just fine.

I don't know whether raising the default would be enough to fix that
in a nice way, and I certainly don't pretend to have a specific value
to offer.  But it's undeniable that we have a serious problem here,
to the point where JIT is a net negative for quite a few people.


> In that context capping the number of backends compiling, particularly
> where plans (and JIT?) might be cached, could well save us (depending
> on workload).

TBH I do not find this proposal attractive in the least.  We have
a problem here even when you consider a single backend.  If we fixed
that, so that we don't invoke JIT unless it really helps, then it's
not going to help less just because you have a lot of backends.
Plus, the overhead of managing a system-wide limit is daunting.

            regards, tom lane



Re: Stampede of the JIT compilers

От
David Rowley
Дата:
On Sun, 25 Jun 2023 at 05:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> James Coleman <jtc331@gmail.com> writes:
> > On Sat, Jun 24, 2023 at 7:40 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >> On 6/24/23 02:33, David Rowley wrote:
> >>> On Sat, 24 Jun 2023 at 02:28, James Coleman <jtc331@gmail.com> wrote:
> >>>> There are a couple of issues here. I'm sure it's been discussed
> >>>> before, and it's not the point of my thread, but I can't help but note
> >>>> that the default value of jit_above_cost of 100000 seems absurdly low.
> >>>> On good hardware like we have even well-planned queries with costs
> >>>> well above that won't be taking as long as JIT compilation does.
>
> >>> It would be good to know your evidence for thinking it's too low.
>
> > It's definitely possible that I stated this much more emphatically
> > than I should have -- it was coming out of my frustration with this
> > situation after all.
>
> I think there is *plenty* of evidence that it is too low, or at least
> that for some reason we are too willing to invoke JIT when the result
> is to make the overall cost of a query far higher than it is without.

I've seen plenty of other reports and I do agree there is a problem,
but I think you're jumping to conclusions in this particular case.
I've seen nothing here that couldn't equally indicate the planner
didn't overestimate the costs or some row estimate for the given
query.  The solution to those problems shouldn't be bumping up the
default JIT thresholds it could be to fix the costs or tune/add
statistics to get better row estimates.

I don't think it's too big an ask to see a few more details so that we
can confirm what the actual problem is.

David



Re: Stampede of the JIT compilers

От
James Coleman
Дата:
On Sat, Jun 24, 2023 at 1:54 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> James Coleman <jtc331@gmail.com> writes:
> > In that context capping the number of backends compiling, particularly
> > where plans (and JIT?) might be cached, could well save us (depending
> > on workload).
>
> TBH I do not find this proposal attractive in the least.  We have
> a problem here even when you consider a single backend.  If we fixed
> that, so that we don't invoke JIT unless it really helps, then it's
> not going to help less just because you have a lot of backends.
> Plus, the overhead of managing a system-wide limit is daunting.
>
>                         regards, tom lane

I'm happy to withdraw that particular idea. My mental model was along
the lines "this is a startup cost, and then we'll have it cached, so
the higher than expected cost won't matter as much when the system
settles down", and in that scenario limiting the size of the herd can
make sense.

But that's not the broader problem, so...

Regards,
James Coleman



Re: Stampede of the JIT compilers

От
James Coleman
Дата:
On Sat, Jun 24, 2023 at 8:14 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sun, 25 Jun 2023 at 05:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > James Coleman <jtc331@gmail.com> writes:
> > > On Sat, Jun 24, 2023 at 7:40 AM Tomas Vondra
> > > <tomas.vondra@enterprisedb.com> wrote:
> > >> On 6/24/23 02:33, David Rowley wrote:
> > >>> On Sat, 24 Jun 2023 at 02:28, James Coleman <jtc331@gmail.com> wrote:
> > >>>> There are a couple of issues here. I'm sure it's been discussed
> > >>>> before, and it's not the point of my thread, but I can't help but note
> > >>>> that the default value of jit_above_cost of 100000 seems absurdly low.
> > >>>> On good hardware like we have even well-planned queries with costs
> > >>>> well above that won't be taking as long as JIT compilation does.
> >
> > >>> It would be good to know your evidence for thinking it's too low.
> >
> > > It's definitely possible that I stated this much more emphatically
> > > than I should have -- it was coming out of my frustration with this
> > > situation after all.
> >
> > I think there is *plenty* of evidence that it is too low, or at least
> > that for some reason we are too willing to invoke JIT when the result
> > is to make the overall cost of a query far higher than it is without.
>
> I've seen plenty of other reports and I do agree there is a problem,
> but I think you're jumping to conclusions in this particular case.
> I've seen nothing here that couldn't equally indicate the planner
> didn't overestimate the costs or some row estimate for the given
> query.  The solution to those problems shouldn't be bumping up the
> default JIT thresholds it could be to fix the costs or tune/add
> statistics to get better row estimates.
>
> I don't think it's too big an ask to see a few more details so that we
> can confirm what the actual problem is.

I did say in the original email "encountered a situation where a
misplanned query (analyzing helped with this, but I think the issue is
still relevant)".

I'll look at specifics again on Monday, but what I do remember is that
there were a lot of joins, and we already know we have cases where
those are planned poorly too (even absent bad stats).

What I wanted to get at more broadly here was thinking along the lines
of how to prevent the misplanning from causing such a disaster.

Regards,
James Coleman



Re: Stampede of the JIT compilers

От
Michael Banck
Дата:
Hi,

On Sat, Jun 24, 2023 at 01:54:53PM -0400, Tom Lane wrote:
> I don't know whether raising the default would be enough to fix that
> in a nice way, and I certainly don't pretend to have a specific value
> to offer.  But it's undeniable that we have a serious problem here,
> to the point where JIT is a net negative for quite a few people.

Some further data: to my knowledge, most major managed postgres
providers disable jit for their users. Azure certainly does, but I don't
have a Google Cloud SQL or RDS instance running right to verify their
settings. I do seem to remember that they did as well though, at least a
while back.


Michael



Re: Stampede of the JIT compilers

От
James Coleman
Дата:
On Sun, Jun 25, 2023 at 5:10 AM Michael Banck <mbanck@gmx.net> wrote:
>
> Hi,
>
> On Sat, Jun 24, 2023 at 01:54:53PM -0400, Tom Lane wrote:
> > I don't know whether raising the default would be enough to fix that
> > in a nice way, and I certainly don't pretend to have a specific value
> > to offer.  But it's undeniable that we have a serious problem here,
> > to the point where JIT is a net negative for quite a few people.
>
> Some further data: to my knowledge, most major managed postgres
> providers disable jit for their users. Azure certainly does, but I don't
> have a Google Cloud SQL or RDS instance running right to verify their
> settings. I do seem to remember that they did as well though, at least a
> while back.
>
>
> Michael

I believe it's off by default in Aurora Postgres also.

Regards,
James Coleman



Re: Stampede of the JIT compilers

От
Laurenz Albe
Дата:
On Sun, 2023-06-25 at 11:10 +0200, Michael Banck wrote:
> On Sat, Jun 24, 2023 at 01:54:53PM -0400, Tom Lane wrote:
> > I don't know whether raising the default would be enough to fix that
> > in a nice way, and I certainly don't pretend to have a specific value
> > to offer.  But it's undeniable that we have a serious problem here,
> > to the point where JIT is a net negative for quite a few people.
>
> Some further data: to my knowledge, most major managed postgres
> providers disable jit for their users.

I have also started recommending jit=off for all but analytic workloads.

Yours,
Laurenz Albe



Re: Stampede of the JIT compilers

От
Andres Freund
Дата:
Hi,

On 2023-06-24 13:54:53 -0400, Tom Lane wrote:
> I think there is *plenty* of evidence that it is too low, or at least
> that for some reason we are too willing to invoke JIT when the result
> is to make the overall cost of a query far higher than it is without.
> Just see all the complaints on the mailing lists that have been
> resolved by advice to turn off JIT.  You do not even have to look
> further than our own regression tests: on my machine with current
> HEAD, "time make installcheck-parallel" reports
> 
> real    0m8.544s
> user    0m0.906s
> sys     0m0.863s
> 
> for a build without --with-llvm, and
> 
> real    0m13.211s
> user    0m0.917s
> sys     0m0.811s

IIRC those are all, or nearly all, cases where we have no stats and the plans
have ridiculous costs (and other reasons like enable_seqscans = false and
using seqscans nonetheless). In those cases no cost based approach will work
:(.


> I don't know whether raising the default would be enough to fix that
> in a nice way, and I certainly don't pretend to have a specific value
> to offer.  But it's undeniable that we have a serious problem here,
> to the point where JIT is a net negative for quite a few people.

Yea, I think at the moment it's not working well enough to be worth having on
by default. Some of that is due to partitioning having become much more
common, leading to much bigger plan trees, some of it is just old stuff that I
had hoped could be addressed more easily.

FWIW, Daniel Gustafsson is hacking on an old patch of mine that was working
towards making the JIT result cacheable (and providing noticeably bigger
performance gains).


> > In that context capping the number of backends compiling, particularly
> > where plans (and JIT?) might be cached, could well save us (depending
> > on workload).
> 
> TBH I do not find this proposal attractive in the least.

Yea, me neither. It doesn't address any of the actual problems and will add
new contention.

Greetings,

Andres Freund