Обсуждение: Oom on temp (un-analyzed table caused by JIT) V16.1

Поиск
Список
Период
Сортировка

Oom on temp (un-analyzed table caused by JIT) V16.1

От
Kirk Wolak
Дата:
Daniel,
  You have a commit [1] that MIGHT fix this.
I have a script that recreates the problem, using random data in pg_temp.
And a nested cursor.

  It took me a few days to reduce this from actual code that was experiencing this.  If I turn off JIT, the problem goes away.  (if I don't FETCH the first row, the memory loss does not happen.  Maybe because opening a cursor is more decoration/prepare)

  I don't have an easy way to test this script right now against the commit.
I am hopeful that your fix fixes this.

  This was my first OOM issue in PG in 3yrs of working with it.

  The problem goes away if the TABLE is analyzed, or JIT is disabled.

  The current script, if run, will consume about 25% of my system memory (10GB).
Just call the function below until it dies if that's what you need.  The only way to get the memory back down is to close the connection.

SELECT pg_temp.fx(497);

Surprisingly, to me, the report from pg_get_backend_memory_contexts() doesn't really show "missing memory", which  I thought it would.  (FWIW, we caught this with multiple rounds of testing our code, slowing down, then crashing...  Is there ANY way to interrogate that we are above X% of system memory so we know to let this backend go?)

It takes about 18 minutes to run on my 4 CPU VM.

For now, we are going to add some ANALYZE statements to our code.
We will consider disabling JIT.

Thanks,
Kirk

  

Вложения

Re: Oom on temp (un-analyzed table caused by JIT) V16.1

От
Pavel Stehule
Дата:
Hi

po 15. 1. 2024 v 7:24 odesílatel Kirk Wolak <wolakk@gmail.com> napsal:
Daniel,
  You have a commit [1] that MIGHT fix this.
I have a script that recreates the problem, using random data in pg_temp.
And a nested cursor.

  It took me a few days to reduce this from actual code that was experiencing this.  If I turn off JIT, the problem goes away.  (if I don't FETCH the first row, the memory loss does not happen.  Maybe because opening a cursor is more decoration/prepare)

  I don't have an easy way to test this script right now against the commit.
I am hopeful that your fix fixes this.

  This was my first OOM issue in PG in 3yrs of working with it.

  The problem goes away if the TABLE is analyzed, or JIT is disabled.

  The current script, if run, will consume about 25% of my system memory (10GB).
Just call the function below until it dies if that's what you need.  The only way to get the memory back down is to close the connection.

SELECT pg_temp.fx(497);

Surprisingly, to me, the report from pg_get_backend_memory_contexts() doesn't really show "missing memory", which  I thought it would.  (FWIW, we caught this with multiple rounds of testing our code, slowing down, then crashing...  Is there ANY way to interrogate that we are above X% of system memory so we know to let this backend go?)

I wrote simple extension that can show memory allocation from system perspective


 

It takes about 18 minutes to run on my 4 CPU VM.

For now, we are going to add some ANALYZE statements to our code.

remember - don't run anything without VACUUM ANALYZE.

Without it, the queries can be slow - ANALYZE sets stats, VACUUM prepare visibility maps - without visibility maps index only scan cannot be used

autovacuum doesn't see into opened transactions, and autovacuum is executed in 1minute cycles. Autovacuum doesn't see temporary tables too. Temporary tables (data) are visible only from owner process.


 
We will consider disabling JIT.

Has sense only for bigger analytics queries.

Regards

Pavel
 

Re: Oom on temp (un-analyzed table caused by JIT) V16.1

От
Daniel Gustafsson
Дата:
> On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com> wrote:

>   You have a commit [1] that MIGHT fix this.
> I have a script that recreates the problem, using random data in pg_temp.
> And a nested cursor.

Running your reproducer script in a tight loop for a fair bit of time on the
v16 HEAD I cannot see any memory growth, so it's plausible that the upcoming
16.2 will work better in your environment.

>   It took me a few days to reduce this from actual code that was experiencing this.  If I turn off JIT, the problem
goesaway.  (if I don't FETCH the first row, the memory loss does not happen.  Maybe because opening a cursor is more
decoration/prepare)
>
>   I don't have an easy way to test this script right now against the commit.

There are up to date snapshots of the upcoming v16 minor release which might
make testing easier than building postgres from source?

    https://www.postgresql.org/download/snapshots/

Admittedly I don't know whether those are built with LLVM support but I think
they might be.

> I am hopeful that your fix fixes this.

It seems likely, so it would be very valuable if you could try running the pre
-release version of v16 before 16.2 ships to verify this.

--
Daniel Gustafsson




Re: Oom on temp (un-analyzed table caused by JIT) V16.1

От
Kirk Wolak
Дата:
On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com> wrote:

>   You have a commit [1] that MIGHT fix this.
> I have a script that recreates the problem, using random data in pg_temp.
> And a nested cursor.

Running your reproducer script in a tight loop for a fair bit of time on the
v16 HEAD I cannot see any memory growth, so it's plausible that the upcoming
16.2 will work better in your environment.

The script starts by creating 90 Million rows...  In my world that part of the script, plus the indexes, etc.  Takes about 8-9 minutes.
And has no memory loss. 

I used the memory reported by HTOP to track the problem.  I Forgot to mention this.
I am curious what you used?  (Because it doesn't display symptoms [running dog slow] until the backend has about 85% of the machines memory)
 
There are up to date snapshots of the upcoming v16 minor release which might
make testing easier than building postgres from source?

    https://www.postgresql.org/download/snapshots/

Thank you.  I have assigned that task to the guy who maintains our VMs/Environments. 
I will report back to you.

Kirk

Re: Oom on temp (un-analyzed table caused by JIT) V16.1

От
Pavel Stehule
Дата:


po 15. 1. 2024 v 15:03 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:
> On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com> wrote:

>   You have a commit [1] that MIGHT fix this.
> I have a script that recreates the problem, using random data in pg_temp.
> And a nested cursor.

Running your reproducer script in a tight loop for a fair bit of time on the
v16 HEAD I cannot see any memory growth, so it's plausible that the upcoming
16.2 will work better in your environment.

yes

 

>   It took me a few days to reduce this from actual code that was experiencing this.  If I turn off JIT, the problem goes away.  (if I don't FETCH the first row, the memory loss does not happen.  Maybe because opening a cursor is more decoration/prepare)
>
>   I don't have an easy way to test this script right now against the commit.

There are up to date snapshots of the upcoming v16 minor release which might
make testing easier than building postgres from source?

    https://www.postgresql.org/download/snapshots/

Admittedly I don't know whether those are built with LLVM support but I think
they might be.

> I am hopeful that your fix fixes this.

It seems likely, so it would be very valuable if you could try running the pre
-release version of v16 before 16.2 ships to verify this.

--
Daniel Gustafsson

Re: Oom on temp (un-analyzed table caused by JIT) V16.1

От
Daniel Gustafsson
Дата:
> On 15 Jan 2024, at 16:49, Kirk Wolak <wolakk@gmail.com> wrote:
> On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote:

> The script starts by creating 90 Million rows...  In my world that part of the script, plus the indexes, etc.  Takes
about8-9 minutes. 
> And has no memory loss.

That's expected, the memory leak did not affect those operations.

> I used the memory reported by HTOP to track the problem.  I Forgot to mention this.
> I am curious what you used?  (Because it doesn't display symptoms [running dog slow] until the backend has about 85%
ofthe machines memory) 

I use a combination of tools, in thise case I analyzed a build with Instruments on macOS.

> There are up to date snapshots of the upcoming v16 minor release which might
> make testing easier than building postgres from source?
>
>     https://www.postgresql.org/download/snapshots/ <https://www.postgresql.org/download/snapshots/>
>
> Thank you.  I have assigned that task to the guy who maintains our VMs/Environments.
> I will report back to you.

Great!

--
Daniel Gustafsson




Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

От
Kirk Wolak
Дата:
On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com> wrote:

>   You have a commit [1] that MIGHT fix this.
> I have a script that recreates the problem, using random data in pg_temp.
> And a nested cursor.

Running your reproducer script in a tight loop for a fair bit of time on the
v16 HEAD I cannot see any memory growth, so it's plausible that the upcoming
16.2 will work better in your environment.

Okay, I took the latest source off of git (17devel) and got it to work there in a VM.

It appears this issue is fixed.  It must have been related to the issue originally tagged. 

Thanks!

Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

От
Daniel Gustafsson
Дата:
> On 16 Jan 2024, at 02:53, Kirk Wolak <wolakk@gmail.com> wrote:
>
> On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote:
> > On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com <mailto:wolakk@gmail.com>> wrote:
>
> >   You have a commit [1] that MIGHT fix this.
> > I have a script that recreates the problem, using random data in pg_temp.
> > And a nested cursor.
>
> Running your reproducer script in a tight loop for a fair bit of time on the
> v16 HEAD I cannot see any memory growth, so it's plausible that the upcoming
> 16.2 will work better in your environment.
>
> Okay, I took the latest source off of git (17devel) and got it to work there in a VM.
>
> It appears this issue is fixed.  It must have been related to the issue originally tagged.

Thanks for testing and confirming!  Testing pre-release builds on real life
workloads is invaluable for the development of Postgres so thank you taking the
time.

--
Daniel Gustafsson




Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

От
Kirk Wolak
Дата:
On Tue, Jan 16, 2024 at 3:43 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 16 Jan 2024, at 02:53, Kirk Wolak <wolakk@gmail.com> wrote:
>
> On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote:
> > On 15 Jan 2024, at 07:24, Kirk Wolak <wolakk@gmail.com <mailto:wolakk@gmail.com>> wrote:
>...
> Okay, I took the latest source off of git (17devel) and got it to work there in a VM.
>
> It appears this issue is fixed.  It must have been related to the issue originally tagged.

Thanks for testing and confirming!  Testing pre-release builds on real life
workloads is invaluable for the development of Postgres so thank you taking the
time.
Daniel,
  I did a little more checking and the reason I did not see the link MIGHT be because EXPLAIN did not show a JIT attempt.
I tried to use settings that FORCE a JIT...  But to no avail.

  I am now concerned that the problem is more hidden in my use case.  Meaning I CANNOT conclude it is fixed.
But I know of NO WAY to force a JIT (I lowered costs to 1, etc.  ).

  You don't know a way to force at least the JIT analysis to happen?  (because I already knew if JIT was off, the leak wouldn't happen). 

Thanks,

Kirk Out! 
PS: I assume there is no pg_jit(1) function I can call. LOL

Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

От
Laurenz Albe
Дата:
On Thu, 2024-01-18 at 19:50 -0500, Kirk Wolak wrote:
>   I did a little more checking and the reason I did not see the link MIGHT be because EXPLAIN did not show a JIT
attempt.
> I tried to use settings that FORCE a JIT...  But to no avail.
>
>   I am now concerned that the problem is more hidden in my use case.  Meaning I CANNOT conclude it is fixed.
> But I know of NO WAY to force a JIT (I lowered costs to 1, etc.  ).
>
>   You don't know a way to force at least the JIT analysis to happen?  (because I already knew if JIT was off, the
leakwouldn't happen).  

If you set the limits to 0, you can trigger it easily:

SET jit = on;
SET jit_above_cost = 0;
SET jit_inline_above_cost = 0;
SET jit_optimize_above_cost = 0;

EXPLAIN (ANALYZE) SELECT count(*) FROM foo;
                                                                QUERY PLAN
                  

══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 Finalize Aggregate  (cost=58889.84..58889.85 rows=1 width=8) (actual time=400.462..418.214 rows=1 loops=1)
   ->  Gather  (cost=58889.62..58889.83 rows=2 width=8) (actual time=400.300..418.190 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=57889.62..57889.64 rows=1 width=8) (actual time=384.876..384.878 rows=1 loops=3)
               ->  Parallel Seq Scan on foo  (cost=0.00..52681.30 rows=2083330 width=0) (actual time=0.028..168.510
rows=1666667loops=3) 
 Planning Time: 0.133 ms
 JIT:
   Functions: 8
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.038 ms, Inlining 279.779 ms, Optimization 38.395 ms, Emission 73.105 ms, Total 392.316 ms
 Execution Time: 478.257 ms

Yours,
Laurenz Albe

Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

От
Daniel Gustafsson
Дата:
> On 19 Jan 2024, at 01:50, Kirk Wolak <wolakk@gmail.com> wrote:

>   I did a little more checking and the reason I did not see the link MIGHT be because EXPLAIN did not show a JIT
attempt.
> I tried to use settings that FORCE a JIT...  But to no avail.

Are you sure you are running a JIT enabled server?  Did you compile it yourself
or install a snapshot?

>   You don't know a way to force at least the JIT analysis to happen?  (because I already knew if JIT was off, the
leakwouldn't happen).  

If you set jit_above_cost=0 then postgres will compile a JIT enabled execution
tree.  This does bring up an interesting point, I don't think there is a way
for a user to know whether the server is jit enabled or not (apart from
explaining a query with costs adjusted but that's not all that userfriendly).
Maybe we need a way to reliably tell if JIT is active or not.

--
Daniel Gustafsson




Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

От
Michael Banck
Дата:
Hi,

On Fri, Jan 19, 2024 at 10:48:12AM +0100, Daniel Gustafsson wrote:
> This does bring up an interesting point, I don't think there is a way
> for a user to know whether the server is jit enabled or not (apart
> from explaining a query with costs adjusted but that's not all that
> userfriendly).  Maybe we need a way to reliably tell if JIT is active
> or not.

I thought this is what pg_jit_available() is for?

postgres=> SHOW jit;
 jit 
-----
 on
(1 Zeile)

postgres=> SELECT pg_jit_available();
 pg_jit_available 
------------------
 f
(1 Zeile)


Michael



Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

От
Daniel Gustafsson
Дата:
> On 19 Jan 2024, at 11:04, Michael Banck <mbanck@gmx.net> wrote:
> 
> Hi,
> 
> On Fri, Jan 19, 2024 at 10:48:12AM +0100, Daniel Gustafsson wrote:
>> This does bring up an interesting point, I don't think there is a way
>> for a user to know whether the server is jit enabled or not (apart
>> from explaining a query with costs adjusted but that's not all that
>> userfriendly).  Maybe we need a way to reliably tell if JIT is active
>> or not.
> 
> I thought this is what pg_jit_available() is for?

Ah, it is, I completely forgot we had that one. Thanks!

--
Daniel Gustafsson




Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

От
Kirk Wolak
Дата:
On Fri, Jan 19, 2024 at 4:20 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Thu, 2024-01-18 at 19:50 -0500, Kirk Wolak wrote:
>   I did a little more checking and the reason I did not see the link MIGHT be because EXPLAIN did not show a JIT attempt.
> I tried to use settings that FORCE a JIT...  But to no avail.
>
>   I am now concerned that the problem is more hidden in my use case.  Meaning I CANNOT conclude it is fixed.
> But I know of NO WAY to force a JIT (I lowered costs to 1, etc.  ).
>
>   You don't know a way to force at least the JIT analysis to happen?  (because I already knew if JIT was off, the leak wouldn't happen). 

If you set the limits to 0, you can trigger it easily:

SET jit = on;
SET jit_above_cost = 0;
SET jit_inline_above_cost = 0;
SET jit_optimize_above_cost = 0;

Okay,
  I Did exactly this (just now).  But the EXPLAIN does not contain the JIT?

-------------------------------------------------------------------------------
 Sort  (cost=5458075.88..5477861.00 rows=7914047 width=12)
   Sort Key: seid
   ->  HashAggregate  (cost=3910139.62..4280784.00 rows=7914047 width=12)
         Group Key: seid, fr_field_name, st_field_name
         Planned Partitions: 128
         ->  Seq Scan on parts  (cost=0.00..1923249.00 rows=29850000 width=12)
               Filter: ((seid <> 497) AND ((partnum)::text >= '1'::text))
(7 rows)
 
From a FUTURE email, I noticed pg_jit_available()

and it's set to f??

Okay, so does this require a special BUILD command?

postgres=# select pg_jit_available();
 pg_jit_available
------------------
 f
(1 row)

postgres=# \dconfig *jit*
 List of configuration parameters
        Parameter        |  Value  
-------------------------+---------
 jit                     | on
 jit_above_cost          | 100000
 jit_debugging_support   | off
 jit_dump_bitcode        | off
 jit_expressions         | on
 jit_inline_above_cost   | 500000
 jit_optimize_above_cost | 500000
 jit_profiling_support   | off
 jit_provider            | llvmjit
 jit_tuple_deforming     | on
(10 rows)

Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

От
Daniel Gustafsson
Дата:
> On 19 Jan 2024, at 23:09, Kirk Wolak <wolakk@gmail.com> wrote:

>  From a FUTURE email, I noticed pg_jit_available() and it's set to f??

Right, then this installation does not contain the necessary library to JIT
compile the query.

> Okay, so does this require a special BUILD command?

Yes, it requires that you compile with --with-llvm.  If you don't have llvm in
the PATH you might need to set LLVM_CONFIG to point to your llvm-config binary.
With autotools that would be something like:

    ./configure <other params> --with-llvm LLVM_CONFIG=/path/to/llvm-config

--
Daniel Gustafsson




Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

От
Kirk Wolak
Дата:
On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 19 Jan 2024, at 23:09, Kirk Wolak <wolakk@gmail.com> wrote:

>  From a FUTURE email, I noticed pg_jit_available() and it's set to f??

Right, then this installation does not contain the necessary library to JIT
compile the query.

> Okay, so does this require a special BUILD command?

Yes, it requires that you compile with --with-llvm.  If you don't have llvm in
the PATH you might need to set LLVM_CONFIG to point to your llvm-config binary.
With autotools that would be something like:

    ./configure <other params> --with-llvm LLVM_CONFIG=/path/to/llvm-config

--
Daniel Gustafsson

Thank you, that made it possible to build and run...
UNFORTUNATELY this has a CLEAR memory leak (visible in htop)
I am watching it already consuming 6% of my system memory.

I am re-attaching my script.  WHICH includes the settings to FORCE JIT.
It also does an EXPLAIN so you can verify that JIT is on (this is what I added/noticed!)
And it takes over 20 minutes to get this far.  It's still running.
I am re-attaching the script. (as I tweaked it).

This creates 90 million rows of data, so it takes a while.
I BELIEVE that it consumes far less memory if you do not fetch any rows (I had problems reproducing it if no rows were fetched).
So, this may be beyond the planning stages.

Thanks,

Kirk Out!
Вложения

Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

От
Kirk Wolak
Дата:
On Mon, Jan 22, 2024 at 1:30 AM Kirk Wolak <wolakk@gmail.com> wrote:
On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 19 Jan 2024, at 23:09, Kirk Wolak <wolakk@gmail.com> wrote:

...
    ./configure <other params> --with-llvm LLVM_CONFIG=/path/to/llvm-config

--
Daniel Gustafsson

Thank you, that made it possible to build and run...
UNFORTUNATELY this has a CLEAR memory leak (visible in htop)
I am watching it already consuming 6% of my system memory.


Daniel,
  In the previous email, I made note that once the JIT was enabled, the problem exists in 17Devel.
I re-included my script, which forced the JIT to be used...

  I attached an updated script that forced the settings.
But this is still leaking memory (outside of the pg_backend_memory_context() calls).
Probably because it's at the LLVM level?  And it does NOT happen from planning/opening the query.  It appears I have to fetch the rows to see the problem.

Thanks in advance.  Let me know if I should be doing something different?

Kirk Out!
PS: I was wondering if we had a function that showed total memory of the backend.  For helping to determine if we might have a 3rd party leak? [increase in total memory consumed not noticed by pg_backend_memory_contexts)

#include "postgres.h"
#include <sys/resource.h>

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(pg_backend_memory_footprint);

Datum pg_backend_memory_footprint(PG_FUNCTION_ARGS) {
    long memory_usage_bytes = 0;
    struct rusage usage;

    getrusage(RUSAGE_SELF, &usage);
    memory_usage_bytes = usage.ru_maxrss * 1024;

    PG_RETURN_INT64(memory_usage_bytes);
}



Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

От
Thomas Munro
Дата:
On Thu, Jan 25, 2024 at 8:51 AM Kirk Wolak <wolakk@gmail.com> wrote:
>     getrusage(RUSAGE_SELF, &usage);
>     memory_usage_bytes = usage.ru_maxrss * 1024;

FWIW log_statement_stats = on shows that in the logs.  See ShowUsage()
in postgres.c.



Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

От
Kirk Wolak
Дата:
On Wed, Jan 24, 2024 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Jan 25, 2024 at 8:51 AM Kirk Wolak <wolakk@gmail.com> wrote:
>     getrusage(RUSAGE_SELF, &usage);
>     memory_usage_bytes = usage.ru_maxrss * 1024;

FWIW log_statement_stats = on shows that in the logs.  See ShowUsage()
in postgres.c.

Thank you for this, here is the TERMINAL (Below is the tail of the log).  Notice that the pg_backend_memory_contexts does NOT show the memory consumed.
But your logging sure did!  (I wonder if I enable logging during planning, but there is like 82,000 cursors being opened... (This removed the FETCH and still leaks)


7:01:08 kwolak@postgres= # select pg_temp.fx(497);
NOTICE:  ("9848 kB","10 MB","638 kB")
NOTICE:  -----------after close, Count a: 82636, count b: 82636
NOTICE:  ("9997 kB","10 MB","648 kB")
 fx
----
 
(1 row)

Time: 525870.117 ms (08:45.870)


Tail:

024-01-24 17:01:08.752 EST [28804] DETAIL:  ! system usage stats:
        !       0.001792 s user, 0.000000 s system, 0.005349 s elapsed
        !       [560.535969 s user, 31.441656 s system total]
        !       185300 kB max resident size
        !       232/0 [29219648/54937864] filesystem blocks in/out
        !       0/25 [0/1016519] page faults/reclaims, 0 [0] swaps
        !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
        !       10/1 [62671/9660] voluntary/involuntary context switches
2024-01-24 17:01:08.752 EST [28804] STATEMENT:  explain SELECT DISTINCT seid, fr_field_name, st_field_name
                  FROM pg_temp.parts
                 WHERE seid <> 497 AND partnum >= '1'
                 ORDER BY seid;
2024-01-24 17:01:08.759 EST [28804] LOG:  QUERY STATISTICS
2024-01-24 17:01:08.759 EST [28804] DETAIL:  ! system usage stats:
        !       0.006207 s user, 0.000092 s system, 0.006306 s elapsed
        !       [560.542262 s user, 31.441748 s system total]
        !       185300 kB max resident size
        !       0/0 [29219648/54937864] filesystem blocks in/out
        !       0/4 [0/1016523] page faults/reclaims, 0 [0] swaps
        !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
        !       0/1 [62672/9661] voluntary/involuntary context switches
2024-01-24 17:01:08.759 EST [28804] STATEMENT:  SELECT 'pg_temp.fx(497); -- Not run, do \dt+ parts';
2024-01-24 17:04:30.844 EST [28746] LOG:  checkpoint starting: time
2024-01-24 17:04:32.931 EST [28746] LOG:  checkpoint complete: wrote 21 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.008 s, sync=0.006 s, total=2.087 s; sync files=15, longest=0.001 s, average=0.001 s; distance=98 kB, estimate=134 kB; lsn=0/16304D8, redo lsn=0/1630480
2024-01-24 17:11:06.350 EST [28804] LOG:  QUERY STATISTICS
2024-01-24 17:11:06.350 EST [28804] DETAIL:  ! system usage stats:
        !       515.952870 s user, 6.688389 s system, 525.869933 s elapsed
        !       [1076.495280 s user, 38.130145 s system total]
        !       708104 kB max resident size
        !       370000/3840 [29589648/54941712] filesystem blocks in/out
        !       0/327338 [0/1343861] page faults/reclaims, 0 [0] swaps
        !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
        !       22001/5216 [84675/14878] voluntary/involuntary context switches
2024-01-24 17:11:06.350 EST [28804] STATEMENT:  select pg_temp.fx(497);
2024-01-24 17:12:16.162 EST [28804] LOG:  QUERY STATISTICS
2024-01-24 17:12:16.162 EST [28804] DETAIL:  ! system usage stats:
        !       1.130029 s user, 0.007727 s system, 1.157486 s elapsed
        !       [1077.625396 s user, 38.137921 s system total]
        !       708104 kB max resident size
        !       992/0 [29590640/54941720] filesystem blocks in/out
        !       3/41 [3/1343902] page faults/reclaims, 0 [0] swaps
        !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
        !       9/68 [84685/14946] voluntary/involuntary context switches
2024-01-24 17:12:16.162 EST [28804] STATEMENT:  select now();
2024-01-24 17:12:30.944 EST [28804] LOG:  QUERY STATISTICS
2024-01-24 17:12:30.944 EST [28804] DETAIL:  ! system usage stats:
        !       0.004561 s user, 0.000019 s system, 0.004580 s elapsed
        !       [1077.630064 s user, 38.137944 s system total]
        !       708104 kB max resident size
        !       0/0 [29590640/54941728] filesystem blocks in/out
        !       0/4 [3/1343906] page faults/reclaims, 0 [0] swaps
        !       0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
        !       0/0 [84686/14947] voluntary/involuntary context switches
2024-01-24 17:12:30.944 EST [28804] STATEMENT:  select now();

 

Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

От
Michael Banck
Дата:
Hi,

On Wed, Jan 24, 2024 at 02:50:52PM -0500, Kirk Wolak wrote:
> On Mon, Jan 22, 2024 at 1:30 AM Kirk Wolak <wolakk@gmail.com> wrote:
> > On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> >> > On 19 Jan 2024, at 23:09, Kirk Wolak <wolakk@gmail.com> wrote:
> > Thank you, that made it possible to build and run...
> > UNFORTUNATELY this has a CLEAR memory leak (visible in htop)
> > I am watching it already consuming 6% of my system memory.
> >
> Daniel,
>   In the previous email, I made note that once the JIT was enabled, the
> problem exists in 17Devel.
> I re-included my script, which forced the JIT to be used...
> 
>   I attached an updated script that forced the settings.
> But this is still leaking memory (outside of the
> pg_backend_memory_context() calls).
> Probably because it's at the LLVM level?  And it does NOT happen from
> planning/opening the query.  It appears I have to fetch the rows to
> see the problem.

I had a look at this (and blogged about it here[1]) and was also
wondering what was going on with 17devel and the recent back-branch
releases, cause I could also reproduce those continuing memory leaks.

Adding some debug logging when llvm_inline_reset_caches() is called
solves the mystery: as you are calling a function, the fix that is in
17devel and the back-branch releases is not applicable and only after
the function returns llvm_inline_reset_caches() is being called (as
llvm_jit_context_in_use_count is greater than zero, presumably, so it
never reaches the call-site of llvm_inline_reset_caches()).

If you instead run your SQL in a DO-loop (as in the blog post) and not
as a PL/PgSQL function, you should see that it no longer leaks. This
might be obvious to some (and Andres mentioned it in 
https://www.postgresql.org/message-id/20210421002056.gjd6rpe6toumiqd6%40alap3.anarazel.de)
but it took me a while to figure out/find.


Michael

[1] https://www.credativ.de/en/blog/postgresql-en/quick-benchmark-postgresql-2024q1-release-performance-improvements/



Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [ NOT Fixed ]

От
Kirk Wolak
Дата:
On Thu, Feb 22, 2024 at 4:49 PM Michael Banck <mbanck@gmx.net> wrote:
Hi,

On Wed, Jan 24, 2024 at 02:50:52PM -0500, Kirk Wolak wrote:
> On Mon, Jan 22, 2024 at 1:30 AM Kirk Wolak <wolakk@gmail.com> wrote:
> > On Fri, Jan 19, 2024 at 7:03 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> >> > On 19 Jan 2024, at 23:09, Kirk Wolak <wolakk@gmail.com> wrote:
> > Thank you, that made it possible to build and run...
> > UNFORTUNATELY this has a CLEAR memory leak (visible in htop)
> > I am watching it already consuming 6% of my system memory.
...
I had a look at this (and blogged about it here[1]) and was also
wondering what was going on with 17devel and the recent back-branch
releases, cause I could also reproduce those continuing memory leaks.

Adding some debug logging when llvm_inline_reset_caches() is called
solves the mystery: as you are calling a function, the fix that is in
17devel and the back-branch releases is not applicable and only after
the function returns llvm_inline_reset_caches() is being called (as
llvm_jit_context_in_use_count is greater than zero, presumably, so it
never reaches the call-site of llvm_inline_reset_caches()).

If you instead run your SQL in a DO-loop (as in the blog post) and not
as a PL/PgSQL function, you should see that it no longer leaks. This
might be obvious to some (and Andres mentioned it in
https://www.postgresql.org/message-id/20210421002056.gjd6rpe6toumiqd6%40alap3.anarazel.de)
but it took me a while to figure out/find.

Thanks for confirming.  Inside a routine is required.  But disabling JIT was good enough for us.

Curious if there was another way to end up calling the cleanup?  But it had that "feel" that the context of the cleanup was
being lost between iterations of the loop?