Re: Memory leak from ExecutorState context?

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Memory leak from ExecutorState context?
Дата
Msg-id 45d453c8-b2d3-b477-36eb-32fdf4455f3c@enterprisedb.com
обсуждение исходный текст
Ответ на Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Ответы Re: Memory leak from ExecutorState context?  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Список pgsql-hackers
On 2/28/23 19:06, Jehan-Guillaume de Rorthais wrote:
> Hello all,
> 
> A customer is facing out of memory query which looks similar to this situation:
> 
>   https://www.postgresql.org/message-id/flat/12064.1555298699%40sss.pgh.pa.us#eb519865575bbc549007878a5fb7219b
> 
> This PostgreSQL version is 11.18. Some settings:
> 
> * shared_buffers: 8GB
> * work_mem: 64MB
> * effective_cache_size: 24GB
> * random/seq_page_cost are by default
> * physical memory: 32GB
> 
> The query is really large and actually update kind of a materialized view.
> 
> The customer records the plans of this query on a regular basis. The explain
> analyze of this query before running out of memory was:
> 
>   https://explain.depesz.com/s/sGOH
> 
> The customer is aware he should rewrite this query to optimize it, but it's a
> long time process he can not start immediately. To make it run in the meantime,
> he actually removed the top CTE to a dedicated table. According to their
> experience, it's not the first time they had to split a query this way to make
> it work.
> 
> I've been able to run this query on a standby myself.  I've "call
> MemoryContextStats(TopMemoryContext)" every 10s on a run, see the data parsed
> (best view with "less -S") and the graph associated with it in attachment. It
> shows:
> 
> * HashBatchContext goes up to 1441MB after 240s then stay flat until the end
>   (400s as the last record)

That's interesting. We're using HashBatchContext for very few things, so
what could it consume so much memory? But e.g. the number of buckets
should be limited by work_mem, so how could it get to 1.4GB?

Can you break at ExecHashIncreaseNumBatches/ExecHashIncreaseNumBuckets
and print how many batches/butches are there?

> * ALL other context are stable before 240s, but ExecutorState
> * ExecutorState keeps rising up to 13GB with no interruption until the memory
>   exhaustion
> 
> I did another run with interactive gdb session (see the messy log session in
> attachment, for what it worth). Looking at some backtraces during the memory
> inflation close to the end of the query, all of them were having these frames in
> common:
> 
>   [...]
>   #6  0x0000000000621ffc in ExecHashJoinImpl (parallel=false, pstate=0x31a3378)
>   at nodeHashjoin.c:398 [...]
> 
> ...which is not really helpful but at least, it seems to come from a hash join
> node or some other hash related code. See the gdb session log for more details.
> After the out of mem, pmap of this process shows:
> 
>   430:   postgres: postgres <dbname> [local] EXPLAIN
>   Address           Kbytes     RSS   Dirty Mode  Mapping
>   [...]
>   0000000002c5e000 13719620 8062376 8062376 rw---   [ anon ]
>   [...]
> 
> Is it usual a backend is requesting such large memory size (13 GB) and
> actually use less of 60% of it (7.7GB of RSS)?
> 

No idea. Interpreting this info is pretty tricky, in my experience. It
might mean the memory is no longer used but sbrk couldn't return it to
the OS yet, or something like that.

> Sadly, the database is 1.5TB large and I can not test on a newer major version.
> I did not try to check how large would be the required data set to reproduce
> this, but it moves 10s of million of rows from multiple tables anyway...
> 
> Any idea? How could I help to have a better idea if a leak is actually
> occurring and where exactly?
> 

Investigating memory leaks is tough, especially for generic memory
contexts like ExecutorState :-( Even more so when you can't reproduce it
on a machine with custom builds.

What I'd try is this:

1) attach breakpoints to all returns in AllocSetAlloc(), printing the
pointer and size for ExecutorState context, so something like

  break aset.c:783 if strcmp("ExecutorState",context->header.name) == 0
  commands
  print MemoryChunkGetPointer(chunk) size
  cont
  end

2) do the same for AllocSetFree()

3) Match the palloc/pfree calls (using the pointer address), to
determine which ones are not freed and do some stats on the size.
Usually there's only a couple distinct sizes that account for most of
the leaked memory.

4) Break AllocSetAlloc on those leaked sizes, to determine where the
calls come from.

This usually gives enough info about the leak or at least allows
focusing the investigation to a particular area of code.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jacob Champion
Дата:
Сообщение: Re: Auth extensions, with an LDAP/SCRAM example [was: Proposal: Support custom authentication methods using hooks]
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: Beautify pg_walinspect docs a bit