Re: Query on partitioned table needs memory n_partitions * work_mem

Поиск
Список
Период
Сортировка
От Dimitrios Apostolou
Тема Re: Query on partitioned table needs memory n_partitions * work_mem
Дата
Msg-id e5ae2d08-75e1-3726-8905-1328e86453cd@gmx.net
обсуждение исходный текст
Ответ на Re: Query on partitioned table needs memory n_partitions * work_mem  (David Rowley <dgrowleyml@gmail.com>)
Ответы Re: Query on partitioned table needs memory n_partitions * work_mem
Список pgsql-general
On Thu, 18 Jul 2024, David Rowley wrote:

> On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou <jimis@gmx.net> wrote:
>
>> * The memory is unnecessarily allocated early on, before any partitions
>>    are actually aggregated. I know this because I/O is slow on this device
>>    and the table sizes are huge, it's simply not possible that postgres
>>    went through all partitions and blew up the memory. That would take
>>    hours, but the OOM happens seconds after I start the query.
>
> That's interesting. Certainly, there is some memory allocated during
> executor startup, but that amount should be fairly small.  Are you
> able to provide a self-contained test case that shows the memory
> blowing up before execution begins?

I'm trying hard to create a self-contained way to reproduce the issue.
It's not easy, the behaviour is a bit unstable. So far I see high memory
usage (8-10GB) but I'm not able to OOM with that much (it actually OOM'ed
once, but it was so uncontrollable that I didn't manage to measure and
reproduce again later).



-- I create a table with 2k partitions.

CREATE TABLE   partitioned_table1(
     run_n       bigint   GENERATED ALWAYS AS IDENTITY,
     workitem_n  integer  NOT NULL,
     label_n     smallint,
     result      smallint NOT NULL,
     PRIMARY KEY(workitem_n, run_n)
) PARTITION BY RANGE(workitem_n);

DO $$
for i in range(0, 2000):
     stmt = f'''
         CREATE TABLE part_max{i+1}M
             PARTITION OF partitioned_table1
             FOR VALUES FROM ({i*1000*1000}) TO ({(i+1) * 1000*1000})
     '''
     plpy.execute(stmt)
$$ LANGUAGE plpython3u;


-- I insert random data. First I insert to all partitions, 1M rows each:

DO $$
for i in range(0, 2000):
     stmt = f'''
         INSERT INTO partitioned_table1(workitem_n, label_n, result)
             SELECT
                     j-j%4,
                     CAST(random()*1000 AS INTEGER),
                     CAST(random()*3    AS INTEGER)
                 FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1,
1) as j
     '''
     plpy.info(stmt)
     plpy.execute(stmt)
     plpy.commit()
$$ LANGUAGE plpython3u;


-- Disable parallel execution and group aggregate:

SET SESSION max_parallel_workers_per_gather TO 0;
SET SESSION enable_incremental_sort TO off;

SET SESSION work_mem TO '8MB';


-- Now the following query should do a HashAggregate:

SELECT
      workitem_n, label_n, bool_or(result IN (2,3))
FROM
      partitioned_table1
GROUP BY
      workitem_n, label_n
LIMIT 10;


-- How much was the RSS of the backend while the previous query was
-- running? Not that high. But if we insert some million rows to the
-- 1st partition, then it will be much higher.

DO $$
for i in range(0,2000):
     stmt = f'''
         INSERT INTO partitioned_table1(workitem_n, label_n, result)
             SELECT
                     j%1000000,
                     CAST(random()*20000 AS INTEGER),
                     CAST(random()*4    AS INTEGER)
                 FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j
     '''
     plpy.info(stmt)
     plpy.execute(stmt)
     plpy.commit()
$$ LANGUAGE plpython3u;


-- Now that same previous query consumes between 8GB and 10GB RSS. The
-- more data I insert (to all partitions?), the more memory the query
-- takes.


Overall:

* I don't see the RSS memory usage (8GB) growing proportionally as I
   expected. If I increase work_mem from 4MB to 8MB then I see double RSS
   memory usage (from ~4GB to ~8GB). But then if I increase it further the
   difference is miniscule and no OOM happens.

* Instead I notice RSS memory usage growing slowly while I insert more and more data
   to the table (especially into the 1st partition I think).

* Finally I don't see the memory being free'd by the backend after the
   SELECT finishes. The system is relieved only when I disconnect psql and
   the backend dies. Not sure if that's by design or not.

>
>> Having wasted long time in that, the minimum I can do is submit a
>> documentation patch. At enable_partitionwise_aggregate someting like
>> "WARNING it can increase the memory usage by at least
>> n_partitions * work_mem". How do I move on for such a patch? Pointers
>> would be appreciated. :-)
>
> I think mentioning something about this in enable_partitionwise_join
> and enable_partitionwise_aggregate is probably wise. I'll propose a
> patch on pgsql-hackers.

David and Ashutosh, thank you both for your interest in improving the
documentation. Unfortunately I'm not positive any longer on what exactly
is going on here, I don't understand how the memory is growing. One thing
I can verify is that it's definitely caused by partitioning: I have
another similar huge table but unpartitioned, and no such issues show up.
Maybe someone with knowledge of the HashAggregate algorithm and
partitioning can throw some ideas in.


Regards,
Dimitris




В списке pgsql-general по дате отправления: