Re: Query on partitioned table needs memory n_partitions * work_mem
От | Dimitrios Apostolou |
---|---|
Тема | Re: Query on partitioned table needs memory n_partitions * work_mem |
Дата | |
Msg-id | e5ae2d08-75e1-3726-8905-1328e86453cd@gmx.net обсуждение исходный текст |
Ответ на | Re: Query on partitioned table needs memory n_partitions * work_mem (David Rowley <dgrowleyml@gmail.com>) |
Ответы |
Re: Query on partitioned table needs memory n_partitions * work_mem
|
Список | pgsql-general |
On Thu, 18 Jul 2024, David Rowley wrote: > On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou <jimis@gmx.net> wrote: > >> * The memory is unnecessarily allocated early on, before any partitions >> are actually aggregated. I know this because I/O is slow on this device >> and the table sizes are huge, it's simply not possible that postgres >> went through all partitions and blew up the memory. That would take >> hours, but the OOM happens seconds after I start the query. > > That's interesting. Certainly, there is some memory allocated during > executor startup, but that amount should be fairly small. Are you > able to provide a self-contained test case that shows the memory > blowing up before execution begins? I'm trying hard to create a self-contained way to reproduce the issue. It's not easy, the behaviour is a bit unstable. So far I see high memory usage (8-10GB) but I'm not able to OOM with that much (it actually OOM'ed once, but it was so uncontrollable that I didn't manage to measure and reproduce again later). -- I create a table with 2k partitions. CREATE TABLE partitioned_table1( run_n bigint GENERATED ALWAYS AS IDENTITY, workitem_n integer NOT NULL, label_n smallint, result smallint NOT NULL, PRIMARY KEY(workitem_n, run_n) ) PARTITION BY RANGE(workitem_n); DO $$ for i in range(0, 2000): stmt = f''' CREATE TABLE part_max{i+1}M PARTITION OF partitioned_table1 FOR VALUES FROM ({i*1000*1000}) TO ({(i+1) * 1000*1000}) ''' plpy.execute(stmt) $$ LANGUAGE plpython3u; -- I insert random data. First I insert to all partitions, 1M rows each: DO $$ for i in range(0, 2000): stmt = f''' INSERT INTO partitioned_table1(workitem_n, label_n, result) SELECT j-j%4, CAST(random()*1000 AS INTEGER), CAST(random()*3 AS INTEGER) FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j ''' plpy.info(stmt) plpy.execute(stmt) plpy.commit() $$ LANGUAGE plpython3u; -- Disable parallel execution and group aggregate: SET SESSION max_parallel_workers_per_gather TO 0; SET SESSION enable_incremental_sort TO off; SET SESSION work_mem TO '8MB'; -- Now the following query should do a HashAggregate: SELECT workitem_n, label_n, bool_or(result IN (2,3)) FROM partitioned_table1 GROUP BY workitem_n, label_n LIMIT 10; -- How much was the RSS of the backend while the previous query was -- running? Not that high. But if we insert some million rows to the -- 1st partition, then it will be much higher. DO $$ for i in range(0,2000): stmt = f''' INSERT INTO partitioned_table1(workitem_n, label_n, result) SELECT j%1000000, CAST(random()*20000 AS INTEGER), CAST(random()*4 AS INTEGER) FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j ''' plpy.info(stmt) plpy.execute(stmt) plpy.commit() $$ LANGUAGE plpython3u; -- Now that same previous query consumes between 8GB and 10GB RSS. The -- more data I insert (to all partitions?), the more memory the query -- takes. Overall: * I don't see the RSS memory usage (8GB) growing proportionally as I expected. If I increase work_mem from 4MB to 8MB then I see double RSS memory usage (from ~4GB to ~8GB). But then if I increase it further the difference is miniscule and no OOM happens. * Instead I notice RSS memory usage growing slowly while I insert more and more data to the table (especially into the 1st partition I think). * Finally I don't see the memory being free'd by the backend after the SELECT finishes. The system is relieved only when I disconnect psql and the backend dies. Not sure if that's by design or not. > >> Having wasted long time in that, the minimum I can do is submit a >> documentation patch. At enable_partitionwise_aggregate someting like >> "WARNING it can increase the memory usage by at least >> n_partitions * work_mem". How do I move on for such a patch? Pointers >> would be appreciated. :-) > > I think mentioning something about this in enable_partitionwise_join > and enable_partitionwise_aggregate is probably wise. I'll propose a > patch on pgsql-hackers. David and Ashutosh, thank you both for your interest in improving the documentation. Unfortunately I'm not positive any longer on what exactly is going on here, I don't understand how the memory is growing. One thing I can verify is that it's definitely caused by partitioning: I have another similar huge table but unpartitioned, and no such issues show up. Maybe someone with knowledge of the HashAggregate algorithm and partitioning can throw some ideas in. Regards, Dimitris
В списке pgsql-general по дате отправления: