Обсуждение: PG18 GIN parallel index build crash - invalid memory alloc request size
Testing PostgreSQL 18.0 on Debian from PGDG repo: 18.0-1.pgdg12+3 with PostGIS 3.6.0+dfsg-2.pgdg12+1. Running the osm2pgsql workload to load the entire OSM Planet data set in my home lab system.
I found a weird crash during the recently adjusted parallel GIN index building code. There are 2 parallel workers spawning, one of them crashes then everything terminates. This is one of the last steps in OSM loading, I can reproduce just by trying the one statement again:
I found a weird crash during the recently adjusted parallel GIN index building code. There are 2 parallel workers spawning, one of them crashes then everything terminates. This is one of the last steps in OSM loading, I can reproduce just by trying the one statement again:
gis=# CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags);
ERROR: invalid memory alloc request size 1113001620
ERROR: invalid memory alloc request size 1113001620
I see that this area of the code was just being triaged during early beta time in May, may need another round.
The table is 215 GB. Server has 128GB and only 1/3 is nailed down, there's plenty of RAM available.
Settings include:
work_mem=1GB
maintenance_work_mem=20GB
shared_buffers=48GB
max_parallel_workers_per_gather = 8
maintenance_work_mem=20GB
shared_buffers=48GB
max_parallel_workers_per_gather = 8
Log files show a number of similarly big allocations working before then, here's an example:
LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp161831.0.fileset/0.1", size 1073741824
STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING BTREE (osm_id)
ERROR: invalid memory alloc request size 1137667788
STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags)
CONTEXT: parallel worker
STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING BTREE (osm_id)
ERROR: invalid memory alloc request size 1137667788
STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags)
CONTEXT: parallel worker
And another one to show the size at crash is a little different each time:
Hooked into the error message and it gave this stack trace:
#0 errfinish (filename=0x5646de247420 "./build/../src/backend/utils/mmgr/mcxt.c",
lineno=1174, funcname=0x5646de2477d0 <__func__.3> "MemoryContextSizeFailure")
at ./build/../src/backend/utils/error/elog.c:476
#1 0x00005646ddb4ae9c in MemoryContextSizeFailure (
context=context@entry=0x56471ce98c90, size=size@entry=1136261136,
flags=flags@entry=0) at ./build/../src/backend/utils/mmgr/mcxt.c:1174
#2 0x00005646de05898d in MemoryContextCheckSize (flags=0, size=1136261136,
context=0x56471ce98c90) at ./build/../src/include/utils/memutils_internal.h:172
#3 MemoryContextCheckSize (flags=0, size=1136261136, context=0x56471ce98c90)
at ./build/../src/include/utils/memutils_internal.h:167
#4 AllocSetRealloc (pointer=0x7f34f558b040, size=1136261136, flags=0)
at ./build/../src/backend/utils/mmgr/aset.c:1203
#5 0x00005646ddb701c8 in GinBufferStoreTuple (buffer=0x56471cee0d10,
tup=0x7f34dfdd2030) at ./build/../src/backend/access/gin/gininsert.c:1497
#6 0x00005646ddb70503 in _gin_process_worker_data (progress=<optimized out>,
worker_sort=0x56471cf13638, state=0x7ffc288b0200)
at ./build/../src/backend/access/gin/gininsert.c:1926
#7 _gin_parallel_scan_and_build (state=state@entry=0x7ffc288b0200,
ginshared=ginshared@entry=0x7f4168a5d360,
sharedsort=sharedsort@entry=0x7f4168a5d300, heap=heap@entry=0x7f41686e5280,
index=index@entry=0x7f41686e4738, sortmem=<optimized out>,
progress=<optimized out>) at ./build/../src/backend/access/gin/gininsert.c:2046
#8 0x00005646ddb71ebf in _gin_parallel_build_main (seg=<optimized out>,
toc=0x7f4168a5d000) at ./build/../src/backend/access/gin/gininsert.c:2159
#9 0x00005646ddbdf882 in ParallelWorkerMain (main_arg=<optimized out>)
at ./build/../src/backend/access/transam/parallel.c:1563
#10 0x00005646dde40670 in BackgroundWorkerMain (startup_data=<optimized out>,
startup_data_len=<optimized out>)
at ./build/../src/backend/postmaster/bgworker.c:843
#11 0x00005646dde42a45 in postmaster_child_launch (
child_type=child_type@entry=B_BG_WORKER, child_slot=320,
startup_data=startup_data@entry=0x56471cdbc8f8,
startup_data_len=startup_data_len@entry=1472, client_sock=client_sock@entry=0x0)
at ./build/../src/backend/postmaster/launch_backend.c:290
#12 0x00005646dde44265 in StartBackgroundWorker (rw=0x56471cdbc8f8)
at ./build/../src/backend/postmaster/postmaster.c:4157
#13 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:4323
#14 0x00005646dde45b13 in LaunchMissingBackgroundProcesses ()
at ./build/../src/backend/postmaster/postmaster.c:3397
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1717
#16 0x00005646dde47f6d in PostmasterMain (argc=argc@entry=5,
argv=argv@entry=0x56471cd66dc0)
at ./build/../src/backend/postmaster/postmaster.c:1400
#17 0x00005646ddb4d56c in main (argc=5, argv=0x56471cd66dc0)
at ./build/../src/backend/main/main.c:227
I've frozen my testing at the spot where I can reproduce the problem. I was going to try dropping m_w_m next and turning off the parallel execution. I didn't want to touch anything until after asking if there's more data that should be collected from a crashing instance.
--
Greg Smith, Software Engineering
Snowflake - Where Data Does More
gregory.smith@snowflake.com
ERROR: Database error: ERROR: invalid memory alloc request size 1115943018
Hooked into the error message and it gave this stack trace:
#0 errfinish (filename=0x5646de247420 "./build/../src/backend/utils/mmgr/mcxt.c",
lineno=1174, funcname=0x5646de2477d0 <__func__.3> "MemoryContextSizeFailure")
at ./build/../src/backend/utils/error/elog.c:476
#1 0x00005646ddb4ae9c in MemoryContextSizeFailure (
context=context@entry=0x56471ce98c90, size=size@entry=1136261136,
flags=flags@entry=0) at ./build/../src/backend/utils/mmgr/mcxt.c:1174
#2 0x00005646de05898d in MemoryContextCheckSize (flags=0, size=1136261136,
context=0x56471ce98c90) at ./build/../src/include/utils/memutils_internal.h:172
#3 MemoryContextCheckSize (flags=0, size=1136261136, context=0x56471ce98c90)
at ./build/../src/include/utils/memutils_internal.h:167
#4 AllocSetRealloc (pointer=0x7f34f558b040, size=1136261136, flags=0)
at ./build/../src/backend/utils/mmgr/aset.c:1203
#5 0x00005646ddb701c8 in GinBufferStoreTuple (buffer=0x56471cee0d10,
tup=0x7f34dfdd2030) at ./build/../src/backend/access/gin/gininsert.c:1497
#6 0x00005646ddb70503 in _gin_process_worker_data (progress=<optimized out>,
worker_sort=0x56471cf13638, state=0x7ffc288b0200)
at ./build/../src/backend/access/gin/gininsert.c:1926
#7 _gin_parallel_scan_and_build (state=state@entry=0x7ffc288b0200,
ginshared=ginshared@entry=0x7f4168a5d360,
sharedsort=sharedsort@entry=0x7f4168a5d300, heap=heap@entry=0x7f41686e5280,
index=index@entry=0x7f41686e4738, sortmem=<optimized out>,
progress=<optimized out>) at ./build/../src/backend/access/gin/gininsert.c:2046
#8 0x00005646ddb71ebf in _gin_parallel_build_main (seg=<optimized out>,
toc=0x7f4168a5d000) at ./build/../src/backend/access/gin/gininsert.c:2159
#9 0x00005646ddbdf882 in ParallelWorkerMain (main_arg=<optimized out>)
at ./build/../src/backend/access/transam/parallel.c:1563
#10 0x00005646dde40670 in BackgroundWorkerMain (startup_data=<optimized out>,
startup_data_len=<optimized out>)
at ./build/../src/backend/postmaster/bgworker.c:843
#11 0x00005646dde42a45 in postmaster_child_launch (
child_type=child_type@entry=B_BG_WORKER, child_slot=320,
startup_data=startup_data@entry=0x56471cdbc8f8,
startup_data_len=startup_data_len@entry=1472, client_sock=client_sock@entry=0x0)
at ./build/../src/backend/postmaster/launch_backend.c:290
#12 0x00005646dde44265 in StartBackgroundWorker (rw=0x56471cdbc8f8)
at ./build/../src/backend/postmaster/postmaster.c:4157
#13 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:4323
#14 0x00005646dde45b13 in LaunchMissingBackgroundProcesses ()
at ./build/../src/backend/postmaster/postmaster.c:3397
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1717
#16 0x00005646dde47f6d in PostmasterMain (argc=argc@entry=5,
argv=argv@entry=0x56471cd66dc0)
at ./build/../src/backend/postmaster/postmaster.c:1400
#17 0x00005646ddb4d56c in main (argc=5, argv=0x56471cd66dc0)
at ./build/../src/backend/main/main.c:227
I've frozen my testing at the spot where I can reproduce the problem. I was going to try dropping m_w_m next and turning off the parallel execution. I didn't want to touch anything until after asking if there's more data that should be collected from a crashing instance.
--
Greg Smith, Software Engineering
Snowflake - Where Data Does More
gregory.smith@snowflake.com
Hi, On 10/24/25 05:03, Gregory Smith wrote: > Testing PostgreSQL 18.0 on Debian from PGDG repo: 18.0-1.pgdg12+3 with > PostGIS 3.6.0+dfsg-2.pgdg12+1. Running the osm2pgsql workload to load > the entire OSM Planet data set in my home lab system. > > I found a weird crash during the recently adjusted parallel GIN index > building code. There are 2 parallel workers spawning, one of them > crashes then everything terminates. This is one of the last steps in > OSM loading, I can reproduce just by trying the one statement again: > > gis=# CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags); > ERROR: invalid memory alloc request size 1113001620 > > I see that this area of the code was just being triaged during early > beta time in May, may need another round. > > The table is 215 GB. Server has 128GB and only 1/3 is nailed down, > there's plenty of RAM available. > > Settings include: > work_mem=1GB > maintenance_work_mem=20GB > shared_buffers=48GB > max_parallel_workers_per_gather = 8 > Hmmm, I wonder if the m_w_m is high enough to confuse the trimming logic in some way. Can you try if using smaller m_w_m (maybe 128MB-256MB) makes the issue go away? > Log files show a number of similarly big allocations working before > then, here's an example: > > LOG: temporary file: path "base/pgsql_tmp/ > pgsql_tmp161831.0.fileset/0.1", size 1073741824 > STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING BTREE > (osm_id) > ERROR: invalid memory alloc request size 1137667788 > STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags) > CONTEXT: parallel worker > But that btree allocation is exactly 1GB, which is the palloc limit. And IIRC the tuplesort code is doing palloc_huge, so that's probably why it works fine. While the GIN code does a plain repalloc(), so it's subject to the MaxAllocSize limit. > And another one to show the size at crash is a little different each time: > ERROR: Database error: ERROR: invalid memory alloc request size 1115943018 > > Hooked into the error message and it gave this stack trace: > > #0 errfinish (filename=0x5646de247420 "./build/../src/backend/utils/ > mmgr/mcxt.c", > lineno=1174, funcname=0x5646de2477d0 <__func__.3> > "MemoryContextSizeFailure") > at ./build/../src/backend/utils/error/elog.c:476 > #1 0x00005646ddb4ae9c in MemoryContextSizeFailure ( > context=context@entry=0x56471ce98c90, size=size@entry=1136261136, > flags=flags@entry=0) at ./build/../src/backend/utils/mmgr/mcxt.c:1174 > #2 0x00005646de05898d in MemoryContextCheckSize (flags=0, size=1136261136, > context=0x56471ce98c90) at ./build/../src/include/utils/ > memutils_internal.h:172 > #3 MemoryContextCheckSize (flags=0, size=1136261136, > context=0x56471ce98c90) > at ./build/../src/include/utils/memutils_internal.h:167 > #4 AllocSetRealloc (pointer=0x7f34f558b040, size=1136261136, flags=0) > at ./build/../src/backend/utils/mmgr/aset.c:1203 > #5 0x00005646ddb701c8 in GinBufferStoreTuple (buffer=0x56471cee0d10, > tup=0x7f34dfdd2030) at ./build/../src/backend/access/gin/ > gininsert.c:1497 > #6 0x00005646ddb70503 in _gin_process_worker_data (progress=<optimized > out>, > worker_sort=0x56471cf13638, state=0x7ffc288b0200) > at ./build/../src/backend/access/gin/gininsert.c:1926 > #7 _gin_parallel_scan_and_build (state=state@entry=0x7ffc288b0200, > ginshared=ginshared@entry=0x7f4168a5d360, > sharedsort=sharedsort@entry=0x7f4168a5d300, > heap=heap@entry=0x7f41686e5280, > index=index@entry=0x7f41686e4738, sortmem=<optimized out>, > progress=<optimized out>) at ./build/../src/backend/access/gin/ > gininsert.c:2046 > #8 0x00005646ddb71ebf in _gin_parallel_build_main (seg=<optimized out>, > toc=0x7f4168a5d000) at ./build/../src/backend/access/gin/ > gininsert.c:2159 > #9 0x00005646ddbdf882 in ParallelWorkerMain (main_arg=<optimized out>) > at ./build/../src/backend/access/transam/parallel.c:1563 > #10 0x00005646dde40670 in BackgroundWorkerMain (startup_data=<optimized > out>, > startup_data_len=<optimized out>) > at ./build/../src/backend/postmaster/bgworker.c:843 > #11 0x00005646dde42a45 in postmaster_child_launch ( > child_type=child_type@entry=B_BG_WORKER, child_slot=320, > startup_data=startup_data@entry=0x56471cdbc8f8, > startup_data_len=startup_data_len@entry=1472, > client_sock=client_sock@entry=0x0) > at ./build/../src/backend/postmaster/launch_backend.c:290 > #12 0x00005646dde44265 in StartBackgroundWorker (rw=0x56471cdbc8f8) > at ./build/../src/backend/postmaster/postmaster.c:4157 > #13 maybe_start_bgworkers () at ./build/../src/backend/postmaster/ > postmaster.c:4323 > #14 0x00005646dde45b13 in LaunchMissingBackgroundProcesses () > at ./build/../src/backend/postmaster/postmaster.c:3397 > #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1717 > #16 0x00005646dde47f6d in PostmasterMain (argc=argc@entry=5, > argv=argv@entry=0x56471cd66dc0) > at ./build/../src/backend/postmaster/postmaster.c:1400 > #17 0x00005646ddb4d56c in main (argc=5, argv=0x56471cd66dc0) > at ./build/../src/backend/main/main.c:227 > > I've frozen my testing at the spot where I can reproduce the problem. I > was going to try dropping m_w_m next and turning off the parallel > execution. I didn't want to touch anything until after asking if > there's more data that should be collected from a crashing instance. > Hmm, so it's failing on the repalloc in GinBufferStoreTuple(), which is merging the "GinTuple" into an in-memory buffer. I'll take a closer look once I get back from pgconf.eu, but I guess I failed to consider that the "parts" may be large enough to exceed MaxAlloc. The code tries to flush the "frozen" part of the TID lists part that can no longer change, but I think with m_w_m this large it could happen the first two buffers are already too large (and the trimming happens only after the fact). Can you show the contents of buffer and tup? I'm especially interested in these fields: buffer->nitems buffer->maxitems buffer->nfrozen tup->nitems If I'm right, I think there are two ways to fix this: (1) apply the trimming earlier, i.e. try to freeze + flush before actually merging the data (essentially, update nfrozen earlier) (2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple Or we probably should do both. regards -- Tomas Vondra
On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas@vondra.me> wrote:
Hmmm, I wonder if the m_w_m is high enough to confuse the trimming logic
in some way. Can you try if using smaller m_w_m (maybe 128MB-256MB)
makes the issue go away?
The index builds at up to 4GB of m_w_m. 5GB and above crashes.
Now that I know roughly where the limits are that de-escalates things a bit. The sort of customers deploying a month after release should be OK with just knowing to be careful about high m_w_m settings on PG18 until a fix is ready.
Hope everyone is enjoying Latvia. My obscure music collection includes a band from there I used to see in the NYC area, The Quags; https://www.youtube.com/watch?v=Bg3P4736CxM
Can you show the contents of buffer and tup? I'm especially interested
in these fields:
buffer->nitems
buffer->maxitems
buffer->nfrozen
tup->nitems
I'll see if I can grab that data at the crash point.
FYI for anyone who wants to replicate this: if you have a system with 128GB+ of RAM you could probably recreate the test case. Just have to download the Planet file and run osm2pgsql with the overly tweaked settings I use. I've published all the details of how I run this regression test now.
If I'm right, I think there are two ways to fix this:
(1) apply the trimming earlier, i.e. try to freeze + flush before
actually merging the data (essentially, update nfrozen earlier)
(2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple
Or we probably should do both.
Sounds like (2) is probably mandatory and (1) is good hygiene.
On 10/24/25 22:22, Gregory Smith wrote: > On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas@vondra.me > <mailto:tomas@vondra.me>> wrote: > > Hmmm, I wonder if the m_w_m is high enough to confuse the trimming logic > in some way. Can you try if using smaller m_w_m (maybe 128MB-256MB) > makes the issue go away? > > > The index builds at up to 4GB of m_w_m. 5GB and above crashes. > > Now that I know roughly where the limits are that de-escalates things a > bit. The sort of customers deploying a month after release should be OK > with just knowing to be careful about high m_w_m settings on PG18 until > a fix is ready. > > Hope everyone is enjoying Latvia. My obscure music collection includes > a band from there I used to see in the NYC area, The Quags; https:// > www.youtube.com/watch?v=Bg3P4736CxM <https://www.youtube.com/watch? > v=Bg3P4736CxM> > Nice! > Can you show the contents of buffer and tup? I'm especially interested > in these fields: > buffer->nitems > buffer->maxitems > buffer->nfrozen > tup->nitems > > > I'll see if I can grab that data at the crash point. > > FYI for anyone who wants to replicate this: if you have a system with > 128GB+ of RAM you could probably recreate the test case. Just have to > download the Planet file and run osm2pgsql with the overly tweaked > settings I use. I've published all the details of how I run this > regression test now. > > Settings: https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d > <https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d> > Script setup: https://github.com/gregs1104/pgbent/blob/main/wl/osm- > import <https://github.com/gregs1104/pgbent/blob/main/wl/osm-import> > Test runner: https://github.com/gregs1104/pgbent/blob/main/util/osm- > importer <https://github.com/gregs1104/pgbent/blob/main/util/osm-importer> > Parse results: https://github.com/gregs1104/pgbent/blob/main/util/ > pgbench-init-parse <https://github.com/gregs1104/pgbent/blob/main/util/ > pgbench-init-parse> > I did reproduce this using OSM, although I used different settings, but that's only affects loading. Setting maintenance_work_mem=20GB is more than enough to trigger the error during parallel index build. So I don't need the data. > > If I'm right, I think there are two ways to fix this: > (1) apply the trimming earlier, i.e. try to freeze + flush before > actually merging the data (essentially, update nfrozen earlier) > (2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple > Or we probably should do both. > > > Sounds like (2) is probably mandatory and (1) is good hygiene. > Yes, (2) is mandatory to fix this, and it's also sufficient. See the attached fix. I'll clean this up and push soon. AFAICS (1) is not really needed. I was concerned we might end up with each worker producing a TID buffer close to maintenance_work_mem, and then the leader would have to use twice as much memory when merging. But it turns out I already thought about that, and the workers use a fair share or maintenance_work_mem, not a new limit. So they produce smaller chunks, and those should not exceed maintenance_work_mem when merging. I tried "freezing" the existing buffer more eagerly (before merging the tuple), but that made no difference. The workers produce data with a lot of overlaps (simply because that's how the parallel builds divide data), and the amount of trimmed data is tiny. Something like 10k TIDs from a buffer of 1M TIDs. So a tiny difference, and it'd still fail. I'm not against maybe experimenting with this, but it's going to be a mater-only thing, not for backpatching. Maybe we should split the data into smaller chunks while building tuples in ginFlushBuildState. That'd probably allow enforcing the memory limit more strictly, because we sometimes hold multiple copies of the TIDs arrays. But that's for master too. regards -- Tomas Vondra
Вложения
On 10/26/25 16:16, Tomas Vondra wrote: > On 10/24/25 22:22, Gregory Smith wrote: >> On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas@vondra.me >> <mailto:tomas@vondra.me>> wrote: >> >> ... >>>> Can you show the contents of buffer and tup? I'm especially interested >> in these fields: >> buffer->nitems >> buffer->maxitems >> buffer->nfrozen >> tup->nitems >> >> >> I'll see if I can grab that data at the crash point. >> >> FYI for anyone who wants to replicate this: if you have a system with >> 128GB+ of RAM you could probably recreate the test case. Just have to >> download the Planet file and run osm2pgsql with the overly tweaked >> settings I use. I've published all the details of how I run this >> regression test now. >> >> Settings: https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d >> <https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d> >> Script setup: https://github.com/gregs1104/pgbent/blob/main/wl/osm- >> import <https://github.com/gregs1104/pgbent/blob/main/wl/osm-import> >> Test runner: https://github.com/gregs1104/pgbent/blob/main/util/osm- >> importer <https://github.com/gregs1104/pgbent/blob/main/util/osm-importer> >> Parse results: https://github.com/gregs1104/pgbent/blob/main/util/ >> pgbench-init-parse <https://github.com/gregs1104/pgbent/blob/main/util/ >> pgbench-init-parse> >> > > I did reproduce this using OSM, although I used different settings, but > that's only affects loading. Setting maintenance_work_mem=20GB is more > than enough to trigger the error during parallel index build. > > So I don't need the data. > >> >> If I'm right, I think there are two ways to fix this: >> (1) apply the trimming earlier, i.e. try to freeze + flush before >> actually merging the data (essentially, update nfrozen earlier) >> (2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple >> Or we probably should do both. >> >> >> Sounds like (2) is probably mandatory and (1) is good hygiene. >> > > Yes, (2) is mandatory to fix this, and it's also sufficient. See the > attached fix. I'll clean this up and push soon. > > AFAICS (1) is not really needed. I was concerned we might end up with > each worker producing a TID buffer close to maintenance_work_mem, and > then the leader would have to use twice as much memory when merging. But > it turns out I already thought about that, and the workers use a fair > share or maintenance_work_mem, not a new limit. So they produce smaller > chunks, and those should not exceed maintenance_work_mem when merging. > > I tried "freezing" the existing buffer more eagerly (before merging the > tuple), but that made no difference. The workers produce data with a lot > of overlaps (simply because that's how the parallel builds divide data), > and the amount of trimmed data is tiny. Something like 10k TIDs from a > buffer of 1M TIDs. So a tiny difference, and it'd still fail. > > I'm not against maybe experimenting with this, but it's going to be a > mater-only thing, not for backpatching. > > Maybe we should split the data into smaller chunks while building tuples > in ginFlushBuildState. That'd probably allow enforcing the memory limit > more strictly, because we sometimes hold multiple copies of the TIDs > arrays. But that's for master too. > I spoke too soon, apparently :-( (2) is not actually a fix. It does fix some cases of invalid alloc size failures, the following call to ginMergeItemPointers() can hit that too, because it does palloc() internally. I didn't notice this before because of the other experimental changes, and because it seems to depend on which of the OSM indexes is being built, with how many workers, etc. I was a bit puzzled how come we don't hit this with serial builds too, because that calls ginMergeItemPointers() too. I guess that's just luck, because with serial builds we're likely flushing the TID list in smaller chunks, appending to an existing tuple. And it seems unlikely to cross the alloc limit for any of those. But for parallel builds we're pretty much guaranteed to see all TIDs for a key at once. I see two ways to fix this: a) Do the (re)palloc_huge change, but then also change the palloc call in ginMergeItemPointers. I'm not sure if we want to change the existing function, or create a static copy in gininsert.c with this tweak (it doesn't need anything else, so it's not that bad). b) Do the data splitting in ginFlushBuildState, so that workers don't generate chunks larger than MaxAllocSize/nworkers (for any key). The leader then merges at most one chunk per worker at a time, so it still fits into the alloc limit. Both seem to work. I like (a) more, because it's more consistent with how I understand m_w_m. It's weird to say "use up to 20GB of memory" and then the system overrides that with "1GB". I don't think it affects performance, though. I'll experiment with this a bit more, I just wanted to mention the fix I posted earlier does not actually fix the issue. I also wonder how far are we from hitting the uint32 limits. FAICS with m_w_m=24GB we might end up with too many elements, even with serial index builds. It'd have to be a quite weird data set, though. regards -- Tomas Vondra
Вложения
On Sun, Oct 26, 2025 at 5:52 PM Tomas Vondra <tomas@vondra.me> wrote:
I like (a) more, because it's more consistent with how I understand m_w_m. It's weird
to say "use up to 20GB of memory" and then the system overrides that with "1GB".
I don't think it affects performance, though.
That was back in PG14 and so many bottlenecks have moved around. Since reporting this bug I've done a set of PG18 tests with m_w_m=256MB, and one of them just broke my previous record time running PG17. So even that size setting seems fine.
I also wonder how far are we from hitting the uint32 limits. FAICS with
m_w_m=24GB we might end up with too many elements, even with serial
index builds. It'd have to be a quite weird data set, though.
Since I'm starting to doubt I ever really needed even 20GB, I wouldn't stress about supporting that much being important. I'll see if I can trigger an overflow with a test case though, maybe it's worth protecting against even if it's not a functional setting.
On 10/28/25 21:54, Gregory Smith wrote: > On Sun, Oct 26, 2025 at 5:52 PM Tomas Vondra <tomas@vondra.me > <mailto:tomas@vondra.me>> wrote: > > I like (a) more, because it's more consistent with how I understand > m_w_m. It's weird > to say "use up to 20GB of memory" and then the system overrides that > with "1GB". > I don't think it affects performance, though. > > > There wasn't really that much gain from 1GB -> 20GB, I was using that > setting for QA purposes more than measured performance. During the > early parts of an OSM build, you need to have a big Node Cache to hit > max speed, 1/2 or more of a ~90GB file. Once that part finishes, > the 45GB+ cache block frees up and index building starts. I just looked > at how much was just freed and thought "ehhh...split it in half and > maybe 20GB maintenance mem?" Results seemed a little better than the > 1GB setting I started at, so I've ran with that 20GB setting since. > > That was back in PG14 and so many bottlenecks have moved around. Since > reporting this bug I've done a set of PG18 tests with m_w_m=256MB, and > one of them just broke my previous record time running PG17. So even > that size setting seems fine. > Right, that matches my observations from testing the fixes. I'd attribute this to caching effects when the accumulated GIN entries fit into L3. > I also wonder how far are we from hitting the uint32 limits. FAICS with > m_w_m=24GB we might end up with too many elements, even with serial > index builds. It'd have to be a quite weird data set, though. > > > Since I'm starting to doubt I ever really needed even 20GB, I wouldn't > stress about supporting that much being important. I'll see if I can > trigger an overflow with a test case though, maybe it's worth protecting > against even if it's not a functional setting. > Yeah, I definitely want to protect against this. I believe similar failures can happen even with much lower m_w_m values (possibly ~2-3GB), although only with weird/skewed data sets. AFAICS a constant single-element array would trigger this, but I haven't tested that. Serial builds can fail with large maintenance_work_mem too, like this: ERROR: posting list is too long HINT: Reduce "maintenance_work_mem". but it's deterministic, and it's actually a proper error message, not just some weird "invalid alloc size". Attached is a v3 of the patch series. 0001 and 0002 were already posted, and I believe either of those would address the issue. 0003 is more of an optimization, further reducing the memory usage. I'm putting this through additional testing, which takes time. But it seems there's still some loose end in 0001, as I just got the "invalid alloc request" failure with it applied ... I'll take a look tomorrow. regards -- Tomas Vondra
Вложения
On 10/29/25 01:05, Tomas Vondra wrote: > ... >> Yeah, I definitely want to protect against this. I believe similar > failures can happen even with much lower m_w_m values (possibly ~2-3GB), > although only with weird/skewed data sets. AFAICS a constant > single-element array would trigger this, but I haven't tested that. > > Serial builds can fail with large maintenance_work_mem too, like this: > > ERROR: posting list is too long > HINT: Reduce "maintenance_work_mem". > > but it's deterministic, and it's actually a proper error message, not > just some weird "invalid alloc size". > > Attached is a v3 of the patch series. 0001 and 0002 were already posted, > and I believe either of those would address the issue. 0003 is more of > an optimization, further reducing the memory usage. > > I'm putting this through additional testing, which takes time. But it > seems there's still some loose end in 0001, as I just got the "invalid > alloc request" failure with it applied ... I'll take a look tomorrow. > Unsurprisingly, there were a couple more palloc/repalloc calls (in ginPostingListDecodeAllSegments) that could fail with long TID lists produced when merging worker data. The attached v4 fixes this. However, I see this as a sign that allowing huge allocations is not the right way to fix this. The GIN code generally assumes, and I don't think reworking this in a bugfix seems a bit too invasive. And I'm not really certain this is the last place that could hit this. Another argument against 0001 is using more memory does not really help anything. It's not any faster or simpler. It's more like "let's use the memory we have" rather than "let's use the memory we need". So I'm planning to get rid of 0001, and fix that by 0002 or 0002+0003. That seems like a better and (unexpectedly) less invasive fix. regards -- Tomas Vondra