Обсуждение: brininsert optimization opportunity
Hello hackers, My colleague, Ashwin, pointed out to me that brininsert's per-tuple init of the revmap access struct can have non-trivial overhead. Turns out he is right. We are saving 24 bytes of memory per-call for the access struct, and a bit on buffer/locking overhead, with the attached patch. The implementation ties the revmap cleanup as a MemoryContext callback to the IndexInfo struct's MemoryContext, as there is no teardown function provided by the index AM for end-of-insert-command. Test setup (local Ubuntu workstation): # Drop caches and restart between each run: sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches;" pg_ctl -D /usr/local/pgsql/data/ -l /tmp/logfile restart \timing DROP TABLE heap; CREATE TABLE heap(i int); CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1); INSERT INTO heap SELECT 1 FROM generate_series(1, 200000000); Results: We see an improvement for 100M tuples and an even bigger improvement for 200M tuples. Master (29cf61ade3f245aa40f427a1d6345287ef77e622): test=# INSERT INTO heap SELECT 1 FROM generate_series(1, 100000000); INSERT 0 100000000 Time: 222762.159 ms (03:42.762) -- 3 runs test=# INSERT INTO heap SELECT 1 FROM generate_series(1, 200000000); INSERT 0 200000000 Time: 471168.181 ms (07:51.168) Time: 457071.883 ms (07:37.072) TimeL 486969.205 ms (08:06.969) Branch: test2=# INSERT INTO heap SELECT 1 FROM generate_series(1, 100000000); INSERT 0 100000000 Time: 200046.519 ms (03:20.047) -- 3 runs test2=# INSERT INTO heap SELECT 1 FROM generate_series(1, 200000000); INSERT 0 200000000 Time: 369041.832 ms (06:09.042) Time: 365483.382 ms (06:05.483) Time: 375506.144 ms (06:15.506) # Profiled backend running INSERT of 100000000 rows sudo perf record -p 11951 --call-graph fp sleep 180 Please see attached perf diff between master and branch. We see that we save on a bit of overhead from brinRevmapInitialize(), brinRevmapTerminate() and lock routines. Regards, Soumyadeep (VMware)
Вложения
On 2023-Jul-03, Soumyadeep Chakraborty wrote: > My colleague, Ashwin, pointed out to me that brininsert's per-tuple init > of the revmap access struct can have non-trivial overhead. > > Turns out he is right. We are saving 24 bytes of memory per-call for > the access struct, and a bit on buffer/locking overhead, with the > attached patch. Hmm, yeah, I remember being bit bothered by this repeated initialization. Your patch looks reasonable to me. I would set bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure. Also, please add comments atop these two new functions, to explain what they are. Nice results. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On 7/4/23 13:23, Alvaro Herrera wrote: > On 2023-Jul-03, Soumyadeep Chakraborty wrote: > >> My colleague, Ashwin, pointed out to me that brininsert's per-tuple init >> of the revmap access struct can have non-trivial overhead. >> >> Turns out he is right. We are saving 24 bytes of memory per-call for >> the access struct, and a bit on buffer/locking overhead, with the >> attached patch. > > Hmm, yeah, I remember being bit bothered by this repeated > initialization. Your patch looks reasonable to me. I would set > bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure. > Also, please add comments atop these two new functions, to explain what > they are. > > Nice results. > Yeah. I wonder how much of that runtime is the generate_series(), though. What's the speedup if that part is subtracted. It's guaranteed to be even more significant, but by how much? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you both for reviewing! On Tue, Jul 4, 2023 at 4:24AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Hmm, yeah, I remember being bit bothered by this repeated > initialization. Your patch looks reasonable to me. I would set > bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure. > Also, please add comments atop these two new functions, to explain what > they are. Done. Set bistate->bs_desc = NULL; as well. Added comments. On Tue, Jul 4, 2023 at 4:59AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Yeah. I wonder how much of that runtime is the generate_series(), > though. What's the speedup if that part is subtracted. It's guaranteed > to be even more significant, but by how much? When trying COPY, I got tripped by the following: We get a buffer leak WARNING for the meta page and a revmap page. WARNING: buffer refcount leak: [094] (rel=base/156912/206068, blockNum=1, flags=0x83000000, refcount=1 1) WARNING: buffer refcount leak: [093] (rel=base/156912/206068, blockNum=0, flags=0x83000000, refcount=1 1) PrintBufferLeakWarning bufmgr.c:3240 ResourceOwnerReleaseInternal resowner.c:554 ResourceOwnerRelease resowner.c:494 PortalDrop portalmem.c:563 exec_simple_query postgres.c:1284 We release the buffer during this resowner release and then we crash with: TRAP: failed Assert("bufnum <= NBuffers"), File: "../../../../src/include/storage/bufmgr.h", Line: 305, PID: 86833 postgres: pivotal test4 [local] COPY(ExceptionalCondition+0xbb)[0x5572b55bcc79] postgres: pivotal test4 [local] COPY(+0x61ccfc)[0x5572b537dcfc] postgres: pivotal test4 [local] COPY(ReleaseBuffer+0x19)[0x5572b5384db2] postgres: pivotal test4 [local] COPY(brinRevmapTerminate+0x1e)[0x5572b4e3fd39] postgres: pivotal test4 [local] COPY(+0xcfc44)[0x5572b4e30c44] postgres: pivotal test4 [local] COPY(+0x89e7f2)[0x5572b55ff7f2] postgres: pivotal test4 [local] COPY(MemoryContextDelete+0xd7)[0x5572b55ff683] postgres: pivotal test4 [local] COPY(PortalDrop+0x374)[0x5572b5602dc7] Unfortunately, when we do COPY, the MemoryContext where makeIndexInfo gets called is PortalContext and that is what is set in ii_Context. Furthermore, we clean up the resource owner stuff before we can clean up the MemoryContexts in PortalDrop(). The CurrentMemoryContext when initialize_brin_insertstate() is called depends. For CopyMultiInsertBufferFlush() -> ExecInsertIndexTuples() it is PortalContext, and for CopyFrom() -> ExecInsertIndexTuples() it is ExecutorState/ExprContext. We can't rely on it to register the callback neither. What we can do is create a new MemoryContext for holding the BrinInsertState, and we tie the callback to that so that cleanup is not affected by all of these variables. See v2 patch attached. Passes make installcheck-world and make installcheck -C src/test/modules/brin. However, we do still have 1 issue with the v2 patch: When we try to cancel (Ctrl-c) a running COPY command: ERROR: buffer 151 is not owned by resource owner TopTransaction #4 0x0000559cbc54a934 in ResourceOwnerForgetBuffer (owner=0x559cbd6fcf28, buffer=143) at resowner.c:997 #5 0x0000559cbc2c45e7 in UnpinBuffer (buf=0x7f8d4a8f3f80) at bufmgr.c:2390 #6 0x0000559cbc2c7e49 in ReleaseBuffer (buffer=143) at bufmgr.c:4488 #7 0x0000559cbbd82d53 in brinRevmapTerminate (revmap=0x559cbd7a03b8) at brin_revmap.c:105 #8 0x0000559cbbd73c44 in brininsertCleanupCallback (arg=0x559cbd7a5b68) at brin.c:168 #9 0x0000559cbc54280c in MemoryContextCallResetCallbacks (context=0x559cbd7a5a50) at mcxt.c:506 #10 0x0000559cbc54269d in MemoryContextDelete (context=0x559cbd7a5a50) at mcxt.c:421 #11 0x0000559cbc54273e in MemoryContextDeleteChildren (context=0x559cbd69ae90) at mcxt.c:457 #12 0x0000559cbc54625c in AtAbort_Portals () at portalmem.c:850 Haven't found a way to fix this ^ yet. Maybe there is a better way of doing our cleanup? I'm not sure. Would love your input! The other alternative for all this is to introduce new AM callbacks for insert_begin and insert_end. That might be a tougher sell? Now, to finally answer your question about the speedup without generate_series(). We do see an even higher speedup! seq 1 200000000 > /tmp/data.csv \timing DROP TABLE heap; CREATE TABLE heap(i int); CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1); COPY heap FROM '/tmp/data.csv'; -- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622) COPY 200000000 Time: 205072.444 ms (03:25.072) Time: 215380.369 ms (03:35.380) Time: 203492.347 ms (03:23.492) -- 3 runs (branch v2) COPY 200000000 Time: 135052.752 ms (02:15.053) Time: 135093.131 ms (02:15.093) Time: 138737.048 ms (02:18.737) Regards, Soumyadeep (VMware)
Вложения
On 7/4/23 21:25, Soumyadeep Chakraborty wrote: > Thank you both for reviewing! > > On Tue, Jul 4, 2023 at 4:24AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > >> Hmm, yeah, I remember being bit bothered by this repeated >> initialization. Your patch looks reasonable to me. I would set >> bistate->bs_rmAccess to NULL in the cleanup callback, just to be sure. >> Also, please add comments atop these two new functions, to explain what >> they are. > > Done. Set bistate->bs_desc = NULL; as well. Added comments. > > > On Tue, Jul 4, 2023 at 4:59AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> Yeah. I wonder how much of that runtime is the generate_series(), >> though. What's the speedup if that part is subtracted. It's guaranteed >> to be even more significant, but by how much? > > When trying COPY, I got tripped by the following: > > We get a buffer leak WARNING for the meta page and a revmap page. > > WARNING: buffer refcount leak: [094] (rel=base/156912/206068, > blockNum=1, flags=0x83000000, refcount=1 1) > WARNING: buffer refcount leak: [093] (rel=base/156912/206068, > blockNum=0, flags=0x83000000, refcount=1 1) > > PrintBufferLeakWarning bufmgr.c:3240 > ResourceOwnerReleaseInternal resowner.c:554 > ResourceOwnerRelease resowner.c:494 > PortalDrop portalmem.c:563 > exec_simple_query postgres.c:1284 > > We release the buffer during this resowner release and then we crash > with: > > TRAP: failed Assert("bufnum <= NBuffers"), File: > "../../../../src/include/storage/bufmgr.h", Line: 305, PID: 86833 > postgres: pivotal test4 [local] COPY(ExceptionalCondition+0xbb)[0x5572b55bcc79] > postgres: pivotal test4 [local] COPY(+0x61ccfc)[0x5572b537dcfc] > postgres: pivotal test4 [local] COPY(ReleaseBuffer+0x19)[0x5572b5384db2] > postgres: pivotal test4 [local] COPY(brinRevmapTerminate+0x1e)[0x5572b4e3fd39] > postgres: pivotal test4 [local] COPY(+0xcfc44)[0x5572b4e30c44] > postgres: pivotal test4 [local] COPY(+0x89e7f2)[0x5572b55ff7f2] > postgres: pivotal test4 [local] COPY(MemoryContextDelete+0xd7)[0x5572b55ff683] > postgres: pivotal test4 [local] COPY(PortalDrop+0x374)[0x5572b5602dc7] > > Unfortunately, when we do COPY, the MemoryContext where makeIndexInfo > gets called is PortalContext and that is what is set in ii_Context. > Furthermore, we clean up the resource owner stuff before we can clean > up the MemoryContexts in PortalDrop(). > > The CurrentMemoryContext when initialize_brin_insertstate() is called > depends. For CopyMultiInsertBufferFlush() -> ExecInsertIndexTuples() > it is PortalContext, and for CopyFrom() -> ExecInsertIndexTuples() it is > ExecutorState/ExprContext. We can't rely on it to register the callback > neither. > > What we can do is create a new MemoryContext for holding the > BrinInsertState, and we tie the callback to that so that cleanup is not > affected by all of these variables. See v2 patch attached. Passes make > installcheck-world and make installcheck -C src/test/modules/brin. >> However, we do still have 1 issue with the v2 patch: > When we try to cancel (Ctrl-c) a running COPY command: > ERROR: buffer 151 is not owned by resource owner TopTransaction > I'm not sure if memory context callbacks are the right way to rely on for this purpose. The primary purpose of memory contexts is to track memory, so using them for this seems a bit weird. There are cases that do something similar, like expandendrecord.c where we track refcounted tuple slot, but IMHO there's a big difference between tracking one slot allocated right there, and unknown number of buffers allocated much later. The fact that even with the extra context is still doesn't handle query cancellations is another argument against that approach (I wonder how expandedrecord.c handles that, but I haven't checked). > > Maybe there is a better way of doing our cleanup? I'm not sure. Would > love your input! > > The other alternative for all this is to introduce new AM callbacks for > insert_begin and insert_end. That might be a tougher sell? > That's the approach I wanted to suggest, more or less - to do the cleanup from ExecCloseIndices() before index_close(). I wonder if it's even correct to do that later, once we release the locks etc. I don't think ii_AmCache was intended for stuff like this - GIN and GiST only use it to cache stuff that can be just pfree-d, but for buffers that's no enough. It's not surprising we need to improve this. FWIW while debugging this (breakpoint on MemoryContextDelete) I was rather annoyed the COPY keeps dropping and recreating the two BRIN contexts - brininsert cxt / brin dtuple. I wonder if we could keep and reuse those too, but I don't know how much it'd help. > Now, to finally answer your question about the speedup without > generate_series(). We do see an even higher speedup! > > seq 1 200000000 > /tmp/data.csv > \timing > DROP TABLE heap; > CREATE TABLE heap(i int); > CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1); > COPY heap FROM '/tmp/data.csv'; > > -- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622) > COPY 200000000 > Time: 205072.444 ms (03:25.072) > Time: 215380.369 ms (03:35.380) > Time: 203492.347 ms (03:23.492) > > -- 3 runs (branch v2) > > COPY 200000000 > Time: 135052.752 ms (02:15.053) > Time: 135093.131 ms (02:15.093) > Time: 138737.048 ms (02:18.737) > That's nice, but it still doesn't say how much of that is reading the data. If you do just copy into a table without any indexes, how long does it take? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 4, 2023 at 2:54 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I'm not sure if memory context callbacks are the right way to rely on > for this purpose. The primary purpose of memory contexts is to track > memory, so using them for this seems a bit weird. Yeah, this just kept getting dirtier and dirtier. > There are cases that do something similar, like expandendrecord.c where > we track refcounted tuple slot, but IMHO there's a big difference > between tracking one slot allocated right there, and unknown number of > buffers allocated much later. Yeah, the following code in ER_mc_callbackis is there I think to prevent double freeing the tupdesc (since it might be freed in ResourceOwnerReleaseInternal()) (The part about: /* Ditto for tupdesc references */). if (tupdesc->tdrefcount > 0) { if (--tupdesc->tdrefcount == 0) FreeTupleDesc(tupdesc); } Plus the above code doesn't try anything with Resource owner stuff, whereas releasing a buffer means: ReleaseBuffer() -> UnpinBuffer() -> ResourceOwnerForgetBuffer(CurrentResourceOwner, b); > The fact that even with the extra context is still doesn't handle query > cancellations is another argument against that approach (I wonder how > expandedrecord.c handles that, but I haven't checked). > > > > > Maybe there is a better way of doing our cleanup? I'm not sure. Would > > love your input! > > > > The other alternative for all this is to introduce new AM callbacks for > > insert_begin and insert_end. That might be a tougher sell? > > > > That's the approach I wanted to suggest, more or less - to do the > cleanup from ExecCloseIndices() before index_close(). I wonder if it's > even correct to do that later, once we release the locks etc. I'll try this out and introduce a couple of new index AM callbacks. I think it's best to do it before releasing the locks - otherwise it might be weird to manipulate buffers of an index relation, without having some sort of lock on it. I'll think about it some more. > I don't think ii_AmCache was intended for stuff like this - GIN and GiST > only use it to cache stuff that can be just pfree-d, but for buffers > that's no enough. It's not surprising we need to improve this. Hmmm, yes, although the docs state: "If the index AM wishes to cache data across successive index insertions within an SQL statement, it can allocate space in indexInfo->ii_Context and store a pointer to the data in indexInfo->ii_AmCache (which will be NULL initially)." they don't mention anything about buffer usage. Well we will fix it! PS: It should be possible to make GIN and GiST use the new index AM APIs as well. > FWIW while debugging this (breakpoint on MemoryContextDelete) I was > rather annoyed the COPY keeps dropping and recreating the two BRIN > contexts - brininsert cxt / brin dtuple. I wonder if we could keep and > reuse those too, but I don't know how much it'd help. > Interesting, I will investigate that. > > Now, to finally answer your question about the speedup without > > generate_series(). We do see an even higher speedup! > > > > seq 1 200000000 > /tmp/data.csv > > \timing > > DROP TABLE heap; > > CREATE TABLE heap(i int); > > CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1); > > COPY heap FROM '/tmp/data.csv'; > > > > -- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622) > > COPY 200000000 > > Time: 205072.444 ms (03:25.072) > > Time: 215380.369 ms (03:35.380) > > Time: 203492.347 ms (03:23.492) > > > > -- 3 runs (branch v2) > > > > COPY 200000000 > > Time: 135052.752 ms (02:15.053) > > Time: 135093.131 ms (02:15.093) > > Time: 138737.048 ms (02:18.737) > > > > That's nice, but it still doesn't say how much of that is reading the > data. If you do just copy into a table without any indexes, how long > does it take? So, I loaded the same heap table without any indexes and at the same scale. I got: postgres=# COPY heap FROM '/tmp/data.csv'; COPY 200000000 Time: 116161.545 ms (01:56.162) Time: 114182.745 ms (01:54.183) Time: 114975.368 ms (01:54.975) perf diff also attached between the three: w/ no indexes (baseline), master and v2. Regards, Soumyadeep (VMware)
Вложения
On 7/5/23 02:35, Soumyadeep Chakraborty wrote: > On Tue, Jul 4, 2023 at 2:54 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > >> I'm not sure if memory context callbacks are the right way to rely on >> for this purpose. The primary purpose of memory contexts is to track >> memory, so using them for this seems a bit weird. > > Yeah, this just kept getting dirtier and dirtier. > >> There are cases that do something similar, like expandendrecord.c where >> we track refcounted tuple slot, but IMHO there's a big difference >> between tracking one slot allocated right there, and unknown number of >> buffers allocated much later. > > Yeah, the following code in ER_mc_callbackis is there I think to prevent double > freeing the tupdesc (since it might be freed in ResourceOwnerReleaseInternal()) > (The part about: /* Ditto for tupdesc references */). > > if (tupdesc->tdrefcount > 0) > { > if (--tupdesc->tdrefcount == 0) > FreeTupleDesc(tupdesc); > } > Plus the above code doesn't try anything with Resource owner stuff, whereas > releasing a buffer means: > ReleaseBuffer() -> UnpinBuffer() -> > ResourceOwnerForgetBuffer(CurrentResourceOwner, b); > Well, my understanding is the expandedrecord.c code increments the refcount exactly to prevent the resource owner to release the slot too early. My assumption is we'd need to do something similar for the revmap buffers by calling IncrBufferRefCount or something. But that's going to be messy, because the buffers are read elsewhere. >> The fact that even with the extra context is still doesn't handle query >> cancellations is another argument against that approach (I wonder how >> expandedrecord.c handles that, but I haven't checked). >> >>> >>> Maybe there is a better way of doing our cleanup? I'm not sure. Would >>> love your input! >>> >>> The other alternative for all this is to introduce new AM callbacks for >>> insert_begin and insert_end. That might be a tougher sell? >>> >> >> That's the approach I wanted to suggest, more or less - to do the >> cleanup from ExecCloseIndices() before index_close(). I wonder if it's >> even correct to do that later, once we release the locks etc. > > I'll try this out and introduce a couple of new index AM callbacks. I > think it's best to do it before releasing the locks - otherwise it > might be weird > to manipulate buffers of an index relation, without having some sort of lock on > it. I'll think about it some more. > I don't understand why would this need more than just a callback to release the cache. >> I don't think ii_AmCache was intended for stuff like this - GIN and GiST >> only use it to cache stuff that can be just pfree-d, but for buffers >> that's no enough. It's not surprising we need to improve this. > > Hmmm, yes, although the docs state: > "If the index AM wishes to cache data across successive index insertions within > an SQL statement, it can allocate space in indexInfo->ii_Context and > store a pointer > to the data in indexInfo->ii_AmCache (which will be NULL initially)." > they don't mention anything about buffer usage. Well we will fix it! > > PS: It should be possible to make GIN and GiST use the new index AM APIs > as well. > Why should GIN/GiST use the new API? I think it's perfectly sensible to only require the "cleanup callback" when just pfree() is not enough. >> FWIW while debugging this (breakpoint on MemoryContextDelete) I was >> rather annoyed the COPY keeps dropping and recreating the two BRIN >> contexts - brininsert cxt / brin dtuple. I wonder if we could keep and >> reuse those too, but I don't know how much it'd help. >> > > Interesting, I will investigate that. > >>> Now, to finally answer your question about the speedup without >>> generate_series(). We do see an even higher speedup! >>> >>> seq 1 200000000 > /tmp/data.csv >>> \timing >>> DROP TABLE heap; >>> CREATE TABLE heap(i int); >>> CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1); >>> COPY heap FROM '/tmp/data.csv'; >>> >>> -- 3 runs (master 29cf61ade3f245aa40f427a1d6345287ef77e622) >>> COPY 200000000 >>> Time: 205072.444 ms (03:25.072) >>> Time: 215380.369 ms (03:35.380) >>> Time: 203492.347 ms (03:23.492) >>> >>> -- 3 runs (branch v2) >>> >>> COPY 200000000 >>> Time: 135052.752 ms (02:15.053) >>> Time: 135093.131 ms (02:15.093) >>> Time: 138737.048 ms (02:18.737) >>> >> >> That's nice, but it still doesn't say how much of that is reading the >> data. If you do just copy into a table without any indexes, how long >> does it take? > > So, I loaded the same heap table without any indexes and at the same > scale. I got: > > postgres=# COPY heap FROM '/tmp/data.csv'; > COPY 200000000 > Time: 116161.545 ms (01:56.162) > Time: 114182.745 ms (01:54.183) > Time: 114975.368 ms (01:54.975) > OK, so the baseline is 115 seconds. With the current code, an index means +100 seconds. With the optimization, it's just +20. Nice. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 5, 2023 at 3:16 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > I'll try this out and introduce a couple of new index AM callbacks. I > > think it's best to do it before releasing the locks - otherwise it > > might be weird > > to manipulate buffers of an index relation, without having some sort of lock on > > it. I'll think about it some more. > > > > I don't understand why would this need more than just a callback to > release the cache. We wouldn't. I thought that it would be slightly cleaner and slightly more performant if we moved the (if !state) branches out of the XXXinsert() functions. But I guess, let's minimize the changes here. One cleanup callback is enough. > > PS: It should be possible to make GIN and GiST use the new index AM APIs > > as well. > > > > Why should GIN/GiST use the new API? I think it's perfectly sensible to > only require the "cleanup callback" when just pfree() is not enough. Yeah no need. Attached v3 of the patch w/ a single index AM callback. Regards, Soumyadeep (VMware)
Вложения
Attached v4 of the patch, rebased against latest HEAD. Regards, Soumyadeep (VMware)
Вложения
Created an entry for the Sep CF: https://commitfest.postgresql.org/44/4484/ Regards, Soumyadeep (VMware) On Sat, Jul 29, 2023 at 9:28 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote: > > Attached v4 of the patch, rebased against latest HEAD. > > Regards, > Soumyadeep (VMware)
Rebased against latest HEAD. Regards, Soumyadeep (VMware)
Вложения
Hi, I took a look at this patch today. I had to rebase the patch, due to some minor bitrot related to 9f0602539d (but nothing major). I also did a couple tiny cosmetic tweaks, but other than that the patch seems OK. See the attached v6. I did some simple performance tests too, similar to those in the initial message: CREATE UNLOGGED TABLE heap (i int); CREATE INDEX ON heap USING brin(i) WITH (pages_per_range=1); -- TRUNCATE heap; INSERT INTO heap SELECT 1 FROM generate_series(1, 20000000); And the results look like this (5 runs each): master: 16448.338 16066.473 16039.166 16067.420 16080.066 patched: 13260.065 13229.800 13254.454 13265.479 13273.693 So that's a nice improvement, even though enabling WAL will make the relative speedup somewhat smaller. The one thing I'm not entirely sure about is adding new stuff to the IndexAmRoutine. I don't think we want to end up with too many callbacks that all AMs have to initialize etc. I can't think of a different/better way to do this, though. Barring objections, I'll try to push this early next week, after another round of cleanup. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Fri, 3 Nov 2023 at 19:37, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > Hi, > > I took a look at this patch today. I had to rebase the patch, due to > some minor bitrot related to 9f0602539d (but nothing major). I also did > a couple tiny cosmetic tweaks, but other than that the patch seems OK. > See the attached v6. > [...] > Barring objections, I'll try to push this early next week, after another > round of cleanup. No hard objections: The principle looks fine. I do think we should choose a better namespace than bs_* for the fields of BrinInsertState, as BrinBuildState already uses the bs_* namespace for its fields in the same file, but that's only cosmetic. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Fri, 3 Nov 2023 at 19:37, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > The one thing I'm not entirely sure about is adding new stuff to the > IndexAmRoutine. I don't think we want to end up with too many callbacks > that all AMs have to initialize etc. I can't think of a different/better > way to do this, though. Yes there is not really an alternative. Also, aminsertcleanup() is very similar to amvacuumcleanup(), so it is not awkward. Why should vacuum be an exclusive VIP? :) And there are other indexam callbacks that not every AM implements. So this addition is not unprecedented in that sense. > Barring objections, I'll try to push this early next week, after another > round of cleanup. Many thanks for resurrecting this patch! On Fri, Nov 3, 2023 at 12:16PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > I do think we should choose a better namespace than bs_* for the > fields of BrinInsertState, as BrinBuildState already uses the bs_* > namespace for its fields in the same file, but that's only cosmetic. > bis_* then. Regards, Soumyadeep (VMware)
I've done a bit more cleanup on the last version of the patch (renamed the fields to start with bis_ as agreed, rephrased the comments / docs / commit message a bit) and pushed. Thanks for the patch! -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Nov 25, 2023 at 12:06 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
I've done a bit more cleanup on the last version of the patch (renamed
the fields to start with bis_ as agreed, rephrased the comments / docs /
commit message a bit) and pushed.
Thanks a lot Tomas for helping to drive the patch to completion iteratively and realizing the benefits.
- Ashwin
Thanks a lot for reviewing and pushing! :) Regards, Soumyadeep (VMware)
On Sun, Nov 26, 2023 at 4:06 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
I've done a bit more cleanup on the last version of the patch (renamed
the fields to start with bis_ as agreed, rephrased the comments / docs /
commit message a bit) and pushed.
It seems that we have an oversight in this commit. If there is no tuple
that has been inserted, we wouldn't have an available insert state in
the clean up phase. So the Assert in brininsertcleanup() is not always
right. For example:
regression=# update brin_summarize set value = brin_summarize.value;
server closed the connection unexpectedly
So I wonder if we should check 'bistate' and do the clean up only if
there is an available one, something like below.
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -359,7 +359,9 @@ brininsertcleanup(IndexInfo *indexInfo)
{
BrinInsertState *bistate = (BrinInsertState *) indexInfo->ii_AmCache;
- Assert(bistate);
+ /* We don't have an available insert state, nothing to do */
+ if (!bistate)
+ return;
Thanks
Richard
that has been inserted, we wouldn't have an available insert state in
the clean up phase. So the Assert in brininsertcleanup() is not always
right. For example:
regression=# update brin_summarize set value = brin_summarize.value;
server closed the connection unexpectedly
So I wonder if we should check 'bistate' and do the clean up only if
there is an available one, something like below.
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -359,7 +359,9 @@ brininsertcleanup(IndexInfo *indexInfo)
{
BrinInsertState *bistate = (BrinInsertState *) indexInfo->ii_AmCache;
- Assert(bistate);
+ /* We don't have an available insert state, nothing to do */
+ if (!bistate)
+ return;
Thanks
Richard
On Sun, Nov 26, 2023 at 9:28 PM Richard Guo <guofenglinux@gmail.com> wrote: > > > On Sun, Nov 26, 2023 at 4:06 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: >> >> I've done a bit more cleanup on the last version of the patch (renamed >> the fields to start with bis_ as agreed, rephrased the comments / docs / >> commit message a bit) and pushed. > > > It seems that we have an oversight in this commit. If there is no tuple > that has been inserted, we wouldn't have an available insert state in > the clean up phase. So the Assert in brininsertcleanup() is not always > right. For example: > > regression=# update brin_summarize set value = brin_summarize.value; > server closed the connection unexpectedly > I wasn't able to repro the issue on 86b64bafc19c4c60136a4038d2a8d1e6eecc59f2. with UPDATE/INSERT: postgres=# drop table a; DROP TABLE postgres=# create table a(i int); CREATE TABLE postgres=# create index on a using brin(i); CREATE INDEX postgres=# insert into a select 1 where 1!=1; INSERT 0 0 postgres=# update a set i = 2 where i = 0; UPDATE 0 postgres=# update a set i = a.i; UPDATE 0 This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7, we have moved ExecOpenIndices() from ExecInitModifyTable() to ExecInsert(). Since we never open the indices if nothing is inserted, we would never attempt to close them with ExecCloseIndices() while the ii_AmCache is NULL (which is what causes this assertion failure). However, it is possible to repro the issue with: # create empty file touch /tmp/a.csv postgres=# create table a(i int); CREATE TABLE postgres=# create index on a using brin(i); CREATE INDEX postgres=# copy a from '/tmp/a.csv'; TRAP: failed Assert("bistate"), File: "brin.c", Line: 362, PID: 995511 > So I wonder if we should check 'bistate' and do the clean up only if there is an available one, something like below. Yes, this is the right way to go. Thanks for reporting! Regards, Soumyadeep (VMware)
On Mon, Nov 27, 2023 at 1:53 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote:
On Sun, Nov 26, 2023 at 9:28 PM Richard Guo <guofenglinux@gmail.com> wrote:
> It seems that we have an oversight in this commit. If there is no tuple
> that has been inserted, we wouldn't have an available insert state in
> the clean up phase. So the Assert in brininsertcleanup() is not always
> right. For example:
>
> regression=# update brin_summarize set value = brin_summarize.value;
> server closed the connection unexpectedly
I wasn't able to repro the issue on 86b64bafc19c4c60136a4038d2a8d1e6eecc59f2.
with UPDATE/INSERT:
This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7,
we have moved ExecOpenIndices()
from ExecInitModifyTable() to ExecInsert(). Since we never open the
indices if nothing is
inserted, we would never attempt to close them with ExecCloseIndices()
while the ii_AmCache
is NULL (which is what causes this assertion failure).
AFAICS we would also open the indices from ExecUpdate(). So if we
update the table in a way that no new tuples are inserted, we will have
this issue. As I showed previously, the query below crashes for me on
latest master (dc9f8a7983).
regression=# update brin_summarize set value = brin_summarize.value;
server closed the connection unexpectedly
There are other code paths that call ExecOpenIndices(), such as
ExecMerge(). I believe it's not hard to create queries that trigger this
Assert for those cases.
Thanks
Richard
update the table in a way that no new tuples are inserted, we will have
this issue. As I showed previously, the query below crashes for me on
latest master (dc9f8a7983).
regression=# update brin_summarize set value = brin_summarize.value;
server closed the connection unexpectedly
There are other code paths that call ExecOpenIndices(), such as
ExecMerge(). I believe it's not hard to create queries that trigger this
Assert for those cases.
Thanks
Richard
On 11/27/23 08:37, Richard Guo wrote: > > On Mon, Nov 27, 2023 at 1:53 PM Soumyadeep Chakraborty > <soumyadeep2007@gmail.com <mailto:soumyadeep2007@gmail.com>> wrote: > > On Sun, Nov 26, 2023 at 9:28 PM Richard Guo <guofenglinux@gmail.com > <mailto:guofenglinux@gmail.com>> wrote: > > It seems that we have an oversight in this commit. If there is no > tuple > > that has been inserted, we wouldn't have an available insert state in > > the clean up phase. So the Assert in brininsertcleanup() is not > always > > right. For example: > > > > regression=# update brin_summarize set value = brin_summarize.value; > > server closed the connection unexpectedly > > I wasn't able to repro the issue on > 86b64bafc19c4c60136a4038d2a8d1e6eecc59f2. > with UPDATE/INSERT: > > This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7, > we have moved ExecOpenIndices() > from ExecInitModifyTable() to ExecInsert(). Since we never open the > indices if nothing is > inserted, we would never attempt to close them with ExecCloseIndices() > while the ii_AmCache > is NULL (which is what causes this assertion failure). > > > AFAICS we would also open the indices from ExecUpdate(). So if we > update the table in a way that no new tuples are inserted, we will have > this issue. As I showed previously, the query below crashes for me on > latest master (dc9f8a7983). > > regression=# update brin_summarize set value = brin_summarize.value; > server closed the connection unexpectedly > > There are other code paths that call ExecOpenIndices(), such as > ExecMerge(). I believe it's not hard to create queries that trigger > this Assert for those cases. > FWIW I can readily reproduce it like this: drop table t; create table t (a int); insert into t values (1); create index on t using brin (a); update t set a = a; I however wonder if maybe we should do the check in index_insert_cleanup and not in the AM callback. That seems simpler / better, because the AM callbacks then can't make this mistake. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/27/23 11:34, Tomas Vondra wrote: > > > On 11/27/23 08:37, Richard Guo wrote: >> >> On Mon, Nov 27, 2023 at 1:53 PM Soumyadeep Chakraborty >> <soumyadeep2007@gmail.com <mailto:soumyadeep2007@gmail.com>> wrote: >> >> On Sun, Nov 26, 2023 at 9:28 PM Richard Guo <guofenglinux@gmail.com >> <mailto:guofenglinux@gmail.com>> wrote: >> > It seems that we have an oversight in this commit. If there is no >> tuple >> > that has been inserted, we wouldn't have an available insert state in >> > the clean up phase. So the Assert in brininsertcleanup() is not >> always >> > right. For example: >> > >> > regression=# update brin_summarize set value = brin_summarize.value; >> > server closed the connection unexpectedly >> >> I wasn't able to repro the issue on >> 86b64bafc19c4c60136a4038d2a8d1e6eecc59f2. >> with UPDATE/INSERT: >> >> This could be because since c5b7ba4e67aeb5d6f824b74f94114d99ed6e42b7, >> we have moved ExecOpenIndices() >> from ExecInitModifyTable() to ExecInsert(). Since we never open the >> indices if nothing is >> inserted, we would never attempt to close them with ExecCloseIndices() >> while the ii_AmCache >> is NULL (which is what causes this assertion failure). >> >> >> AFAICS we would also open the indices from ExecUpdate(). So if we >> update the table in a way that no new tuples are inserted, we will have >> this issue. As I showed previously, the query below crashes for me on >> latest master (dc9f8a7983). >> >> regression=# update brin_summarize set value = brin_summarize.value; >> server closed the connection unexpectedly >> >> There are other code paths that call ExecOpenIndices(), such as >> ExecMerge(). I believe it's not hard to create queries that trigger >> this Assert for those cases. >> > > FWIW I can readily reproduce it like this: > > drop table t; > create table t (a int); > insert into t values (1); > create index on t using brin (a); > update t set a = a; > > I however wonder if maybe we should do the check in index_insert_cleanup > and not in the AM callback. That seems simpler / better, because the AM > callbacks then can't make this mistake. > I did it this way (check in index_insert_cleanup) and pushed. Thanks for the report! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello Tomas and Soumyadeep, 25.11.2023 23:06, Tomas Vondra wrote: > I've done a bit more cleanup on the last version of the patch (renamed > the fields to start with bis_ as agreed, rephrased the comments / docs / > commit message a bit) and pushed. Please look at a query, which produces warnings similar to the ones observed upthread: CREATE TABLE t(a INT); INSERT INTO t SELECT x FROM generate_series(1,10000) x; CREATE INDEX idx ON t USING brin (a); REINDEX index CONCURRENTLY idx; WARNING: resource was not closed: [1863] (rel=base/16384/16389, blockNum=1, flags=0x93800000, refcount=1 1) WARNING: resource was not closed: [1862] (rel=base/16384/16389, blockNum=0, flags=0x93800000, refcount=1 1) The first bad commit for this anomaly is c1ec02be1. May be you would also want to fix in passing some typos/inconsistencies introduced with recent brin-related commits: s/bs_blkno/bt_blkno/ s/emptry/empty/ s/firt/first/ s/indexinsertcleanup/aminsertcleanup/ s/ maxRange/ nextRange/ s/paga /page / Best regards, Alexander
On 12/11/23 16:00, Alexander Lakhin wrote: > Hello Tomas and Soumyadeep, > > 25.11.2023 23:06, Tomas Vondra wrote: >> I've done a bit more cleanup on the last version of the patch (renamed >> the fields to start with bis_ as agreed, rephrased the comments / docs / >> commit message a bit) and pushed. > > Please look at a query, which produces warnings similar to the ones > observed upthread: > CREATE TABLE t(a INT); > INSERT INTO t SELECT x FROM generate_series(1,10000) x; > CREATE INDEX idx ON t USING brin (a); > REINDEX index CONCURRENTLY idx; > > WARNING: resource was not closed: [1863] (rel=base/16384/16389, > blockNum=1, flags=0x93800000, refcount=1 1) > WARNING: resource was not closed: [1862] (rel=base/16384/16389, > blockNum=0, flags=0x93800000, refcount=1 1) > > The first bad commit for this anomaly is c1ec02be1. > Thanks for the report. I haven't investigated what the issue is, but it seems we fail to release the buffers in some situations - I'd bet we fail to call the cleanup callback in some place, or something like that. I'll take a look tomorrow. > May be you would also want to fix in passing some typos/inconsistencies > introduced with recent brin-related commits: > s/bs_blkno/bt_blkno/ > s/emptry/empty/ > s/firt/first/ > s/indexinsertcleanup/aminsertcleanup/ > s/ maxRange/ nextRange/ > s/paga /page / > Definitely. Thanks for noticing those! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/11/23 16:41, Tomas Vondra wrote: > On 12/11/23 16:00, Alexander Lakhin wrote: >> Hello Tomas and Soumyadeep, >> >> 25.11.2023 23:06, Tomas Vondra wrote: >>> I've done a bit more cleanup on the last version of the patch (renamed >>> the fields to start with bis_ as agreed, rephrased the comments / docs / >>> commit message a bit) and pushed. >> >> Please look at a query, which produces warnings similar to the ones >> observed upthread: >> CREATE TABLE t(a INT); >> INSERT INTO t SELECT x FROM generate_series(1,10000) x; >> CREATE INDEX idx ON t USING brin (a); >> REINDEX index CONCURRENTLY idx; >> >> WARNING: resource was not closed: [1863] (rel=base/16384/16389, >> blockNum=1, flags=0x93800000, refcount=1 1) >> WARNING: resource was not closed: [1862] (rel=base/16384/16389, >> blockNum=0, flags=0x93800000, refcount=1 1) >> >> The first bad commit for this anomaly is c1ec02be1. >> > > Thanks for the report. I haven't investigated what the issue is, but it > seems we fail to release the buffers in some situations - I'd bet we > fail to call the cleanup callback in some place, or something like that. > I'll take a look tomorrow. > Yeah, just as I expected this seems to be a case of failing to call the index_insert_cleanup after doing some inserts - in this case the inserts happen in table_index_validate_scan, but validate_index has no idea it needs to do the cleanup. The attached 0001 patch fixes this by adding the call to validate_index, which seems like the proper place as it's where the indexInfo is allocated and destroyed. But this makes me think - are there any other places that might call index_insert without then also doing the cleanup? I did grep for the index_insert() calls, and it seems OK except for this one. There's a call in toast_internals.c, but that seems OK because that only deals with btree indexes (and those don't need any cleanup). The same logic applies to unique_key_recheck(). The rest goes through execIndexing.c, which should do the cleanup in ExecCloseIndices(). Note: We should probably call the cleanup even in the btree cases, even if only to make it clear it needs to be called after index_insert(). I was thinking maybe we should have some explicit call to destroy the IndexInfo, but that seems rather inconvenient - it'd force everyone to carefully track lifetimes of the IndexInfo instead of just relying on memory context reset/destruction. That seems quite error-prone. I propose we do a much simpler thing instead - allow the cache may be initialized / cleaned up repeatedly, and make sure it gets reset at convenient place (typically after index_insert calls that don't go through execIndexing). That'd mean the cleanup does not need to happen very far from the index_insert(), which makes the reasoning much easier. 0002 does this. But maybe there's a better way to ensure the cleanup gets called even when not using execIndexing. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Tue, Dec 12, 2023 at 3:25 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > The attached 0001 patch fixes this by adding the call to validate_index, which seems like the proper place as it's where the indexInfo is allocated and destroyed. Yes, and by reading validate_index's header comment, there is a clear expectation that we will be adding tuples to the index in the table AM call table_index_validate_scan (albeit "validating" doesn't necessarily convey this side effect). So, it makes perfect sense to call it here. > But this makes me think - are there any other places that might call > index_insert without then also doing the cleanup? I did grep for the > index_insert() calls, and it seems OK except for this one. > > There's a call in toast_internals.c, but that seems OK because that only > deals with btree indexes (and those don't need any cleanup). The same > logic applies to unique_key_recheck(). The rest goes through > execIndexing.c, which should do the cleanup in ExecCloseIndices(). > > Note: We should probably call the cleanup even in the btree cases, even > if only to make it clear it needs to be called after index_insert(). Agreed. Doesn't feel great, but yeah all of these btree specific code does call through index_* functions, instead of calling btree functions directly. So, ideally we should follow through with that pattern and call the cleanup explicitly. But we are introducing per-tuple overhead that is totally wasted. Maybe we can add a comment instead like: void toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock) { int i; /* * Save a few cycles by choosing not to call index_insert_cleanup(). Toast * indexes are btree, which don't have a aminsertcleanup() anyway. */ /* Close relations and clean up things */ ... } And add something similar for unique_key_recheck()? That said, I am also not opposed to just accepting these wasted cycles, if the commenting seems wonky. > I propose we do a much simpler thing instead - allow the cache may be > initialized / cleaned up repeatedly, and make sure it gets reset at > convenient place (typically after index_insert calls that don't go > through execIndexing). That'd mean the cleanup does not need to happen > very far from the index_insert(), which makes the reasoning much easier. > 0002 does this. That kind of goes against the ethos of the ii_AmCache, which is meant to capture state to be used across multiple index inserts. Also, quoting the current docs: "If the index AM wishes to cache data across successive index insertions within an SQL statement, it can allocate space in <literal>indexInfo->ii_Context</literal> and store a pointer to the data in <literal>indexInfo->ii_AmCache</literal> (which will be NULL initially). After the index insertions complete, the memory will be freed automatically. If additional cleanup is required (e.g. if the cache contains buffers and tuple descriptors), the AM may define an optional function <literal>indexinsertcleanup</literal>, called before the memory is released." The memory will be freed automatically (as soon as ii_Context goes away). So, why would we explicitly free it, like in the attached 0002 patch? And the whole point of the cleanup function is to do something other than freeing memory, as the docs note highlights so well. Also, the 0002 patch does introduce per-tuple function call overhead in heapam_index_validate_scan(). Also, now we have split brain...in certain situations we want to call index_insert_cleanup() per index insert and in certain cases we would like it called once for all inserts. Not very easy to understand IMO. Finally, I don't think that any index AM would have anything to clean up after every insert. I had tried (abused) an approach with MemoryContextCallbacks upthread and that seems a no-go. And yes I agree, having a dual to makeIndexInfo() (like destroyIndexInfo()) means that we lose the benefits of ii_Context. That could be disruptive to existing AMs in-core and outside. All said and done, v1 has my vote. Regards, Soumyadeep (VMware)
On 12/13/23 09:12, Soumyadeep Chakraborty wrote: > On Tue, Dec 12, 2023 at 3:25 AM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > >> The attached 0001 patch fixes this by adding the call to validate_index, > which seems like the proper place as it's where the indexInfo is > allocated and destroyed. > > Yes, and by reading validate_index's header comment, there is a clear > expectation that we will be adding tuples to the index in the table AM call > table_index_validate_scan (albeit "validating" doesn't necessarily convey this > side effect). So, it makes perfect sense to call it here. > OK > >> But this makes me think - are there any other places that might call >> index_insert without then also doing the cleanup? I did grep for the >> index_insert() calls, and it seems OK except for this one. >> >> There's a call in toast_internals.c, but that seems OK because that only >> deals with btree indexes (and those don't need any cleanup). The same >> logic applies to unique_key_recheck(). The rest goes through >> execIndexing.c, which should do the cleanup in ExecCloseIndices(). >> >> Note: We should probably call the cleanup even in the btree cases, even >> if only to make it clear it needs to be called after index_insert(). > > Agreed. Doesn't feel great, but yeah all of these btree specific code does call > through index_* functions, instead of calling btree functions directly. So, > ideally we should follow through with that pattern and call the cleanup > explicitly. But we are introducing per-tuple overhead that is totally wasted. I haven't tried but I very much doubt this will be measurable. It's just a trivial check if a pointer is NULL. We do far more expensive stuff in this code path. > Maybe we can add a comment instead like: > > void > toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock) > { > int i; > > /* > * Save a few cycles by choosing not to call index_insert_cleanup(). Toast > * indexes are btree, which don't have a aminsertcleanup() anyway. > */ > > /* Close relations and clean up things */ > ... > } > > And add something similar for unique_key_recheck()? That said, I am also not > opposed to just accepting these wasted cycles, if the commenting seems wonky. > I really don't want to do this sort of stuff unless we know it actually saves something. >> I propose we do a much simpler thing instead - allow the cache may be >> initialized / cleaned up repeatedly, and make sure it gets reset at >> convenient place (typically after index_insert calls that don't go >> through execIndexing). That'd mean the cleanup does not need to happen >> very far from the index_insert(), which makes the reasoning much easier. >> 0002 does this. > > That kind of goes against the ethos of the ii_AmCache, which is meant > to capture state to be used across multiple index inserts. Why would it be against the ethos? The point is that we reuse stuff over multiple index_insert() calls. If we can do that for all inserts, cool. But if that's difficult, it's maybe better to cache for smaller batches of inserts (instead of making it more complex for all index AMs, even those not doing any caching). > Also, quoting the current docs: > > "If the index AM wishes to cache data across successive index insertions > within an SQL statement, it can allocate space > in <literal>indexInfo->ii_Context</literal> and store a pointer to the > data in <literal>indexInfo->ii_AmCache</literal> (which will be NULL > initially). After the index insertions complete, the memory will be freed > automatically. If additional cleanup is required (e.g. if the cache contains > buffers and tuple descriptors), the AM may define an optional function > <literal>indexinsertcleanup</literal>, called before the memory is released." > > The memory will be freed automatically (as soon as ii_Context goes away). So, > why would we explicitly free it, like in the attached 0002 patch? And the whole > point of the cleanup function is to do something other than freeing memory, as > the docs note highlights so well. > Not sure I follow. The whole reason for having the index_insert_cleanup callback is we can't rely on ii_Context going away, exactly because that just throws away the memory and we need to release buffers etc. The only reason why the 0002 patch does pfree() is that it clearly indicates whether the cache is initialized. We could have a third state "allocated but not initialized", but that doesn't seem worth it. If you're saying the docs are misleading in some way, then maybe we need to clarify that. > Also, the 0002 patch does introduce per-tuple function call overhead in > heapam_index_validate_scan(). > How come? The cleanup is done only once after the scan completes. Isn't that exactly what we want to do? > Also, now we have split brain...in certain situations we want to call > index_insert_cleanup() per index insert and in certain cases we would like it > called once for all inserts. Not very easy to understand IMO. > > Finally, I don't think that any index AM would have anything to clean up after > every insert. > But I didn't suggest to call the cleanup after every index insert. If a function does a bunch of index_insert calls, it'd do the cleanup only once. The point is that it'd happen "close" to the inserts, when we know it needs to be done. Yes, it might happen multiple times for the same query, but that still likely saves quite a bit of work (compared to per-insert init+cleanup). We need to call the cleanup at some point, and the only alternative I can think of is to call it in every place that calls BuildIndexInfo (unless we can guarantee the place can't do index_insert). > I had tried (abused) an approach with MemoryContextCallbacks upthread and that > seems a no-go. And yes I agree, having a dual to makeIndexInfo() (like > destroyIndexInfo()) means that we lose the benefits of ii_Context. That could > be disruptive to existing AMs in-core and outside. > What do you mean by "dual makeIndexInfo"? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi All, not sure how to "Specify thread msgid" - choose one which i think is close to my new feature request. query: SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE t1.a_indexed_col='some_value' OR t2.a_indexed_col='some_vable'; can the server automatically replace the OR logic above with UNION please? i.e. replace it with: (SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE t1.a_indexed_col='some_value' ) UNION (SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE t2.a_indexed_col='some_vable'); Thanks
On 2023-Dec-21, James Wang wrote: > Hi All, not sure how to "Specify thread msgid" - choose one which i think is close to my new feature request. Hello James, based on the "Specify thread msgid" message it looks like you were trying to request a feature using the Commitfest website? That won't work; the commitfest website is only intended as a tracker of in-progress development work. Without a Postgres code patch, that website doesn't help you any. What you have done amounts to hijacking an unrelated mailing list thread, which is discouraged and frowned upon. That said, sadly we don't have any official feature request system, Please start a new thread by composing an entirely new message to pgsql-hackers@lists.postgresql.org, and don't use the commitfest website for it. That said, > query: > > SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE t1.a_indexed_col='some_value' OR t2.a_indexed_col='some_vable'; > > can the server automatically replace the OR logic above with UNION please? i.e. replace it with: > > (SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE t1.a_indexed_col='some_value' ) > UNION > (SELECT count(1) FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id WHERE t2.a_indexed_col='some_vable'); I have the feeling that this has already been discussed, but I can't find anything useful in the mailing list archives. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "Saca el libro que tu religión considere como el indicado para encontrar la oración que traiga paz a tu alma. Luego rebootea el computador y ve si funciona" (Carlos Duclós)
On 2023-Dec-12, Tomas Vondra wrote: > I propose we do a much simpler thing instead - allow the cache may be > initialized / cleaned up repeatedly, and make sure it gets reset at > convenient place (typically after index_insert calls that don't go > through execIndexing). That'd mean the cleanup does not need to happen > very far from the index_insert(), which makes the reasoning much easier. > 0002 does this. I'm not in love with this 0002 patch; I think the layering after 0001 is correct in that the insert_cleanup call should remain in validate_index and called after the whole thing is done, but 0002 changes things so that now every table AM has to worry about doing this correctly; and a bug of omission will not be detected unless you have a BRIN index on such a table and happen to use CREATE INDEX CONCURRENTLY. So a developer has essentially zero chance to do things correctly, which I think we'd rather avoid. So I think we should do 0001 and perhaps some further tweaks to the original brininsert optimization commit: I think the aminsertcleanup callback should receive the indexRelation as first argument; and also I think it's not index_insert_cleanup() job to worry about whether ii_AmCache is NULL or not, but instead the callback should be invoked always, and then it's aminsertcleanup job to do nothing if ii_AmCache is NULL. That way, the index AM API doesn't have to worry about which parts of IndexInfo (or the indexRelation) is aminsertcleanup going to care about. If we decide to change this, then the docs also need a bit of tweaking I think. Lastly, I kinda disagree with the notion that only some of the callers of aminsert should call aminsertcleanup, even though btree doesn't have an aminsertcleanup and thus it can't affect TOAST or catalogs. Maybe we can turn index_insert_cleanup into an inline function, which can quickly do nothing if aminsertcleanup isn't defined. Then we no longer have the layering violation where we assume that btree doesn't care. But the proposed change in this paragraph can be maybe handled separately to avoid confusing things with the bugfix in the two paragraphs above. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ Essentially, you're proposing Kevlar shoes as a solution for the problem that you want to walk around carrying a loaded gun aimed at your foot. (Tom Lane)
On 2024-Jan-08, Alvaro Herrera wrote: > So I think we should do 0001 and perhaps some further tweaks to the > original brininsert optimization commit: [...] So I propose the attached patch, which should fix the reported bug and the things I mentioned above, and also the typos Alexander mentioned elsewhere in the thread. > Lastly, I kinda disagree with the notion that only some of the callers > of aminsert should call aminsertcleanup, even though btree doesn't have > an aminsertcleanup and thus it can't affect TOAST or catalogs. [...] I didn't do anything about this. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Вложения
On 1/8/24 16:51, Alvaro Herrera wrote: > On 2023-Dec-12, Tomas Vondra wrote: > >> I propose we do a much simpler thing instead - allow the cache may be >> initialized / cleaned up repeatedly, and make sure it gets reset at >> convenient place (typically after index_insert calls that don't go >> through execIndexing). That'd mean the cleanup does not need to happen >> very far from the index_insert(), which makes the reasoning much easier. >> 0002 does this. > > I'm not in love with this 0002 patch; I think the layering after 0001 is > correct in that the insert_cleanup call should remain in validate_index > and called after the whole thing is done, but 0002 changes things so > that now every table AM has to worry about doing this correctly; and a > bug of omission will not be detected unless you have a BRIN index on > such a table and happen to use CREATE INDEX CONCURRENTLY. So a > developer has essentially zero chance to do things correctly, which I > think we'd rather avoid. > True. If the AM code does not need to worry about this kind of stuff, that would be good / less error prone. One thing that is not very clear to me is that I don't think there's a very good way to determine which places need the cleanup call. Because it depends on (a) what kind of index is used and (b) what happens in the code called earlier (which may easily do arbitrary stuff). Which means we have to call the cleanup whenever the code *might* have done inserts into the index. Maybe it's not such an issue in practice, though. > So I think we should do 0001 and perhaps some further tweaks to the > original brininsert optimization commit: I think the aminsertcleanup > callback should receive the indexRelation as first argument; and also I > think it's not index_insert_cleanup() job to worry about whether > ii_AmCache is NULL or not, but instead the callback should be invoked > always, and then it's aminsertcleanup job to do nothing if ii_AmCache is > NULL. That way, the index AM API doesn't have to worry about which > parts of IndexInfo (or the indexRelation) is aminsertcleanup going to > care about. If we decide to change this, then the docs also need a bit > of tweaking I think. > Yeah, passing the indexRelation to the am callback seems reasonable. It's more consistent what we do for other callbacks, and perhaps the callback might need the indexRelation. I don't quite see why we should invoke the callback with ii_AmCache=NULL though. If there's nothing cached, why bother? It just means all cleanup callbacks have to do this NULL check on their own. > Lastly, I kinda disagree with the notion that only some of the callers > of aminsert should call aminsertcleanup, even though btree doesn't have > an aminsertcleanup and thus it can't affect TOAST or catalogs. Maybe we > can turn index_insert_cleanup into an inline function, which can quickly > do nothing if aminsertcleanup isn't defined. Then we no longer have the > layering violation where we assume that btree doesn't care. But the > proposed change in this paragraph can be maybe handled separately to > avoid confusing things with the bugfix in the two paragraphs above. > After thinking about this a bit more I agree with you - we should call the cleanup from each place calling aminsert, even if it's for nbtree (or other index types that don't require cleanup at the moment). I wonder if there's a nice way to check this in assert-enabled builds? Could we tweak nbtree (or index AM in general) to check that all places that called aminsert also called aminsertcleanup? For example, I was thinking we might add a flag to IndexInfo (separate from the ii_AmCache), tracking if aminsert() was called, and then later check the aminsertcleanup() got called too. The problem however is there's no obviously convenient place for this check, because IndexInfo is not freed explicitly ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2024-Feb-13, Tomas Vondra wrote: > One thing that is not very clear to me is that I don't think there's a > very good way to determine which places need the cleanup call. Because > it depends on (a) what kind of index is used and (b) what happens in the > code called earlier (which may easily do arbitrary stuff). Which means > we have to call the cleanup whenever the code *might* have done inserts > into the index. Maybe it's not such an issue in practice, though. I think it's not an issue, or rather that we should not try to guess. Instead make it a simple rule: if aminsert is called, then aminsertcleanup must be called afterwards, period. > On 1/8/24 16:51, Alvaro Herrera wrote: > > So I think we should do 0001 and perhaps some further tweaks to the > > original brininsert optimization commit: I think the aminsertcleanup > > callback should receive the indexRelation as first argument; and also I > > think it's not index_insert_cleanup() job to worry about whether > > ii_AmCache is NULL or not, but instead the callback should be invoked > > always, and then it's aminsertcleanup job to do nothing if ii_AmCache is > > NULL. [...] > I don't quite see why we should invoke the callback with ii_AmCache=NULL > though. If there's nothing cached, why bother? It just means all cleanup > callbacks have to do this NULL check on their own. Guessing that aminsertcleanup is not needed when ii_AmCache is NULL seems like a leaky abstraction. I propose to have only the AM know whether the cleanup call is important or not, without index_insert_cleanup assuming that it's related to ii_AmCache. Somebody could decide to have something completely different during insert cleanup, which is not in ii_AmCache. > I wonder if there's a nice way to check this in assert-enabled builds? > Could we tweak nbtree (or index AM in general) to check that all places > that called aminsert also called aminsertcleanup? > > For example, I was thinking we might add a flag to IndexInfo (separate > from the ii_AmCache), tracking if aminsert() was called, and Then later > check the aminsertcleanup() got called too. The problem however is > there's no obviously convenient place for this check, because IndexInfo > is not freed explicitly ... I agree it would be nice to have a way to verify, but it doesn't seem 100% essential. After all, it's not very common to add new calls to aminsert. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
On Thu, Feb 29, 2024 at 01:20:39PM +0100, Alvaro Herrera wrote: > I think it's not an issue, or rather that we should not try to guess. > Instead make it a simple rule: if aminsert is called, then > aminsertcleanup must be called afterwards, period. > > I agree it would be nice to have a way to verify, but it doesn't seem > 100% essential. After all, it's not very common to add new calls to > aminsert. This thread is listed as an open item. What's the follow-up plan? The last email of this thread is dated as of the 29th of February, which was 6 weeks ago. -- Michael
Вложения
On 4/18/24 09:07, Michael Paquier wrote: > On Thu, Feb 29, 2024 at 01:20:39PM +0100, Alvaro Herrera wrote: >> I think it's not an issue, or rather that we should not try to guess. >> Instead make it a simple rule: if aminsert is called, then >> aminsertcleanup must be called afterwards, period. >> >> I agree it would be nice to have a way to verify, but it doesn't seem >> 100% essential. After all, it's not very common to add new calls to >> aminsert. > > This thread is listed as an open item. What's the follow-up plan? > The last email of this thread is dated as of the 29th of February, > which was 6 weeks ago. Apologies, I got distracted by the other patches. The bug is still there, I believe the patch shared by Alvaro in [1] is the right way to deal with it. I'll take care of that today/tomorrow. [1] https://www.postgresql.org/message-id/202401091043.e3nrqiad6gb7@alvherre.pgsql regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, Here's two patched to deal with this open item. 0001 is a trivial fix of typos and wording, I moved it into a separate commit for clarity. 0002 does the actual fix - adding the index_insert_cleanup(). It's 99% the patch Alvaro shared some time ago, with only some minor formatting tweaks by me. I've also returned to this Alvaro's comment: > Lastly, I kinda disagree with the notion that only some of the callers > of aminsert should call aminsertcleanup, even though btree doesn't > have an aminsertcleanup and thus it can't affect TOAST or catalogs. which was a reaction to my earlier statement about places calling index_insert(): > There's a call in toast_internals.c, but that seems OK because that > only deals with btree indexes (and those don't need any cleanup). The > same logic applies to unique_key_recheck(). The rest goes through > execIndexing.c, which should do the cleanup in ExecCloseIndices(). I think Alvaro is right, so I went through all index_insert() callers and checked which need the cleanup. Luckily there's not that many of them, only 5 in total call index_insert() directly: 1) toast_save_datum (src/backend/access/common/toast_internals.c) This is safe, because the index_insert() passes indexInfo=NULL, so there can't possibly be any cache. If we ever decide to pass a valid indexInfo, we can add the cleanup, now it seems pointless. Note: If someone created a BRIN index on a TOAST table, that'd already crash, because BRIN blindly dereferences the indexInfo. Maybe that should be fixed, but we don't support CREATE INDEX on TOAST tables. 2) heapam_index_validate_scan (src/backend/access/heap/heapam_handler.c) Covered by the committed fix, adding cleanup to validate_index. 3) CatalogIndexInsert (src/backend/catalog/indexing.c) Covered by all callers also calling CatalogCloseIndexes, which in turn calls ExecCloseIndices and cleanup. 4) unique_key_recheck (src/backend/commands/constraint.c) This seems like the only place missing the cleanup call. 5) ExecInsertIndexTuples (src/backend/executor/execIndexing.c) Should be covered by ExecCloseIndices, called after the insertions. So it seems only (4) unique_key_recheck needs the extra call (it can't really happen higher because the indexInfo is a local variable). So the 0002 patch adds the call. The patch also adds a test for this (or rather tweaks an existing one). It's a bit too late for me to push this now, I'll do so early tomorrow. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 4/19/24 00:13, Tomas Vondra wrote: > ... > > It's a bit too late for me to push this now, I'll do so early tomorrow. > FWIW I've pushed both patches, which resolves the open item, so I've moved it to the "resolved" part on wiki. thanks -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 19, 2024 at 04:14:00PM +0200, Tomas Vondra wrote: > FWIW I've pushed both patches, which resolves the open item, so I've > moved it to the "resolved" part on wiki. Thanks, Tomas! -- Michael