Re: Shared detoast Datum proposal
От | Andy Fan |
---|---|
Тема | Re: Shared detoast Datum proposal |
Дата | |
Msg-id | 87jzms0xza.fsf@163.com обсуждение исходный текст |
Ответ на | Re: Shared detoast Datum proposal (Andy Fan <zhihuifan1213@163.com>) |
Ответы |
Re: Shared detoast Datum proposal
(Nikita Malakhov <hukutoc@gmail.com>)
|
Список | pgsql-hackers |
Hi, >> >> Good idea. Either that (a separate README), or a comment in a header of >> some suitable .c/.h file (I prefer that, because that's kinda obvious >> when reading the code, I often not notice a README exists next to it). > > Great, I'd try this from tomorrow. I have made it. Currently I choose README because this feature changed createplan.c, setrefs.c, execExpr.c and execExprInterp.c, so putting the high level design to any of them looks inappropriate. So the high level design is here and detailed design for each steps is in the comments around the code. Hope this is helpful! The problem: ------------- In the current expression engine, a toasted datum is detoasted when required, but the result is discarded immediately, either by pfree it immediately or leave it for ResetExprContext. Arguments for which one to use exists sometimes. More serious problem is detoasting is expensive, especially for the data types like jsonb or array, which the value might be very huge. In the blow example, the detoasting happens twice. SELECT jb_col->'a', jb_col->'b' FROM t; Within the shared-detoast-datum, we just need to detoast once for each tuple, and discard it immediately when the tuple is not needed any more. FWIW this issue may existing for small numeric, text as well because of SHORT_TOAST feature where the toast's len using 1 byte rather than 4 bytes. Current Design -------------- The high level design is let createplan.c and setref.c decide which Vars can use this feature, and then the executor save the detoast datum back slot->to tts_values[*] during the ExprEvalStep of EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes: - The existing expression engine read datum from tts_values[*], no any extra work need to be done. - Reuse the lifespan of TupleTableSlot system to manage memory. It is natural to think the detoast datum is a tts_value just that it is in a detoast format. Since we have a clear lifespan for TupleTableSlot already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse them for managing the datoast datum's memory. - The existing projection method can copy the datoasted datum (int64) automatically to the next node's slot, but keeping the ownership unchanged, so only the slot where the detoast really happen take the charge of it's lifespan. So the final data change is adding the below field into TubleTableSlot. typedef struct TupleTableSlot { .. /* * The attributes whose values are the detoasted version in tts_values[*], * if so these memory needs some extra clean-up. These memory can't be put * into ecxt_per_tuple_memory since many of them needs a longer life * span. * * These memory is put into TupleTableSlot.tts_mcxt and be clear * whenever the tts_values[*] is invalidated. */ Bitmapset *pre_detoast_attrs; }; Assuming which Var should use this feature has been decided in createplan.c and setref.c already. The 3 new ExprEvalSteps EEOP_{INNER,OUTER,SCAN}_VAR_TOAST as used. During the evaluating these steps, the below code is used. static inline void ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum) { if (!slot->tts_isnull[attnum] && VARATT_IS_EXTENDED(slot->tts_values[attnum])) { Datum oldDatum; MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt); oldDatum = slot->tts_values[attnum]; slot->tts_values[attnum] = PointerGetDatum(detoast_attr( (struct varlena *) oldDatum)); Assert(slot->tts_nvalid > attnum); Assert(oldDatum != slot->tts_values[attnum]); slot->pre_detoasted_attrs= bms_add_member(slot->pre_detoasted_attrs, attnum); MemoryContextSwitchTo(old); } } Since I don't want to the run-time extra check to see if is a detoast should happen, so introducing 3 new steps. When to free the detoast datum? It depends on when the slot's tts_values[*] is invalidated, ExecClearTuple is the clear one, but any TupleTableSlotOps which set the tts_nvalid = 0 tells us no one will use the datum in tts_values[*] so it is time to release them based on slot.pre_detoast_attrs as well. Now comes to the createplan.c/setref.c part, which decides which Vars should use the shared detoast feature. The guideline of this is: 1. It needs a detoast for a given expression. 2. It should not breaks the CP_SMALL_TLIST design. Since we saved the detoast datum back to tts_values[*], which make tuple bigger. if we do this blindly, it would be harmful to the ORDER / HASH style nodes. A high level data flow is: 1. at the createplan.c, we walk the plan tree go gather the CP_SMALL_TLIST because of SORT/HASH style nodes, information and save it to Plan.forbid_pre_detoast_vars via the function set_plan_forbid_pre_detoast_vars_recurse. 2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each expression, so it is a good time to track the attribute number and see if the Var is directly or indirectly accessed. Usually the indirectly access a Var means a detoast would happens, for example an expression like a > 3. However some known expressions like VAR is NULL; is ignored. The output is {Scan|Join}.xxx_reference_attrs; As a result, the final result is added into the plan node of Scan and Join. typedef struct Scan { /* * Records of var's varattno - 1 where the Var is accessed indirectly by * any expression, like a > 3. However a IS [NOT] NULL is not included * since it doesn't access the tts_values[*] at all. * * This is a essential information to figure out which attrs should use * the pre-detoast-attrs logic. */ Bitmapset *reference_attrs; } Scan; typedef struct Join { .. /* * Records of var's varattno - 1 where the Var is accessed indirectly by * any expression, like a > 3. However a IS [NOT] NULL is not included * since it doesn't access the tts_values[*] at all. * * This is a essential information to figure out which attrs should use * the pre-detoast-attrs logic. */ Bitmapset *outer_reference_attrs; Bitmapset *inner_reference_attrs; } Join; Note that here I used '_reference_' rather than '_detoast_' is because at this part, I still don't know if it is a toastable attrbiute, which is known at the MakeTupleTableSlot stage. 3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs in ScanState & JoinState, which will be passed into expression engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN} _VAR_TOAST steps are generated finally then everything is connected with ExecSlotDetoastDatum! Testing ------- Case 1: create table t (a numeric); insert into t select i from generate_series(1, 100000)i; cat 1.sql select * from t where a > 0; In this test, the current master run detoast twice for each datum. one in numeric_gt, one in numeric_out. this feature makes the detoast once. pgbench -f 1.sql -n postgres -T 10 -M prepared master: 30.218 ms patched(Bitmapset): 30.881ms Then we can see the perf report as below: - 7.34% 0.00% postgres postgres [.] ExecSlotDetoastDatum (inlined) - ExecSlotDetoastDatum (inlined) - 3.47% bms_add_member - 3.06% bms_make_singleton (inlined) - palloc0 1.30% AllocSetAlloc - 5.99% 0.00% postgres postgres [.] ExecFreePreDetoastDatum (inlined) - ExecFreePreDetoastDatum (inlined) 2.64% bms_next_member 1.17% bms_del_members 0.94% AllocSetFree One of the reasons is because Bitmapset will deallocate its memory when all the bits are deleted due to commit 00b41463c, then we have to allocate memory at the next time when adding a member to it. This kind of action is executed 100000 times in the above workload. Then I introduce bitset data struct (commit 0002) which is pretty like the Bitmapset, but it would not deallocate the memory when all the bits is unset. and use it in this feature (commit 0003). Then the result became to: 28.715ms - 5.22% 0.00% postgres postgres [.] ExecFreePreDetoastDatum (inlined) - ExecFreePreDetoastDatum (inlined) - 2.82% bitset_next_member 1.69% bms_next_member_internal (inlined) 0.95% bitset_next_member 0.66% AllocSetFree Here we can see the expensive calls are bitset_next_member on slot->pre_detoast_attrs and pfree. if we put the detoast datum into a dedicated memory context, then we can save the cost of bitset_next_member since can discard all the memory in once and use MemoryContextReset instead of AllocSetFree (commit 0004). then the result became to 27.840ms. So the final result for case 1: master: 30.218 ms patched(Bitmapset): 30.881ms patched(bitset): 28.715ms latency average(bitset + tts_value_mctx) = 27.840 ms Big jsonbs test: create table b(big jsonb); insert into b select jsonb_object_agg(x::text, random()::text || random()::text || random()::text) from generate_series(1,600) f(x); insert into b select (select big from b) from generate_series(1, 1000)i; explain analyze select big->'1', big->'2', big->'3', big->'5', big->'10' from b; master: 702.224 ms patched: 133.306 ms Memory usage test: I run the workload of tpch scale 10 on against both master and patched versions, the memory usage looks stable. In progress work: I'm still running tpc-h scale 100 to see if anything interesting finding, that is in progress. As for the scale 10: master: 55s patched: 56s The reason is q9 plan changed a bit, the real reason needs some more time. Since this patch doesn't impact on the best path generation, so it should not reasonble for me. -- Best Regards Andy Fan
Вложения
В списке pgsql-hackers по дате отправления:
Следующее
От: "Zhijie Hou (Fujitsu)"Дата:
Сообщение: RE: Synchronizing slots from primary to standby