Re: Shared detoast Datum proposal

Поиск
Список
Период
Сортировка
От Andy Fan
Тема Re: Shared detoast Datum proposal
Дата
Msg-id 87jzms0xza.fsf@163.com
обсуждение исходный текст
Ответ на Re: Shared detoast Datum proposal  (Andy Fan <zhihuifan1213@163.com>)
Ответы Re: Shared detoast Datum proposal  (Nikita Malakhov <hukutoc@gmail.com>)
Список pgsql-hackers

Hi,

>>
>> Good idea. Either that (a separate README), or a comment in a header of
>> some suitable .c/.h file (I prefer that, because that's kinda obvious
>> when reading the code, I often not notice a README exists next to it).
>
> Great, I'd try this from tomorrow. 

I have made it.  Currently I choose README because this feature changed
createplan.c, setrefs.c, execExpr.c and execExprInterp.c, so putting the
high level design to any of them looks inappropriate. So the high level
design is here and detailed design for each steps is in the comments
around the code.  Hope this is helpful!

The problem:
-------------

In the current expression engine, a toasted datum is detoasted when
required, but the result is discarded immediately, either by pfree it
immediately or leave it for ResetExprContext. Arguments for which one to
use exists sometimes. More serious problem is detoasting is expensive,
especially for the data types like jsonb or array, which the value might
be very huge. In the blow example, the detoasting happens twice.

SELECT jb_col->'a', jb_col->'b' FROM t;

Within the shared-detoast-datum, we just need to detoast once for each
tuple, and discard it immediately when the tuple is not needed any
more. FWIW this issue may existing for small numeric, text as well
because of SHORT_TOAST feature where the toast's len using 1 byte rather
than 4 bytes.

Current Design
--------------

The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:

- The existing expression engine read datum from tts_values[*], no any
  extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
  is natural to think the detoast datum is a tts_value just that it is
  in a detoast format. Since we have a clear lifespan for TupleTableSlot
  already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
  them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
  automatically to the next node's slot, but keeping the ownership
  unchanged, so only the slot where the detoast really happen take the
  charge of it's lifespan.

So the final data change is adding the below field into TubleTableSlot.

typedef struct TupleTableSlot
{
..

    /*
     * The attributes whose values are the detoasted version in tts_values[*],
     * if so these memory needs some extra clean-up. These memory can't be put
     * into ecxt_per_tuple_memory since many of them needs a longer life
     * span.
     *
     * These memory is put into TupleTableSlot.tts_mcxt and be clear
     * whenever the tts_values[*] is invalidated.
     */
    Bitmapset    *pre_detoast_attrs;
};

Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The 3 new ExprEvalSteps
EEOP_{INNER,OUTER,SCAN}_VAR_TOAST as used. During the evaluating these
steps, the below code is used.

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
    if (!slot->tts_isnull[attnum] &&
        VARATT_IS_EXTENDED(slot->tts_values[attnum]))
    {
        Datum        oldDatum;
        MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

        oldDatum = slot->tts_values[attnum];
        slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
                                                                (struct varlena *) oldDatum));
        Assert(slot->tts_nvalid > attnum);
        Assert(oldDatum != slot->tts_values[attnum]);
        slot->pre_detoasted_attrs= bms_add_member(slot->pre_detoasted_attrs, attnum);
        MemoryContextSwitchTo(old);
    }
}

Since I don't want to the run-time extra check to see if is a detoast
should happen, so introducing 3 new steps.

When to free the detoast datum? It depends on when the slot's
tts_values[*] is invalidated, ExecClearTuple is the clear one, but any
TupleTableSlotOps which set the tts_nvalid = 0 tells us no one will use
the datum in tts_values[*] so it is time to release them based on
slot.pre_detoast_attrs as well.

Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:

1. It needs a detoast for a given expression.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
   detoast datum back to tts_values[*], which make tuple bigger. if we
   do this blindly, it would be harmful to the ORDER / HASH style nodes.

A high level data flow is:

1. at the createplan.c, we walk the plan tree go gather the
   CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
   it to Plan.forbid_pre_detoast_vars via the function
   set_plan_forbid_pre_detoast_vars_recurse.

2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
   expression, so it is a good time to track the attribute number and
   see if the Var is directly or indirectly accessed. Usually the
   indirectly access a Var means a detoast would happens, for 
   example an expression like a > 3. However some known expressions like
   VAR is NULL; is ignored. The output is {Scan|Join}.xxx_reference_attrs;

As a result, the final result is added into the plan node of Scan and
Join.

typedef struct Scan
{
    /*
     * Records of var's varattno - 1 where the Var is accessed indirectly by
     * any expression, like a > 3.  However a IS [NOT] NULL is not included
     * since it doesn't access the tts_values[*] at all.
     *
     * This is a essential information to figure out which attrs should use
     * the pre-detoast-attrs logic.
     */
    Bitmapset  *reference_attrs;
} Scan;

typedef struct Join
{
    ..
    /*
     * Records of var's varattno - 1 where the Var is accessed indirectly by
     * any expression, like a > 3.  However a IS [NOT] NULL is not included
     * since it doesn't access the tts_values[*] at all.
     *
     * This is a essential information to figure out which attrs should use
     * the pre-detoast-attrs logic.
     */
    Bitmapset  *outer_reference_attrs;
    Bitmapset  *inner_reference_attrs;
} Join;


Note that here I used '_reference_' rather than '_detoast_' is because
at this part, I still don't know if it is a toastable attrbiute, which
is known at the MakeTupleTableSlot stage.

3. At the InitPlan Stage, we calculate the final xxx_pre_detoast_attrs
   in ScanState & JoinState, which will be passed into expression
   engine in the ExecInitExprRec stage and EEOP_{INNER|OUTER|SCAN}
   _VAR_TOAST steps are generated finally then everything is connected
   with ExecSlotDetoastDatum!


Testing
-------

Case 1:

create table t (a numeric);
insert into t select i from generate_series(1, 100000)i;

cat 1.sql

select * from t where a > 0;

In this test, the current master run detoast twice for each datum. one
in numeric_gt,  one in numeric_out.  this feature makes the detoast once.

pgbench -f 1.sql -n postgres -T 10 -M prepared

master: 30.218 ms
patched(Bitmapset): 30.881ms


Then we can see the perf report as below:

-    7.34%     0.00%  postgres  postgres           [.] ExecSlotDetoastDatum (inlined)
   - ExecSlotDetoastDatum (inlined)
      - 3.47% bms_add_member
       - 3.06% bms_make_singleton (inlined)
        - palloc0
         1.30% AllocSetAlloc

-    5.99%     0.00%  postgres  postgres  [.] ExecFreePreDetoastDatum (inlined)
   - ExecFreePreDetoastDatum (inlined)
    2.64% bms_next_member
    1.17% bms_del_members
    0.94% AllocSetFree

One of the reasons is because Bitmapset will deallocate its memory when
all the bits are deleted due to commit 00b41463c, then we have to
allocate memory at the next time when adding a member to it. This kind
of action is executed 100000 times in the above workload.

Then I introduce bitset data struct (commit 0002) which is pretty like
the Bitmapset, but it would not deallocate the memory when all the bits
is unset. and use it in this feature (commit 0003). Then the result
became to: 28.715ms 

-    5.22%     0.00%  postgres  postgres  [.] ExecFreePreDetoastDatum (inlined)
   - ExecFreePreDetoastDatum (inlined)
      - 2.82% bitset_next_member
       1.69% bms_next_member_internal (inlined)
       0.95% bitset_next_member
    0.66% AllocSetFree

Here we can see the expensive calls are bitset_next_member on
slot->pre_detoast_attrs and pfree. if we put the detoast datum into
a dedicated memory context, then we can save the cost of
bitset_next_member since can discard all the memory in once and use
MemoryContextReset instead of AllocSetFree (commit 0004). then the
result became to 27.840ms. 

So the final result for case 1: 

master: 30.218 ms
patched(Bitmapset): 30.881ms
patched(bitset): 28.715ms
latency average(bitset + tts_value_mctx) = 27.840 ms


Big jsonbs test:

create table b(big jsonb);

insert into b select
jsonb_object_agg(x::text,
random()::text || random()::text || random()::text)
from generate_series(1,600) f(x);

insert into b select (select big from b) from generate_series(1, 1000)i;

explain analyze
select big->'1', big->'2', big->'3', big->'5', big->'10' from b;

master: 702.224 ms
patched: 133.306 ms

Memory usage test:

I run the workload of tpch scale 10 on against both master and patched
versions, the memory usage looks stable.

In progress work:

I'm still running tpc-h scale 100 to see if anything interesting
finding, that is in progress. As for the scale 10:

master: 55s
patched: 56s

The reason is q9 plan changed a bit, the real reason needs some more
time. Since this patch doesn't impact on the best path generation, so it
should not reasonble for me.

-- 
Best Regards
Andy Fan


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: RangeTblEntry jumble omissions
Следующее
От: "Zhijie Hou (Fujitsu)"
Дата:
Сообщение: RE: Synchronizing slots from primary to standby