Обсуждение: Make TID Scans recalculate the TIDs less often

Поиск
Список
Период
Сортировка

Make TID Scans recalculate the TIDs less often

От
David Rowley
Дата:
Over on [1], there was some concern about having to recalculate the
TID Scan's TidList on every recheck call.  This work entails
evaluating all the TID expressions, sorting the resulting list of TIDs
and deduplicating it (so that we don't scan the same TID twice).

As it turns out, it certainly is possible that doing that work could
take up quite a bit of time, and having to do it as often as every
rescan *could* be noticed *if* the list is big enough and the rescans
are frequent enough. The following case demonstrates this:

set max_parallel_Workers_per_gather=0;
set enable_seqscan=0;
set enable_material=0;
set jit=0;
select sum(c) from million m left join lateral (select count(*) c from
empty where ctid in

('(1,1)','(1,2)','(1,3)','(1,4)','(1,5)','(1,6)','(1,7)','(1,8)','(1,9)','(1,10)','(1,11)','(1,12)','(1,13)','(1,14)','(1,15)','(1,16)','(1,17)','(1,18)','(1,19)','(1,20)','(1,21)','(1,22)','(1,23)','(1,24)','(1,25)','(1,26)','(1,27)','(1,28)','(1,29)','(1,30)','(1,31)','(1,32)','(1,33)','(1,34)','(1,35)','(1,36)','(1,37)','(1,38)','(1,39)','(1,40)','(1,41)','(1,42)','(1,43)','(1,44)','(1,45)','(1,46)','(1,47)','(1,48)','(1,49)','(1,50)','(1,51)','(1,52)','(1,53)','(1,54)','(1,55)','(1,56)','(1,57)','(1,58)','(1,59)','(1,60)','(1,61)','(1,62)','(1,63)','(1,64)','(1,65)','(1,66)','(1,67)','(1,68)','(1,69)','(1,70)','(1,71)','(1,72)','(1,73)','(1,74)','(1,75)','(1,76)','(1,77)','(1,78)','(1,79)','(1,80)','(1,81)','(1,82)','(1,83)','(1,84)','(1,85)','(1,86)','(1,87)','(1,88)','(1,89)','(1,90)','(1,91)','(1,92)','(1,93)','(1,94)','(1,95)','(1,96)','(1,97)','(1,98)','(1,99)','(1,100)'))
on 1=1;

master:

Time: 613.541 ms
Time: 621.037 ms
Time: 623.430 ms

patched:

Time: 298.863 ms
Time: 298.015 ms
Time: 297.172 ms

The part I don't know is if it's at all likely that someone would ever
hit this. We've added TID scan, so filtering on ctid must be common
enough to warrant having that code (yes, I know it's required for
WHERE CURRENT OF too), I just don't know how common rescans are in
those queries.

The patch optimises the recalc by changing things so the recalc is
only done when a parameter has changed that's mentioned somewhere in
TID quals. If no such parameter has changed, we use the same list as
we did on the last scan.

Does anyone think this is worth pursuing further?

Patch attached.

David

[1] https://postgr.es/m/4a6268ff-3340-453a-9bf5-c98d51a6f729@app.fastmail.com

Вложения

Re: Make TID Scans recalculate the TIDs less often

От
Andrey Borodin
Дата:

> On 17 Sep 2025, at 09:59, David Rowley <dgrowleyml@gmail.com> wrote:
>
> Does anyone think this is worth pursuing further?

I heard of following use-case: data transferring system partition big tables by ctid ranges to mimic parallel secscan,
butwith many network connections. Some performance improvement was claimed and connection failure resistance (when one
connectionwas broken only one partition is rescanned with same snapshot). 

I'm not entirely sure this is a safe approach (I have a gut feeling that tid scan is not MVCC safe). But for retrieving
verybig tables without strong guarantees this might make some sense. 

Would your patch improve performance of such case?


Best regards, Andrey Borodin.


Re: Make TID Scans recalculate the TIDs less often

От
David Rowley
Дата:
On Wed, 17 Sept 2025 at 18:29, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> I heard of following use-case: data transferring system partition big tables by ctid ranges to mimic parallel
secscan,but with many network connections. Some performance improvement was claimed and connection failure resistance
(whenone connection was broken only one partition is rescanned with same snapshot). 

> Would your patch improve performance of such case?

I suspect they're just running a SELECT * to a single table "WHERE
ctid BETWEEN" some fixed range of TIDs.  If that's the case then this
won't help as there are no rescans, therefore the TID Range is only
calculated once. Also, I imagine TID Range Scans are less affected
than TID Scans simply because TidExprListCreate() is more complex than
TidRangeEval().

What you'd need for this to ever be slow is lots of rescans, so
something like TID Scan on the inside of a nested loop. If you look at
ExecReScanTidScan() you'll see it pfrees the tss_TidList, which
requires that the list gets built all over again on the next call to
TidNext(). It's the call to TidListEval() that is potentially
expensive due to the expression evaluation, memory allocation, sorting
and distinct work.

David



Re: Make TID Scans recalculate the TIDs less often

От
Andrey Borodin
Дата:

> On 17 Sep 2025, at 14:51, David Rowley <dgrowleyml@gmail.com> wrote:
>
> What you'd need for this to ever be slow is lots of rescans, so
> something like TID Scan on the inside of a nested loop. If you look at
> ExecReScanTidScan() you'll see it pfrees the tss_TidList, which
> requires that the list gets built all over again on the next call to
> TidNext(). It's the call to TidListEval() that is potentially
> expensive due to the expression evaluation, memory allocation, sorting
> and distinct work.

Occasionally (when dealing with corruption) I do stuff like

begin;
update public.tablename set description = description where ctid in (select
('('||b.blkno::text||','||(x::text)||')')::tidfrom generate_series(1,300) x, blocks b); 

in some forms they are actually joins. Also, pageinspecting things out is always a join (CTAS a copy of table rows that
haveparticular infomask bits). But, fortunately, it's not that frequent case. It's always "plumbing", not a "regular
databaseusage". 


Best regards, Andrey Borodin.


Re: Make TID Scans recalculate the TIDs less often

От
David Rowley
Дата:
On Wed, 17 Sept 2025 at 22:13, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Occasionally (when dealing with corruption) I do stuff like
>
> begin;
> update public.tablename set description = description where ctid in (select
('('||b.blkno::text||','||(x::text)||')')::tidfrom generate_series(1,300) x, blocks b);
 
>
> in some forms they are actually joins. Also, pageinspecting things out is always a join (CTAS a copy of table rows
thathave particular infomask bits). But, fortunately, it's not that frequent case. It's always "plumbing", not a
"regulardatabase usage".
 

Thanks for sharing that one.  If that UPDATE did do a Nested Loop join
with a TID Scan on the inner side, the optimisation I have in the
patch *wouldn't* be applied as a parameter is changing that genuinely
does need the TidList to be recalculated over again.

David



Re: Make TID Scans recalculate the TIDs less often

От
Chao Li
Дата:

On Sep 17, 2025, at 18:31, David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 17 Sept 2025 at 22:13, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Occasionally (when dealing with corruption) I do stuff like

begin;
update public.tablename set description = description where ctid in (select ('('||b.blkno::text||','||(x::text)||')')::tid from generate_series(1,300) x, blocks b);

in some forms they are actually joins. Also, pageinspecting things out is always a join (CTAS a copy of table rows that have particular infomask bits). But, fortunately, it's not that frequent case. It's always "plumbing", not a "regular database usage".

Thanks for sharing that one.  If that UPDATE did do a Nested Loop join
with a TID Scan on the inner side, the optimisation I have in the
patch *wouldn't* be applied as a parameter is changing that genuinely
does need the TidList to be recalculated over again.

David


I tried to trace this patch again today.

With David’s example with “million" and “empty" tables, TidScan.tidparamids is NULL, so that in ExecRescanTidScan(), bms_overlap() will return false, and TidList is not free-ed.

Same thing for Andrey’s example.

Based on David’s example, I build this SQL:

```
# Insert tuples to empty first
select sum(c) from million m left join lateral (select a c from empty e where e.ctid = m.ctid) on 1=1;
```

With this SQL, outer scan passes a parameter to inter TidScan, so that “tidparamids” is not NULL now. Then I noticed one thing: now it needs to re-evaluate TidList. However, the new TidList always has the same length of the old TidList, so that we don’t need to free the old TidList, instead, we can actually reuse memory of the old TidList, which could be a small optimization.

For that, I created 0002. But TBH, with 0002, and with psql timing on, I don’t see obvious improvement wrt execution time.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Вложения