Обсуждение: Make TID Scans recalculate the TIDs less often
Over on [1], there was some concern about having to recalculate the TID Scan's TidList on every recheck call. This work entails evaluating all the TID expressions, sorting the resulting list of TIDs and deduplicating it (so that we don't scan the same TID twice). As it turns out, it certainly is possible that doing that work could take up quite a bit of time, and having to do it as often as every rescan *could* be noticed *if* the list is big enough and the rescans are frequent enough. The following case demonstrates this: set max_parallel_Workers_per_gather=0; set enable_seqscan=0; set enable_material=0; set jit=0; select sum(c) from million m left join lateral (select count(*) c from empty where ctid in ('(1,1)','(1,2)','(1,3)','(1,4)','(1,5)','(1,6)','(1,7)','(1,8)','(1,9)','(1,10)','(1,11)','(1,12)','(1,13)','(1,14)','(1,15)','(1,16)','(1,17)','(1,18)','(1,19)','(1,20)','(1,21)','(1,22)','(1,23)','(1,24)','(1,25)','(1,26)','(1,27)','(1,28)','(1,29)','(1,30)','(1,31)','(1,32)','(1,33)','(1,34)','(1,35)','(1,36)','(1,37)','(1,38)','(1,39)','(1,40)','(1,41)','(1,42)','(1,43)','(1,44)','(1,45)','(1,46)','(1,47)','(1,48)','(1,49)','(1,50)','(1,51)','(1,52)','(1,53)','(1,54)','(1,55)','(1,56)','(1,57)','(1,58)','(1,59)','(1,60)','(1,61)','(1,62)','(1,63)','(1,64)','(1,65)','(1,66)','(1,67)','(1,68)','(1,69)','(1,70)','(1,71)','(1,72)','(1,73)','(1,74)','(1,75)','(1,76)','(1,77)','(1,78)','(1,79)','(1,80)','(1,81)','(1,82)','(1,83)','(1,84)','(1,85)','(1,86)','(1,87)','(1,88)','(1,89)','(1,90)','(1,91)','(1,92)','(1,93)','(1,94)','(1,95)','(1,96)','(1,97)','(1,98)','(1,99)','(1,100)')) on 1=1; master: Time: 613.541 ms Time: 621.037 ms Time: 623.430 ms patched: Time: 298.863 ms Time: 298.015 ms Time: 297.172 ms The part I don't know is if it's at all likely that someone would ever hit this. We've added TID scan, so filtering on ctid must be common enough to warrant having that code (yes, I know it's required for WHERE CURRENT OF too), I just don't know how common rescans are in those queries. The patch optimises the recalc by changing things so the recalc is only done when a parameter has changed that's mentioned somewhere in TID quals. If no such parameter has changed, we use the same list as we did on the last scan. Does anyone think this is worth pursuing further? Patch attached. David [1] https://postgr.es/m/4a6268ff-3340-453a-9bf5-c98d51a6f729@app.fastmail.com
Вложения
> On 17 Sep 2025, at 09:59, David Rowley <dgrowleyml@gmail.com> wrote: > > Does anyone think this is worth pursuing further? I heard of following use-case: data transferring system partition big tables by ctid ranges to mimic parallel secscan, butwith many network connections. Some performance improvement was claimed and connection failure resistance (when one connectionwas broken only one partition is rescanned with same snapshot). I'm not entirely sure this is a safe approach (I have a gut feeling that tid scan is not MVCC safe). But for retrieving verybig tables without strong guarantees this might make some sense. Would your patch improve performance of such case? Best regards, Andrey Borodin.
On Wed, 17 Sept 2025 at 18:29, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > I heard of following use-case: data transferring system partition big tables by ctid ranges to mimic parallel secscan,but with many network connections. Some performance improvement was claimed and connection failure resistance (whenone connection was broken only one partition is rescanned with same snapshot). > Would your patch improve performance of such case? I suspect they're just running a SELECT * to a single table "WHERE ctid BETWEEN" some fixed range of TIDs. If that's the case then this won't help as there are no rescans, therefore the TID Range is only calculated once. Also, I imagine TID Range Scans are less affected than TID Scans simply because TidExprListCreate() is more complex than TidRangeEval(). What you'd need for this to ever be slow is lots of rescans, so something like TID Scan on the inside of a nested loop. If you look at ExecReScanTidScan() you'll see it pfrees the tss_TidList, which requires that the list gets built all over again on the next call to TidNext(). It's the call to TidListEval() that is potentially expensive due to the expression evaluation, memory allocation, sorting and distinct work. David
> On 17 Sep 2025, at 14:51, David Rowley <dgrowleyml@gmail.com> wrote: > > What you'd need for this to ever be slow is lots of rescans, so > something like TID Scan on the inside of a nested loop. If you look at > ExecReScanTidScan() you'll see it pfrees the tss_TidList, which > requires that the list gets built all over again on the next call to > TidNext(). It's the call to TidListEval() that is potentially > expensive due to the expression evaluation, memory allocation, sorting > and distinct work. Occasionally (when dealing with corruption) I do stuff like begin; update public.tablename set description = description where ctid in (select ('('||b.blkno::text||','||(x::text)||')')::tidfrom generate_series(1,300) x, blocks b); in some forms they are actually joins. Also, pageinspecting things out is always a join (CTAS a copy of table rows that haveparticular infomask bits). But, fortunately, it's not that frequent case. It's always "plumbing", not a "regular databaseusage". Best regards, Andrey Borodin.
On Wed, 17 Sept 2025 at 22:13, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Occasionally (when dealing with corruption) I do stuff like > > begin; > update public.tablename set description = description where ctid in (select ('('||b.blkno::text||','||(x::text)||')')::tidfrom generate_series(1,300) x, blocks b); > > in some forms they are actually joins. Also, pageinspecting things out is always a join (CTAS a copy of table rows thathave particular infomask bits). But, fortunately, it's not that frequent case. It's always "plumbing", not a "regulardatabase usage". Thanks for sharing that one. If that UPDATE did do a Nested Loop join with a TID Scan on the inner side, the optimisation I have in the patch *wouldn't* be applied as a parameter is changing that genuinely does need the TidList to be recalculated over again. David
On Sep 17, 2025, at 18:31, David Rowley <dgrowleyml@gmail.com> wrote:On Wed, 17 Sept 2025 at 22:13, Andrey Borodin <x4mmm@yandex-team.ru> wrote:Occasionally (when dealing with corruption) I do stuff like
begin;
update public.tablename set description = description where ctid in (select ('('||b.blkno::text||','||(x::text)||')')::tid from generate_series(1,300) x, blocks b);
in some forms they are actually joins. Also, pageinspecting things out is always a join (CTAS a copy of table rows that have particular infomask bits). But, fortunately, it's not that frequent case. It's always "plumbing", not a "regular database usage".
Thanks for sharing that one. If that UPDATE did do a Nested Loop join
with a TID Scan on the inner side, the optimisation I have in the
patch *wouldn't* be applied as a parameter is changing that genuinely
does need the TidList to be recalculated over again.
David
I tried to trace this patch again today.
With David’s example with “million" and “empty" tables, TidScan.tidparamids is NULL, so that in ExecRescanTidScan(), bms_overlap() will return false, and TidList is not free-ed.
Same thing for Andrey’s example.
Based on David’s example, I build this SQL:
```
# Insert tuples to empty first
select sum(c) from million m left join lateral (select a c from empty e where e.ctid = m.ctid) on 1=1;```
With this SQL, outer scan passes a parameter to inter TidScan, so that “tidparamids” is not NULL now. Then I noticed one thing: now it needs to re-evaluate TidList. However, the new TidList always has the same length of the old TidList, so that we don’t need to free the old TidList, instead, we can actually reuse memory of the old TidList, which could be a small optimization.
For that, I created 0002. But TBH, with 0002, and with psql timing on, I don’t see obvious improvement wrt execution time.