Обсуждение: Fix missing EvalPlanQual recheck for TID scans
Hi all, I learned that the TID Scan node does not properly implement EvalPlanQual recheck, which has already been noted in commentadded in 2002 (commit 6799a6ca21). This does seem to have the potential to manifest in observable and surprising wayswhen queries are filtering on `ctid`, like for optimistic concurrency control as in https://stackoverflow.com/q/78487441. Attached is an attempt from me at fixing this issue for TID Scan as well as TID Range Scan which appears to have the sameissue. This is my first time looking at the Postgres source (nor do I work with C on a regular basis), so I will not be surprisedto hear that I've done things wrong, but I'm hopeful that this is usable as written. Running `meson test` passesfor me, including the added isolation tests, and I confirmed that both of the tests that now result in a negative recheckwere previously failing. In my implementation I am calling TidRangeEval every time TidRangeRecheck is called; let me know if this is a performanceconcern. It didn't seem that any of the existing TidRangeScanState flags are quite right to know if the boundshave been initialized, but I am happy to add one. The original issue appears to have been present since the introduction of TID scans in 1999 (commit 6f9ff92cc0) so I imagineit may make sense to backport the fix, although evidently not many people are running into this. Thanks, Sophie
Вложения
On Mon, 8 Sept 2025 at 10:31, Sophie Alpert <pg@sophiebits.com> wrote: > I learned that the TID Scan node does not properly implement EvalPlanQual recheck, which has already been noted in commentadded in 2002 (commit 6799a6ca21). This does seem to have the potential to manifest in observable and surprising wayswhen queries are filtering on `ctid`, like for optimistic concurrency control as in https://stackoverflow.com/q/78487441. Thanks for the report and the patch. I'll summarise this issue, as you provided an external link which might one day disappear: drop table if exists test_optimistic_lock; create table public.test_optimistic_lock ( id bigserial primary key, name varchar, rest_count integer, update_user varchar, update_time timestamp without time zone, create_time timestamp without time zone ); insert into public.test_optimistic_lock (name, rest_count, update_time, create_time) values ('Product1', 1, current_timestamp, current_timestamp); S1: begin; S1: select ctid, rest_count from public.test_optimistic_lock where name = 'Product1'; S2: begin; S2: select ctid, rest_count from public.test_optimistic_lock where name = 'Product1'; S1: update public.test_optimistic_lock set rest_count = rest_count - 1, update_user = 'A', update_time = current_timestamp where name = 'Product1' and ctid = '(0,1)'; S2: update public.test_optimistic_lock set rest_count = rest_count - 1, update_user = 'A', update_time = current_timestamp where name = 'Product1' and ctid = '(0,1)'; -- waits for S1 S1: commit: S2: Gets UPDATE 1 when it should get UPDATE 0!! I agree that this is a bug and we should fix it. The behaviour should match what you'd get if you ran it with "SET enable_tidscan = 0;". When that's done, the Seq Scan will have the ctid related expressions within the qual and it will correctly filter out the non-matching tuple. > Attached is an attempt from me at fixing this issue for TID Scan as well as TID Range Scan which appears to have the sameissue. I think this is generally pretty good. Here's a quick review: 1. This part is quite annoying: + if (node->tss_TidList == NULL) + TidListEval(node); Of course, it's required since ExecReScanTidScan() wiped out that list. Maybe we can add a boolean variable named tts_inScan to TidScanState after tss_isCurrentOf (so as it consumes the existing padding bytes and does not enlarge that struct), and have ExecReScanTidScan() set that to false rather than wipe out the tss_TidList. Then just reuse the existing list in TidRecheck(). That should save on the extra overhead of having to rebuild the list each time there's an EPQ recheck. 2. For TidRangeRecheck, I don't see why this part is needed: + if (!TidRangeEval(node)) + return false; The TID range is preserved already and shouldn't need to be recalculated. 3. In TidRecheck(), can you make "match" an "ItemPointer" and do: match = (ItemPointer) bsearch(...); 4. Can you put this comment above the "if"? + if (node->tss_isCurrentOf) + /* WHERE CURRENT OF always intends to resolve to the latest tuple */ + return true; What you've got there doesn't meet the style we use. Technically there should be braces because there are multiple lines (per our style (not C)), but it's a bit needless to do that as the comment makes the same amount of sense if it goes before the "if". 5. Can you make TidRangeRecheck() look like this? static bool TidRangeRecheck(TidRangeScanState *node, TupleTableSlot *slot) { /* Recheck the ctid is still within range */ if (ItemPointerCompare(&slot->tts_tid, &node->trss_mintid) < 0 || ItemPointerCompare(&slot->tts_tid, &node->trss_maxtid) > 0) return false; return true; } The reason being that it's closer to how heap_getnextslot_tidrange() does it, and I'd rather the conditions were kept as similar as possible to that function. 6. For the tests. It should be ok to make the Tid range scan test do ctid BETWEEN '(0,1)' AND '(0,1)'. I feel this is more aligned to the TID Range Scan version of what you're doing in the TID Scan test. I'll need to give the tests a closer inspection later. David
Thanks for taking a look. On Sun, Sep 7, 2025 at 9:51 PM, David Rowley <dgrowleyml@gmail.com> wrote: > I agree that this is a bug and we should fix it. The behaviour should > match what you'd get if you ran it with "SET enable_tidscan = 0;". I've confirmed that my new tests already pass on current master if `SET enable_tidscan = off;` is added to both session setups. > 1. This part is quite annoying: > > + if (node->tss_TidList == NULL) > + TidListEval(node); > > Of course, it's required since ExecReScanTidScan() wiped out that list. > > Maybe we can add a boolean variable named tts_inScan to TidScanState > after tss_isCurrentOf (so as it consumes the existing padding bytes > and does not enlarge that struct), and have ExecReScanTidScan() set > that to false rather than wipe out the tss_TidList. Then just reuse > the existing list in TidRecheck(). That should save on the extra > overhead of having to rebuild the list each time there's an EPQ > recheck. > > 2. For TidRangeRecheck, I don't see why this part is needed: > > + if (!TidRangeEval(node)) > + return false; > > The TID range is preserved already and shouldn't need to be recalculated. I am a bit unclear on the exact logic around what order Next, ReScan, and Recheck are called in, so I may have written thecode too defensively. Concretely: For ExecReScanTidScan wiping out tss_TidList, I was thinking that when ExecReScanTidScanis called, there may be changed params that require the list to be recalculated. I also wasn't sure if Recheckcan be called without a preceding Next and/or ReScan having been called with the same params, or if it can alwaysrely on those having been called immediately prior. (I had reviewed IndexRecheck as I figured it seems the most likelyto be a correct example here, but I wasn't able to infer from how it's written.) If you could help clarify the guarantees here I'd appreciate it. For what it's worth, if I test removing the tss_TidListclearing in ExecReScanTidScan and recomputation in TidRecheck (of course this is not correct in general), thenmy `tid1 tidsucceed2 c1 c2` test now fails due to TidRecheck incorrectly returning false. I'm actually digging in more now and if I'm reading the code correctly, EvalPlanQualStart calls ExecInitNode to create anentirely separate PlanState tree for EPQ so I'm unsure if any of the state is carried over to the Recheck calls? If I'munderstandign correctly, then it seems the best we can do in this patch is to leave TidRecheck as I had it but for TidRangeRecheckI can add a node->trss_boundsInitialized flag, so that we recompute it once per EPQ instead of per tuple checked. (I'll be happy to make the stylistic changes you mentioned.) Appreciate your time, Sophie
On Mon, 8 Sept 2025 at 18:57, Sophie Alpert <pg@sophiebits.com> wrote: > I'm actually digging in more now and if I'm reading the code correctly, EvalPlanQualStart calls ExecInitNode to createan entirely separate PlanState tree for EPQ so I'm unsure if any of the state is carried over to the Recheck calls? Oh, you're right. The EPQ executor start isn't the same as the normal executor state, so the recalc seems to be needed. David
Updated patch attached. On Sun, Sep 7, 2025 at 9:51 PM, David Rowley <dgrowleyml@gmail.com> wrote: > 1. This part is quite annoying: > > + if (node->tss_TidList == NULL) > + TidListEval(node); > > Of course, it's required since ExecReScanTidScan() wiped out that list. Given that EPQ uses separate PlanState, I've left this as is. > 2. For TidRangeRecheck, I don't see why this part is needed: > > + if (!TidRangeEval(node)) > + return false; > > The TID range is preserved already and shouldn't need to be recalculated. I've added a new trss_boundsInitialized flag such that we calculate the range once per EPQ rescan. In order to preserve thesemantics when the min or max is NULL, I'm setting trss_mintid/trss_maxtid to have invalid ItemPointers as a sentinelin the cases where TidRangeEval returns false. I added a ItemPointerIsValid assertion given that it's now more relevantto correctness but I can remove it if it feels superfluous. Let me know if there is a more idiomatic way to treatthis. > 3. In TidRecheck(), can you make "match" an "ItemPointer" and do: > match = (ItemPointer) bsearch(...); > 4. Can you put this comment above the "if"? > 5. Can you make TidRangeRecheck() look like this? > 6. For the tests. It should be ok to make the Tid range scan test do > ctid BETWEEN '(0,1)' AND '(0,1)'. I feel this is more aligned to the > TID Range Scan version of what you're doing in the TID Scan test. All done. Do let me know if other changes would be helpful. Sophie
Вложения
On Sep 8, 2025, at 17:15, David Rowley <dgrowleyml@gmail.com> wrote:On Mon, 8 Sept 2025 at 18:57, Sophie Alpert <pg@sophiebits.com> wrote:I'm actually digging in more now and if I'm reading the code correctly, EvalPlanQualStart calls ExecInitNode to create an entirely separate PlanState tree for EPQ so I'm unsure if any of the state is carried over to the Recheck calls?
Oh, you're right. The EPQ executor start isn't the same as the normal
executor state, so the recalc seems to be needed.
David
HighGo Software Co., Ltd.
https://www.highgo.com/
Hi Chao, Thanks for taking a look. On Tue, Sep 9, 2025 at 12:52 AM, Chao Li <li.evan.chao@gmail.com> wrote: > However, I noticed that, when “TidRecheck()” is called, it is actually passed with “epqstate->recheckplanstate” that isalways newly created, which means we repeatedly call “TidListEval(node)” and run “bsearch()” when there are multiple rowinvolved in the update. If update qual is just “ctid = <a single ctid>”, that should fine. But if we do “update .. wherectid in (a list of ctid)”, then duplicately call “TidListEval()” might not be a good thing. Based on my current understanding, the EPQ PlanState is shared across all EPQ checks for a given plan and that ReScan is called at similar times as it is for the initial access path with, ExecScanFetch delegating to the appropriate *Next or *Recheck method as each row is processed. Therefore, my patch does not recompute the TidList for every row. > As “TidRecheck(TidScanState *node, TupleTableSlot *slot)” gets the second parameter “slot” with the updated slot, so Iam thinking the other solution could be that, we just check if the updated slot is visible to current transaction. So Imade a local change like this: > > ... > tuple = ExecCopySlotHeapTuple(slot); > visible = HeapTupleSatisfiesVisibility(tuple, GetActiveSnapshot(), slot->tts_tableOid); This is semantically different from the patch I've proposed. My test `permutation tid1 tidsucceed2 c1 c2 read` fails against your code, despite passing with the `SET enable_tidscan = OFF;` flag that I understand we want to match the behavior of. (In addition, I understand HeapTupleSatisfiesVisibility is specific to the heap access method and can't be used here, though perhaps tuple_fetch_row_version could be used instead.) Sophie
On Sep 10, 2025, at 02:20, Sophie Alpert <pg@sophiebits.com> wrote:Hi Chao,
Thanks for taking a look.
On Tue, Sep 9, 2025 at 12:52 AM, Chao Li <li.evan.chao@gmail.com> wrote:However, I noticed that, when “TidRecheck()” is called, it is actually passed with “epqstate->recheckplanstate” that is always newly created, which means we repeatedly call “TidListEval(node)” and run “bsearch()” when there are multiple row involved in the update. If update qual is just “ctid = <a single ctid>”, that should fine. But if we do “update .. where ctid in (a list of ctid)”, then duplicately call “TidListEval()” might not be a good thing.
Based on my current understanding, the EPQ PlanState is shared across all EPQ
checks for a given plan and that ReScan is called at similar times as it is for
the initial access path with, ExecScanFetch delegating to the appropriate *Next
or *Recheck method as each row is processed. Therefore, my patch does not
recompute the TidList for every row.
As “TidRecheck(TidScanState *node, TupleTableSlot *slot)” gets the second parameter “slot” with the updated slot, so I am thinking the other solution could be that, we just check if the updated slot is visible to current transaction. So I made a local change like this:
...
tuple = ExecCopySlotHeapTuple(slot);
visible = HeapTupleSatisfiesVisibility(tuple, GetActiveSnapshot(), slot->tts_tableOid);
This is semantically different from the patch I've proposed. My test
`permutation tid1 tidsucceed2 c1 c2 read` fails against your code, despite
passing with the `SET enable_tidscan = OFF;` flag that I understand we want to
match the behavior of. (In addition, I understand HeapTupleSatisfiesVisibility
is specific to the heap access method and can't be used here, though perhaps
tuple_fetch_row_version could be used instead.)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sep 10, 2025, at 13:18, Chao Li <li.evan.chao@gmail.com> wrote:With my proposal, my theory is that, TidNext() fetches a tuple by ctid, if other concurrent transaction update/delete the tuple, the tuple's visibility changes.
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, Sep 9, 2025 at 10:18 PM, Chao Li <li.evan.chao@gmail.com> wrote: > No, that’s not true. If you extend David’s procedure and use 3 sessions to reproduce the problem, s1 update (0,1), s2 update(0,2) and s3 update (0,1) or (0,2), and trace the backend process of s3, you will see every time when TidRecheck()is called, “node” is brand new, so it has to call TidListEval(node) every time. After more investigation I agree you are correct here; in a single query, ReScan is called once for each subsequent tuplebeing rechecked. (ExecUpdate calls EvalPlanQualBegin which always sets rcplanstate->chgParam.) On Wed, Sep 10, 2025 at 10:30 PM, Chao Li <li.evan.chao@gmail.com> wrote: > But, why can't we make TidRecheck() to simplify return FALSE? > > Looks like the only case where TidRecheck() is called is a concurrent transaction upadated the row, and in that case, ctidhave must be changed. Although returning FALSE is clever, it is only correct if several other unstated preconditions for the function hold, whichI am loath to rely on. And indeed, like I mentioned in my previous message, my isolation test `permutation tid1 tidsucceed2 c1 c2 read` from eval-plan-qual.specin my patch will fail if Recheck were to return false in this case. Though somewhat contrived, you canimagine this happening with multiple sessions driven by the same application: setup: two rows exist with ctid=(0,1) and (0,2) S1: BEGIN; S2: BEGIN; S1: UPDATE WHERE ctid=(0,1) RETURNING ctid; -- returns (0,3), which the application uses in the next query from another session: S2: UPDATE WHERE ctid=(0,1) OR ctid=(0,3); -- statement snapshot sees (0,1); recheck will see (0,3) and should pass S1: COMMIT; S2: COMMIT; I would not defend this as good code from an application developer but the behavior is observable. So I understand it wouldbe best to match the enable_tidscan = off behavior, which my existing strategy more verifiably does. Of course if theteam disagrees with me then I will defer to everyone's better judgement. I hope it is rare that many tuples need to be rechecked in a single query but if this is common, then I would suggest itwould be better to rework the generic EPQ machinery so that EPQ nodes do not need to be not reset for every row, ratherthan depending on subtle implicit guarantees in the TidRecheck code. Sophie
On Sat, Sep 13, 2025 at 3:12 PM, Sophie Alpert <pg@sophiebits.com> wrote: > And indeed, like I mentioned in my previous message, my isolation test > `permutation tid1 tidsucceed2 c1 c2 read` from eval-plan-qual.spec in > my patch will fail if Recheck were to return false in this case. Though > somewhat contrived, you can imagine this happening with multiple > sessions driven by the same application: Another case where returning FALSE does not give the correct behavior is when two relations are involved, only one of whichis modified: S1: BEGIN; S2: BEGIN; S1: UPDATE accounts SET balance = balance + 100 WHERE ctid = '(0,1)' RETURNING accountid, balance; S2: SELECT * FROM accounts JOIN accounts_ext USING (accountid) WHERE accounts_ext.ctid = '(0,1)' FOR UPDATE OF accounts; S1: COMMIT; S2: COMMIT; In my patch the S2 query correctly returns one row, whereas with your proposed change it incorrectly returns none. accountid|balance|balance2|balance|other|newcol|newcol2 ---------+-------+--------+-------+-----+------+------- checking | 700| 1400| 600|other| 42| Sophie
On Sep 14, 2025, at 06:12, Sophie Alpert <pg@sophiebits.com> wrote:
And indeed, like I mentioned in my previous message, my isolation test `permutation tid1 tidsucceed2 c1 c2 read` from eval-plan-qual.spec in my patch will fail if Recheck were to return false in this case. Though somewhat contrived, you can imagine this happening with multiple sessions driven by the same application:
setup: two rows exist with ctid=(0,1) and (0,2)
S1: BEGIN;
S2: BEGIN;
S1: UPDATE WHERE ctid=(0,1) RETURNING ctid;
-- returns (0,3), which the application uses in the next query from another session:
S2: UPDATE WHERE ctid=(0,1) OR ctid=(0,3); -- statement snapshot sees (0,1); recheck will see (0,3) and should pass
S1: COMMIT;
S2: COMMIT;
I would not defend this as good code from an application developer but the behavior is observable. So I understand it would be best to match the enable_tidscan = off behavior, which my existing strategy more verifiably does. Of course if the team disagrees with me then I will defer to everyone's better judgement.
HighGo Software Co., Ltd.
https://www.highgo.com/
Вложения
On Sun, Sep 14, 2025 at 6:49 PM, Chao Li <li.evan.chao@gmail.com> wrote: > This is a wrong example and (0,3) should NOT be updated. According to the definition of “read committed”: > https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED > “A query sees only data committed before the query began”. You paraphrased the docs here but did so incorrectly: the actual quote is "a SELECT query (without a FOR UPDATE/SHARE clause)sees only data committed before the query began". We are not discussing the behavior of a plain SELECT query so thisdescription is not relevant. For Update and LockRows, the expected EvalPlanQual behavior is that rows are checked againstthe predicate twice — once as of the statement snapshot and once as of locking time — and the rows that match bothtimes are used. In my example with ctid (0,3), the row matches the 'ctid = (0,1) OR ctid = (0,3)' predicate both times. The row is not newlycreated, so the newly-created row in your example is not analogous. I continue to believe that my implementation of TidRecheck plainly satisfies the contract for what the scan recheck is meantto do; the fact that it matches the enable_tidscan=OFF behavior is further corroboration of that fact. Sophie
On Mon, 15 Sept 2025 at 13:49, Chao Li <li.evan.chao@gmail.com> wrote: > This is a wrong example and (0,3) should NOT be updated. According to the definition of “read committed”: > > 13.2. Transaction Isolation > postgresql.org > > “A query sees only data committed before the query began”. I'm afraid your understanding of the documentation isn't correct. Maybe you've observed some translated version which isn't correct, or you've made a mistake in your translation back to English. For the quoted line, I'm not quite sure where that comes from, but I don't see it as it is in the documents. The particular part of the documentation that's relevant is in [1]. The following is the relevant text: "If the first updater commits, the second updater will ignore the row if the first updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row. The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version of the row still matches the search condition. If so, the second updater proceeds with its operation using the updated version of the row." > And this example also proves my solution of checking visibility works. The specific part that Sophie aims to fix is the incorrect behaviour regarding the "re-evaluation" of the updated row. The correct way to fix that is to ensure the updated tuple matches the WHERE clause of the statement. Adding some visibility checks in place of that is complete nonsense. What does need to happen is validation that the ctid of the updated tuple matches the WHERE clause of the UPDATE statement. The reason anything special must happen for TID Scan and TID Range Scan and not Seq Scan is that in Seq Scan the ctid qual will be part of the scan's qual, whereas in the TID Scan variants, that's been extracted as it's assumed the scan itself will handle only getting the required rows. It's just that's currently not enforced during the recheck. If you review the ExecScan() comments, you'll see the recheck function's role is the following: * A 'recheck method' must also be provided that can check an * arbitrary tuple of the relation against any qual conditions * that are implemented internal to the access method. If you need guidance about how this should be behaving, try with "SET enable_tidscan = 0;" and see what that does. David [1] https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED
If you need guidance about how this should be behaving, try with "SET
enable_tidscan = 0;" and see what that does.
David
[1] https://www.postgresql.org/docs/current/transaction-iso.html#XACT-READ-COMMITTED
Read Committed is the default isolation level in PostgreSQL. When a transaction uses this isolation level, a SELECT
query (without a FOR UPDATE/SHARE
clause) sees only data committed before the query began; it never sees either uncommitted data or changes committed by concurrent transactions during the query's execution. In effect, a SELECT
query sees a snapshot of the database as of the instant the query begins to run. However, SELECT
does see the effects of previous updates executed within its own transaction, even though they are not yet committed. Also note that two successive SELECT
commands can see different data, even though they are within a single transaction, if other transactions commit changes after the first SELECT
starts and before the second SELECT
starts.
UPDATE
, DELETE
, SELECT FOR UPDATE
, and SELECT FOR SHARE
commands behave the same as SELECT
in terms of searching for target rows: they will only find target rows that were committed as of the command start time.
HighGo Software Co., Ltd.
https://www.highgo.com/
It says that UPDATE will only find target rows that were committed as of the command start time. I think the statement implies that an “update” statement will never update a “future” tuple.
evantest=*# update t set b=30 where id = 1;evantest=*# update t set b = 30; # block here```1 | 5 | 30 <=== only the old tuple is updated
On Mon, 15 Sept 2025 at 17:32, Chao Li <li.evan.chao@gmail.com> wrote: > UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for targetrows: they will only find target rows that were committed as of the command start time. > > It says that UPDATE will only find target rows that were committed as of the command start time. I think the statementimplies that an “update” statement will never update a “future” tuple. I think you've only read the first sentence in that paragraph. Please read the entire paragraph. You'll see it goes on to explain "it will attempt to apply its operation to the updated version of the row". David
On Sep 15, 2025, at 14:27, David Rowley <dgrowleyml@gmail.com> wrote:
I think you've only read the first sentence in that paragraph. Please
read the entire paragraph. You'll see it goes on to explain "it will
attempt to apply its operation to the updated version of the row".
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sep 15, 2025, at 16:12, Chao Li <li.evan.chao@gmail.com> wrote:Then the solution of TidListEval() is correct, and the only issue is repeatedly calling TidListEval() on every recheck.
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, 9 Sept 2025 at 04:36, Sophie Alpert <pg@sophiebits.com> wrote: > I've added a new trss_boundsInitialized flag such that we calculate the range once per EPQ rescan. In order to preservethe semantics when the min or max is NULL, I'm setting trss_mintid/trss_maxtid to have invalid ItemPointers as asentinel in the cases where TidRangeEval returns false. I added a ItemPointerIsValid assertion given that it's now morerelevant to correctness but I can remove it if it feels superfluous. Let me know if there is a more idiomatic way totreat this. This part seems a bit pointless as EPQ queries will only return 1 row anyway (Check EvalPlanQual() where it calls EvalPlanQualNext()). There will be a rescan, which will wipe out the range before the recheck function is ever called again. I think if we want to do better at not continuously recalculating the TIDList or range, then something more generic might be better. If the planner were to note down which ParamIds exist in the TID exprs, we could just mark that the TID List / Range needs to be recalculated whenever one of those parameters changes. That should help in non-EPQ cases too. For the v2 patch, I've hacked on that a bit and stripped out the trss_boundsInitialized stuff and just make it so we recalculate the TID List/Range on every recheck. I also added another isolation test permutation to have s1 rollback and ensure that s2 updates the ctid = '(0,1)' tuple. The attached v3-0001 is the updated v2 patch, and v3-0002 is a POC of what I described above. Seems there is something to it as the performance is better. It is a very contrived test case, however. create table empty(a int); create table million (a int); insert into million select generate_series(1,1000000); set max_parallel_Workers_per_gather=0; set enable_seqscan=0; set enable_material=0; set jit=0; select sum(c) from million m left join lateral (select count(*) c from empty where ctid in ('(1,1)','(1,2)','(1,3)','(1,4)','(1,5)','(1,6)','(1,7)','(1,8)','(1,9)','(1,10)','(1,11)','(1,12)','(1,13)','(1,14)','(1,15)','(1,16)','(1,17)','(1,18)','(1,19)','(1,20)','(1,21)','(1,22)','(1,23)','(1,24)','(1,25)','(1,26)','(1,27)','(1,28)','(1,29)','(1,30)','(1,31)','(1,32)','(1,33)','(1,34)','(1,35)','(1,36)','(1,37)','(1,38)','(1,39)','(1,40)','(1,41)','(1,42)','(1,43)','(1,44)','(1,45)','(1,46)','(1,47)','(1,48)','(1,49)','(1,50)','(1,51)','(1,52)','(1,53)','(1,54)','(1,55)','(1,56)','(1,57)','(1,58)','(1,59)','(1,60)','(1,61)','(1,62)','(1,63)','(1,64)','(1,65)','(1,66)','(1,67)','(1,68)','(1,69)','(1,70)','(1,71)','(1,72)','(1,73)','(1,74)','(1,75)','(1,76)','(1,77)','(1,78)','(1,79)','(1,80)','(1,81)','(1,82)','(1,83)','(1,84)','(1,85)','(1,86)','(1,87)','(1,88)','(1,89)','(1,90)','(1,91)','(1,92)','(1,93)','(1,94)','(1,95)','(1,96)','(1,97)','(1,98)','(1,99)','(1,100)')) on 1=1; master: Time: 613.541 ms Time: 621.037 ms Time: 623.430 ms master + v3-0002: Time: 298.863 ms Time: 298.015 ms Time: 297.172 ms Overall, I'm keen to get moving fixing the initial reported problem. Maybe the 0002 part can be done in master or not at all. 0002 modifies the TidScan and TidRangeScan, which we can't really do in the back branches anyway. David
Вложения
On Mon, Sep 15, 2025 at 4:23 AM, David Rowley <dgrowleyml@gmail.com> wrote: > For the v2 patch, I've hacked on that a bit and stripped out the > trss_boundsInitialized stuff and just make it so we recalculate the > TID List/Range on every recheck. I also added another isolation test > permutation to have s1 rollback and ensure that s2 updates the ctid = > '(0,1)' tuple. Thanks, this seems sensible given the reality that rescan happens once per tuple. The `if (node->tss_TidList == NULL)` checkyou kept from me seems likewise pointless in v3-0001 but not particularly intrusive (though of course you use it inv3-0002). Otherwise v3-0001 lgtm. (I also took a look at v3-0002 and don't see any immediate issues, though neither would I reallyconsider myself qualified to review it.) Sophie
On Sep 15, 2025, at 19:23, David Rowley <dgrowleyml@gmail.com> wrote:
The attached v3-0001 is the updated v2 patch, and v3-0002 is a POC of
what I described above. Seems there is something to it as the
performance is better. It is a very contrived test case, however.
create table empty(a int);
create table million (a int);
insert into million select generate_series(1,1000000);
set max_parallel_Workers_per_gather=0;
set enable_seqscan=0;
set enable_material=0;
set jit=0;
select sum(c) from million m left join lateral (select count(*) c from
empty where ctid in
('(1,1)','(1,2)','(1,3)','(1,4)','(1,5)','(1,6)','(1,7)','(1,8)','(1,9)','(1,10)','(1,11)','(1,12)','(1,13)','(1,14)','(1,15)','(1,16)','(1,17)','(1,18)','(1,19)','(1,20)','(1,21)','(1,22)','(1,23)','(1,24)','(1,25)','(1,26)','(1,27)','(1,28)','(1,29)','(1,30)','(1,31)','(1,32)','(1,33)','(1,34)','(1,35)','(1,36)','(1,37)','(1,38)','(1,39)','(1,40)','(1,41)','(1,42)','(1,43)','(1,44)','(1,45)','(1,46)','(1,47)','(1,48)','(1,49)','(1,50)','(1,51)','(1,52)','(1,53)','(1,54)','(1,55)','(1,56)','(1,57)','(1,58)','(1,59)','(1,60)','(1,61)','(1,62)','(1,63)','(1,64)','(1,65)','(1,66)','(1,67)','(1,68)','(1,69)','(1,70)','(1,71)','(1,72)','(1,73)','(1,74)','(1,75)','(1,76)','(1,77)','(1,78)','(1,79)','(1,80)','(1,81)','(1,82)','(1,83)','(1,84)','(1,85)','(1,86)','(1,87)','(1,88)','(1,89)','(1,90)','(1,91)','(1,92)','(1,93)','(1,94)','(1,95)','(1,96)','(1,97)','(1,98)','(1,99)','(1,100)'))
on 1=1;
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, 16 Sept 2025 at 05:42, Sophie Alpert <pg@sophiebits.com> wrote: > Thanks, this seems sensible given the reality that rescan happens once per tuple. The `if (node->tss_TidList == NULL)`check you kept from me seems likewise pointless in v3-0001 but not particularly intrusive (though of course you useit in v3-0002). That was a defensive coding choice. It could have been an Assert(node->tss_TidList == NULL);, but I just prefer it as an "if". > Otherwise v3-0001 lgtm. (I also took a look at v3-0002 and don't see any immediate issues, though neither would I reallyconsider myself qualified to review it.) Thanks for looking, and for the report and patches. I've pushed and backpatched the fixes. v13 doesn't have TID Range Scans, so I opted to break the fix into 2 parts and apply to the relevant branches to try and keep the commit messages as accurate as possible. David