Обсуждение: When creating index, why pointing to old version of tuple

Поиск
Список
Период
Сортировка

When creating index, why pointing to old version of tuple

От
Chao Li
Дата:
Hi Community,

Let me use a small example to demonstrate my observation.

Step 1: create a simple table, insert a tuple and update it.

create table ta (id int, name varchar(32), age int);
insert into ta values(1, 'aa', 4);
update ta set age=99 where id=1;

Step 2: with pageinspect, we can the 2 version of the tuple:

SELECT * FROM heap_page_items(get_raw_page('ta', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |           t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------
  1 |   8152 |        1 |     36 |    765 |    765 |        0 | (0,2)  |       16387 |         34 |     24 |        |       | \x010000000761610004000000
  2 |   8112 |        1 |     36 |    765 |      0 |        2 | (0,2)  |       32771 |      10242 |     24 |        |       | \x010000000761610063000000
(2 rows)

The old version's ctid now points to (0,2) which is expected.

Step 3: create a index on the table

create index idx_ta_age on ta(age);

Step 4: view the index page

evantest=# SELECT * FROM bt_page_items('idx_ta_age', 1);
 itemoffset | ctid  | itemlen | nulls | vars |          data           | dead | htid  | tids
------------+-------+---------+-------+------+-------------------------+------+-------+------
          1 | (0,1) |      16 | f     | f    | 63 00 00 00 00 00 00 00 | f    | (0,1) |
(1 row)

Here comes my question, why the index entry's ctid points to the old version tuple?

I understand that, for updated tuples, old version's ctid points to new version, that builds a chain of all versions. But my confusion is that, when an index is created, older transactions and in-progress transactions won't see the newly created index. So, it should be ok for the index to point to the newest version of tuple version that is visible to the index.

Can anyone please explain me about that? 

Thanks and regards,

Chao Li (Evan)
------------------------------
HighGo Software Inc.
https://www.highgo.com/

Re: When creating index, why pointing to old version of tuple

От
Tender Wang
Дата:


Chao Li <li.evan.chao@gmail.com> 于2025年8月1日周五 14:16写道:
Hi Community,

Let me use a small example to demonstrate my observation.

Step 1: create a simple table, insert a tuple and update it.

create table ta (id int, name varchar(32), age int);
insert into ta values(1, 'aa', 4);
update ta set age=99 where id=1;

Step 2: with pageinspect, we can the 2 version of the tuple:

SELECT * FROM heap_page_items(get_raw_page('ta', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |           t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------
  1 |   8152 |        1 |     36 |    765 |    765 |        0 | (0,2)  |       16387 |         34 |     24 |        |       | \x010000000761610004000000
  2 |   8112 |        1 |     36 |    765 |      0 |        2 | (0,2)  |       32771 |      10242 |     24 |        |       | \x010000000761610063000000
(2 rows)

The old version's ctid now points to (0,2) which is expected.

Step 3: create a index on the table

create index idx_ta_age on ta(age);

Step 4: view the index page

evantest=# SELECT * FROM bt_page_items('idx_ta_age', 1);
 itemoffset | ctid  | itemlen | nulls | vars |          data           | dead | htid  | tids
------------+-------+---------+-------+------+-------------------------+------+-------+------
          1 | (0,1) |      16 | f     | f    | 63 00 00 00 00 00 00 00 | f    | (0,1) |
(1 row)

Here comes my question, why the index entry's ctid points to the old version tuple?

I understand that, for updated tuples, old version's ctid points to new version, that builds a chain of all versions. But my confusion is that, when an index is created, older transactions and in-progress transactions won't see the newly created index. So, it should be ok for the index to point to the newest version of tuple version that is visible to the index.

Can anyone please explain me about that? 


If the index points to the newest version of the tuple, how do old transactions read the old version of the tuple using an index scan for old transactions?
Pointing to the old version is friendly if the table is often updated. This way, we don't need to update the index tuple.


--
Thanks,
Tender Wang

Re: When creating index, why pointing to old version of tuple

От
Chao Li
Дата:
> If the index points to the newest version of the tuple, how do old transactions read the old version of the tuple using an index scan for old transactions?

Say there is a long transaction x1, it is on-going.

And transaction x2 started later than x1 started, and x2 created an index. Should x1 be visible to the new index?

My understanding is, the old transaction x1 cannot use the new index to scan for the old version of the tuple. If you read my example, in the index, the key is the new age value (99), while the old age value is 4, thus using the old value will not hit the index entry.

Chao Li (Evan)
------------------------------
HighGo Software Inc.
https://www.highgo.com/


Tender Wang <tndrwang@gmail.com> 于2025年8月1日周五 14:37写道:


Chao Li <li.evan.chao@gmail.com> 于2025年8月1日周五 14:16写道:
Hi Community,

Let me use a small example to demonstrate my observation.

Step 1: create a simple table, insert a tuple and update it.

create table ta (id int, name varchar(32), age int);
insert into ta values(1, 'aa', 4);
update ta set age=99 where id=1;

Step 2: with pageinspect, we can the 2 version of the tuple:

SELECT * FROM heap_page_items(get_raw_page('ta', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |           t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------
  1 |   8152 |        1 |     36 |    765 |    765 |        0 | (0,2)  |       16387 |         34 |     24 |        |       | \x010000000761610004000000
  2 |   8112 |        1 |     36 |    765 |      0 |        2 | (0,2)  |       32771 |      10242 |     24 |        |       | \x010000000761610063000000
(2 rows)

The old version's ctid now points to (0,2) which is expected.

Step 3: create a index on the table

create index idx_ta_age on ta(age);

Step 4: view the index page

evantest=# SELECT * FROM bt_page_items('idx_ta_age', 1);
 itemoffset | ctid  | itemlen | nulls | vars |          data           | dead | htid  | tids
------------+-------+---------+-------+------+-------------------------+------+-------+------
          1 | (0,1) |      16 | f     | f    | 63 00 00 00 00 00 00 00 | f    | (0,1) |
(1 row)

Here comes my question, why the index entry's ctid points to the old version tuple?

I understand that, for updated tuples, old version's ctid points to new version, that builds a chain of all versions. But my confusion is that, when an index is created, older transactions and in-progress transactions won't see the newly created index. So, it should be ok for the index to point to the newest version of tuple version that is visible to the index.

Can anyone please explain me about that? 


If the index points to the newest version of the tuple, how do old transactions read the old version of the tuple using an index scan for old transactions?
Pointing to the old version is friendly if the table is often updated. This way, we don't need to update the index tuple.


--
Thanks,
Tender Wang

Re: When creating index, why pointing to old version of tuple

От
Tender Wang
Дата:


Chao Li <li.evan.chao@gmail.com> 于2025年8月1日周五 14:47写道:
> If the index points to the newest version of the tuple, how do old transactions read the old version of the tuple using an index scan for old transactions?

Say there is a long transaction x1, it is on-going.

And transaction x2 started later than x1 started, and x2 created an index. Should x1 be visible to the new index?

My understanding is, the old transaction x1 cannot use the new index to scan for the old version of the tuple. If you read my example, in the index, the key is the new age value (99), while the old age value is 4, thus using the old value will not hit the index entry.

Yeah, in your case, the index is non-visible for x1. Your saying is correct.  But what if the table already has an index? 
--
Thanks,
Tender Wang

Re: When creating index, why pointing to old version of tuple

От
Chao Li
Дата:

> But what if the table already has an index? 

I have tested that, if I create the index first, then update the tuple, the index entry will only point to the new version of data. That's why my question was specifically about creating the index after updating the tuple.


Chao Li (Evan)
------------------------------
HighGo Software Inc.
https://www.highgo.com/


Tender Wang <tndrwang@gmail.com> 于2025年8月1日周五 14:58写道:


Chao Li <li.evan.chao@gmail.com> 于2025年8月1日周五 14:47写道:
> If the index points to the newest version of the tuple, how do old transactions read the old version of the tuple using an index scan for old transactions?

Say there is a long transaction x1, it is on-going.

And transaction x2 started later than x1 started, and x2 created an index. Should x1 be visible to the new index?

My understanding is, the old transaction x1 cannot use the new index to scan for the old version of the tuple. If you read my example, in the index, the key is the new age value (99), while the old age value is 4, thus using the old value will not hit the index entry.

Yeah, in your case, the index is non-visible for x1. Your saying is correct.  But what if the table already has an index? 
--
Thanks,
Tender Wang

When creating index, why pointing to old version of tuple

От
"David G. Johnston"
Дата:
On Friday, August 1, 2025, Chao Li <li.evan.chao@gmail.com> wrote:

> But what if the table already has an index? 

I have tested that, if I create the index first, then update the tuple, the index entry will only point to the new version of data. That's why my question was specifically about creating the index after updating the tuple.

IIUC it’s just a seemingly low-value optimization that no one has bothered to implement. The code path in question handles both initial creation and reindexing and the later needs to keep the chain intact for concurrent readers.  It just doesn’t seem worth it to offer 10% off new index creations and then charging full price thereafter.  It could actually be a bit counter-productive since your initial evaluation period would be skewed toward the positive.

David J.

Re: When creating index, why pointing to old version of tuple

От
Chao Li
Дата:
This explanation sounds reasonable to me. I didn't mean to "optimize" the logic, I was just trying to understand the behavior. 

Chao Li (Evan)
------------------------------
HighGo Software Inc.
https://www.highgo.com/


David G. Johnston <david.g.johnston@gmail.com> 于2025年8月1日周五 15:50写道:
On Friday, August 1, 2025, Chao Li <li.evan.chao@gmail.com> wrote:

> But what if the table already has an index? 

I have tested that, if I create the index first, then update the tuple, the index entry will only point to the new version of data. That's why my question was specifically about creating the index after updating the tuple.

IIUC it’s just a seemingly low-value optimization that no one has bothered to implement. The code path in question handles both initial creation and reindexing and the later needs to keep the chain intact for concurrent readers.  It just doesn’t seem worth it to offer 10% off new index creations and then charging full price thereafter.  It could actually be a bit counter-productive since your initial evaluation period would be skewed toward the positive.

David J.

Re: When creating index, why pointing to old version of tuple

От
Tom Lane
Дата:
Chao Li <li.evan.chao@gmail.com> writes:
> I understand that, for updated tuples, old version's ctid points to new
> version, that builds a chain of all versions. But my confusion is that,
> when an index is created, older transactions and in-progress transactions
> won't see the newly created index.

I think this misunderstanding is the root of your confusion.  Yes,
transactions with older snapshots can see and use such an index.
In fact, they *must* see it, because they had better update it
if they make any new insertions in the table.

This means that index creation has to index any row version that
is even potentially still-visible to any open transaction.

Moreover, since our tracking of visibility is approximate (OldestXmin
certainly doesn't capture everything), that will result in sometimes
indexing things that an omniscient observer could know aren't visible
to any remaining transaction.  I'm not sure if that applies to your
test case, but maybe it does, depending on what else was happening in
the system.

Also, there are a bunch of edge cases concerning HOT updates, which
are described in src/backend/access/heap/README.HOT.  I'm not
sure whether your test case is affected by those rules, but
that could be an independent reason for index entries that
point somewhere other than where you expected.

            regards, tom lane



Re: When creating index, why pointing to old version of tuple

От
龙小龙
Дата:
On read committed isolation level, x1 should see the new index when it execute the next sql after x2 is committed.

Best Regards,
Heisenberg

2025年8月1日 14:47,Chao Li <li.evan.chao@gmail.com> 写道:

> If the index points to the newest version of the tuple, how do old transactions read the old version of the tuple using an index scan for old transactions?

Say there is a long transaction x1, it is on-going.

And transaction x2 started later than x1 started, and x2 created an index. Should x1 be visible to the new index?

My understanding is, the old transaction x1 cannot use the new index to scan for the old version of the tuple. If you read my example, in the index, the key is the new age value (99), while the old age value is 4, thus using the old value will not hit the index entry.

Chao Li (Evan)
------------------------------
HighGo Software Inc.
https://www.highgo.com/


Tender Wang <tndrwang@gmail.com> 于2025年8月1日周五 14:37写道:


Chao Li <li.evan.chao@gmail.com> 于2025年8月1日周五 14:16写道:
Hi Community,

Let me use a small example to demonstrate my observation.

Step 1: create a simple table, insert a tuple and update it.

create table ta (id int, name varchar(32), age int);
insert into ta values(1, 'aa', 4);
update ta set age=99 where id=1;

Step 2: with pageinspect, we can the 2 version of the tuple:

SELECT * FROM heap_page_items(get_raw_page('ta', 0));
 lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |           t_data
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------+----------------------------
  1 |   8152 |        1 |     36 |    765 |    765 |        0 | (0,2)  |       16387 |         34 |     24 |        |       | \x010000000761610004000000
  2 |   8112 |        1 |     36 |    765 |      0 |        2 | (0,2)  |       32771 |      10242 |     24 |        |       | \x010000000761610063000000
(2 rows)

The old version's ctid now points to (0,2) which is expected.

Step 3: create a index on the table

create index idx_ta_age on ta(age);

Step 4: view the index page

evantest=# SELECT * FROM bt_page_items('idx_ta_age', 1);
 itemoffset | ctid  | itemlen | nulls | vars |          data           | dead | htid  | tids
------------+-------+---------+-------+------+-------------------------+------+-------+------
          1 | (0,1) |      16 | f     | f    | 63 00 00 00 00 00 00 00 | f    | (0,1) |
(1 row)

Here comes my question, why the index entry's ctid points to the old version tuple?

I understand that, for updated tuples, old version's ctid points to new version, that builds a chain of all versions. But my confusion is that, when an index is created, older transactions and in-progress transactions won't see the newly created index. So, it should be ok for the index to point to the newest version of tuple version that is visible to the index.

Can anyone please explain me about that? 


If the index points to the newest version of the tuple, how do old transactions read the old version of the tuple using an index scan for old transactions?
Pointing to the old version is friendly if the table is often updated. This way, we don't need to update the index tuple.


--
Thanks,
Tender Wang