Обсуждение: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

Поиск

Список

Период

Сортировка

eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

23 июня 2025 г., 23:25:16

Hi,

The attached patch set eliminates xl_heap_visible, the WAL record
emitted when a block of the heap is set all-visible/frozen in the
visibility map. Instead, it includes the information needed to update
the VM in the WAL record already emitted by the operation modifying
the heap page.

Currently COPY FREEZE and vacuum are the only operations that set the
VM. So, this patch modifies the xl_heap_multi_insert and xl_heap_prune
records.

The result is a dramatic reduction in WAL volume for these operations.
I've included numbers below.

I also think that it makes more sense to include changes to the VM in
the same WAL record as the changes that rendered the page all-visible.
In some cases, we will only set the page all-visible, but that is in
the context of the operation on the heap page which discovered that it
was all-visible. Therefore, I find this to be a clarity as well as a
performance improvement.

This project is also the first step toward setting the VM on-access
for queries which do not modify the page. There are a few design
issues that must be sorted out for that project which I will detail
separately. Note that this patch set currently does not implement
setting the VM on-access.

The attached patch set isn't 100% polished. I think some of the
variable names and comments could use work, but I'd like to validate
the idea of doing this before doing a full polish. This is a summary
of what is in the set:

Patches:
0001 - 0002: cleanup
0003 - 0004: refactoring
0005: COPY FREEZE changes
0006: refactoring
0007: vacuum phase III changes
0008: vacuum phase I empty page changes
0009 - 0012: refactoring
0013: vacuum phase I normal page changes
0014: cleanup

Performance benefits of eliminating xl_heap_visible:

vacuum of table with index (DDL at bottom of email)
--
master -> patch
WAL bytes: 405346 -> 303088 = 25% reduction
WAL records: 6682 -> 4459 = 33% reduction

vacuum of table without index
--
master -> patch
WAL records: 4452 -> 2231 = 50% reduction
WAL bytes: 289016 -> 177978 = 38% reduction

COPY FREEZE of table without index
--
master -> patch
WAL records: 3672777 -> 1854589 = 50% reduction
WAL bytes: 841340339 -> 748545732  = 11% reduction (new pages need a
copy of the whole page)

table for vacuum example:
--
create table foo(a int, b numeric, c numeric) with (autovacuum_enabled= false);
insert into foo select i % 18, repeat('1', 400)::numeric, repeat('2',
400)::numeric from generate_series(1,40000)i;
-- don't make index for no-index case
create index on foo(a);
delete from foo where a = 1;
vacuum (verbose, process_toast false) foo;


copy freeze example:
--
-- create a data file
create table large(a int, b int) with (autovacuum_enabled = false,
fillfactor = 10);
insert into large SELECT generate_series(1,40000000)i, 1;
copy large to 'large.data';

-- example
BEGIN;
create table large(a int, b int) with (autovacuum_enabled = false,
fillfactor = 10);
COPY large FROM 'large.data' WITH (FREEZE);
COMMIT;

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

27 июня 2025 г., 01:04:34

On Mon, Jun 23, 2025 at 4:25 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> The attached patch set eliminates xl_heap_visible, the WAL record
> emitted when a block of the heap is set all-visible/frozen in the
> visibility map. Instead, it includes the information needed to update
> the VM in the WAL record already emitted by the operation modifying
> the heap page.

Rebased in light of recent changes on master:

0001: cleanup
0002: preparatory work
0003: eliminate xl_heap_visible for COPY FREEZE
0004 - 0005: eliminate xl_heap_visible for vacuum's phase III
0006: eliminate xl_heap_visible for vacuum phase I empty pages
0007 - 0010: preparatory refactoring
0011: eliminate xl_heap_visible from vacuum phase I prune/freeze
0012: remove xl_heap_visible

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

10 июля 2025 г., 00:59:26

On Thu, Jun 26, 2025 at 6:04 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Rebased in light of recent changes on master:

This needed another rebase, and, in light of the discussion in [1],
I've also removed the patch to add heap wrappers for setting pages
all-visible.

More notably, the final patch (0012) in attached v3 allows on-access
pruning to set the VM.

To do this, it plumbs some information down from the executor to the
table scan about whether or not the table is modified by the query. We
don't want to set the VM only to clear it while scanning pages for an
UPDATE or while locking rows in a SELECT FOR UPDATE.

Because we only do on-access pruning when pd_prune_xid is valid, we
shouldn't need much of a heuristic for deciding when to set the VM
on-access -- but I've included one anyway: we only do it if we are
actually pruning or if the page is already dirty and no FPI would be
emitted.

You can see it in action with the following:

create extension pg_visibility;
create table foo (a int, b int) with (autovacuum_enabled=false, fillfactor=90);
insert into foo select generate_series(1,300), generate_series(1,300);
create index on foo (a);
update foo set b = 51 where b = 50;
select * from foo where a = 50;
select * from pg_visibility_map_summary('foo');

The SELECT will set a page all-visible in the VM.
In this patch set, on-access pruning is enabled for sequential scans
and the underlying heap relation in index scans and bitmap heap scans.
This example can exercise any of the three if you toggle
enable_indexscan and enable_bitmapscan appropriately.

From a performance perspective, If you run a trivial pgbench, you can
see far more all-visible pages set in the pgbench_[x] relations with
no noticeable overhead. But, I'm planning to do some performance
experiments to show how this affects our ability to choose index only
scan plans in realistic workloads.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_Yj%3DyrL%2BgGGsqfYVQcYn7rDp6hDeoF1vN453JDp8dEY%2Bw%40mail.gmail.com

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

12 июля 2025 г., 01:19:15

On Wed, Jul 9, 2025 at 5:59 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Jun 26, 2025 at 6:04 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > Rebased in light of recent changes on master:
>
> This needed another rebase, and, in light of the discussion in [1],
> I've also removed the patch to add heap wrappers for setting pages
> all-visible.

Andrey Borodin made the excellent point off-list that I forgot to
remove the xl_heap_visible struct itself -- which is rather important
to a patch set purporting to eliminate xl_heap_visible! New version
attached.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Andrey Borodin

Дата:

13 июля 2025 г., 21:34:26

> On 12 Jul 2025, at 03:19, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> remove the xl_heap_visible struct

Same goes for VISIBILITYMAP_XLOG_CATALOG_REL and XLOG_HEAP2_VISIBLE. But please do not rush to remove it, perhaps I
willhave a more exhaustive list later. Currently the patch set is expected to be unpolished. 
I just need to absorb all effects to have a high-level evaluation of the patch set effect.

I'm still trying to grasp connection of first patch with Assert(prstate->cutoffs) to other patches;

Also, I'd prefer "page is not marked all-visible but visibility map bit is set in relation" to emit XX001 for
monitoringreasons, but again, this is small note, while I need a broader picture. 

So far I do not see any general problems in delegating redo work from xl_heap_visible to other record. FWIW I observed
severalcases of VM corruptions that might be connected to the fact that we log VM changes independently of data changes
thatcaused VM to change. But I have no real evidence or understanding what happened. 

Best regards, Andrey Borodin.

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

13 июля 2025 г., 22:15:22

On Sun, Jul 13, 2025 at 2:34 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > On 12 Jul 2025, at 03:19, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> > remove the xl_heap_visible struct
>
> Same goes for VISIBILITYMAP_XLOG_CATALOG_REL and XLOG_HEAP2_VISIBLE. But please do not rush to remove it, perhaps I
willhave a more exhaustive list later. Currently the patch set is expected to be unpolished. 
> I just need to absorb all effects to have a high-level evaluation of the patch set effect.

I actually did remove those if you check the last version posted. I
did notice there is one remaining comment referring to
XLOG_HEAP2_VISIBLE I missed somehow, but the actual enums/macros were
removed already.

> I'm still trying to grasp connection of first patch with Assert(prstate->cutoffs) to other patches;

I added this because I noticed that it was used without validating it
was provided in that location. The last patch in the set which sets
the VM on access changes where cutoffs are used, so I noticed what I
felt was a missing assert in master while developing that page.

> Also, I'd prefer "page is not marked all-visible but visibility map bit is set in relation" to emit XX001 for
monitoringreasons, but again, this is small note, while I need a broader picture. 

Could you clarify what you mean by this? Are you talking about the
string representation of the visibility map bits in the WAL record
representations in heapdesc.c?

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Andrey Borodin

Дата:

14 июля 2025 г., 09:37:47

> On 14 Jul 2025, at 00:15, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
>>
>> Also, I'd prefer "page is not marked all-visible but visibility map bit is set in relation" to emit XX001 for
monitoringreasons, but again, this is small note, while I need a broader picture. 
>
> Could you clarify what you mean by this? Are you talking about the
> string representation of the visibility map bits in the WAL record
> representations in heapdesc.c?

This might be a bit off-topic for this thread, but as long as the patch touches that code we can look into this too.

If VM bit all-visible is set while page is not all-visible IndexOnlyScan will show incorrect results. I observed this
inconsistencyfew times on production. 

Two persistent subsystems (VM and heap) contradict each other, that's why I think this is a data corruption. Yes, we
canrepair the VM by assuming heap to be the source of truth in this case. But we must also emit ERRCODE_DATA_CORRUPTED
XX001code into the logs. In many cases this will alert on-call SRE. 

To do so I propose to replace elog(WARNING,...) with ereport(WARNING,(errcode(ERRCODE_DATA_CORRUPTED),..).

Best regards, Andrey Borodin.

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

01 августа 2025 г., 01:58:11

Thanks for continuing to take a look, Andrey.

On Mon, Jul 14, 2025 at 2:37 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> This might be a bit off-topic for this thread, but as long as the patch touches that code we can look into this too.
>
> If VM bit all-visible is set while page is not all-visible IndexOnlyScan will show incorrect results. I observed this
inconsistencyfew times on production. 

That's very unfortunate. I wonder what could be causing this. Do you
suspect a bug in Postgres? Or something wrong with the disk, etc?

> Two persistent subsystems (VM and heap) contradict each other, that's why I think this is a data corruption. Yes, we
canrepair the VM by assuming heap to be the source of truth in this case. But we must also emit ERRCODE_DATA_CORRUPTED
XX001code into the logs. In many cases this will alert on-call SRE. 
>
> To do so I propose to replace elog(WARNING,...) with ereport(WARNING,(errcode(ERRCODE_DATA_CORRUPTED),..).

Ah, you mean the warnings currently in lazy_scan_prune(). To me this
suggestion makes sense. I see at least one other example with
ERRCODE_DATA_CORRUPTED that is an error level below ERROR.

I have attached a cleaned up and updated version of the patch set (it
doesn't yet include your suggested error message change).

What's new in this version
-----
In addition to general code, comment, and commit message improvements,
notable changes are as follows:

- I have used the GlobalVisState for determining if the whole page is
visible in a more natural way.

- I micro-benchmarked and identified some sources of regression in the
additional code SELECT queries would do to set the VM. So, there are
several new commits addressing these (for example inlining several
functions and unsetting all-visible when we see a dead tuple if we
won't attempt freezing).

- Because heap_page_prune_and_freeze() was getting long, I added some
helper functions.

Performance impact of setting the VM on-access
-------
I found that with the patch set applied, we set many pages all-visible
in the VM on access, resulting in a higher overall number of pages set
all-visible, reducing load for vacuum, and dramatically decreasing
heap fetches by index-only scans.

I devised a simple benchmark -- with 8 workers inserting 20 rows at a
time into a table with a few columns and updating a single row that
they just inserted. Another worker queries the table 1x second using
an index.

After running the benchmark for a few minutes, though the table was
autovacuumed several times in both cases, with the patchset applied,
15% more blocks were all-visible at the end of the benchmark.

And with my patch applied, index-only scans did far fewer heap
fetches. A SELECT count(*) of the table at the same point in the
benchmark did 10,000 heap fetches on master and 500 with the patch
applied (I used auto_explain to determine this).

With my patch applied, autovacuum workers write half as much WAL as on
master. Some of this is courtesy of other patches in the set which
eliminate separate WAL records for setting the page all-visible. But,
vacuum is also scanning fewer pages and dirtying fewer buffers because
they are being set all-visible on-access.

There are more details about the benchmark at the end of the email.

Setting pd_prune_xid on insert
------
The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
patch in the set. It sets pd_prune_xid on insert (so pages filled by
COPY or insert can also be set all-visible in the VM before they are
vacuumed). I gave it a .txt extension because it currently fails
035_standby_logical_decoding due to a recovery conflict. I need to
investigate more to see if this is a bug in my patch set or elsewhere
in Postgres.

Besides the failing test, I have a feeling that my current heuristic
for whether or not to set the VM on-access is not quite right for
pages that have only been inserted to -- and if we get it wrong, we've
wasted those CPU cycles because we didn't otherwise need to prune the
page.

- Melanie

Benchmark
-------
psql -c "
DROP TABLE IF EXISTS simple_table;

CREATE TABLE simple_table (
    id SERIAL PRIMARY KEY,
    group_id INT NOT NULL,
    data TEXT,
    created_at TIMESTAMPTZ DEFAULT now()
);

create index on simple_table(group_id);
"

pgbench \
  --no-vacuum \
  --random-seed=0 \
  -c 8 \
  -j 8 \
  -M prepared \
  -T 200 \
  > "pgbench_run_summary_update_${version}" \
-f- <<EOF &
\set gid random(1,1000)

INSERT INTO simple_table (group_id, data)
  SELECT :gid, 'inserted'
  RETURNING id \gset

update simple_table set data = 'updated' where id = :id;

insert into simple_table (group_id, data)
  select :gid, 'inserted'
  from generate_series(1,20);
EOF
insert_pid=$!

pgbench \
  --no-vacuum \
  --random-seed=0 \
  -c 1 \
  -j 1 \
  --rate=1 \
  -M prepared \
  -T 200 \
  > "pgbench_run_summary_select_${version}" \
-f- <<EOF
\set gid random(1, 1000)
select max(created_at) from simple_table where group_id = :gid;
select count(*) from simple_table where group_id = :gid;
EOF

wait $insert_pid

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

02 августа 2025 г., 00:36:19

On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> patch in the set. It sets pd_prune_xid on insert (so pages filled by
> COPY or insert can also be set all-visible in the VM before they are
> vacuumed). I gave it a .txt extension because it currently fails
> 035_standby_logical_decoding due to a recovery conflict. I need to
> investigate more to see if this is a bug in my patch set or elsewhere
> in Postgres.

I figured out that if we set the VM on-access, we need to enable
hot_standby_feedback in more places in 035_standby_logical_decoding.pl
to avoid recovery conflicts. I've done that in the attached updated
version 6. There are a few other issues in
035_standby_logical_decoding.pl that I reported here [1]. With these
changes, setting pd_prune_xid on insert passes tests. Whether or not
we want to do it (and what the heuristic should be for deciding when
to do it) is another question.

- Melanie

[1] https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

26 августа 2025 г., 12:58:28

On Sat, 2 Aug 2025 at 02:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> > patch in the set. It sets pd_prune_xid on insert (so pages filled by
> > COPY or insert can also be set all-visible in the VM before they are
> > vacuumed). I gave it a .txt extension because it currently fails
> > 035_standby_logical_decoding due to a recovery conflict. I need to
> > investigate more to see if this is a bug in my patch set or elsewhere
> > in Postgres.
>
> I figured out that if we set the VM on-access, we need to enable
> hot_standby_feedback in more places in 035_standby_logical_decoding.pl
> to avoid recovery conflicts. I've done that in the attached updated
> version 6. There are a few other issues in
> 035_standby_logical_decoding.pl that I reported here [1]. With these
> changes, setting pd_prune_xid on insert passes tests. Whether or not
> we want to do it (and what the heuristic should be for deciding when
> to do it) is another question.
>
> - Melanie
>
> [1]
https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com

Hi!

Andrey told me off-list about this thread and I decided to take a look.

I tried to play with each patch in this patchset and find a
corruption, but I was unsuccessful. I will conduct further tests
later. I am not implying that I suspect this patchset causes any
corruption; I am merely attempting to verify it.

I also have few comments and questions. Here is my (very limited)
review of 0001, because I believe that removing xl_heap_visible from
COPY FREEZE is pure win, so this patch can be very beneficial by
itself.

visibilitymap_set_vmbyte is introduced in 0001 and removed in 0012.
This is strange to me, maybe we can avoid visibilitymap_set_vmbyte in
first place?

In 0001:

1)
should we add "Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
LW_EXCLUSIVE));" in visibilitymap_set_vmbyte?

Also here  `Assert(visibilitymap_pin_ok(BufferGetBlockNumber(buffer),
vmbuffer));` can be beneficial:

>/*
>+ * If we're only adding already frozen rows to a previously empty
>+ * page, mark it as all-frozen and update the visibility map. We're
>+ * already holding a pin on the vmbuffer.
>+ */
>   else if (all_frozen_set)
>+ {
>    PageSetAllVisible(page);
>+ LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
>+ visibilitymap_set_vmbyte(relation,
>+ BufferGetBlockNumber(buffer),
>+ vmbuffer,
>+ VISIBILITYMAP_ALL_VISIBLE |
>+ VISIBILITYMAP_ALL_FROZEN);
>+ }

2)
in heap_xlog_multi_insert:

+
+ visibilitymap_pin(reln, blkno, &vmbuffer);
+ visibilitymap_set_vmbyte(....)

Do we need to pin vmbuffer here? Looks like
XLogReadBufferForRedoExtended already pins vmbuffer. I verified this
with CheckBufferIsPinnedOnce(vmbuffer) just before visibilitymap_pin
and COPY ... WITH (FREEZE true) test.

3)
>+
> +#ifdef TRACE_VISIBILITYMAP
> + elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
> +#endif

I can see this merely copy-pasted from visibilitymap_set, but maybe
display "flags" also?

4) visibilitymap_set receives  XLogRecPtr recptr parameters, which is
set to WAL record lsn during recovery and to InvalidXLogRecPtr
otherwise. visibilitymap_set manages VM page LSN bases on this recptr
value (inside function logic). visibilitymap_set_vmbyte behaves
vise-versa and makes its caller responsible for page LSN management.
Maybe we should keep these two functions akin to each other?

--
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

26 августа 2025 г., 23:01:24

On Sat, 2 Aug 2025 at 02:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> > patch in the set. It sets pd_prune_xid on insert (so pages filled by
> > COPY or insert can also be set all-visible in the VM before they are
> > vacuumed). I gave it a .txt extension because it currently fails
> > 035_standby_logical_decoding due to a recovery conflict. I need to
> > investigate more to see if this is a bug in my patch set or elsewhere
> > in Postgres.
>
> I figured out that if we set the VM on-access, we need to enable
> hot_standby_feedback in more places in 035_standby_logical_decoding.pl
> to avoid recovery conflicts. I've done that in the attached updated
> version 6. There are a few other issues in
> 035_standby_logical_decoding.pl that I reported here [1]. With these
> changes, setting pd_prune_xid on insert passes tests. Whether or not
> we want to do it (and what the heuristic should be for deciding when
> to do it) is another question.
>
> - Melanie
>
> [1]
https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com


0002 No comments from me. Looks straightforward.

Few comments on 0003.

1) This patch introduces XLHP_HAS_VMFLAGS. However it lacks some
helpful comments about this new status bit.

a) In heapam_xlog.h, in xl_heap_prune struct definition:


/*
* If XLHP_HAS_CONFLICT_HORIZON is set, the conflict horizon XID follows,
* unaligned
*/
+ /* If XLHP_HAS_VMFLAGS is set, newly set visibility map bits comes,
unaligned */

b)

we can add here comment why we use  memcpy assignment, akin to /*
memcpy() because snapshot_conflict_horizon is stored unaligned */

+ /* Next are the optionally included vmflags. Copy them out for later use. */
+ if ((xlrec.flags & XLHP_HAS_VMFLAGS) != 0)
+ {
+ memcpy(&vmflags, maindataptr, sizeof(uint8));
+ maindataptr += sizeof(uint8);

2) Should we move conflict_xid = visibility_cutoff_xid; assignment
just after heap_page_is_all_visible_except_lpdead call in
lazy_vacuum_heap_page?

3) Looking at this diff, do not comprehend one bit: how are we
protected from passing an all-visible page to lazy_vacuum_heap_page. I
did not manage to reproduce such behaviour though.

+ if ((vmflags & VISIBILITYMAP_VALID_BITS) != 0)
+ {
+ Assert(!PageIsAllVisible(page));
+ set_pd_all_vis = true;
+ LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
+ PageSetAllVisible(page);
+ visibilitymap_set_vmbyte(vacrel->rel,
+ blkno,
+



--
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

27 августа 2025 г., 15:55:27

On Sat, 2 Aug 2025 at 02:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> > patch in the set. It sets pd_prune_xid on insert (so pages filled by
> > COPY or insert can also be set all-visible in the VM before they are
> > vacuumed). I gave it a .txt extension because it currently fails
> > 035_standby_logical_decoding due to a recovery conflict. I need to
> > investigate more to see if this is a bug in my patch set or elsewhere
> > in Postgres.
>
> I figured out that if we set the VM on-access, we need to enable
> hot_standby_feedback in more places in 035_standby_logical_decoding.pl
> to avoid recovery conflicts. I've done that in the attached updated
> version 6. There are a few other issues in
> 035_standby_logical_decoding.pl that I reported here [1]. With these
> changes, setting pd_prune_xid on insert passes tests. Whether or not
> we want to do it (and what the heuristic should be for deciding when
> to do it) is another question.
>
> - Melanie
>
> [1]
https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com

v6-0015:
I chose to verify whether this single modification would be beneficial
on the HEAD.

Benchmark I did:

```

\timing
CREATE TABLE zz(i int);
alter table zz set (autovacuum_enabled = false);
TRUNCATE zz;
copy zz from program 'yes 2 | head -n 180000000';
copy zz from program 'yes 2 | head -n 180000000';

delete from zz where (REPLACE(REPLACE(ctid::text, '(', '{'), ')',
'}')::int[])[2] = 7 ;

VACUUM FREEZE zz;
```

And I checked perf top footprint for last statement (vacuum).  My
detailed results are attached. It is a HEAD vs HEAD+v6-0015 benchmark.

TLDR: function inlining is indeed beneficial, TransactionIdPrecedes
function disappears from perf top footprint, though query runtime is
not changed much. So, while not resulting in query speedup, this can
save CPU.
Maybe we can derive an artificial benchmark, which will show query
speed up, but for now I dont have one.

--
Best regards,
Kirill Reshke

Вложения

attach.txt

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

27 августа 2025 г., 16:08:28

Thanks for all the reviews. I'm working on responding to your previous
mails with a new version.

On Wed, Aug 27, 2025 at 8:55 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> v6-0015:
> I chose to verify whether this single modification would be beneficial
> on the HEAD.
>
> Benchmark I did:
>
> ```
>
> \timing
> CREATE TABLE zz(i int);
> alter table zz set (autovacuum_enabled = false);
> TRUNCATE zz;
> copy zz from program 'yes 2 | head -n 180000000';
> copy zz from program 'yes 2 | head -n 180000000';
>
> delete from zz where (REPLACE(REPLACE(ctid::text, '(', '{'), ')',
> '}')::int[])[2] = 7 ;
>
> VACUUM FREEZE zz;
> ```
>
> And I checked perf top footprint for last statement (vacuum).  My
> detailed results are attached. It is a HEAD vs HEAD+v6-0015 benchmark.
>
> TLDR: function inlining is indeed beneficial, TransactionIdPrecedes
> function disappears from perf top footprint, though query runtime is
> not changed much. So, while not resulting in query speedup, this can
> save CPU.
> Maybe we can derive an artificial benchmark, which will show query
> speed up, but for now I dont have one.

I'm not surprised that vacuum freeze does not show a speed up from the
function inlining.

This patch was key for avoiding a regression in the most contrived
worst case scenario example of setting the VM on-access. That is, if
you are pruning only a single tuple on the page as part of a SELECT
query that returns no tuples (think SELECT * FROM foo OFFSET N where N
is greater than the number of rows in the table), and I add
determining if the page is all visible, then the overhead of these
extra function calls in heap_prune_record_unchanged_lp_normal() is
noticeable.

We might be able to come up with a similar example in vacuum without
freeze since it will try to determine if the page is all-visible. Your
example is still running on my machine, though, so I haven't verified
this yet :)

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

27 августа 2025 г., 22:02:01

Thanks for the review! Updates are in attached v7.

One note on 0022 in the set, which sets pd_prune_xid on insert: the
recently added index-killtuples isolation test was failing with this
patch applied. With the patch, the "access" step reports more heap
page hits than before. After some analysis, it seems one of the heap
pages in kill_prior_tuples table is now being pruned in an earlier
step. Somehow this leads to 4 hits counted instead of 3 (even though
there are only 4 blocks in the relation). I recall Bertrand mentioning
something in some other thread about hits being double counted with
AIO reads, so I'm going to try and go dig that up. But, for now, I've
modified the test -- I believe the patch is only revealing an issue
with instrumentation, not causing a bug.

On Tue, Aug 26, 2025 at 5:58 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> visibilitymap_set_vmbyte is introduced in 0001 and removed in 0012.
> This is strange to me, maybe we can avoid visibilitymap_set_vmbyte in
> first place?

The reason for this is that in the earlier patch I introduce
visibilitymap_set_vmbyte() for one user while other users still use
visibilitymap_set(). But, by the end of the set, all users use
visibillitymap_set_vmbyte(). So I think it makes most sense for it to
be named visibilitymap_set() at that point. Until all users are
committed, the two functions both have to exist and need different
names.

> In 0001:
> should we add "Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
> LW_EXCLUSIVE));" in visibilitymap_set_vmbyte?

I don't want any operations on the heap buffer (including asserts) in
visibilitymap_set_vmbyte(). The heap block is only provided to look up
the VM bits.

I think your idea is a good one for the existing visibilitymap_set(),
though, so I've added a new patch to the set (0002) which does this. I
also added a similar assertion for the vmbuffer to
visibilitymap_set_vmbyte().

> Also here  `Assert(visibilitymap_pin_ok(BufferGetBlockNumber(buffer),
> vmbuffer));` can be beneficial:

I had omitted this because the same logic is checked inside of
visiblitymap_set_vmbyte() with an error occurring if conditions are
not met. However, since the same is true in visibilitymap_set() and
heap_multi_insert() still asserted visiblitymap_pin_ok(), I've added
it back to my patch set.

> in heap_xlog_multi_insert:
> +
> + visibilitymap_pin(reln, blkno, &vmbuffer);
> + visibilitymap_set_vmbyte(....)
>
> Do we need to pin vmbuffer here? Looks like
> XLogReadBufferForRedoExtended already pins vmbuffer. I verified this
> with CheckBufferIsPinnedOnce(vmbuffer) just before visibilitymap_pin
> and COPY ... WITH (FREEZE true) test.

I thought the reason visibilitymap_set() did it was that it was
possible for the block of the VM corresponding to the block of the
heap to be different during recovery than it was when emitting the
record, and thus we needed the part of visiblitymap_pin() that
released the old vmbuffer and got the new one corresponding to the
heap block.

I can't quite think of how this could happen though.

Assuming it can't happen, then we can get rid of visiblitymap_pin()
(and add visibilitymap_pin_ok()) in both visiblitymap_set_vmbyte() and
visibilitymap_set(). I've done this to visibilitymap_set() in a
separate patch 0001. I would like other opinions/confirmation that the
block of the VM corresponding to the heap block cannot differ during
recovery from that what it was when the record was emitted during
normal operation, though.

> > +#ifdef TRACE_VISIBILITYMAP
> > + elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
> > +#endif
>
> I can see this merely copy-pasted from visibilitymap_set, but maybe
> display "flags" also?

Done in attached.

> 4) visibilitymap_set receives  XLogRecPtr recptr parameters, which is
> set to WAL record lsn during recovery and to InvalidXLogRecPtr
> otherwise. visibilitymap_set manages VM page LSN bases on this recptr
> value (inside function logic). visibilitymap_set_vmbyte behaves
> vise-versa and makes its caller responsible for page LSN management.
> Maybe we should keep these two functions akin to each other?

So, with visibilitymap_set_vmbyte(), the whole idea is to just update
the VM and then leave the WAL logging and other changes to the caller
(like marking the buffer dirty, setting the page LSN, etc). The series
of operations needed to make a persistent change are up to the caller.
visibilitymap_set() is meant to just make sure that the correct bits
in the VM are set for the given heap block.

I looked at ways of making the current visibilitymap_set() API cleaner
-- like setting the heap page LSN with the VM recptr in the caller of
visibilitymap_set() instead. There wasn't a way of doing it that
seemed like enough of an improvement to merit the change.

Not to mention, the goal of the patchset is to remove the current
visibilitymap_set(), so I'm not too worried about parity between the
two functions. They may coexist for awhile, but hopefully today's
visibilitymap_set() will eventually be removed.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

27 августа 2025 г., 22:08:41

On Tue, Aug 26, 2025 at 4:01 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> Few comments on 0003.
>
> 1) This patch introduces XLHP_HAS_VMFLAGS. However it lacks some
> helpful comments about this new status bit.

I added the ones you suggested in my v7 posted here [1].

> 2) Should we move conflict_xid = visibility_cutoff_xid; assignment
> just after heap_page_is_all_visible_except_lpdead call in
> lazy_vacuum_heap_page?

Why would we want to do that? We only want to set it if the page is
all visible, so we would have to guard it similarly.

> 3) Looking at this diff, do not comprehend one bit: how are we
> protected from passing an all-visible page to lazy_vacuum_heap_page. I
> did not manage to reproduce such behaviour though.
>
> + if ((vmflags & VISIBILITYMAP_VALID_BITS) != 0)
> + {
> + Assert(!PageIsAllVisible(page));
> + set_pd_all_vis = true;
> + LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> + PageSetAllVisible(page);
> + visibilitymap_set_vmbyte(vacrel->rel,
> + blkno,

So, for one, there is an assert just above this code in
lazy_vacuum_heap_page() that nunused > 0 -- so we know that the page
couldn't have been all-visible already because it had unused line
pointers.

Otherwise, if it was possible for an already all-visible page to get
here, the same thing would happen that happens on master --
heap_page_is_all_visible[_except_lpdead()] would return true and we
would try to set the VM which would end up being a no-op.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_YD0ecXeAh%2BDmJpzQOJwcRzmMyGdcc5W_0pEF78rYSJkQ%40mail.gmail.com

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

28 августа 2025 г., 12:11:48

On Thu, 28 Aug 2025 at 00:02, Melanie Plageman
<melanieplageman@gmail.com> wrote:

> > Do we need to pin vmbuffer here? Looks like
> > XLogReadBufferForRedoExtended already pins vmbuffer. I verified this
> > with CheckBufferIsPinnedOnce(vmbuffer) just before visibilitymap_pin
> > and COPY ... WITH (FREEZE true) test.
>
> I thought the reason visibilitymap_set() did it was that it was
> possible for the block of the VM corresponding to the block of the
> heap to be different during recovery than it was when emitting the
> record, and thus we needed the part of visiblitymap_pin() that
> released the old vmbuffer and got the new one corresponding to the
> heap block.
>
> I can't quite think of how this could happen though.
>
> Assuming it can't happen, then we can get rid of visiblitymap_pin()
> (and add visibilitymap_pin_ok()) in both visiblitymap_set_vmbyte() and
> visibilitymap_set(). I've done this to visibilitymap_set() in a
> separate patch 0001. I would like other opinions/confirmation that the
> block of the VM corresponding to the heap block cannot differ during
> recovery from that what it was when the record was emitted during
> normal operation, though.

I did micro git-blame research here. I spotted only one related change
[0]. Looks like before this change pin was indeed needed.
But not after this change, so this visibilitymap_pin is just an oversight?
Related thread is [1]. I quickly checked the discussion in this
thread, and it looks like no one was bothered about these lines or VM
logging changes (in this exact pin buffer aspect). The discussion was
of other aspects of this commit.

[0] https://github.com/postgres/postgres/commit/2c03216d8311
[1] https://www.postgresql.org/message-id/533D6CBF.6080203%40vmware.com


-- 
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

03 сентября 2025 г., 00:52:37

On Thu, Aug 28, 2025 at 5:12 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> I did micro git-blame research here. I spotted only one related change
> [0]. Looks like before this change pin was indeed needed.
> But not after this change, so this visibilitymap_pin is just an oversight?
> Related thread is [1]. I quickly checked the discussion in this
> thread, and it looks like no one was bothered about these lines or VM
> logging changes (in this exact pin buffer aspect). The discussion was
> of other aspects of this commit.

Wow, thanks so much for doing that research. Looking at it myself, it
does indeed seem like just an oversight. It isn't harmful since it
won't take another pin, but it is confusing, so I think we should at
least remove it in master. I'm not as sure about back branches.

I would like someone to confirm that there is no way we could end up
with a different block of the VM containing the vm bits for a heap
block during recovery than during normal operation.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

03 сентября 2025 г., 02:11:01

On Tue, Sep 2, 2025 at 5:52 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Aug 28, 2025 at 5:12 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > I did micro git-blame research here. I spotted only one related change
> > [0]. Looks like before this change pin was indeed needed.
> > But not after this change, so this visibilitymap_pin is just an oversight?
> > Related thread is [1]. I quickly checked the discussion in this
> > thread, and it looks like no one was bothered about these lines or VM
> > logging changes (in this exact pin buffer aspect). The discussion was
> > of other aspects of this commit.
>
> Wow, thanks so much for doing that research. Looking at it myself, it
> does indeed seem like just an oversight. It isn't harmful since it
> won't take another pin, but it is confusing, so I think we should at
> least remove it in master. I'm not as sure about back branches.

I've updated the commit message in the patch set to reflect the
research you did in attached v8.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Andres Freund

Дата:

03 сентября 2025 г., 02:54:07

Hi,

On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 27 Aug 2025 08:50:15 -0400
> Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay

LGTM.


> From 7c5cb3edf89735eaa8bee9ca46111bd6c554720b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 27 Aug 2025 10:07:29 -0400
> Subject: [PATCH v8 02/22] Add assert and log message to visibilitymap_set

LGTM.


> From 07f31099754636ec9dabf6cca06c33c4b19c230c Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 17 Jun 2025 17:22:10 -0400
> Subject: [PATCH v8 03/22] Eliminate xl_heap_visible in COPY FREEZE
>
> Instead of emitting a separate WAL record for setting the VM bits in
> xl_heap_visible, specify the changes to make to the VM block in the
> xl_heap_multi_insert record instead.
>
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com


> +        /*
> +         * If we're only adding already frozen rows to a previously empty
> +         * page, mark it as all-frozen and update the visibility map. We're
> +         * already holding a pin on the vmbuffer.
> +         */
>          else if (all_frozen_set)
> +        {
> +            Assert(visibilitymap_pin_ok(BufferGetBlockNumber(buffer), vmbuffer));
>              PageSetAllVisible(page);
> +            LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> +            visibilitymap_set_vmbyte(relation,
> +                                     BufferGetBlockNumber(buffer),
> +                                     vmbuffer,
> +                                     VISIBILITYMAP_ALL_VISIBLE |
> +                                     VISIBILITYMAP_ALL_FROZEN);
> +        }

From an abstraction POV I don't love that heapam now is responsible for
acquiring and releasing the lock. But that ship already kind of has sailed, as
heapam.c is already responsible for releasing the vm buffer etc...

I've wondered about splitting the responsibilities up into multiple
visibilitymap_set_* functions, so that heapam.c wouldn't need to acquire the
lock and set the LSN. But it's probably not worth it.


> +    /*
> +     * Now read and update the VM block. Even if we skipped updating the heap
> +     * page due to the file being dropped or truncated later in recovery, it's
> +     * still safe to update the visibility map.  Any WAL record that clears
> +     * the visibility map bit does so before checking the page LSN, so any
> +     * bits that need to be cleared will still be cleared.
> +     *
> +     * It is only okay to set the VM bits without holding the heap page lock
> +     * because we can expect no other writers of this page.
> +     */
> +    if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> +        XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> +                                      &vmbuffer) == BLK_NEEDS_REDO)
> +    {
> +        Relation    reln = CreateFakeRelcacheEntry(rlocator);
> +
> +        Assert(visibilitymap_pin_ok(blkno, vmbuffer));
> +        visibilitymap_set_vmbyte(reln, blkno,
> +                                 vmbuffer,
> +                                 VISIBILITYMAP_ALL_VISIBLE |
> +                                 VISIBILITYMAP_ALL_FROZEN);
> +
> +        /*
> +         * It is not possible that the VM was already set for this heap page,
> +         * so the vmbuffer must have been modified and marked dirty.
> +         */
> +        Assert(BufferIsDirty(vmbuffer));

How about making visibilitymap_set_vmbyte() return whether it needed to do
something? This seems somewhat indirect...

I think it might be good to encapsulate this code into a helper in
visibilitymap.c, there will be more callers in the subsequent patches.


> +/*
> + * Set flags in the VM block contained in the passed in vmBuf.
> + *
> + * This function is for callers which include the VM changes in the same WAL
> + * record as the modifications of the heap page which rendered it all-visible.
> + * Callers separately logging the VM changes should invoke visibilitymap_set()
> + * instead.
> + *
> + * Caller must have pinned and exclusive locked the correct block of the VM in
> + * vmBuf. This block should contain the VM bits for the given heapBlk.
> + *
> + * During normal operation (i.e. not recovery), this should be called in a
> + * critical section which also makes any necessary changes to the heap page
> + * and, if relevant, emits WAL.
> + *
> + * Caller is responsible for WAL logging the changes to the VM buffer and for
> + * making any changes needed to the associated heap page. This includes
> + * maintaining any invariants such as ensuring the buffer containing heapBlk
> + * is pinned and exclusive locked.
> + */
> +uint8
> +visibilitymap_set_vmbyte(Relation rel, BlockNumber heapBlk,
> +                         Buffer vmBuf, uint8 flags)

Why is it named vmbyte? This actually just sets the two bits corresponding to
the buffer, not the entire byte. So it seems somewhat misleading to reference
byte.




> From dc318358572f61efbd0e05aae2b9a077b422bcf5 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 18 Jun 2025 12:42:13 -0400
> Subject: [PATCH v8 05/22] Eliminate xl_heap_visible from vacuum phase III
>
> Instead of emitting a separate xl_heap_visible record for each page that
> is rendered all-visible by vacuum's third phase, include the updates to
> the VM in the already emitted xl_heap_prune record.

Reading through the change I didn't particularly like that there's another
optional field in xl_heap_prune, as it seemed liked something that should be
encoded in flags.  Of course there aren't enough flag bits available.  But
that made me look at the rest of the record: Uh, what do we use the reason
field for?  As far as I can tell f83d709760d8 added it without introducing any
users? It doesn't even seem to be set.


> @@ -51,10 +52,15 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>             (xlrec.flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
>
>      /*
> -     * We are about to remove and/or freeze tuples.  In Hot Standby mode,
> -     * ensure that there are no queries running for which the removed tuples
> -     * are still visible or which still consider the frozen xids as running.
> -     * The conflict horizon XID comes after xl_heap_prune.
> +     * After xl_heap_prune is the optional snapshot conflict horizon.
> +     *
> +     * In Hot Standby mode, we must ensure that there are no running queries
> +     * which would conflict with the changes in this record. If pruning, that
> +     * means we cannot remove tuples still visible to transactions on the
> +     * standby. If freezing, that means we cannot freeze tuples with xids that
> +     * are still considered running on the standby. And for setting the VM, we
> +     * cannot do so if the page isn't all-visible to all transactions on the
> +     * standby.
>       */

I'm a bit confused by this new comment - it sounds like we're deciding whether
to remove tuple versions, but that decision has long been made, no?



> @@ -2846,8 +2848,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>      OffsetNumber unused[MaxHeapTuplesPerPage];
>      int            nunused = 0;
>      TransactionId visibility_cutoff_xid;
> +    TransactionId conflict_xid = InvalidTransactionId;
>      bool        all_frozen;
>      LVSavedErrInfo saved_err_info;
> +    uint8        vmflags = 0;
> +    bool        set_pd_all_vis = false;
>
>      Assert(vacrel->do_index_vacuuming);
>
> @@ -2858,6 +2863,20 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>                               VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
>                               InvalidOffsetNumber);
>
> +    if (heap_page_is_all_visible_except_lpdead(vacrel->rel, buffer,
> +                                               vacrel->cutoffs.OldestXmin,
> +                                               deadoffsets, num_offsets,
> +                                               &all_frozen, &visibility_cutoff_xid,
> +                                               &vacrel->offnum))
> +    {
> +        vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> +        if (all_frozen)
> +        {
> +            vmflags |= VISIBILITYMAP_ALL_FROZEN;
> +            Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> +        }
> +    }
> +
>      START_CRIT_SECTION();

I am rather confused - we never can set all-visible if there are any LP_DEAD
items left. If the idea is that we are removing the LP_DEAD items in
lazy_vacuum_heap_page() - what guarantees that all LP_DEAD items are being
removed? Couldn't some tuples get marked LP_DEAD by on-access pruning, after
vacuum visited the page and collected dead items?

Ugh, I see - it works because we pass in the set of dead items.  I think that
makes the name *really* misleading, it's not except LP_DEAD, it's except the
offsets passed in, no?

But then you actually check that the set of dead items didn't change - what
guarantees that?


I didn't look at the later patches, except that I did notice this:

> @@ -268,7 +264,7 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>          Relation    reln = CreateFakeRelcacheEntry(rlocator);
>
>          visibilitymap_pin(reln, blkno, &vmbuffer);
> -        old_vmbits = visibilitymap_set_vmbyte(reln, blkno, vmbuffer, vmflags);
> +        old_vmbits = visibilitymap_set(reln, blkno, vmbuffer, vmflags);
>          /* Only set VM page LSN if we modified the page */
>          if (old_vmbits != vmflags)
>              PageSetLSN(BufferGetPage(vmbuffer), lsn);
> @@ -279,143 +275,6 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>          UnlockReleaseBuffer(vmbuffer);
>  }

Why are we manually pinning the vm buffer here? Shouldn't the xlog machinery
have done so, as you noticed in one of the early on patches?

Greetings,

Andres Freund

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

03 сентября 2025 г., 12:06:40

On Wed, 3 Sept 2025 at 04:11, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Sep 2, 2025 at 5:52 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Thu, Aug 28, 2025 at 5:12 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> > >
> > > I did micro git-blame research here. I spotted only one related change
> > > [0]. Looks like before this change pin was indeed needed.
> > > But not after this change, so this visibilitymap_pin is just an oversight?
> > > Related thread is [1]. I quickly checked the discussion in this
> > > thread, and it looks like no one was bothered about these lines or VM
> > > logging changes (in this exact pin buffer aspect). The discussion was
> > > of other aspects of this commit.
> >
> > Wow, thanks so much for doing that research. Looking at it myself, it
> > does indeed seem like just an oversight. It isn't harmful since it
> > won't take another pin, but it is confusing, so I think we should at
> > least remove it in master. I'm not as sure about back branches.
>
> I've updated the commit message in the patch set to reflect the
> research you did in attached v8.
>
> - Melanie



Hi!

small comments regarding new series

0001, 0002, 0017 LGTM


In 0015:

```
reshke@yezzey-cbdb-bench:~/postgres$ git diff
src/backend/access/heap/pruneheap.c
diff --git a/src/backend/access/heap/pruneheap.c
b/src/backend/access/heap/pruneheap.c
index 05b51bd8d25..0794af9ae89 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -1398,7 +1398,7 @@ heap_prune_record_unchanged_lp_normal(Page page,
PruneState *prstate, OffsetNumb
                                /*
                                 * For now always use prstate->cutoffs
for this test, because
                                 * we only update 'all_visible' when
freezing is requested. We
-                                * could use
GlobalVisTestIsRemovableXid instead, if a
+                                * could use GlobalVisXidVisibleToAll
instead, if a
                                 * non-freezing caller wanted to set the VM bit.
                                 */
                                Assert(prstate->cutoffs);
```

Also, maybe GlobalVisXidTestAllVisible is a slightly better name? (The
term 'all-visible' is one that we occasionally utilize)


--
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

06 сентября 2025 г., 01:20:21

Thanks for the review!

On Tue, Sep 2, 2025 at 7:54 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> > From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Wed, 27 Aug 2025 08:50:15 -0400
> > Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay

I didn't push it yet because I did a new version that actually
eliminates the asserts in heap_multi_insert() before calling
visibilitymap_set() -- since they are redundant with checks inside
visibilitymap_set(). 0001 of attached v9 is what I plan to push,
barring any objections.

> > From 7c5cb3edf89735eaa8bee9ca46111bd6c554720b Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Wed, 27 Aug 2025 10:07:29 -0400
> > Subject: [PATCH v8 02/22] Add assert and log message to visibilitymap_set

I pushed this.

> From an abstraction POV I don't love that heapam now is responsible for
> acquiring and releasing the lock. But that ship already kind of has sailed, as
> heapam.c is already responsible for releasing the vm buffer etc...
>
> I've wondered about splitting the responsibilities up into multiple
> visibilitymap_set_* functions, so that heapam.c wouldn't need to acquire the
> lock and set the LSN. But it's probably not worth it.

Yea, I explored heap wrappers coupling heap operations related to
setting the VM along with the VM updates [1], but the results weren't
appealing. Setting the heap LSN and marking the heap buffer dirty and
such happens in a different place in different callers because it is
happening as part of the operations that actually end up rendering the
page all-visible.

And a VM-only helper would literally just acquire and release the lock
and set the LSN on the vm page -- which I don't think is worth it.

> > +     /*
> > +      * Now read and update the VM block. Even if we skipped updating the heap
> > +      * page due to the file being dropped or truncated later in recovery, it's
> > +      * still safe to update the visibility map.  Any WAL record that clears
> > +      * the visibility map bit does so before checking the page LSN, so any
> > +      * bits that need to be cleared will still be cleared.
> > +      *
> > +      * It is only okay to set the VM bits without holding the heap page lock
> > +      * because we can expect no other writers of this page.
> > +      */
> > +     if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> > +             XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> > +                                                                       &vmbuffer) == BLK_NEEDS_REDO)
> > +     {
> > +             Relation        reln = CreateFakeRelcacheEntry(rlocator);
> > +
> > +             Assert(visibilitymap_pin_ok(blkno, vmbuffer));
> > +             visibilitymap_set_vmbyte(reln, blkno,
> > +                                                              vmbuffer,
> > +                                                              VISIBILITYMAP_ALL_VISIBLE |
> > +                                                              VISIBILITYMAP_ALL_FROZEN);
> > +
> > +             /*
> > +              * It is not possible that the VM was already set for this heap page,
> > +              * so the vmbuffer must have been modified and marked dirty.
> > +              */
> > +             Assert(BufferIsDirty(vmbuffer));
>
> How about making visibilitymap_set_vmbyte() return whether it needed to do
> something? This seems somewhat indirect...

It does return the state of the previous bits. But, I am specifically
asserting that the buffer is dirty because I am about to set the page
LSN. So I don't just care that changes were made, I care that we
remembered to mark the buffer dirty.

> I think it might be good to encapsulate this code into a helper in
> visibilitymap.c, there will be more callers in the subsequent patches.

By the end of the set, the different callers have different
expectations (some don't expect the buffer to have been dirtied
necessarily) and where they do the various related operations is
spread out depending on the caller. I just couldn't come up with a
helper solution I liked.

That being said, I definitely don't think it's needed for this patch
(logging setting the VM in xl_heap_multi_insert()).

> > +uint8
> > +visibilitymap_set_vmbyte(Relation rel, BlockNumber heapBlk,
> > +                                              Buffer vmBuf, uint8 flags)
>
> Why is it named vmbyte? This actually just sets the two bits corresponding to
> the buffer, not the entire byte. So it seems somewhat misleading to reference
> byte.

Renamed it to visibilitymap_set_vmbits.

> > Instead of emitting a separate xl_heap_visible record for each page that
> > is rendered all-visible by vacuum's third phase, include the updates to
> > the VM in the already emitted xl_heap_prune record.
>
> Reading through the change I didn't particularly like that there's another
> optional field in xl_heap_prune, as it seemed liked something that should be
> encoded in flags.  Of course there aren't enough flag bits available.  But
> that made me look at the rest of the record: Uh, what do we use the reason
> field for?  As far as I can tell f83d709760d8 added it without introducing any
> users? It doesn't even seem to be set.

yikes, you are right about the "reason" member. Attached 0002 removes
it, and I'll go ahead and fix it in the back branches too. I can't
fathom how that slipped through the cracks. We do pass the PruneReason
for setting the rmgr info about what type of record it is (i.e. if it
is one emitted by vacuum phase I, phase III, or on-access pruning).
But we don't need or use a separate member.. I went back and tried to
figure out what the rationale was, but I couldn't find anything.

As for the VM flags being an optional unaligned member -- in v9, I've
expanded the flags member to a uint16 to make room for the extra
flags. Seems we've been surviving with using up 2 bytes this long.

> > @@ -51,10 +52,15 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> >                  (xlrec.flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
> >
> >       /*
> > -      * We are about to remove and/or freeze tuples.  In Hot Standby mode,
> > -      * ensure that there are no queries running for which the removed tuples
> > -      * are still visible or which still consider the frozen xids as running.
> > -      * The conflict horizon XID comes after xl_heap_prune.
> > +      * After xl_heap_prune is the optional snapshot conflict horizon.
> > +      *
> > +      * In Hot Standby mode, we must ensure that there are no running queries
> > +      * which would conflict with the changes in this record. If pruning, that
> > +      * means we cannot remove tuples still visible to transactions on the
> > +      * standby. If freezing, that means we cannot freeze tuples with xids that
> > +      * are still considered running on the standby. And for setting the VM, we
> > +      * cannot do so if the page isn't all-visible to all transactions on the
> > +      * standby.
> >        */
>
> I'm a bit confused by this new comment - it sounds like we're deciding whether
> to remove tuple versions, but that decision has long been made, no?

Well, the comment is a revision of a comment that was already there on
essentially why replaying this record could cause recovery conflicts.
It mentioned pruning and freezing, so I expanded it to mention setting
the VM. Taking into account your confusion, I tried rewording it in
attached v9.

> > +     if (heap_page_is_all_visible_except_lpdead(vacrel->rel, buffer,
> > +
vacrel->cutoffs.OldestXmin,
> > +                                                                                        deadoffsets, num_offsets,
> > +                                                                                        &all_frozen,
&visibility_cutoff_xid,
> > +                                                                                        &vacrel->offnum))
>
> I am rather confused - we never can set all-visible if there are any LP_DEAD
> items left. If the idea is that we are removing the LP_DEAD items in
> lazy_vacuum_heap_page() - what guarantees that all LP_DEAD items are being
> removed? Couldn't some tuples get marked LP_DEAD by on-access pruning, after
> vacuum visited the page and collected dead items?
>
> Ugh, I see - it works because we pass in the set of dead items.  I think that
> makes the name *really* misleading, it's not except LP_DEAD, it's except the
> offsets passed in, no?
>
> But then you actually check that the set of dead items didn't change - what
> guarantees that?

So, I pass in the deadoffsets we got from the TIDStore. If the only
dead items on the page are exactly those dead items, then the page
will be all-visible as soon as we set those LP_UNUSED -- which we do
unconditionally. And we have the lock on the page, so no one can
on-access prune and make new dead items while we are in
lazy_vacuum_heap_page().

Given your confusion, I've refactored this and used a different
approach -- I explicitly check the passed-in deadoffsets array when I
encounter a dead item and see if it is there. That should hopefully
make it more clear.

> I didn't look at the later patches, except that I did notice this:
<--snip-->
> Why are we manually pinning the vm buffer here? Shouldn't the xlog machinery
> have done so, as you noticed in one of the early on patches?

Fixed. Thanks!

- Melanie

[1] [1]
https://www.postgresql.org/message-id/flat/CAAKRu_Yj%3DyrL%2BgGGsqfYVQcYn7rDp6hDeoF1vN453JDp8dEY%2Bw%40mail.gmail.com#94602c599abdc8dfc5e438bd24bd8d50

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

06 сентября 2025 г., 01:27:05

On Wed, Sep 3, 2025 at 5:06 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> small comments regarding new series
>
> 0001, 0002, 0017 LGTM

Thanks for continuing to review!

> In 0015:
>
> Also, maybe GlobalVisXidTestAllVisible is a slightly better name? (The
> term 'all-visible' is one that we occasionally utilize)

Actually, I was trying to distinguish it from all-visible because I
interpret that to mean every thing is visible -- as in, every tuple on
a page is visible to everyone. And here we are referring to one xid
and want to know if it is visible to everyone as no longer running. I
don't think my name  ("visible-to-all") is good, but I'm hesitant to
co-opt "all-visible" here.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

08 сентября 2025 г., 18:44:24

On Fri, Sep 5, 2025 at 6:20 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> > On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> > > From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman <melanieplageman@gmail.com>
> > > Date: Wed, 27 Aug 2025 08:50:15 -0400
> > > Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay
>
> I didn't push it yet because I did a new version that actually
> eliminates the asserts in heap_multi_insert() before calling
> visibilitymap_set() -- since they are redundant with checks inside
> visibilitymap_set(). 0001 of attached v9 is what I plan to push,
> barring any objections.

I pushed this, so rebased v10 is  attached. I've added one new patch:
0002 adds ERRCODE_DATA_CORRUPTED to the existing log messages about
VM/data corruption in vacuum. Andrey Borodin earlier suggested this,
and I had neglected to include it.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Robert Haas

Дата:

08 сентября 2025 г., 19:41:00

On Fri, Sep 5, 2025 at 6:20 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> yikes, you are right about the "reason" member. Attached 0002 removes
> it, and I'll go ahead and fix it in the back branches too.

I think changing this in the back-branches is a super-bad idea. If you
want, you can add a comment in the back-branches saying "oops, we
shipped a field that isn't used for anything", but changing the struct
definition is very likely to make 0 people happy and >0 people
unhappy. On the other hand, changing this in master is a good idea and
you should go ahead and do that before this creates any more
confusion.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

08 сентября 2025 г., 21:32:29

On Mon, Sep 8, 2025 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Sep 5, 2025 at 6:20 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > yikes, you are right about the "reason" member. Attached 0002 removes
> > it, and I'll go ahead and fix it in the back branches too.
>
> I think changing this in the back-branches is a super-bad idea. If you
> want, you can add a comment in the back-branches saying "oops, we
> shipped a field that isn't used for anything", but changing the struct
> definition is very likely to make 0 people happy and >0 people
> unhappy. On the other hand, changing this in master is a good idea and
> you should go ahead and do that before this creates any more
> confusion.

Yes, that makes 100% sense. It should have occurred to me. I've pushed
the commit to master. I didn't put an updated set of patches here in
case someone was already reviewing them, as nothing else has changed.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Robert Haas

Дата:

08 сентября 2025 г., 21:54:34

On Mon, Sep 8, 2025 at 11:44 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I pushed this, so rebased v10 is  attached. I've added one new patch:
> 0002 adds ERRCODE_DATA_CORRUPTED to the existing log messages about
> VM/data corruption in vacuum. Andrey Borodin earlier suggested this,
> and I had neglected to include it.

Writing "ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED)" is very
much a minority position. Generally the call to errcode() is on the
following line. I think the commit message could use a bit of work,
too. The first sentence heavily duplicates the second and the fourth,
and the third sentence isn't sufficiently well-connected to the rest
to make it clear why you're restating this general principle in this
commit message.

Perhaps something like:

Add error codes when VACUUM discovers VM corruption

Commit fd6ec93bf890314ac694dc8a7f3c45702ecc1bbd and other previous
work has established the principle that when an error is potentially
reachable in case of on-disk corruption, but is not expected to be
reached otherwise, ERRCODE_DATA_CORRUPTED should be used. This allows
log monitoring software to search for evidence of corruption by
filtering on the error code.

That kibitzing aside, I think this is pretty clearly the right thing to do.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

08 сентября 2025 г., 22:14:01

On Mon, Sep 8, 2025 at 2:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Commit fd6ec93bf890314ac694dc8a7f3c45702ecc1bbd and other previous
> work has established the principle that when an error is potentially
> reachable in case of on-disk corruption, but is not expected to be
> reached otherwise, ERRCODE_DATA_CORRUPTED should be used. This allows
> log monitoring software to search for evidence of corruption by
> filtering on the error code.
>
> That kibitzing aside, I think this is pretty clearly the right thing to do.

Thanks for the suggested wording and the pointer to that thread.

I noticed that in that thread they decided to use errmsg_internal()
instead of errmsg() for a few different reasons -- one of which was
that the situation is not supposed to happen/cannot happen -- which I
don't really understand. It is a reachable code path. Another is that
it is extra work for translators, which I'm also not sure how to apply
to my situation. Are the VM corruption cases worth extra work to the
translators?

I think the most compelling reason is that people will want to search
for the error message in English online. So, for that reason, my
instinct is to use errmsg_internal() in my case as well.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Robert Haas

Дата:

08 сентября 2025 г., 22:53:35

On Mon, Sep 8, 2025 at 3:14 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I noticed that in that thread they decided to use errmsg_internal()
> instead of errmsg() for a few different reasons -- one of which was
> that the situation is not supposed to happen/cannot happen -- which I
> don't really understand. It is a reachable code path. Another is that
> it is extra work for translators, which I'm also not sure how to apply
> to my situation. Are the VM corruption cases worth extra work to the
> translators?
>
> I think the most compelling reason is that people will want to search
> for the error message in English online. So, for that reason, my
> instinct is to use errmsg_internal() in my case as well.

I don't find that reason particularly compelling -- people could want
to search for any error message, or they could equally want to be able
to read it without Google translate. Guessing which messages are
obscure enough that we need not translate them exceeds my powers. If I
were doing it, I'd make it errmsg() rather than errmsg_internal() and
let the translations team change it if they don't think it's worth
bothering with, because if you make it errmsg_internal() then they
won't see it until somebody complains about it not getting translated.
However, I suspect different committers would pursue different
strategies here.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Robert Haas

Дата:

08 сентября 2025 г., 23:14:47

Reviewing 0003:

+               /*
+                * If we're only adding already frozen rows to a
previously empty
+                * page, mark it as all-frozen and update the
visibility map. We're
+                * already holding a pin on the vmbuffer.
+                */
                else if (all_frozen_set)
+               {
                        PageSetAllVisible(page);
+                       LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
+                       visibilitymap_set_vmbits(relation,
+
  BufferGetBlockNumber(buffer),
+
  vmbuffer,
+
  VISIBILITYMAP_ALL_VISIBLE |
+
  VISIBILITYMAP_ALL_FROZEN);

Locking a buffer in a critical section violates the order of
operations proposed in the 'Write-Ahead Log Coding' section of
src/backend/access/transam/README.

+        * Now read and update the VM block. Even if we skipped
updating the heap
+        * page due to the file being dropped or truncated later in
recovery, it's
+        * still safe to update the visibility map.  Any WAL record that clears
+        * the visibility map bit does so before checking the page LSN, so any
+        * bits that need to be cleared will still be cleared.
+        *
+        * It is only okay to set the VM bits without holding the heap page lock
+        * because we can expect no other writers of this page.

The first paragraph of this paraphrases a similar content in
xlog_heap_visible(), but I don't see the variation in phrasing as an
improvement.

The second paragraph does not convince me at all. I see no reason to
believe that this is safe, or that it is a good idea. The code in
xlog_heap_visible() thinks its OK to unlock and relock the page to
make visibilitymap_set() happy, which is cringy but probably safe for
lack of concurrent writers, but skipping locking altogether seems
deeply unwise.

- *             visibilitymap_set        - set a bit in a previously pinned page
+ *             visibilitymap_set        - set bit(s) in a previously
pinned page and log
+ *      visibilitymap_set_vmbits - set bit(s) in a pinned page

I suspect the indentation was done with a different mix of spaces and
tabs here, because this doesn't align for me.

In general, this idea makes some sense to me -- there doesn't seem to
be any particularly good reason why the visibility-map update should
be handled by a different WAL record than the all-visible flag on the
page itself. It's a little hard for me to make that statement too
conclusively without studying more of the patches than I've had time
to do today, but off the top of my head it seems to make sense.
However, I'm not sure you've taken enough care with the details here.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

09 сентября 2025 г., 01:28:46

On Mon, Sep 8, 2025 at 4:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Reviewing 0003:
>
> Locking a buffer in a critical section violates the order of
> operations proposed in the 'Write-Ahead Log Coding' section of
> src/backend/access/transam/README.

Right, I noticed some other callers of visibiltymap_set() (like
lazy_scan_new_or_empty()) did call it in a critical section (and it
exclusive locks the VM page), so I thought perhaps it was better to
keep this operation as close as possible to where we update the VM
(similar to how it is in master in visibilitymap_set()).

But, I think you're right that maintaining the order of operations
proposed in transam/README is more important. As such, in attached
v11, I've modified this patch and the other patches where I replace
visibilitymap_set() with visibilitymap_set_vmbits() to exclusively
lock the vmbuffer before the critical section.
visibilitymap_set_vmbits() asserts that we have the vmbuffer
exclusively locked, so we should be good.

> +        * Now read and update the VM block. Even if we skipped
> updating the heap
> +        * page due to the file being dropped or truncated later in
> recovery, it's
> +        * still safe to update the visibility map.  Any WAL record that clears
> +        * the visibility map bit does so before checking the page LSN, so any
> +        * bits that need to be cleared will still be cleared.
> +        *
> +        * It is only okay to set the VM bits without holding the heap page lock
> +        * because we can expect no other writers of this page.
>
> The first paragraph of this paraphrases a similar content in
> xlog_heap_visible(), but I don't see the variation in phrasing as an
> improvement.

The only difference is I replaced the phrase "LSN interlock" with
"being dropped or truncated later in recovery" -- which is more
specific and, I thought, more clear. Without this comment, it took me
some time to understand the scenarios that might lead us to skip
updating the heap block. heap_xlog_visible() has cause to describe
this situation in an earlier comment -- which is why I think the LSN
interlock comment is less confusing there.

Anyway, I'm open to changing the comment. I could:
1) copy-paste the same comment as heap_xlog_visible()
2) refer to the comment in heap_xlog_visible() (comment seemed a bit
short for that)
3) diverge the comments further by improving the new comment in
heap_xlog_multi_insert() in some way
4) something else?

> The second paragraph does not convince me at all. I see no reason to
> believe that this is safe, or that it is a good idea. The code in
> xlog_heap_visible() thinks its OK to unlock and relock the page to
> make visibilitymap_set() happy, which is cringy but probably safe for
> lack of concurrent writers, but skipping locking altogether seems
> deeply unwise.

Actually in master, heap_xlog_visible() has no lock on the heap page
when it calls visibiltymap_set(). It releases that lock before
recording the freespace in the FSM and doesn't take it again.

It does unlock and relock the VM page -- because visibilitymap_set()
expects to take the lock on the VM.

I agree that not holding the heap lock while updating the VM is
unsatisfying. We can't hold it while doing the IO to read in the VM
block in XLogReadBufferForRedoExtended(). So, we could take it again
before calling visibilitymap_set(). But we don't always have the heap
buffer, though. I suspect this is partially why heap_xlog_visible()
unconditionally passes InvalidBuffer to visibilitymap_set() as the
heap buffer and has special case handling for recovery when we don't
have the heap buffer.

In any case, it isn't an active bug, and I don't think future-proofing
VM replay (i.e. against parallel recovery) is a prerequisite for
committing this patch since it is also that way on master.

> - *             visibilitymap_set        - set a bit in a previously pinned page
> + *             visibilitymap_set        - set bit(s) in a previously
> pinned page and log
> + *      visibilitymap_set_vmbits - set bit(s) in a pinned page
>
> I suspect the indentation was done with a different mix of spaces and
> tabs here, because this doesn't align for me.

oops, fixed.

I pushed the ERRCODE_DATA_CORRUPTED patch, so attached v11 is rebased
and also has the changes mentioned above.

Since you've started reviewing the set, I'll note that patches
0005-0011 are split up for ease of review and it may not necessarily
make sense to keep that separation for eventual commit. They are a
series of steps to move VM updates from lazy_scan_prune() into
pruneheap.c.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Robert Haas

Дата:

09 сентября 2025 г., 17:00:04

On Mon, Sep 8, 2025 at 6:29 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> But, I think you're right that maintaining the order of operations
> proposed in transam/README is more important. As such, in attached
> v11, I've modified this patch and the other patches where I replace
> visibilitymap_set() with visibilitymap_set_vmbits() to exclusively
> lock the vmbuffer before the critical section.
> visibilitymap_set_vmbits() asserts that we have the vmbuffer
> exclusively locked, so we should be good.

That sounds good. I think it is OK to keep some of the odd things that
we're currently doing if they're hard to eliminate, but if they're not
really needed then I'd rather see us standardize the code. I feel (and
I think you may agree, based on other conversations that we've had)
that the visibility map code is somewhat oddly structured, and I'd
like to see us push the amount of oddness down rather than up, if we
can reasonably do so without breaking everything.

> The only difference is I replaced the phrase "LSN interlock" with
> "being dropped or truncated later in recovery" -- which is more
> specific and, I thought, more clear. Without this comment, it took me
> some time to understand the scenarios that might lead us to skip
> updating the heap block. heap_xlog_visible() has cause to describe
> this situation in an earlier comment -- which is why I think the LSN
> interlock comment is less confusing there.
>
> Anyway, I'm open to changing the comment. I could:
> 1) copy-paste the same comment as heap_xlog_visible()
> 2) refer to the comment in heap_xlog_visible() (comment seemed a bit
> short for that)
> 3) diverge the comments further by improving the new comment in
> heap_xlog_multi_insert() in some way
> 4) something else?

IMHO, copying and pasting comments is not great, and comments with
identical intent and divergent wording are also not great. The former
is not great because having a whole bunch of copies of the same
comment, especially if it's a block comment rather than a 1-liner,
uses up a bunch of space and creates a maintenance hazard in the sense
that future updates might not get propagated to all copies. The latter
is not great because it makes it hard to grep for other instances that
should be adjusted when you adjust one, and also because if one
version really is better than the other than ideally we'd like to have
the good version everywhere. Of course, there's some tension between
these two goals. In this particular case, thinking a little harder
about your proposed change, it seems to me that "LSN interlock" is
more clear about what the immediate test is that would cause us to
skip updating the heap page, and "being dropped or truncated later in
recovery" is more clear about what the larger state of the world that
would lead to that situation is. But whatever preference anyone might
have about which way to go with that choice, it is hard to see why the
preference should go one way in one case and the other way in another
case. Therefore, I favor an approach that leads either to an identical
comment in both places, or to one comment referring to the other.

> > The second paragraph does not convince me at all. I see no reason to
> > believe that this is safe, or that it is a good idea. The code in
> > xlog_heap_visible() thinks its OK to unlock and relock the page to
> > make visibilitymap_set() happy, which is cringy but probably safe for
> > lack of concurrent writers, but skipping locking altogether seems
> > deeply unwise.
>
> Actually in master, heap_xlog_visible() has no lock on the heap page
> when it calls visibiltymap_set(). It releases that lock before
> recording the freespace in the FSM and doesn't take it again.
>
> It does unlock and relock the VM page -- because visibilitymap_set()
> expects to take the lock on the VM.
>
> I agree that not holding the heap lock while updating the VM is
> unsatisfying. We can't hold it while doing the IO to read in the VM
> block in XLogReadBufferForRedoExtended(). So, we could take it again
> before calling visibilitymap_set(). But we don't always have the heap
> buffer, though. I suspect this is partially why heap_xlog_visible()
> unconditionally passes InvalidBuffer to visibilitymap_set() as the
> heap buffer and has special case handling for recovery when we don't
> have the heap buffer.

You know, I wasn't thinking carefully enough about the distinction
between the heap page and the visibility map page here. I thought you
were saying that you were modifying a page without a lock on that
page, but you aren't: you're saying you're modifying a page without a
lock on another page to which it is related. The former seems
disastrous, but the latter might be OK. However, I'm sort of confused
about what the comment is trying to say to justify that:

+        * It is only okay to set the VM bits without holding the heap page lock
+        * because we can expect no other writers of this page.

It is not exactly clear to me whether "this page" here refers to the
heap page or the VM page. If it means the heap page, why should that
be so if we haven't got any kind of lock? If it means the VM page,
then why is the heap page even relevant?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

09 сентября 2025 г., 19:24:20

On Tue, Sep 9, 2025 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 8, 2025 at 6:29 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> > The only difference is I replaced the phrase "LSN interlock" with
> > "being dropped or truncated later in recovery" -- which is more
> > specific and, I thought, more clear. Without this comment, it took me
> > some time to understand the scenarios that might lead us to skip
> > updating the heap block. heap_xlog_visible() has cause to describe
> > this situation in an earlier comment -- which is why I think the LSN
> > interlock comment is less confusing there.
> >
> > Anyway, I'm open to changing the comment. I could:
> > 1) copy-paste the same comment as heap_xlog_visible()
> > 2) refer to the comment in heap_xlog_visible() (comment seemed a bit
> > short for that)
> > 3) diverge the comments further by improving the new comment in
> > heap_xlog_multi_insert() in some way
> > 4) something else?
>
> IMHO, copying and pasting comments is not great, and comments with
> identical intent and divergent wording are also not great. The former
> is not great because having a whole bunch of copies of the same
> comment, especially if it's a block comment rather than a 1-liner,
> uses up a bunch of space and creates a maintenance hazard in the sense
> that future updates might not get propagated to all copies. The latter
> is not great because it makes it hard to grep for other instances that
> should be adjusted when you adjust one, and also because if one
> version really is better than the other than ideally we'd like to have
> the good version everywhere. Of course, there's some tension between
> these two goals. In this particular case, thinking a little harder
> about your proposed change, it seems to me that "LSN interlock" is
> more clear about what the immediate test is that would cause us to
> skip updating the heap page, and "being dropped or truncated later in
> recovery" is more clear about what the larger state of the world that
> would lead to that situation is. But whatever preference anyone might
> have about which way to go with that choice, it is hard to see why the
> preference should go one way in one case and the other way in another
> case. Therefore, I favor an approach that leads either to an identical
> comment in both places, or to one comment referring to the other.

I see what you are saying.

For heap_xlog_visible() the LSN interlock comment is easier to parse
because of an earlier comment before reading the heap page:

    /*
     * Read the heap page, if it still exists. If the heap file has dropped or
     * truncated later in recovery, we don't need to update the page, but we'd
     * better still update the visibility map.
     */

I've gone with the direct copy-paste of the LSN interlock paragraph in
attached v12. I think referring to the other comment is too confusing
in context here. However, I also added a line about what could cause
the LSN interlock -- but above it, so as to retain grep-ability of the
other comment.

> > > The second paragraph does not convince me at all. I see no reason to
> > > believe that this is safe, or that it is a good idea. The code in
> > > xlog_heap_visible() thinks its OK to unlock and relock the page to
> > > make visibilitymap_set() happy, which is cringy but probably safe for
> > > lack of concurrent writers, but skipping locking altogether seems
> > > deeply unwise.
> >
> > Actually in master, heap_xlog_visible() has no lock on the heap page
> > when it calls visibiltymap_set(). It releases that lock before
> > recording the freespace in the FSM and doesn't take it again.
> >
> > It does unlock and relock the VM page -- because visibilitymap_set()
> > expects to take the lock on the VM.
> >
> > I agree that not holding the heap lock while updating the VM is
> > unsatisfying. We can't hold it while doing the IO to read in the VM
> > block in XLogReadBufferForRedoExtended(). So, we could take it again
> > before calling visibilitymap_set(). But we don't always have the heap
> > buffer, though. I suspect this is partially why heap_xlog_visible()
> > unconditionally passes InvalidBuffer to visibilitymap_set() as the
> > heap buffer and has special case handling for recovery when we don't
> > have the heap buffer.
>
> You know, I wasn't thinking carefully enough about the distinction
> between the heap page and the visibility map page here. I thought you
> were saying that you were modifying a page without a lock on that
> page, but you aren't: you're saying you're modifying a page without a
> lock on another page to which it is related. The former seems
> disastrous, but the latter might be OK. However, I'm sort of confused
> about what the comment is trying to say to justify that:
>
> +        * It is only okay to set the VM bits without holding the heap page lock
> +        * because we can expect no other writers of this page.
>
> It is not exactly clear to me whether "this page" here refers to the
> heap page or the VM page. If it means the heap page, why should that
> be so if we haven't got any kind of lock? If it means the VM page,
> then why is the heap page even relevant?

I've expanded the comment in v12. In normal operation we must have the
lock on the heap page when setting the VM bits because if another
backend cleared PD_ALL_VISIBLE, we could have the forbidden scenario
where PD_ALL_VISIBLE is clear and the VM is set. This is not allowed
because then someone else may read the VM, conclude the page is
all-visible, and then an index-only scan can return wrong results. In
recovery, there are no concurrent writers, so it can't happen.

It is worth discussing how to fix it in heap_xlog_visible() so that
future scenarios like parallel recovery could not break this. However,
this patch is not a deviation from the behavior on master, and,
technically the behavior on master works.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Robert Haas

Дата:

09 сентября 2025 г., 22:26:08

On Tue, Sep 9, 2025 at 12:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> For heap_xlog_visible() the LSN interlock comment is easier to parse
> because of an earlier comment before reading the heap page:
>
>     /*
>      * Read the heap page, if it still exists. If the heap file has dropped or
>      * truncated later in recovery, we don't need to update the page, but we'd
>      * better still update the visibility map.
>      */
>
> I've gone with the direct copy-paste of the LSN interlock paragraph in
> attached v12. I think referring to the other comment is too confusing
> in context here. However, I also added a line about what could cause
> the LSN interlock -- but above it, so as to retain grep-ability of the
> other comment.

I think that reads a little strangely. I would consolidate: Note that
the heap relation may have been dropped or truncated, leading us to
skip updating the heap block due to the LSN interlock. However, even
in that case, it's still safe to update the visibility map, etc. The
rest of the comment is perhaps a tad more explicit than our usual
practice, but that might be a good thing, because sometimes we're a
little too terse about these critical details.

I just realized that I don't like this:

+ /*
+ * If we're only adding already frozen rows to a previously empty
+ * page, mark it as all-frozen and update the visibility map. We're
+ * already holding a pin on the vmbuffer.
+ */

The thing is, we rarely position a block comment just before an "else
if". There are probably instances, but it's not typical. That's why
the existing comment contains two "if blah then blah" statements of
which you deleted the second -- because it needed to cover both the
"if" and the "else if". An alternative style is to move the comment
down a nesting level and rephrase without the conditional, ie. "We're
only adding frozen rows to a previously empty page, so mark it as
all-frozen etc." But I don't know that I like doing that for one
branch of the "if" and not the other.

The rest of what's now 0001 looks OK to me now, although you might
want to wait for a review from somebody more knowledgeable about this
area.

Some very quick comments on the next few patches -- far from a full review:

0002. Looks boring, probably unobjectionable provided the payoff patch is OK.

0003. What you've done here with xl_heap_prune.flags is kind of
horrifying. The problem is that, while you've added code explaining
that VISIBILITYMAP_ALL_{VISIBLE,FROZEN} are honorary XLHP flags,
nobody who isn't looking directly at that comment is going to
understand the muddling of the two namespaces. I would suggest not
doing this, even if it means defining redundant constants and writing
technically-unnecessary code to translate between them.

0004. It is not clear to me why you need to get
log_heap_prune_and_freeze to do the work here. Why can't
log_newpage_buffer get the job done already?

0005. It looks a little curious that you delete the
identify-corruption logic from the end of the if-nest and add it to
the beginning. Ceteris paribus, you'd expect that to be worse, since
corruption is a rare case.

0006. "to me marked" -> "to be marked".

+                * If the heap page is all-visible but the VM bit is
not set, we don't
+                * need to dirty the heap page.  However, if checksums
are enabled, we
+                * do need to make sure that the heap page is dirtied
before passing
+                * it to visibilitymap_set(), because it may be logged.
                 */
-               PageSetAllVisible(page);
-               MarkBufferDirty(buf);
+               if (!PageIsAllVisible(page) || XLogHintBitIsNeeded())
+               {
+                       PageSetAllVisible(page);
+                       MarkBufferDirty(buf);
+               }

I really hate this. Maybe you're going to argue that it's not the job
of this patch to fix the awfulness here, but surely marking a buffer
dirty in case some other function decides to WAL-log it is a
ridiculous plan.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

10 сентября 2025 г., 02:07:58

Thanks for the review! I've made the changes to comments and minor
fixes you suggested in attached v13 and have limited my inline
responses to areas where further discussion is required.

On Tue, Sep 9, 2025 at 3:26 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> 0003. What you've done here with xl_heap_prune.flags is kind of
> horrifying. The problem is that, while you've added code explaining
> that VISIBILITYMAP_ALL_{VISIBLE,FROZEN} are honorary XLHP flags,
> nobody who isn't looking directly at that comment is going to
> understand the muddling of the two namespaces. I would suggest not
> doing this, even if it means defining redundant constants and writing
> technically-unnecessary code to translate between them.

Fair. I've introduced new XLHP flags in attached v13. Hopefully it
puts an end to the horror.

> 0004. It is not clear to me why you need to get
> log_heap_prune_and_freeze to do the work here. Why can't
> log_newpage_buffer get the job done already?

Well, I need something to emit the changes to the VM. I'm eliminating
all users of xl_heap_visible. Empty pages are the ones that benefit
the least from switching from xl_heap_visible -> xl_heap_prune. But,
if I don't transition them, we have to maintain all the
xl_heap_visible code (including visibilitymap_set() in its long form).

As for log_newpage_buffer(), I could keep it if you think it is too
confusing to change log_heap_prune_and_freeze()'s API (by passing
force_heap_fpi) to handle this case, I can leave log_newpage_buffer()
there and then call log_heap_prune_and_freeze().

I just thought it seemed simple to avoid emitting the new page record
and the VM update record, so why not -- but I don't have strong
feelings.

> 0005. It looks a little curious that you delete the
> identify-corruption logic from the end of the if-nest and add it to
> the beginning. Ceteris paribus, you'd expect that to be worse, since
> corruption is a rare case.

On master, the two corruption cases are sandwiched between the normal
VM set cases. And I actually think doing it this way is brittle. If
you put the cases which set the VM first, you have to have completely
bulletproof the if statements guarding them to foreclose any possible
corruption case from entering because otherwise you will overwrite the
corruption you then try to detect.

But, specifically, from a performance perspective:

I think moving up the third case doesn't matter because the check is so cheap:

    else if (presult.lpdead_items > 0 && PageIsAllVisible(page))

And as for moving up the second case (the other corruption case), the
non-cheap thing it does is call visibilitymap_get_status()

    else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
             visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)

But once you call visibilitymap_get_status() once, assuming there is
no corruption and you need to go set the VM, you've already got that
page of the VM read, so it is probably pretty cheap. Overall, I didn't
think this would add noticeable overhead or many wasted operations.

And I thought that reorganizing the code improved clarity as well as
decreased the likelihood of bugs from insufficiently guarding positive
cases against corrupt pages and overwriting corruption instead of
detecting it.

If we're really worried about it from a performance perspective, I
could add an extra test at the top of identify_and_fix_vm_corruption()
that dumps out early if (!all_visible_according_to_vm &&
presult.all_visible).

> +                * If the heap page is all-visible but the VM bit is
> not set, we don't
> +                * need to dirty the heap page.  However, if checksums
> are enabled, we
> +                * do need to make sure that the heap page is dirtied
> before passing
> +                * it to visibilitymap_set(), because it may be logged.
>                  */
> -               PageSetAllVisible(page);
> -               MarkBufferDirty(buf);
> +               if (!PageIsAllVisible(page) || XLogHintBitIsNeeded())
> +               {
> +                       PageSetAllVisible(page);
> +                       MarkBufferDirty(buf);
> +               }
>
> I really hate this. Maybe you're going to argue that it's not the job
> of this patch to fix the awfulness here, but surely marking a buffer
> dirty in case some other function decides to WAL-log it is a
> ridiculous plan.

Right, it isn't pretty. But I don't quite see what the alternative is.
We need to mark the buffer dirty before setting the LSN. We could
perhaps rewrite visibilitymap_set()'s API to return the LSN of the
xl_heap_visible record and stamp it on the heap buffer ourselves. But
1) I think visibilitymap_set() purposefully conceals its WAL logging
ways from the caller and propagating that info back up starts to make
the API messy in another way and 2) I'm a bit loath to make big
changes to visibilitymap_set() right now since my patch set eventually
resolves this by putting the changes to the VM and heap page in the
same WAL record.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Robert Haas

Дата:

10 сентября 2025 г., 23:01:25

On Tue, Sep 9, 2025 at 7:08 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Fair. I've introduced new XLHP flags in attached v13. Hopefully it
> puts an end to the horror.

I suggest not renumbering all of the existing flags and just adding
these new ones at the end. Less code churn and more likely to break in
an obvious way if you mix up the two sets of flags.

More on 0002:

+ set_heap_lsn = XLogHintBitIsNeeded() ? true : set_heap_lsn;

Maybe just if (XLogHintBitIsNeeded) set_heap_lsn = true? I don't feel
super-strongly that what you've done is bad but it looks weird to my
eyes.

+ * If we released any space or line pointers or will be setting a page in
+ * the visibility map, measure the page's freespace to later update the

"setting a page in the visibility map" seems a little muddled to me.
You set bits, not pages.

+ * all-visible (or all-frozen, depending on the vacuum mode,) which is

This comma placement is questionable.

  /*
+ * Note that the heap relation may have been dropped or truncated, leading
+ * us to skip updating the heap block due to the LSN interlock. However,
+ * even in that case, it's still safe to update the visibility map. Any
+ * WAL record that clears the visibility map bit does so before checking
+ * the page LSN, so any bits that need to be cleared will still be
+ * cleared.
+ *
+ * Note that the lock on the heap page was dropped above. In normal
+ * operation this would never be safe because a concurrent query could
+ * modify the heap page and clear PD_ALL_VISIBLE -- violating the
+ * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in
+ * the VM is set.
+ *
+ * In recovery, we expect no other writers, so writing to the VM page
+ * without holding a lock on the heap page is considered safe enough. It
+ * is done this way when replaying xl_heap_visible records (see
  */

How many copies of this comment do you plan to end up with?

The comment for log_heap_prune_and_freeze seems to be anticipating future work.

> > 0004. It is not clear to me why you need to get
> > log_heap_prune_and_freeze to do the work here. Why can't
> > log_newpage_buffer get the job done already?
>
> Well, I need something to emit the changes to the VM. I'm eliminating
> all users of xl_heap_visible. Empty pages are the ones that benefit
> the least from switching from xl_heap_visible -> xl_heap_prune. But,
> if I don't transition them, we have to maintain all the
> xl_heap_visible code (including visibilitymap_set() in its long form).
>
> As for log_newpage_buffer(), I could keep it if you think it is too
> confusing to change log_heap_prune_and_freeze()'s API (by passing
> force_heap_fpi) to handle this case, I can leave log_newpage_buffer()
> there and then call log_heap_prune_and_freeze().
>
> I just thought it seemed simple to avoid emitting the new page record
> and the VM update record, so why not -- but I don't have strong
> feelings.

Yeah, I'm not sure what the right thing to do here is. I think I was
again experiencing brain fade by forgetting that there is a heap page
and a VM page and, of course, log_heap_newpage() probably isn't going
to touch the latter. So that makes sense. On the other hand, we could
only have one type of WAL record for every single operation in the
system if we gave it enough flags, and force_heap_fpi seems
suspiciously like a flag that turns this into a whole different kind
of WAL record.

> > 0005. It looks a little curious that you delete the
> > identify-corruption logic from the end of the if-nest and add it to
> > the beginning. Ceteris paribus, you'd expect that to be worse, since
> > corruption is a rare case.
>
> On master, the two corruption cases are sandwiched between the normal
> VM set cases. And I actually think doing it this way is brittle. If
> you put the cases which set the VM first, you have to have completely
> bulletproof the if statements guarding them to foreclose any possible
> corruption case from entering because otherwise you will overwrite the
> corruption you then try to detect.

Hmm. In the current code, we first test (!all_visible_according_to_vm
&& presult.all_visible), then (all_visible_according_to_vm &&
!PageIsAllVisible(page) && visibilitymap_get_status(vacrel->rel,
blkno, &vmbuffer) != 0), and then (presult.lpdead_items > 0 &&
PageIsAllVisible(page)). The first and second can never coexist,
because they require opposite values of all_visible_according_to_vm.
The second and third cannot coexist because they require opposite
values of PageIsAllVisible(page). It is not entirely obvious that the
first and third tests couldn't both pass, but you'd have to have
presult.all_visible and presult.lpdead_items > 0, and it's a bit hard
to see how heap_page_prune_and_freeze() could ever allow that.
Consider:

    if (prstate.all_visible && prstate.lpdead_items == 0)
    {
        presult->all_visible = prstate.all_visible;
        presult->all_frozen = prstate.all_frozen;
    }
    else
    {
        presult->all_visible = false;
        presult->all_frozen = false;
    }
...
    presult->lpdead_items = prstate.lpdead_items;

So I don't really think I'm persuaded that the current way is brittle.
But that having been said, I agree with you that the order of the
checks is kind of random, and I don't think it really matters that
much for performance. What does matter is clarity. I feel like what
I'd ideally like this logic to do is say: do we want the VM bit for
the page to be set to all-frozen, just all-visible, or neither? Then
push the VM bit to the correct state, dragging the page-level bit
along behind. And the current logic sort of does that. It's roughly:

1. Should we go from not-all-visible to either all-visible or
all-frozen? If yes, do so.
2. Should we go from either all-visible or all-frozen to
not-all-visible? If yes, do so.
3. Should we go from either all-visible or all-frozen to
not-all-visible for a different reason? If yes, do so.
4. Should we go from all-visible to all-frozen? If yes, do so.

But what's weird is that all the tests are written differently, and we
have two different reasons for going to not-all-visible, namely
PD_ALL_VISIBLE-not-set and dead-items-on-page, whereas there's only
one test for each of the other state-transitions, because the
decision-making for those cases is fully completed at an earlier
stage. I would kind of like to see this expressed in a way that first
decides which state transition to make (forward-to-all-frozen,
forward-to-all-visible, backward-to-all-visible,
backward-to-not-all-visible, nothing) and then does the corresponding
work. What you're doing instead is splitting half of those functions
off into a helper function while keeping the other half where they are
without cleaning up any of the logic. Now, maybe that's OK: I'm far
from having grokked the whole patch set. But it is not any more clear
than what we have now, IMHO, and perhaps even a bit less so.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

18 сентября 2025 г., 03:10:07

On Wed, Sep 10, 2025 at 4:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 7:08 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Fair. I've introduced new XLHP flags in attached v13. Hopefully it
> > puts an end to the horror.
>
> I suggest not renumbering all of the existing flags and just adding
> these new ones at the end. Less code churn and more likely to break in
> an obvious way if you mix up the two sets of flags.

Makes sense. In my attached v14, I have not renumbered them.

> More on 0002:

After an off-list discussion we had about how to make the patches in
the set progressively improve the code instead of just mechanically
refactoring it, I have made some big changes in the intermediate
patches in the set.

Before actually including the VM changes in the vacuum/prune WAL
records, I first include setting PD_ALL_VISIBLE with the other changes
to the heap page so that we can remove the heap page from the VM
setting WAL chain. This happens to fix the bug we discussed where if
you set an all-visible page all-frozen and checksums/wal_log_hints are
enabled, you may end up setting an LSN on a page that was not marked
dirty.

0001 is RFC but waiting on one other reviewer
0002 - 0007 is a bit of cleanup I had later in the patch set but moved
up because I think it made the intermediate patches better
0008 - 0012 removes the heap page from the XLOG_HEAP2_VISIBLE WAL
chain (it makes all callers of visibilitymap_set() set PD_ALL_VISIBLE
in the same WAL record as changes to the heap page)
0013 - 0018 finish the job eliminating XLOG_HEAP2_VISIBLE and set VM
bits in the same WAL record as the heap changes
0019 - 0024 set the VM on-access

>   /*
> + * Note that the heap relation may have been dropped or truncated, leading
> + * us to skip updating the heap block due to the LSN interlock. However,
> + * even in that case, it's still safe to update the visibility map. Any
> + * WAL record that clears the visibility map bit does so before checking
> + * the page LSN, so any bits that need to be cleared will still be
> + * cleared.
> + *
> + * Note that the lock on the heap page was dropped above. In normal
> + * operation this would never be safe because a concurrent query could
> + * modify the heap page and clear PD_ALL_VISIBLE -- violating the
> + * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in
> + * the VM is set.
> + *
> + * In recovery, we expect no other writers, so writing to the VM page
> + * without holding a lock on the heap page is considered safe enough. It
> + * is done this way when replaying xl_heap_visible records (see
>   */
>
> How many copies of this comment do you plan to end up with?

By the end, one for copy freeze replay and one for prune/freeze/vacuum
replay. I felt two wasn't too bad and was easier than meta-explaining
what the other comment was explaining.

> > > 0004. It is not clear to me why you need to get
> > > log_heap_prune_and_freeze to do the work here. Why can't
> > > log_newpage_buffer get the job done already?
> >
> > Well, I need something to emit the changes to the VM. I'm eliminating
> > all users of xl_heap_visible. Empty pages are the ones that benefit
> > the least from switching from xl_heap_visible -> xl_heap_prune. But,
> > if I don't transition them, we have to maintain all the
> > xl_heap_visible code (including visibilitymap_set() in its long form).
> >
> > As for log_newpage_buffer(), I could keep it if you think it is too
> > confusing to change log_heap_prune_and_freeze()'s API (by passing
> > force_heap_fpi) to handle this case, I can leave log_newpage_buffer()
> > there and then call log_heap_prune_and_freeze().
> >
> > I just thought it seemed simple to avoid emitting the new page record
> > and the VM update record, so why not -- but I don't have strong
> > feelings.
>
> Yeah, I'm not sure what the right thing to do here is. I think I was
> again experiencing brain fade by forgetting that there is a heap page
> and a VM page and, of course, log_heap_newpage() probably isn't going
> to touch the latter. So that makes sense. On the other hand, we could
> only have one type of WAL record for every single operation in the
> system if we gave it enough flags, and force_heap_fpi seems
> suspiciously like a flag that turns this into a whole different kind
> of WAL record.

I've kept log_heap_newpage() and used log_heap_prune_and_freeze() for
setting PD_ALL_VISIBLE and the VM.

> > > 0005. It looks a little curious that you delete the
> > > identify-corruption logic from the end of the if-nest and add it to
> > > the beginning. Ceteris paribus, you'd expect that to be worse, since
> > > corruption is a rare case.
> >
> > On master, the two corruption cases are sandwiched between the normal
> > VM set cases. And I actually think doing it this way is brittle. If
> > you put the cases which set the VM first, you have to have completely
> > bulletproof the if statements guarding them to foreclose any possible
> > corruption case from entering because otherwise you will overwrite the
> > corruption you then try to detect.
>
> Hmm. In the current code, we first test (!all_visible_according_to_vm
> && presult.all_visible), then (all_visible_according_to_vm &&
> !PageIsAllVisible(page) && visibilitymap_get_status(vacrel->rel,
> blkno, &vmbuffer) != 0), and then (presult.lpdead_items > 0 &&
> PageIsAllVisible(page)). The first and second can never coexist,
> because they require opposite values of all_visible_according_to_vm.
> The second and third cannot coexist because they require opposite
> values of PageIsAllVisible(page). It is not entirely obvious that the
> first and third tests couldn't both pass, but you'd have to have
> presult.all_visible and presult.lpdead_items > 0, and it's a bit hard
> to see how heap_page_prune_and_freeze() could ever allow that.
> Consider:
>
>     if (prstate.all_visible && prstate.lpdead_items == 0)
>     {
>         presult->all_visible = prstate.all_visible;
>         presult->all_frozen = prstate.all_frozen;
>     }
>     else
>     {
>         presult->all_visible = false;
>         presult->all_frozen = false;
>     }
> ...
>     presult->lpdead_items = prstate.lpdead_items;
>
> So I don't really think I'm persuaded that the current way is brittle.

I meant brittle because it has to be so carefully coded for it to work
out this way. If you ever wanted to change or enhance it, it's quite
hard to know how to make sure all of them are entirely mutually
exclusive.

> But that having been said, I agree with you that the order of the
> checks is kind of random, and I don't think it really matters that
> much for performance. What does matter is clarity. I feel like what
> I'd ideally like this logic to do is say: do we want the VM bit for
> the page to be set to all-frozen, just all-visible, or neither? Then
> push the VM bit to the correct state, dragging the page-level bit
> along behind. And the current logic sort of does that. It's roughly:
>
> 1. Should we go from not-all-visible to either all-visible or
> all-frozen? If yes, do so.
> 2. Should we go from either all-visible or all-frozen to
> not-all-visible? If yes, do so.
> 3. Should we go from either all-visible or all-frozen to
> not-all-visible for a different reason? If yes, do so.
> 4. Should we go from all-visible to all-frozen? If yes, do so.

I don't necessarily agree that fixing corruption and setting the VM
should be together -- they feel like separate things to me. But, I
don't feel strongly enough about it to push it.

> But what's weird is that all the tests are written differently, and we
> have two different reasons for going to not-all-visible, namely
> PD_ALL_VISIBLE-not-set and dead-items-on-page, whereas there's only
> one test for each of the other state-transitions, because the
> decision-making for those cases is fully completed at an earlier
> stage. I would kind of like to see this expressed in a way that first
> decides which state transition to make (forward-to-all-frozen,
> forward-to-all-visible, backward-to-all-visible,
> backward-to-not-all-visible, nothing) and then does the corresponding
> work. What you're doing instead is splitting half of those functions
> off into a helper function while keeping the other half where they are
> without cleaning up any of the logic. Now, maybe that's OK: I'm far
> from having grokked the whole patch set. But it is not any more clear
> than what we have now, IMHO, and perhaps even a bit less so.

In terms of my patch set, I do have to change something about this
mixture of fixing corruption and setting the VM because I need to set
the VM bits in the same critical section as making the other changes
to the heap page (pruning, etc) and include the VM set changes in the
same WAL record (note that clearing the VM to fix corruption is not
WAL-logged).

What I've gone with is determining what to set the VM bits to and then
fixing the corruption at the same time. Then, later, when making the
changes to the heap page, I actually set the VM. This is kind of the
opposite of what you suggested above -- determining what to set the
bits to altogether -- corruption and non-corruption cases together. I
don't think we can do that though, because fixing the corruption is
non WAL-logged changes to the page and VM and setting the VM bits is a
WAL-logged change. And, you can't clear bits with visibilitymap_set()
(there's an assertion about that). So you have to call different
functions (not to mention emit distinct error messages). I don't know
that I've come up with the ideal solution, though.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Andres Freund

Дата:

18 сентября 2025 г., 19:48:45

Hi,

On 2025-09-17 20:10:07 -0400, Melanie Plageman wrote:
> 0001 is RFC but waiting on one other reviewer

> From cacff6c95e38d370b87148bc48cf6ac5f086ed07 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 17 Jun 2025 17:22:10 -0400
> Subject: [PATCH v14 01/24] Eliminate COPY FREEZE use of XLOG_HEAP2_VISIBLE
> diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c
> index cf843277938..faa7c561a8a 100644
> --- a/src/backend/access/heap/heapam_xlog.c
> +++ b/src/backend/access/heap/heapam_xlog.c
> @@ -551,6 +551,7 @@ heap_xlog_multi_insert(XLogReaderState *record)
>      int            i;
>      bool        isinit = (XLogRecGetInfo(record) & XLOG_HEAP_INIT_PAGE) != 0;
>      XLogRedoAction action;
> +    Buffer        vmbuffer = InvalidBuffer;
>
>      /*
>       * Insertion doesn't overwrite MVCC data, so no conflict processing is
> @@ -571,11 +572,11 @@ heap_xlog_multi_insert(XLogReaderState *record)
>      if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
>      {
>          Relation    reln = CreateFakeRelcacheEntry(rlocator);
> -        Buffer        vmbuffer = InvalidBuffer;
>
>          visibilitymap_pin(reln, blkno, &vmbuffer);
>          visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
>          ReleaseBuffer(vmbuffer);
> +        vmbuffer = InvalidBuffer;
>          FreeFakeRelcacheEntry(reln);
>      }
>
> @@ -662,6 +663,57 @@ heap_xlog_multi_insert(XLogReaderState *record)
>      if (BufferIsValid(buffer))
>          UnlockReleaseBuffer(buffer);
>
> +    buffer = InvalidBuffer;
> +
> +    /*
> +     * Now read and update the VM block.
> +     *
> +     * Note that the heap relation may have been dropped or truncated, leading
> +     * us to skip updating the heap block due to the LSN interlock.

I don't fully understand this - how does dropping/truncating the relation lead
to skipping due to the LSN interlock?


> +     * even in that case, it's still safe to update the visibility map. Any
> +     * WAL record that clears the visibility map bit does so before checking
> +     * the page LSN, so any bits that need to be cleared will still be
> +     * cleared.
> +     *
> +     * Note that the lock on the heap page was dropped above. In normal
> +     * operation this would never be safe because a concurrent query could
> +     * modify the heap page and clear PD_ALL_VISIBLE -- violating the
> +     * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in
> +     * the VM is set.
> +     *
> +     * In recovery, we expect no other writers, so writing to the VM page
> +     * without holding a lock on the heap page is considered safe enough. It
> +     * is done this way when replaying xl_heap_visible records (see
> +     * heap_xlog_visible()).
> +     */
> +    if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> +        XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> +                                      &vmbuffer) == BLK_NEEDS_REDO)
> +    {

Why are we using RBM_ZERO_ON_ERROR here? I know it's copied from
heap_xlog_visible(), but I don't immediately understand (or remember) why we
do so there either?


> +        Page        vmpage = BufferGetPage(vmbuffer);
> +        Relation    reln = CreateFakeRelcacheEntry(rlocator);

Hm. Do we really need to continue doing this ugly fake relcache stuff? I'd
really like to eventually get rid of that and given that the new "code shape"
delegates a lot more responsibility to the redo routines, they should have a
fairly easy time not needing a fake relcache?  Afaict the relation already is
not used outside of debugging paths?


> +        /* initialize the page if it was read as zeros */
> +        if (PageIsNew(vmpage))
> +            PageInit(vmpage, BLCKSZ, 0);
> +
> +        visibilitymap_set_vmbits(reln, blkno,
> +                                 vmbuffer,
> +                                 VISIBILITYMAP_ALL_VISIBLE |
> +                                 VISIBILITYMAP_ALL_FROZEN);
> +
> +        /*
> +         * It is not possible that the VM was already set for this heap page,
> +         * so the vmbuffer must have been modified and marked dirty.
> +         */

I assume that's because we a) checked the LSN interlock b) are replaying
something that needed to newly set the bit?


Except for the above comments, this looks pretty good to me.


Seems 0002 should just be applied...


Re 0003: I wonder if it's getting to the point that a struct should be used as
the argument.

Greetings,

Andres Freund

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

24 сентября 2025 г., 20:07:46

On Thu, Sep 18, 2025 at 12:48 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2025-09-17 20:10:07 -0400, Melanie Plageman wrote:
>
> > +     /*
> > +      * Now read and update the VM block.
> > +      *
> > +      * Note that the heap relation may have been dropped or truncated, leading
> > +      * us to skip updating the heap block due to the LSN interlock.
>
> I don't fully understand this - how does dropping/truncating the relation lead
> to skipping due to the LSN interlock?

Yes, this wasn't right. I misunderstood.

What I think it should say is that if the heap update was skipped due
to LSN interlock we still have to replay the updates to the VM because
each vm page contains bits for multiple heap blocks and if the record
included a vm page FPI, subsequent updates to the VM may rely on this
FPI to avoid torn pages. We don't condition it on the heap redo having
been an FPI, probably because it is not worth it -- but I wonder if
that is worth calling out in the comment?

Do we also need to replay it when the heap redo returns BLK_NOTFOUND?
I assume this can happen in the case of relation dropped or truncated
-- but in this case there wouldn't be subsequent records updating the
VM for other heap blocks that we need to replay because the other heap
blocks won't be found either, right?

> > +     if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> > +             XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> > +                                                                       &vmbuffer) == BLK_NEEDS_REDO)
> > +     {
>
> Why are we using RBM_ZERO_ON_ERROR here? I know it's copied from
> heap_xlog_visible(), but I don't immediately understand (or remember) why we
> do so there either?

It has been RBM_ZERO_ON_ERROR since XLogReadBufferForRedoExtended()
was introduced here in 2c03216d8311.
I think we probably do this because vm_readbuf() passes ReadBuffer()
RBM_ZERO_ON_ERROR and has this comment

     * For reading we use ZERO_ON_ERROR mode, and initialize the page if
     * necessary. It's always safe to clear bits, so it's better to clear
     * corrupt pages than error out.

Do you think I also should have a comment in heap_xlog_multi_insert()?

> > +             Page            vmpage = BufferGetPage(vmbuffer);
> > +             Relation        reln = CreateFakeRelcacheEntry(rlocator);
>
> Hm. Do we really need to continue doing this ugly fake relcache stuff? I'd
> really like to eventually get rid of that and given that the new "code shape"
> delegates a lot more responsibility to the redo routines, they should have a
> fairly easy time not needing a fake relcache?  Afaict the relation already is
> not used outside of debugging paths?

Yes, interestingly we don't have the relname in recovery anyway, so it
does all this fake relcache stuff only to convert the relfilenode to a
string and uses that.

The fake relcache stuff will still be used by visibilitymap_pin()
which callers like heap_xlog_delete() use to get the VM page. And I
don't think it is worth coming up with a version of that that doesn't
use the relcache. But you're right that the Relation is not needed for
visibilitymap_set_vmbits(). I've changed it to just take the relation
name as a string.

> > +             /* initialize the page if it was read as zeros */
> > +             if (PageIsNew(vmpage))
> > +                     PageInit(vmpage, BLCKSZ, 0);
> > +
> > +             visibilitymap_set_vmbits(reln, blkno,
> > +                                                              vmbuffer,
> > +                                                              VISIBILITYMAP_ALL_VISIBLE |
> > +                                                              VISIBILITYMAP_ALL_FROZEN);
> > +
> > +             /*
> > +              * It is not possible that the VM was already set for this heap page,
> > +              * so the vmbuffer must have been modified and marked dirty.
> > +              */
>
> I assume that's because we a) checked the LSN interlock b) are replaying
> something that needed to newly set the bit?

Yes, perhaps it is not worth having the assert since it attracts extra
attention to an invariant that is unlikely to be in danger of
regression.

> Seems 0002 should just be applied...

Done

> Re 0003: I wonder if it's getting to the point that a struct should be used as
> the argument.

I have been thinking about this. I have yet to come up with a good
idea for a struct name or multiple struct names that seem to fit here.
I could move the other output parameters into the PruneFreezeResult
and then maybe make some kind of PruneFreezeParameters struct or
something?

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Robert Haas

Дата:

24 сентября 2025 г., 23:13:43

I find this patch set quite hard to follow. 0001 altogether removes
the use of XLOG_HEAP2_VISIBLE in cases where we use
XLOG_HEAP2_MULTI_INSERT, but then 0007 (the next non-refactoring
patch) begins half-removing the dependency on XLOG_HEAP2_VISIBLE,
assisted by 0009 and 0010, and then later you come back and remove the
other half of the dependency. I know it was I who proposed (off-list)
first making the XLOG_HEAP2_VISIBLE record only deal with the VM page
and not the heap buffer, but I'm not sure that idea quite worked out
in terms of making this easier to follow. At the least, it seems weird
that COPY FREEZE is an exception that gets handled in a different way
than all the other cases, fully removing the dependency in one step.
It would also be nice if each time you repost this, or maybe in a
README that you post along beside the actual patches, you'd include
some kind of roadmap to help the reader understand the internal
structure of the patch set, like 1 does this, 2-9 get us to here,
10-whatever get us to this next place.

I don't really understand how the interlocking works. 0011 changes
visibilitymap_set so that it doesn't take the heap block as an
argument, but we'd better hold a lock on the heap page while setting
the VM bit, otherwise I think somebody could come along and modify the
heap page after we decided it was all-visible and before we actually
set the VM bit, which would be terrible. I would expect the comments
and the commit message to say something about that, but I don't see
that they do.

I find myself fearful of the way that 0007 propagates the existing
hacks around setting the VM bit into a new place:

+               /*
+                * We always emit a WAL record when setting
PD_ALL_VISIBLE, but we are
+                * careful not to emit a full page image unless
+                * checksums/wal_log_hints are enabled. We only set
the heap page LSN
+                * if full page images were an option when emitting
WAL. Otherwise,
+                * subsequent modifications of the page may
incorrectly skip emitting
+                * a full page image.
+                */
+               if (do_prune || nplans > 0 ||
+                       (xlrec.flags & XLHP_SET_PD_ALL_VIS &&
XLogHintBitIsNeeded()))
+                       PageSetLSN(page, lsn);

I suppose it's not the worst thing to duplicate this logic, because
you're later going to remove the original copy. But, it took me >10
minutes to find the text in src/backend/access/transam/README, in the
second half of the "Writing Hints" section, that explains the overall
principle here, and since the patch set doesn't seem to touch that
text, maybe you weren't even aware it was there. And, it's a little
weird to have a single WAL record that is either a hint or not a hint
depending on a complex set of conditions. (IMHO mixing & and &&
without parentheses is quite brave, and an explicit != 0 might not be
a bad idea either.)

Anyway, I kind of wonder if it's time to back out the hack that I
installed here many years ago. At the time, I thought that it would be
bad if a VACUUM swept over the visibility map setting VM bits and as a
result emitted an FPI for every page in the entire heap ... but
everyone who is running with checksums has accepted that cost already,
and with those being the default, that's probably going to be most
people. It would be even more compelling if we were going to freeze,
prune, and set all-visible on access, because then presumably the case
where we touch a page and ONLY set the VM bit would be rare, so the
cost of doing that wouldn't matter much, but I guess the patch doesn't
go that far -- we can freeze or set all-visible on access but not
prune, without which the scenario I was worrying about at the time is
still fairly plausible, I think, if checksums are turned off.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

07 октября 2025 г., 01:40:20

On Wed, Sep 24, 2025 at 4:13 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> I find this patch set quite hard to follow. 0001 altogether removes
> the use of XLOG_HEAP2_VISIBLE in cases where we use
> XLOG_HEAP2_MULTI_INSERT, but then 0007 (the next non-refactoring
> patch) begins half-removing the dependency on XLOG_HEAP2_VISIBLE,
> assisted by 0009 and 0010, and then later you come back and remove the
> other half of the dependency. I know it was I who proposed (off-list)
> first making the XLOG_HEAP2_VISIBLE record only deal with the VM page
> and not the heap buffer, but I'm not sure that idea quite worked out
> in terms of making this easier to follow. At the least, it seems weird
> that COPY FREEZE is an exception that gets handled in a different way
> than all the other cases, fully removing the dependency in one step.
> It would also be nice if each time you repost this, or maybe in a
> README that you post along beside the actual patches, you'd include
> some kind of roadmap to help the reader understand the internal
> structure of the patch set, like 1 does this, 2-9 get us to here,
> 10-whatever get us to this next place.

In attached v16, I’ve reverted to removing XLOG_HEAP2_VISIBLE
entirely, rather than first removing each caller's heap page from the
VM WAL chain. I reordered changes and squashed several refactoring
patches to improve patch-by-patch readability. This should make the
set read differently from earlier versions that removed
XLOG_HEAP2_VISIBLE and had more step-by-step mechanical refactoring.

I think if we plan to go all the way with removing XLOG_HEAP2_VISIBLE,
having intermediate patches that just set PD_ALL_VISIBLE when making
other heap pages are more confusing than helpful. Also, I think having
separate flags for setting PD_ALL_VISIBLE in the WAL record
over-complicated the code.

0001:  remove XLOG_HEAP2_VISIBLE from COPY FREEZE
0002 - 0005: various refactoring in advance of removing
XLOG_HEAP2_VISIBLE in pruning
0006: Pruning and freezing by vacuum sets the VM and emits a single
WAL record with those changes
0007: Reaping (phase III) by vacuum sets the VM and sets line pointers
unused in a single WAL record
0008 - 0009: XLOG_HEAP2_VISIBLE is eliminated
0010 - 0012: preparation for setting VM on-access
0013: set VM on-access
0014: set pd_prune_xid on insert

> I find myself fearful of the way that 0007 propagates the existing
> hacks around setting the VM bit into a new place:
>
> +               /*
> +                * We always emit a WAL record when setting
> PD_ALL_VISIBLE, but we are
> +                * careful not to emit a full page image unless
> +                * checksums/wal_log_hints are enabled. We only set
> the heap page LSN
> +                * if full page images were an option when emitting
> WAL. Otherwise,
> +                * subsequent modifications of the page may
> incorrectly skip emitting
> +                * a full page image.
> +                */
> +               if (do_prune || nplans > 0 ||
> +                       (xlrec.flags & XLHP_SET_PD_ALL_VIS &&
> XLogHintBitIsNeeded()))
> +                       PageSetLSN(page, lsn);
>
> I suppose it's not the worst thing to duplicate this logic, because
> you're later going to remove the original copy. But, it took me >10
> minutes to find the text in src/backend/access/transam/README, in the
> second half of the "Writing Hints" section, that explains the overall
> principle here, and since the patch set doesn't seem to touch that
> text, maybe you weren't even aware it was there.

I don't think that src/backend/access/transam/README must change with
my patch. It is still true that if the only change we are making to
the heap page is setting PD_ALL_VISIBLE and checksums/wal_log_hints
are disabled, we explicitly avoid an FPI and thus can't stamp the page
LSN.

> And, it's a little
> weird to have a single WAL record that is either a hint or not a hint
> depending on a complex set of conditions.

PD_ALL_VISIBLE is different from tuple hints and other page hints
because setting the VM is always WAL logged and when we replay that,
it will always set PD_ALL_VISIBLE, so PD_ALL_VISIBLE is effectively
always WAL-logged. The other hints aren't wal-logged unless checksums
are enabled and we need an FPI. So PD_ALL_VISIBLE is different from
other page hints in multiple ways. We can't make it more like those
hints because of needing to preserve the invariant that the VM is
never set when the page is clear. The only thing we could do is forbid
omitting the FPI even when checksums are not enabled.

> Anyway, I kind of wonder if it's time to back out the hack that I
> installed here many years ago. At the time, I thought that it would be
> bad if a VACUUM swept over the visibility map setting VM bits and as a
> result emitted an FPI for every page in the entire heap ... but
> everyone who is running with checksums has accepted that cost already,
> and with those being the default, that's probably going to be most
> people.

I agree that PD_ALL_VISIBLE persistence is complicated, but we have
other special cases that complicate the code for a performance
benefit. I guess the question is if we are saying people shouldn't run
without checksums in production. If that's true, then it's fine to
remove this optimization. Otherwise, I'm not so sure.

I think cloud providers generally have checksums enabled, but I don't
know what is common on-prem.

> It would be even more compelling if we were going to freeze,
> prune, and set all-visible on access, because then presumably the case
> where we touch a page and ONLY set the VM bit would be rare, so the
> cost of doing that wouldn't matter much, but I guess the patch doesn't
> go that far -- we can freeze or set all-visible on access but not
> prune, without which the scenario I was worrying about at the time is
> still fairly plausible, I think, if checksums are turned off.

With the whole set applied, we can prune and set the VM on access but
not freeze. I have a patch to do that, but it introduced noticeable
CPU overhead to prepare the freeze plans. I'd have to spend much more
time studying it to avoid regressing workloads where we don't end up
freezing but prepare the freeze plans during SELECT queries.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

09 октября 2025 г., 01:54:25

On Mon, Oct 6, 2025 at 6:40 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> In attached v16, I’ve reverted to removing XLOG_HEAP2_VISIBLE
> entirely, rather than first removing each caller's heap page from the
> VM WAL chain. I reordered changes and squashed several refactoring
> patches to improve patch-by-patch readability. This should make the
> set read differently from earlier versions that removed
> XLOG_HEAP2_VISIBLE and had more step-by-step mechanical refactoring.
>
> I think if we plan to go all the way with removing XLOG_HEAP2_VISIBLE,
> having intermediate patches that just set PD_ALL_VISIBLE when making
> other heap pages are more confusing than helpful. Also, I think having
> separate flags for setting PD_ALL_VISIBLE in the WAL record
> over-complicated the code.

I decided to reorder the patches to remove XLOG_HEAP2_VISIBLE from
vacuum phase III before removing it from vacuum phase I because
removing it from phase III doesn't require preliminary refactoring
patches. I've done that in the attached v17.

I've also added an experimental patch on the end that refactors large
chunks of heap_page_prune_and_freeze() into helpers. I got some
feedback off-list that heap_page_prune_and_freeze() is too unwieldy
now. I'm not sure how I feel about them yet, so I haven't documented
them or moved them up in the patch set to before changes to
heap_page_prune_and_freeze().

0001: Eliminate XLOG_HEAP2_VISIBLE from COPY FREEZE
0002: Eliminate XLOG_HEAP2_VISIBLE from phase III of vacuum
0003 - 0006: cleanup and refactoring to prepare for 0007
0007: Eliminate XLOG_HEAP2_VISIBLE from vacuum prune/freeze
0008 - 0009: Remove XLOG_HEAP2_VISIBLE
0010 - 0012: refactoring to prepare for 0013
0013: Set VM on-access
0014: Set pd_prune_xid on insert
0015: Experimental refactoring of heap_page_prune_and_freeze into helpers

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Andres Freund

Дата:

09 октября 2025 г., 21:18:49

Hi,

On 2025-10-08 18:54:25 -0400, Melanie Plageman wrote:
> +uint8
> +visibilitymap_set_vmbits(BlockNumber heapBlk,
> +                         Buffer vmBuf, uint8 flags,
> +                         const char *heapRelname)
> +{
> +    BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk);
> +    uint32        mapByte = HEAPBLK_TO_MAPBYTE(heapBlk);
> +    uint8        mapOffset = HEAPBLK_TO_OFFSET(heapBlk);
> +    Page        page;
> +    uint8       *map;
> +    uint8        status;
> +
> +#ifdef TRACE_VISIBILITYMAP
> +    elog(DEBUG1, "vm_set flags 0x%02X for %s %d",
> +         flags, heapRelname, heapBlk);
> +#endif

I like it doesn't take a Relation anymore, but I'd just pass the smgrrelation
instead, then you don't need to allocate the string in the caller, when it's
approximately never used.

Otherwise this looks pretty close to me.



> @@ -71,12 +84,12 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>      }
>  
>      /*
> -     * If we have a full-page image, restore it and we're done.
> +     * If we have a full-page image of the heap block, restore it and we're
> +     * done with the heap block.
>       */
> -    action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> -                                           (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> -                                           &buffer);
> -    if (action == BLK_NEEDS_REDO)
> +    if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> +                                      (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> +                                      &buffer) == BLK_NEEDS_REDO)
>      {
>          Page        page = BufferGetPage(buffer);
>          OffsetNumber *redirected;

Why move it around this way?


> @@ -138,36 +157,104 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>          /* There should be no more data */
>          Assert((char *) frz_offsets == dataptr + datalen);
>  
> +        if ((vmflags & VISIBILITYMAP_VALID_BITS))
> +            PageSetAllVisible(page);
> +
> +        MarkBufferDirty(buffer);
> +
> +        /*
> +         * Always emit a WAL record when setting PD_ALL_VISIBLE but only emit
> +         * an FPI if checksums/wal_log_hints are enabled.

This comment reads as-if we're WAL logging here, but this is a
Wendy's^Wrecovery.

> Advance the page LSN
> +         * only if the record could include an FPI, since recovery skips
> +         * records <= the stamped LSN. Otherwise it might skip an earlier FPI
> +         * needed to repair a torn page.
> +         */

This is confusing, should probably just reference the stuff we did in the
!recovery case.


> +        if (do_prune || nplans > 0 ||
> +            ((vmflags & VISIBILITYMAP_VALID_BITS) && XLogHintBitIsNeeded()))
> +            PageSetLSN(page, lsn);
> +
>          /*
>           * Note: we don't worry about updating the page's prunability hints.
>           * At worst this will cause an extra prune cycle to occur soon.
>           */

Not your fault, but that seems odd? Why aren't we just doing the right thing?

>      /*
> -     * If we released any space or line pointers, update the free space map.
> +     * If we released any space or line pointers or set PD_ALL_VISIBLE or the
> +     * VM, update the freespace map.

I'd replace the first or with a , ;)


> +     * Even when no actual space is freed (e.g., when only marking the page
> +     * all-visible or frozen), we still update the FSM. Because the FSM is
> +     * unlogged and maintained heuristically, it often becomes stale on
> +     * standbys. If such a standby is later promoted and runs VACUUM, it will
> +     * skip recalculating free space for pages that were marked all-visible
> +     * (or all-frozen, depending on the mode). FreeSpaceMapVacuum can then
> +     * propagate overly optimistic free space values upward, causing future
> +     * insertions to select pages that turn out to be unusable. In bulk, this
> +     * can lead to long stalls.
> +     *
> +     * To prevent this, always refresh the FSM’s view when a page becomes
> +     * all-visible or all-frozen.

I'd s/refresh/update/, because refresh sounds more like rereading the current
state of the FSM, rather than changing the FSM.


> +        /* We don't have relation name during recovery, so use relfilenode */
> +        relname = psprintf("%u", rlocator.relNumber);
> +        old_vmbits = visibilitymap_set_vmbits(blkno, vmbuffer, vmflags, relname);
>  
> -            XLogRecordPageWithFreeSpace(rlocator, blkno, freespace);
> +        /* Only set VM page LSN if we modified the page */
> +        if (old_vmbits != vmflags)
> +        {
> +            Assert(BufferIsDirty(vmbuffer));
> +            PageSetLSN(BufferGetPage(vmbuffer), lsn);
>          }
> -        else
> -            UnlockReleaseBuffer(buffer);
> +        pfree(relname);

Hm. When can we actually enter the old_vmbits == vmflags case?  It might also
be fine to just say that we don't expect it to change but are mirroring the
code in visibilitymap_set().


I wonder if the VM specific redo portion should be in a common helper? Might
not be enough code to worry though...


> @@ -2070,8 +2079,24 @@ log_heap_prune_and_freeze(Relation relation, Buffer buffer,
>      xlhp_prune_items dead_items;
>      xlhp_prune_items unused_items;
>      OffsetNumber frz_offsets[MaxHeapTuplesPerPage];
> +    bool        do_prune = nredirected > 0 || ndead > 0 || nunused > 0;
> +    bool        do_set_vm = vmflags & VISIBILITYMAP_VALID_BITS;
>  
>      xlrec.flags = 0;
> +    regbuf_flags = REGBUF_STANDARD;
> +
> +    Assert((vmflags & VISIBILITYMAP_VALID_BITS) == vmflags);
> +
> +    /*
> +     * We can avoid an FPI if the only modification we are making to the heap
> +     * page is to set PD_ALL_VISIBLE and checksums/wal_log_hints are disabled.

Maybe s/an FPI/an FPI for the heap pae/?


> +     * Note that if we explicitly skip an FPI, we must not set the heap page
> +     * LSN later.
> +     */
> +    if (!do_prune &&
> +        nfrozen == 0 &&
> +        (!do_set_vm || !XLogHintBitIsNeeded()))
> +        regbuf_flags |= REGBUF_NO_IMAGE;

>      /*
>       * Prepare data for the buffer.  The arrays are not actually in the
> @@ -2079,7 +2104,11 @@ log_heap_prune_and_freeze(Relation relation, Buffer buffer,
>       * page image, the arrays can be omitted.
>       */
>      XLogBeginInsert();
> -    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
> +    XLogRegisterBuffer(0, buffer, regbuf_flags);
> +
> +    if (do_set_vm)
> +        XLogRegisterBuffer(1, vmbuffer, 0);

Seems a bit confusing that it's named regbuf_flags but isn't used all the
XLogRegisterBuffer calls. Maybe make the name more specific
(regbuf_flags_heap?)...

>      }
>      recptr = XLogInsert(RM_HEAP2_ID, info);
>  
> -    PageSetLSN(BufferGetPage(buffer), recptr);
> +    if (do_set_vm)
> +    {
> +        Assert(BufferIsDirty(vmbuffer));
> +        PageSetLSN(BufferGetPage(vmbuffer), recptr);
> +    }

> +    /*
> +     * We must bump the page LSN if pruning or freezing. If we are only
> +     * updating PD_ALL_VISIBLE, though, we can skip doing this unless
> +     * wal_log_hints/checksums are enabled. Torn pages are possible if we
> +     * update PD_ALL_VISIBLE without bumping the LSN, but this is deemed okay
> +     * for page hint updates.
> +     */

Arguably it's not a torn page if we only modified something as narrow as a
hint bit, and are redoing that change after recovery. But that's extremely
nitpicky.

I wonder if the comment explaining this should be put into one place and
reference it from all the different places.

> @@ -2860,6 +2867,29 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>                               VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
>                               InvalidOffsetNumber);
>  
> +    /*
> +     * Before marking dead items unused, check whether the page will become
> +     * all-visible once that change is applied.

So the function is named _would_ but here you say will :)


> This lets us reap the tuples
> +     * and mark the page all-visible within the same critical section,
> +     * enabling both changes to be emitted in a single WAL record. Since the
> +     * visibility checks may perform I/O and allocate memory, they must be
> +     * done outside the critical section.
> +     */
> +    if (heap_page_would_be_all_visible(vacrel, buffer,
> +                                       deadoffsets, num_offsets,
> +                                       &all_frozen, &visibility_cutoff_xid))
> +    {
> +        vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> +        if (all_frozen)
> +        {
> +            vmflags |= VISIBILITYMAP_ALL_FROZEN;
> +            Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> +        }
> +
> +        /* Take the lock on the vmbuffer before entering a critical section */
> +        LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);

It sure would be nice if we had documented the lock order between the heap
page and the corresponding VM page anywhere.  This is just doing what we did
before, so it's not this patch's fault, but I did get worried about it for a
moment.


> +/*
> + * Check whether the heap page in buf is all-visible except for the dead
> + * tuples referenced in the deadoffsets array.
> + *
> + * The visibility checks may perform IO and allocate memory so they must not
> + * be done in a critical section. This function is used by vacuum to determine
> + * if the page will be all-visible once it reaps known dead tuples. That way
> + * it can do both in the same critical section and emit a single WAL record.
> + *
> + * Returns true if the page is all-visible other than the provided
> + * deadoffsets and false otherwise.
> + *
> + * Output parameters:
> + *
> + *  - *all_frozen: true if every tuple on the page is frozen
> + *  - *visibility_cutoff_xid: newest xmin; valid only if page is all-visible
> + * Callers looking to verify that the page is already all-visible can call
> + * heap_page_is_all_visible().
> + *
> + * This logic is closely related to heap_prune_record_unchanged_lp_normal().
> + * If you modify this function, ensure consistency with that code. An
> + * assertion cross-checks that both remain in agreement. Do not introduce new
> + * side-effects.
> + */
> +static bool
> +heap_page_would_be_all_visible(LVRelState *vacrel, Buffer buf,
> +                               OffsetNumber *deadoffsets,
> +                               int ndeadoffsets,
> +                               bool *all_frozen,
> +                               TransactionId *visibility_cutoff_xid)
> +{
>      Page        page = BufferGetPage(buf);
>      BlockNumber blockno = BufferGetBlockNumber(buf);
>      OffsetNumber offnum,
>                  maxoff;
>      bool        all_visible = true;
> +    int            matched_dead_count = 0;
>  
>      *visibility_cutoff_xid = InvalidTransactionId;
>      *all_frozen = true;
>  
> +    Assert(ndeadoffsets == 0 || deadoffsets);
> +
> +#ifdef USE_ASSERT_CHECKING
> +    /* Confirm input deadoffsets[] is strictly sorted */
> +    if (ndeadoffsets > 1)
> +    {
> +        for (int i = 1; i < ndeadoffsets; i++)
> +            Assert(deadoffsets[i - 1] < deadoffsets[i]);
> +    }
> +#endif
> +
>      maxoff = PageGetMaxOffsetNumber(page);
>      for (offnum = FirstOffsetNumber;
>           offnum <= maxoff && all_visible;
> @@ -3649,9 +3712,15 @@ heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
>           */
>          if (ItemIdIsDead(itemid))
>          {
> -            all_visible = false;
> -            *all_frozen = false;
> -            break;
> +            if (!deadoffsets ||
> +                matched_dead_count >= ndeadoffsets ||
> +                deadoffsets[matched_dead_count] != offnum)
> +            {
> +                *all_frozen = all_visible = false;
> +                break;
> +            }
> +            matched_dead_count++;
> +            continue;
>          }
>  
>          Assert(ItemIdIsNormal(itemid));

Hm, what about an assert checking that matched_dead_count == ndeadoffsets at
the end?


> From 6b5fc27f0d80bab1df86a2e6fb51b64fd20c3cbb Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 15 Sep 2025 12:06:19 -0400
> Subject: [PATCH v17 03/15] Assorted trivial heap_page_prune_and_freeze cleanup

Seems like a good idea, but I'm too lazy to go through this in detail.


> From c69a5219a9b792f3c9f6dc730b8810a88d088ae6 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 16 Sep 2025 14:22:10 -0400
> Subject: [PATCH v17 04/15] Add helper for freeze determination to
>  heap_page_prune_and_freeze
> 
> After scanning through the line pointers on the heap page during
> vacuum's first phase, we use several statuses and information we
> collected to determine whether or not we will use the freeze plans we
> assembled.
> 
> Do this in a helper for better readability.


> @@ -663,85 +775,11 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>       * Decide if we want to go ahead with freezing according to the freeze
>       * plans we prepared, or not.
>       */
> -    do_freeze = false;
> - ...
> +    do_freeze = heap_page_will_freeze(params->relation, buffer,
> +                                      did_tuple_hint_fpi,
> +                                      do_prune,
> +                                      do_hint_prune,
> +                                      &prstate);
>  

Assuming this is just moving the code, I like this quite bit.


> From d4a4be3eed25853fc1ea84ebc2cbe0226afd823a Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 15 Sep 2025 16:25:44 -0400
> Subject: [PATCH v17 05/15] Update PruneState.all_[visible|frozen] earlier in
>  pruning
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> In the prune/freeze path, we currently delay clearing all_visible and
> all_frozen when dead items are present. This allows opportunistic
> freezing if the page would otherwise be fully frozen, since those dead
> items are later removed in vacuum’s third phase.
> 
> However, if no freezing will be attempted, there’s no need to delay.
> Clearing the flags promptly avoids extra bookkeeping in
> heap_prune_unchanged_lp_normal(). At present this has no runtime effect
> because all callers that consider setting the VM also attempt freezing,
> but future callers (e.g. on-access pruning) may want to set the VM
> without preparing freeze plans.

s/heap_prune_unchanged_lp_normal/heap_prune_record_unchanged_lp_normal/

I think this should make it clearer that this is about reducing overhead for
future use of this code in on-access-pruning.


> We also used to defer clearing all_visible and all_frozen until after
> computing the visibility cutoff XID. By determining the cutoff earlier,
> we can update these flags immediately after deciding whether to
> opportunistically freeze. This is necessary if we want to set the VM in
> the same WAL record that prunes and freezes tuples on the page.

I think this last sentence needs to be first. This is the only really
important thing in this patch, afaict.



> From 86193a71d2ff9649b5b1c1e6963bd610285ad369 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 3 Oct 2025 15:57:02 -0400
> Subject: [PATCH v17 06/15] Make heap_page_is_all_visible independent of
>  LVRelState
> 
> Future commits will use this function inside of pruneheap.c where we do
> not have access to the LVRelState. We only need a few parameters from
> the LVRelState, so just pass those in explicitly.
> 
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
> Reviewed-by: Robert Haas <robertmhaas@gmail.com>
> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com

Makes sense. I don't think we need to wait for other stuff to be committed to
commit this.


> From dde0dfc578137f7c93f9a0e34af38dcdb841b080 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 8 Oct 2025 15:39:01 -0400
> Subject: [PATCH v17 07/15] Eliminate XLOG_HEAP2_VISIBLE from vacuum
>  prune/freeze
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit

Seems very mildly odd that 0002 references phase III in the subject, but this
doesn't...

(I'm just very lightly skimming from this point on)


> During vacuum's first and third phases, we examine tuples' visibility
> to determine if we can set the page all-visible in the visibility map.
> 
> Previously, this check compared tuple xmins against a single XID chosen at
> the start of vacuum (OldestXmin). We now use GlobalVisState, which also
> enables future work to set the VM during on-access pruning, since ordinary
> queries have access to GlobalVisState but not OldestXmin.
> 
> This also benefits vacuum directly: GlobalVisState may advance
> during a vacuum, allowing more pages to become considered all-visible.
> In the rare case that it moves backward, VACUUM falls back to OldestXmin
> to ensure we don’t attempt to freeze a dead tuple that wasn’t yet
> prunable according to the GlobalVisState.

It could, but it currently won't advance in vacuum, right?


> From e412f9298b0735d1091f4769ace4d2d1a7e62312 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 29 Jul 2025 09:57:13 -0400
> Subject: [PATCH v17 12/15] Inline TransactionIdFollows/Precedes()
> 
> Calling these from on-access pruning code had noticeable overhead in a
> profile. There does not seem to be a reason not to inline them.

Makes sense, just commit this ahead of the more complicated rest.



> From 54fcba140e515eba0eb1f9d48e7d5875b92e7e39 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 29 Jul 2025 14:34:30 -0400
> Subject: [PATCH v17 13/15] Allow on-access pruning to set pages all-visible

Sorry, will have to look at this another time...

Greetings,

Andres Freund

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

14 октября 2025 г., 06:31:04

On Thu, 9 Oct 2025 at 03:54, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Mon, Oct 6, 2025 at 6:40 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > In attached v16, I’ve reverted to removing XLOG_HEAP2_VISIBLE
> > entirely, rather than first removing each caller's heap page from the
> > VM WAL chain. I reordered changes and squashed several refactoring
> > patches to improve patch-by-patch readability. This should make the
> > set read differently from earlier versions that removed
> > XLOG_HEAP2_VISIBLE and had more step-by-step mechanical refactoring.
> >
> > I think if we plan to go all the way with removing XLOG_HEAP2_VISIBLE,
> > having intermediate patches that just set PD_ALL_VISIBLE when making
> > other heap pages are more confusing than helpful. Also, I think having
> > separate flags for setting PD_ALL_VISIBLE in the WAL record
> > over-complicated the code.
>
> I decided to reorder the patches to remove XLOG_HEAP2_VISIBLE from
> vacuum phase III before removing it from vacuum phase I because
> removing it from phase III doesn't require preliminary refactoring
> patches. I've done that in the attached v17.
>
> I've also added an experimental patch on the end that refactors large
> chunks of heap_page_prune_and_freeze() into helpers. I got some
> feedback off-list that heap_page_prune_and_freeze() is too unwieldy
> now. I'm not sure how I feel about them yet, so I haven't documented
> them or moved them up in the patch set to before changes to
> heap_page_prune_and_freeze().
>
> 0001: Eliminate XLOG_HEAP2_VISIBLE from COPY FREEZE
> 0002: Eliminate XLOG_HEAP2_VISIBLE from phase III of vacuum
> 0003 - 0006: cleanup and refactoring to prepare for 0007
> 0007: Eliminate XLOG_HEAP2_VISIBLE from vacuum prune/freeze
> 0008 - 0009: Remove XLOG_HEAP2_VISIBLE
> 0010 - 0012: refactoring to prepare for 0013
> 0013: Set VM on-access
> 0014: Set pd_prune_xid on insert
> 0015: Experimental refactoring of heap_page_prune_and_freeze into helpers
>
> - Melanie

Hi! Should we also bump XLOG_PAGE_MAGIC after d96f87332 & add323da40a
or do we wait for full set to be committed?
--
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Michael Paquier

Дата:

14 октября 2025 г., 06:42:59

On Tue, Oct 14, 2025 at 08:31:04AM +0500, Kirill Reshke wrote:
> Hi! Should we also bump XLOG_PAGE_MAGIC after d96f87332 & add323da40a
> or do we wait for full set to be committed?

I may be missing something, of course, but d96f87332 has not changed
the WAL format, VISIBILITYMAP_ALL_VISIBLE and VISIBILITYMAP_ALL_FROZEN
existing before that.  The change in xl_heap_prune as done in
add323da40a6 should have bumped the format number.
--
Michael

Вложения

signature.asc

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

14 октября 2025 г., 17:16:24

On Mon, Oct 13, 2025 at 11:43 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Oct 14, 2025 at 08:31:04AM +0500, Kirill Reshke wrote:
> > Hi! Should we also bump XLOG_PAGE_MAGIC after d96f87332 & add323da40a
> > or do we wait for full set to be committed?
>
> I may be missing something, of course, but d96f87332 has not changed
> the WAL format, VISIBILITYMAP_ALL_VISIBLE and VISIBILITYMAP_ALL_FROZEN
> existing before that.  The change in xl_heap_prune as done in
> add323da40a6 should have bumped the format number.

Oops! Thanks for reporting.

I messed up and forgot to do this. And, if I'm not misunderstanding
the criteria, I did the same thing at the beginning of September with
4b5f206de2bb. I've committed the bump. Hopefully I learned my lesson.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

15 октября 2025 г., 02:26:57

Thanks so much for the review! I've addressed all your feedback except
what is commented on inline below.
I've gone ahead and committed the preliminary patches that you thought
were ready to commit.

Attached v18 is what remains.

0001 - 0003: refactoring
0004 - 0006: finish eliminating XLOG_HEAP2_VISIBLE
0007 - 0009: refactoring
0010: Set VM on-access
0011: Set prune xid on insert
0012: Some refactoring for discussion

For 0001, I got feedback heap_page_prune_and_freeze() has too many
arguments, so I tried to address that. I'm interested to know if folks
like this more.

0011 still needs a bit of investigation to understand fully if
anything else in the index-killtuples test needs to be changed to make
sure we have the same coverage.

0012 is sort of WIP. I got feedback heap_page_prune_and_freeze() was
too long and should be split up into helpers. I want to know if this
split makes sense. I can pull it down the patch stack if so.

Only 0001 and 0012 are optional amongst the refactoring patches. The
others are required to make on-access VM-setting possible or viable.

On Thu, Oct 9, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
>
> > @@ -71,12 +84,12 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> >       }
> > -     action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> > -                                                                                (xlrec.flags & XLHP_CLEANUP_LOCK)
!=0, 
> > -                                                                                &buffer);
> > -     if (action == BLK_NEEDS_REDO)
> > +     if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> > +                                                                       (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> > +                                                                       &buffer) == BLK_NEEDS_REDO)
> >       {
> >               Page            page = BufferGetPage(buffer);
> >               OffsetNumber *redirected;
>
> Why move it around this way?

Because there will be an action for the visibility map
XLogReadBufferForRedoExtended(). I could have renamed it heap_action,
but it is being used only in one place, so I preferred to just cut it
to avoid any confusion.

> > Advance the page LSN
> > +              * only if the record could include an FPI, since recovery skips
> > +              * records <= the stamped LSN. Otherwise it might skip an earlier FPI
> > +              * needed to repair a torn page.
> > +              */
>
> This is confusing, should probably just reference the stuff we did in the
> !recovery case.

I fixed this and addressed all your feedback related to this before committing.

> > +             if (do_prune || nplans > 0 ||
> > +                     ((vmflags & VISIBILITYMAP_VALID_BITS) && XLogHintBitIsNeeded()))
> > +                     PageSetLSN(page, lsn);
> > +
> >               /*
> >                * Note: we don't worry about updating the page's prunability hints.
> >                * At worst this will cause an extra prune cycle to occur soon.
> >                */
>
> Not your fault, but that seems odd? Why aren't we just doing the right thing?

The comment dates back to 6f10eb2. I imagine no one ever bothered to
fuss with extracting the XID. You could change
heap_page_prune_execute() to return the right value -- though that's a
bit ugly since it is used in normal operation as well as recovery.

> I wonder if the VM specific redo portion should be in a common helper? Might
> not be enough code to worry though...

I think it might be more code as a helper at this point.

> > @@ -2860,6 +2867,29 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> >                                                        VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
> >                                                        InvalidOffsetNumber);
> >
> > +     /*
> > +      * Before marking dead items unused, check whether the page will become
> > +      * all-visible once that change is applied.
>
> So the function is named _would_ but here you say will :)

I thought about it more and still feel that this function name should
contain "would". From vacuum's perspective it is "will" -- because it
knows it will remove those dead items, but from the function's
perspective it is hypothetical. I changed the comment though.

> > +     if (heap_page_would_be_all_visible(vacrel, buffer,
> > +                                                                        deadoffsets, num_offsets,
> > +                                                                        &all_frozen, &visibility_cutoff_xid))
> > +     {
> > +             vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> > +             if (all_frozen)
> > +             {
> > +                     vmflags |= VISIBILITYMAP_ALL_FROZEN;
> > +                     Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> > +             }
> > +
> > +             /* Take the lock on the vmbuffer before entering a critical section */
> > +             LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
>
> It sure would be nice if we had documented the lock order between the heap
> page and the corresponding VM page anywhere.  This is just doing what we did
> before, so it's not this patch's fault, but I did get worried about it for a
> moment.

Well, the comment above the visibilitymap_set* functions says what
expectations they have for the heap page being locked.

> > +static bool
> > +heap_page_would_be_all_visible(LVRelState *vacrel, Buffer buf,
> > +                                                        OffsetNumber *deadoffsets,
> > +                                                        int ndeadoffsets,
> > +                                                        bool *all_frozen,
> > +                                                        TransactionId *visibility_cutoff_xid)
> > +{
> >       Page            page = BufferGetPage(buf);

> Hm, what about an assert checking that matched_dead_count == ndeadoffsets at
> the end?

I was going to put an Assert(ndeadoffsets <= matched_dead_count), but
then I started wondering if there is a way we could end up with fewer
dead items than we collected during phase I.

I had thought about if we dropped an index and then did on-access
pruning -- but we don't allow setting LP_DEAD items LP_UNUSED in
on-access pruning. So, maybe this is safe... I can do a follow-on
commit to add the assert. But I'm just not 100% sure I've thought of
all the cases where we might end up with fewer dead items.

> > During vacuum's first and third phases, we examine tuples' visibility
> > to determine if we can set the page all-visible in the visibility map.
> >
> > Previously, this check compared tuple xmins against a single XID chosen at
> > the start of vacuum (OldestXmin). We now use GlobalVisState, which also
> > enables future work to set the VM during on-access pruning, since ordinary
> > queries have access to GlobalVisState but not OldestXmin.
> >
> > This also benefits vacuum directly: GlobalVisState may advance
> > during a vacuum, allowing more pages to become considered all-visible.
> > In the rare case that it moves backward, VACUUM falls back to OldestXmin
> > to ensure we don’t attempt to freeze a dead tuple that wasn’t yet
> > prunable according to the GlobalVisState.
>
> It could, but it currently won't advance in vacuum, right?

I thought it was possible for it to advance when calling
heap_prune_satisfies_vacuum() ->
GlobalVisTestIsRemovableXid()->...GlobalVisUpdate(). This case isn't
going to be common, but some things can cause us to update it.

We have talked about explicitly updating GlobalVisState more often
during vacuums of large tables. But I was under the impression that it
was at least possible for it to advance during vacuum now.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

29 октября 2025 г., 14:03:14

On Wed, 15 Oct 2025 at 04:27, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Thanks so much for the review! I've addressed all your feedback except
> what is commented on inline below.
> I've gone ahead and committed the preliminary patches that you thought
> were ready to commit.
>
> Attached v18 is what remains.
>
> 0001 - 0003: refactoring
> 0004 - 0006: finish eliminating XLOG_HEAP2_VISIBLE
> 0007 - 0009: refactoring
> 0010: Set VM on-access
> 0011: Set prune xid on insert
> 0012: Some refactoring for discussion
>
> For 0001, I got feedback heap_page_prune_and_freeze() has too many
> arguments, so I tried to address that. I'm interested to know if folks
> like this more.
>
> 0011 still needs a bit of investigation to understand fully if
> anything else in the index-killtuples test needs to be changed to make
> sure we have the same coverage.
>
> 0012 is sort of WIP. I got feedback heap_page_prune_and_freeze() was
> too long and should be split up into helpers. I want to know if this
> split makes sense. I can pull it down the patch stack if so.
>
> Only 0001 and 0012 are optional amongst the refactoring patches. The
> others are required to make on-access VM-setting possible or viable.
>
> On Thu, Oct 9, 2025 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > > @@ -71,12 +84,12 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> > >       }
> > > -     action = XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> > > -                                                                                (xlrec.flags &
XLHP_CLEANUP_LOCK)!= 0, 
> > > -                                                                                &buffer);
> > > -     if (action == BLK_NEEDS_REDO)
> > > +     if (XLogReadBufferForRedoExtended(record, 0, RBM_NORMAL,
> > > +                                                                       (xlrec.flags & XLHP_CLEANUP_LOCK) != 0,
> > > +                                                                       &buffer) == BLK_NEEDS_REDO)
> > >       {
> > >               Page            page = BufferGetPage(buffer);
> > >               OffsetNumber *redirected;
> >
> > Why move it around this way?
>
> Because there will be an action for the visibility map
> XLogReadBufferForRedoExtended(). I could have renamed it heap_action,
> but it is being used only in one place, so I preferred to just cut it
> to avoid any confusion.
>
> > > Advance the page LSN
> > > +              * only if the record could include an FPI, since recovery skips
> > > +              * records <= the stamped LSN. Otherwise it might skip an earlier FPI
> > > +              * needed to repair a torn page.
> > > +              */
> >
> > This is confusing, should probably just reference the stuff we did in the
> > !recovery case.
>
> I fixed this and addressed all your feedback related to this before committing.
>
> > > +             if (do_prune || nplans > 0 ||
> > > +                     ((vmflags & VISIBILITYMAP_VALID_BITS) && XLogHintBitIsNeeded()))
> > > +                     PageSetLSN(page, lsn);
> > > +
> > >               /*
> > >                * Note: we don't worry about updating the page's prunability hints.
> > >                * At worst this will cause an extra prune cycle to occur soon.
> > >                */
> >
> > Not your fault, but that seems odd? Why aren't we just doing the right thing?
>
> The comment dates back to 6f10eb2. I imagine no one ever bothered to
> fuss with extracting the XID. You could change
> heap_page_prune_execute() to return the right value -- though that's a
> bit ugly since it is used in normal operation as well as recovery.
>
> > I wonder if the VM specific redo portion should be in a common helper? Might
> > not be enough code to worry though...
>
> I think it might be more code as a helper at this point.
>
> > > @@ -2860,6 +2867,29 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
> > >                                                        VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
> > >                                                        InvalidOffsetNumber);
> > >
> > > +     /*
> > > +      * Before marking dead items unused, check whether the page will become
> > > +      * all-visible once that change is applied.
> >
> > So the function is named _would_ but here you say will :)
>
> I thought about it more and still feel that this function name should
> contain "would". From vacuum's perspective it is "will" -- because it
> knows it will remove those dead items, but from the function's
> perspective it is hypothetical. I changed the comment though.
>
> > > +     if (heap_page_would_be_all_visible(vacrel, buffer,
> > > +                                                                        deadoffsets, num_offsets,
> > > +                                                                        &all_frozen, &visibility_cutoff_xid))
> > > +     {
> > > +             vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> > > +             if (all_frozen)
> > > +             {
> > > +                     vmflags |= VISIBILITYMAP_ALL_FROZEN;
> > > +                     Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> > > +             }
> > > +
> > > +             /* Take the lock on the vmbuffer before entering a critical section */
> > > +             LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> >
> > It sure would be nice if we had documented the lock order between the heap
> > page and the corresponding VM page anywhere.  This is just doing what we did
> > before, so it's not this patch's fault, but I did get worried about it for a
> > moment.
>
> Well, the comment above the visibilitymap_set* functions says what
> expectations they have for the heap page being locked.
>
> > > +static bool
> > > +heap_page_would_be_all_visible(LVRelState *vacrel, Buffer buf,
> > > +                                                        OffsetNumber *deadoffsets,
> > > +                                                        int ndeadoffsets,
> > > +                                                        bool *all_frozen,
> > > +                                                        TransactionId *visibility_cutoff_xid)
> > > +{
> > >       Page            page = BufferGetPage(buf);
>
> > Hm, what about an assert checking that matched_dead_count == ndeadoffsets at
> > the end?
>
> I was going to put an Assert(ndeadoffsets <= matched_dead_count), but
> then I started wondering if there is a way we could end up with fewer
> dead items than we collected during phase I.
>
> I had thought about if we dropped an index and then did on-access
> pruning -- but we don't allow setting LP_DEAD items LP_UNUSED in
> on-access pruning. So, maybe this is safe... I can do a follow-on
> commit to add the assert. But I'm just not 100% sure I've thought of
> all the cases where we might end up with fewer dead items.
>
> > > During vacuum's first and third phases, we examine tuples' visibility
> > > to determine if we can set the page all-visible in the visibility map.
> > >
> > > Previously, this check compared tuple xmins against a single XID chosen at
> > > the start of vacuum (OldestXmin). We now use GlobalVisState, which also
> > > enables future work to set the VM during on-access pruning, since ordinary
> > > queries have access to GlobalVisState but not OldestXmin.
> > >
> > > This also benefits vacuum directly: GlobalVisState may advance
> > > during a vacuum, allowing more pages to become considered all-visible.
> > > In the rare case that it moves backward, VACUUM falls back to OldestXmin
> > > to ensure we don’t attempt to freeze a dead tuple that wasn’t yet
> > > prunable according to the GlobalVisState.
> >
> > It could, but it currently won't advance in vacuum, right?
>
> I thought it was possible for it to advance when calling
> heap_prune_satisfies_vacuum() ->
> GlobalVisTestIsRemovableXid()->...GlobalVisUpdate(). This case isn't
> going to be common, but some things can cause us to update it.
>
> We have talked about explicitly updating GlobalVisState more often
> during vacuums of large tables. But I was under the impression that it
> was at least possible for it to advance during vacuum now.
>
> - Melanie


Hi!

First of all, I rechecked v18 patches, they still cause WAL bytes
reduction. In a no-index vacuum case my result is a 39% reduction in
WAL bytes.
Almost like in your first message.

Here are my comments about code, I may be very nitpicky in minor
details, sorry for that

In 0003:

get_conflict_xid function logic is bit strange for me, it assigns
conflict_xid to some value,  but in the very end we have

> + /*
>+ * We can omit the snapshot conflict horizon if we are not pruning or
>+ * freezing any tuples and are setting an already all-visible page
>+ * all-frozen in the VM. In this case, all of the tuples on the page must
>+ * already be visible to all MVCC snapshots on the standby.
>+ */
>+ if (!do_prune && !do_freeze &&
>+ do_set_vm && blk_already_av && set_blk_all_frozen)
> + conflict_xid = InvalidTransactionId;

I feel like we should move this check to the beginning of the
function, and just  return InvalidTransactionId in that if cond.

in 0004:

> + if (old_vmbits == new_vmbits)
> + {
> + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> + /* Unset so we don't emit WAL since no change occurred */
> + do_set_vm = false;
> + }

and then

>  END_CRIT_SECTION();
> + if (do_set_vm)
> + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> +

So, in the heap_page_prune_and_freeze function we release buffer lock
both inside and outside the crit section. As I understand, this is
actually safe. I also looked in other xlog coding practices for other
access methods (GiST, GIN, ....), and I can see that some of them
release buffers before leaving crit sections and some of them after.
But I still suggest to be in sync with 'Write-Ahead Log Coding'
section of
src/backend/access/transam/README, which says:

6. END_CRIT_SECTION()

7. Unlock and unpin the buffer(s).

Let's be consistent in this at least in this single function context.


In 0010:

I'm not terribly convenient that adding SO_ALLOW_VM_SET to TAM
ScanOptions is the right thing to do. Looks like VM bits are something
that make sense for HEAP AM for not for any TAM. So, don't we break
some layer of abstraction here? Would it be better for HEAP AM to set
some flags in heap_beginscan?


Overall 0001-0003 are mostly fine for me, 0004-0006 are the right
thing to do IMHO, but maybe they need some more review from hackers.
Other patches i did not review in a great detail, will return to this
later



--
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

04 ноября 2025 г., 19:48:15

Thanks for the review!

On Wed, Oct 29, 2025 at 7:03 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> get_conflict_xid function logic is bit strange for me, it assigns
> conflict_xid to some value,  but in the very end we have
>
> > + /*
> >+ * We can omit the snapshot conflict horizon if we are not pruning or
> >+ * freezing any tuples and are setting an already all-visible page
> >+ * all-frozen in the VM. In this case, all of the tuples on the page must
> >+ * already be visible to all MVCC snapshots on the standby.
> >+ */
> >+ if (!do_prune && !do_freeze &&
> >+ do_set_vm && blk_already_av && set_blk_all_frozen)
> > + conflict_xid = InvalidTransactionId;
>
> I feel like we should move this check to the beginning of the
> function, and just  return InvalidTransactionId in that if cond.

You're right. I've changed it as you suggest in attached v19.

> > + if (old_vmbits == new_vmbits)
> > + {
> > + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> > + /* Unset so we don't emit WAL since no change occurred */
> > + do_set_vm = false;
> > + }
>
> and then
>
> >  END_CRIT_SECTION();
> > + if (do_set_vm)
> > + LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> > +
>
> So, in the heap_page_prune_and_freeze function we release buffer lock
> both inside and outside the crit section. As I understand, this is
> actually safe. I also looked in other xlog coding practices for other
> access methods (GiST, GIN, ....), and I can see that some of them
> release buffers before leaving crit sections and some of them after.
> But I still suggest to be in sync with 'Write-Ahead Log Coding'
> section of
> src/backend/access/transam/README, which says:
>
> 6. END_CRIT_SECTION()
>
> 7. Unlock and unpin the buffer(s).
>
> Let's be consistent in this at least in this single function context.

I see what you are saying. However, I don't see a good way to
determine whether or not we need to unlock the VM without introducing
another local variable in the outermost scope -- like "unlock_vm".
This function already has a lot of local variables, so I'm loath to do
that. And we want do_set_vm to reflect whether or not we actually set
it in case it gets used in the future.

This function doesn't lock or unlock the heap buffer so it doesn't
seem as urgent to me to follow the letter of the law in this case.

Attached patch doesn't have this change, but this is what it would look like:

    /* Lock vmbuffer before entering a critical section */
+   unlock_vm = do_set_vm;
    if (do_set_vm)
        LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);

@@ -1112,12 +1114,9 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
            old_vmbits = visibilitymap_set(blockno,
                                           vmbuffer, new_vmbits,
                                           params->relation->rd_locator);
-           if (old_vmbits == new_vmbits)
-           {
-               LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
-               /* Unset so we don't emit WAL since no change occurred */
-               do_set_vm = false;
-           }
+
+           /* Unset so we don't emit WAL since no change occurred */
+           do_set_vm = old_vmbits != new_vmbits;
        }

        /*
@@ -1145,7 +1144,7 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,

    END_CRIT_SECTION();

-   if (do_set_vm)
+   if (unlock_vm)
        LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);

> In 0010:
>
> I'm not terribly convenient that adding SO_ALLOW_VM_SET to TAM
> ScanOptions is the right thing to do. Looks like VM bits are something
> that make sense for HEAP AM for not for any TAM. So, don't we break
> some layer of abstraction here? Would it be better for HEAP AM to set
> some flags in heap_beginscan?

I don't see another good way of doing it.

The information about whether or not the relation is modified in the
query is gathered during planning and saved in the plan. We need to
get that information to the scan descriptor, which is all we have when
we call heap_page_prune_opt() during the scan. The scan descriptor is
created by the table AM implementations of scan_begin(). The table AM
callbacks don't pass down the plan -- which makes sense; the scan
shouldn't know about the plan. They do pass down flags, so I thought
it made the most sense to add a flag. Note that I was able to avoid
modifying the actual table and index AM callbacks (scan_begin() and
ambeginscan()). I only made new wrappers that took "modifies_rel".

Now, it is true that referring to the VM is somewhat of a layering
violation. Though, other table AMs may use the information about if
the query modifies the relation -- which is really what this flag
represents. The ScanOptions are usually either a type or a call to
action. Which is why I felt a bit uncomfortable calling it something
like SO_MODIFIES_REL -- which is less of an option and more a piece of
information. And it makes it sound like the scan modifies the rel,
which is not the case. I wonder if there is another solution. Or maybe
we call it SO_QUERY_MODIFIES_REL?

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

18 ноября 2025 г., 02:07:27

Attached v20 has general cleanup, changes to the table/index AM
callbacks detailed below, and it moves the
heap_page_prune_and_freeze() refactoring commit down the stack to
0004.

0001 - 0003 are fairly trivial cleanup patches. I think they are ready
to commit, so if I don't hear any objections in the next few days,
I'll go ahead and commit them.

On Tue, Nov 4, 2025 at 11:48 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 7:03 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > In 0010:
> >
> > I'm not terribly convenient that adding SO_ALLOW_VM_SET to TAM
> > ScanOptions is the right thing to do. Looks like VM bits are something
> > that make sense for HEAP AM for not for any TAM. So, don't we break
> > some layer of abstraction here? Would it be better for HEAP AM to set
> > some flags in heap_beginscan?
>
> I don't see another good way of doing it.
>
> The information about whether or not the relation is modified in the
> query is gathered during planning and saved in the plan. We need to
> get that information to the scan descriptor, which is all we have when
> we call heap_page_prune_opt() during the scan. The scan descriptor is
> created by the table AM implementations of scan_begin(). The table AM
> callbacks don't pass down the plan -- which makes sense; the scan
> shouldn't know about the plan. They do pass down flags, so I thought
> it made the most sense to add a flag. Note that I was able to avoid
> modifying the actual table and index AM callbacks (scan_begin() and
> ambeginscan()). I only made new wrappers that took "modifies_rel".
>
> Now, it is true that referring to the VM is somewhat of a layering
> violation. Though, other table AMs may use the information about if
> the query modifies the relation -- which is really what this flag
> represents. The ScanOptions are usually either a type or a call to
> action. Which is why I felt a bit uncomfortable calling it something
> like SO_MODIFIES_REL -- which is less of an option and more a piece of
> information. And it makes it sound like the scan modifies the rel,
> which is not the case. I wonder if there is another solution. Or maybe
> we call it SO_QUERY_MODIFIES_REL?

Attached v20 changes the ScanOption name to SO_HINT_REL_READ_ONLY and
removes the new helper functions which took modifies_rel as a
parameter. Instead it modifies the existing
table_beginscan()/index_beginscan() helpers and the relevant callbacks
they invoke to have a new flags parameter. These are additional caller
provider flags.

In master, the IndexScan structures and helpers don't use ScanOptions,
but since I'm using them for properties of the base relation, I think
it is fine. I'm not sure if I should name the parameter base_rel_flags
instead of flags for the index-related callbacks and helpers or if
leaving it more generic is better, though.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

19 ноября 2025 г., 12:35:39

On Tue, 18 Nov 2025 at 04:07, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Attached v20 has general cleanup, changes to the table/index AM
> callbacks detailed below, and it moves the
> heap_page_prune_and_freeze() refactoring commit down the stack to
> 0004.
>
> 0001 - 0003 are fairly trivial cleanup patches. I think they are ready
> to commit, so if I don't hear any objections in the next few days,
> I'll go ahead and commit them.
>

Hi! I looked up these 0002-0003 patches once again, LGTM. In
particular, I think 0002 & 0003 makes VM bits management more simple.
My only review comment is about 0003:
Should we make frz_conflict_horizon not a heap_page_will_freeze's
argument but rather just another field of  PruneState struct? If i'm
not mistaken, 'frz_conflict_horizon' fits good to be a part of pruning
state

-- 
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

20 ноября 2025 г., 02:13:30

On Wed, Nov 19, 2025 at 4:35 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> Hi! I looked up these 0002-0003 patches once again, LGTM. In
> particular, I think 0002 & 0003 makes VM bits management more simple.

Thanks for the review!

> My only review comment is about 0003:
> Should we make frz_conflict_horizon not a heap_page_will_freeze's
> argument but rather just another field of  PruneState struct? If i'm
> not mistaken, 'frz_conflict_horizon' fits good to be a part of pruning
> state

Since it is passed into one of the helpers, I think I agree. Attached
v21 has this change.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

20 ноября 2025 г., 20:19:58

On Wed, Nov 19, 2025 at 6:13 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Since it is passed into one of the helpers, I think I agree. Attached
> v21 has this change.

I've committed the first three patches. Attached v22 is the remaining
patches which set the VM in heap_page_prune_and_freeze() for vacuum
and then allow on-access pruning to also set the VM.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Dagfinn Ilmari Mannsåker

Дата:

20 ноября 2025 г., 20:55:05

Melanie Plageman <melanieplageman@gmail.com> writes:

> +            PruneFreezeParams params = {.relation = relation,.buffer = buffer,
> +                .reason = PRUNE_ON_ACCESS,.options = 0,
> +                .vistest = vistest,.cutoffs = NULL
> +            };

I didn't pay much attention to this thread, so I didn't notice this
until it got committed, but I'd like to lodge an objection to this
formatting, especially the lack of spaces before the field names. This
would be much more readable with one struct field per line, i.e.

    PruneFreezeParams params = {
        .relation = rel,
                .buffer = buf,
        .reason = PRUNE_VACUUM_SCAN,
        .options = HEAP_PAGE_PRUNE_FREEZE,
        .vistest = vacrel->vistest,
        .cutoffs = &vacrel->cutoffs,
    };

or at a pinch, if we're really being stingy with the vertical space:

    PruneFreezeParams params = {
        .relation = rel, .buffer = buf,
                .reason = PRUNE_VACUUM_SCAN, .options = HEAP_PAGE_PRUNE_FREEZE,
        .vistest = vacrel->vistest, .cutoffs = &vacrel->cutoffs,
    };

I had a quick grep, and every other designated struct initialiser I
could find uses the one-field-per-line form, but they're not consistent
about the comma after the last field.  I personally prefer having it, so
that one can add more fields later without having to modify the
unrelated line.

- ilmari

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Dagfinn Ilmari Mannsåker

Дата:

20 ноября 2025 г., 21:02:20

Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> writes:

> Melanie Plageman <melanieplageman@gmail.com> writes:
>
>> +            PruneFreezeParams params = {.relation = relation,.buffer = buffer,
>> +                .reason = PRUNE_ON_ACCESS,.options = 0,
>> +                .vistest = vistest,.cutoffs = NULL
>> +            };
>
> I didn't pay much attention to this thread, so I didn't notice this
> until it got committed, but I'd like to lodge an objection to this
> formatting, especially the lack of spaces before the field names. This
> would be much more readable with one struct field per line, i.e.
>
>     PruneFreezeParams params = {
>         .relation = rel,
>                 .buffer = buf,
>         .reason = PRUNE_VACUUM_SCAN,
>         .options = HEAP_PAGE_PRUNE_FREEZE,
>         .vistest = vacrel->vistest,
>         .cutoffs = &vacrel->cutoffs,
>     };

D'oh, my mail client untabified the .buffer line while I was editing it,
that should of course be:

    PruneFreezeParams params = {
        .relation = rel,
        .buffer = buf,
        .reason = PRUNE_VACUUM_SCAN,
        .options = HEAP_PAGE_PRUNE_FREEZE,
        .vistest = vacrel->vistest,
        .cutoffs = &vacrel->cutoffs,
    };

- ilmari

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

21 ноября 2025 г., 01:23:05

On Thu, Nov 20, 2025 at 12:55 PM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:
>
> I didn't pay much attention to this thread, so I didn't notice this
> until it got committed, but I'd like to lodge an objection to this
> formatting, especially the lack of spaces before the field names. This
> would be much more readable with one struct field per line, i.e.
>
>         PruneFreezeParams params = {
>                 .relation = rel,
>                 .buffer = buf,
>                 .reason = PRUNE_VACUUM_SCAN,
>                 .options = HEAP_PAGE_PRUNE_FREEZE,
>                 .vistest = vacrel->vistest,
>                 .cutoffs = &vacrel->cutoffs,
>         };
>
> or at a pinch, if we're really being stingy with the vertical space:
>
>         PruneFreezeParams params = {
>                 .relation = rel, .buffer = buf,
>                 .reason = PRUNE_VACUUM_SCAN, .options = HEAP_PAGE_PRUNE_FREEZE,
>                 .vistest = vacrel->vistest, .cutoffs = &vacrel->cutoffs,
>         };
>
> I had a quick grep, and every other designated struct initialiser I
> could find uses the one-field-per-line form, but they're not consistent
> about the comma after the last field.  I personally prefer having it, so
> that one can add more fields later without having to modify the
> unrelated line.

pgindent doesn't allow for a space after the comma before the period.
One reason I used struct initialization was to save space, so I'm a
bit loath to put every member on its own line. However, I don't want
to make the code less readable to others. So, I will commit an update
as you request.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Chao Li

Дата:

21 ноября 2025 г., 04:09:21


> On Nov 21, 2025, at 01:19, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Wed, Nov 19, 2025 at 6:13 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>>
>> Since it is passed into one of the helpers, I think I agree. Attached
>> v21 has this change.
>
> I've committed the first three patches. Attached v22 is the remaining
> patches which set the VM in heap_page_prune_and_freeze() for vacuum
> and then allow on-access pruning to also set the VM.
>

I just started reviewing 0001 yesterday and got a few comments. However, it was late, I didn’t have enough time to wrap
up,so I decided to review a few more today and send the comments together. As you have pushed 0001-0003, I’d still
raisemy comment for them now, and I will review the rest of commits next week. 

1 - pushed 0001
```
             /*
              * Report the number of tuples reclaimed to pgstats.  This is
@@ -419,60 +425,44 @@ heap_page_will_freeze(Relation relation, Buffer buffer,
  * also need to account for a reduction in the length of the line pointer
  * array following array truncation by us.
  *
- * If the HEAP_PRUNE_FREEZE option is set, we will also freeze tuples if it's
- * required in order to advance relfrozenxid / relminmxid, or if it's
- * considered advantageous for overall system performance to do so now.  The
- * 'cutoffs', 'presult', 'new_relfrozen_xid' and 'new_relmin_mxid' arguments
- * are required when freezing.  When HEAP_PRUNE_FREEZE option is set, we also
- * set presult->all_visible and presult->all_frozen on exit, to indicate if
- * the VM bits can be set.  They are always set to false when the
- * HEAP_PRUNE_FREEZE option is not set, because at the moment only callers
- * that also freeze need that information.
- *
- * vistest is used to distinguish whether tuples are DEAD or RECENTLY_DEAD
- * (see heap_prune_satisfies_vacuum).
- *
- * options:
- *   MARK_UNUSED_NOW indicates that dead items can be set LP_UNUSED during
- *   pruning.
+ * params contains the input parameters used to control freezing and pruning
+ * behavior. See the definition of PruneFreezeParams for more on what each
+ * parameter does.
  *
- *   FREEZE indicates that we will also freeze tuples, and will return
- *   'all_visible', 'all_frozen' flags to the caller.
- *
- * cutoffs contains the freeze cutoffs, established by VACUUM at the beginning
- * of vacuuming the relation.  Required if HEAP_PRUNE_FREEZE option is set.
- * cutoffs->OldestXmin is also used to determine if dead tuples are
- * HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD.
+ * If the HEAP_PAGE_PRUNE_FREEZE option is set in params, we will freeze
+ * tuples if it's required in order to advance relfrozenxid / relminmxid, or
+ * if it's considered advantageous for overall system performance to do so
+ * now.  The 'params.cutoffs', 'presult', 'new_relfrozen_xid' and
+ * 'new_relmin_mxid' arguments are required when freezing.  When
+ * HEAP_PAGE_PRUNE_FREEZE option is passed, we also set presult->all_visible
+ * and presult->all_frozen on exit, to indicate if the VM bits can be set.
+ * They are always set to false when the HEAP_PAGE_PRUNE_FREEZE option is not
+ * passed, because at the moment only callers that also freeze need that
+ * information.
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
  * heap_page_prune_and_freeze() is responsible for initializing it.  Required
  * by all callers.
  *
- * reason indicates why the pruning is performed.  It is included in the WAL
- * record for debugging and analysis purposes, but otherwise has no effect.
- *
  * off_loc is the offset location required by the caller to use in error
  * callback.
  *
  * new_relfrozen_xid and new_relmin_mxid must provided by the caller if the
- * HEAP_PRUNE_FREEZE option is set.  On entry, they contain the oldest XID and
- * multi-XID seen on the relation so far.  They will be updated with oldest
- * values present on the page after pruning.  After processing the whole
- * relation, VACUUM can use these values as the new relfrozenxid/relminmxid
- * for the relation.
+ * HEAP_PAGE_PRUNE_FREEZE option is set in params.  On entry, they contain the
+ * oldest XID and multi-XID seen on the relation so far.  They will be updated
+ * with oldest values present on the page after pruning.  After processing the
+ * whole relation, VACUUM can use these values as the new
+ * relfrozenxid/relminmxid for the relation.
  */
 void
-heap_page_prune_and_freeze(Relation relation, Buffer buffer,
-                           GlobalVisState *vistest,
-                           int options,
-                           struct VacuumCutoffs *cutoffs,
+heap_page_prune_and_freeze(PruneFreezeParams *params,
                            PruneFreezeResult *presult,
-                           PruneReason reason,
                            OffsetNumber *off_loc,
                            TransactionId *new_relfrozen_xid,
                            MultiXactId *new_relmin_mxid)
 {
```

For this function interface change, I got a concern. The old function comment says "cutoffs contains the freeze cutoffs
….Required if HEAP_PRUNE_FREEZE option is set.”, meaning that cutoffs is only useful and must be set when
HEAP_PRUNE_FREEZEis set. But the new comment seems to have lost this indication. 

And in the old function interface, cutoffs sat right next to options, readers are easy to notice:

* when options is 0, cutoffs is null
```
            heap_page_prune_and_freeze(relation, buffer, vistest, 0,
                                       NULL, &presult, PRUNE_ON_ACCESS, &dummy_off_loc, NULL, NULL);
```

* when options has HEAP_PAGE_PRUNE_FREEZE, cutoffs is passed in
```
    prune_options = HEAP_PAGE_PRUNE_FREEZE;
    if (vacrel->nindexes == 0)
        prune_options |= HEAP_PAGE_PRUNE_MARK_UNUSED_NOW;

    heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
                               &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
                               &vacrel->offnum,
                               &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
```

So, the change doesn’t break anything, but makes code a little bit harder to read. So, my suggestion is to add an
assertin heap_page_prune_and_freeze, something like: 

```
Assert(!(params->options & HEAP_PAGE_PRUNE_FREEZE) || params->cutoffs != NULL);
```

2 - pushed 0001
```
+    PruneFreezeParams params = {.relation = rel,.buffer = buf,
+        .reason = PRUNE_VACUUM_SCAN,.options = HEAP_PAGE_PRUNE_FREEZE,
+        .cutoffs = &vacrel->cutoffs,.vistest = vacrel->vistest
+    };
```

Using a designated initializer is not wrong, but makes future maintenance harder, because when a new field is added,
thisinitializer will leave the new field uninitiated. From my impression, I don’t remember I ever see a designated
initializerin PG code. I only remember 3 ways I have seen: 

* use an initialize function to set every fields individually
* palloc0 to set all 0, then set non-zero fields individually
* {0} to set all 0, then set non-zero fields individually

3 - pushed 0002
```
                     prstate->all_visible = false;
+                    prstate->all_frozen = false;
```

Nit: Now setting the both fields to false repeat in 6 places. Maybe add a static inline function, say
PruneClearVisibilityFlags(),may improve maintainability. 

4 - pushed 0003
```
+ * opporunistically freeze, to indicate if the VM bits can be set.  They are
```

Typo: opporunistically, missed a “t”.

I’d stop here today, and continue reviewing rest commits in next week.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Chao Li

Дата:

24 ноября 2025 г., 11:07:59

> On Nov 21, 2025, at 09:09, Chao Li <li.evan.chao@gmail.com> wrote:
>
> I’d stop here today, and continue reviewing rest commits in next week.

I continue reviewing today.

0004 This a pure refactoring. It splits heap_page_prune_and_freeze to multiple small functions. LGTM, no comment.

0005 overall good, a few nit comments as below.

0006, 0007 look good, no comment.

5 - 0005 - heapam.h
```
+    /*
+     *
+     * vmbuffer is the buffer that must already contain contain the required
+     * block of the visibility map if we are to update it. blk_known_av is the
```

Nit:

* an unnecessary empty comment line.
* “contain contain” => “contain"

6 - 0005 heapam_xlog.c
```
+         * The critical integrity requirement here is that we must never end
+         * up with with the visibility map bit set and the page-level
```

Nit: “with with” => “with”

I will continue reviewing 0008 and rest tomorrow.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Chao Li

Дата:

24 ноября 2025 г., 12:31:31


> On Nov 24, 2025, at 16:07, Chao Li <li.evan.chao@gmail.com> wrote:
>
> 0006, 0007 look good, no comment.

I missed a nit comment in 0007:

7 - 0007
```
+ * To handle recovery conflict during logical decoding on standby, we must know
+ * if the table is a catalog table. Note that in visibilitymapdefs.h
+ * VISIBLITYMAP_XLOG_CATALOG_REL is also defined as (1 << 2). xl_heap_prune
+ * records should use XLHP_IS_CATALOG_REL, not VISIBILIYTMAP_XLOG_CATALOG_REL --
+ * even if they only contain updates to the VM.
```

VISIBLITYMAP_XLOG_CATALOG_REL missed “I” after “B”.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Andres Freund

Дата:

25 ноября 2025 г., 01:24:46

Hi,

On 2025-11-20 12:19:58 -0500, Melanie Plageman wrote:
> From 363f0e4ac9ac7699a6d9c2a267a2ad60825407c8 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Mon, 17 Nov 2025 15:11:27 -0500
> Subject: [PATCH v22 1/9] Split heap_page_prune_and_freeze() into helpers
>
> Refactor the setup and planning phases of pruning and freezing into
> helpers. This streamlines heap_page_prune_and_freeze() and makes it more
> clear when the examination of tuples ends and page modifications begin.

I think this is a considerable improvement.

I didn't review this with a lot of detail, given that it's mostly moving
code.

One minor thing: It's slightly odd that prune_freeze_plan() gets an oid
argument, prune_freeze_setup() gets the entire prstate,
heap_page_will_freeze() gets the Relation. It's what they need, but still a
bit odd.


FWIW, I found the diff generated by
  git show --diff-algorithm=minimal --color-moved-ws=allow-indentation-change

useful for viewing this diff, showed much more clearly how little the code
actually changed.



> From 8ebaf434af5afaebcf71550116c59355b3bf15c1 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 8 Oct 2025 15:39:01 -0400
> Subject: [PATCH v22 2/9] Eliminate XLOG_HEAP2_VISIBLE from vacuum phase I
>  prune/freeze
>
> Vacuum no longer emits a separate WAL record for each page set
> all-visible or all-frozen during phase I. Instead, visibility map
> updates are now included in the XLOG_HEAP2_PRUNE_VACUUM_SCAN record that
> is already emitted for pruning and freezing.
>
> Previously, heap_page_prune_and_freeze() determined whether a page was
> all-visible, but the corresponding VM bits were only set later in
> lazy_scan_prune(). Now the VM is updated immediately in
> heap_page_prune_and_freeze(), at the same time as the heap
> modifications.
>
> This change applies only to vacuum phase I, not to pruning performed
> during normal page access.

Hm. This change makes sense, but unfortunately I find it somewhat hard to
review. There are a lot of changes that don't obviously work towards one
goal in this commit.

>@@ -238,6 +239,16 @@ typedef struct PruneFreezeParams
>     Relation    relation;       /* relation containing buffer to be pruned */
>     Buffer      buffer;         /* buffer to be pruned */
> 
>+    /*
>+     *
>+     * vmbuffer is the buffer that must already contain contain the required
>+     * block of the visibility map if we are to update it. blk_known_av is the
>+     * visibility status of the heap block as of the last call to
>+     * find_next_unskippable_block().
>+     */
>+    Buffer      vmbuffer;
>+    bool        blk_known_av;
>+
>     /*
>      * The reason pruning was performed.  It is used to set the WAL record
>      * opcode which is used for debugging and analysis purposes.

What is blk_known_av set to if the block is known to not be all visible?
Compared to the case where we did not yet determine the visibility status of
the block?


>@@ -250,8 +261,10 @@ typedef struct PruneFreezeParams
>      * HEAP_PAGE_PRUNE_MARK_UNUSED_NOW indicates that dead items can be set
>      * LP_UNUSED during pruning.
>      *
>-     * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples, and
>-     * will return 'all_visible', 'all_frozen' flags to the caller.
>+     * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples
>+     *
>+     * HEAP_PAGE_PRUNE_UPDATE_VIS indicates that we will set the page's status
>+     * in the VM.
>      */
>     int         options;

nit^2: The previous version and the other paragraphs end in a .


> @@ -157,17 +159,36 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>          /* There should be no more data */
>          Assert((char *) frz_offsets == dataptr + datalen);
>
> -        if (vmflags & VISIBILITYMAP_VALID_BITS)
> -            PageSetAllVisible(page);
> -
> -        MarkBufferDirty(buffer);
> +        if (do_prune || nplans > 0)
> +            mark_buffer_dirty = set_lsn = true;
>
>          /*
> -         * See log_heap_prune_and_freeze() for commentary on when we set the
> -         * heap page LSN.
> +         * The critical integrity requirement here is that we must never end
> +         * up with with the visibility map bit set and the page-level
> +         * PD_ALL_VISIBLE bit clear.  If that were to occur, a subsequent page

s/clear/unset/ would be a tad clearer.


> +         * modification would fail to clear the visibility map bit.
> +         *
> +         * vmflags may be nonzero with PD_ALL_VISIBLE already set (e.g. when
> +         * marking an all-visible page all-frozen). If only the VM is updated,
> +         * the heap page need not be dirtied.
>           */
> -        if (do_prune || nplans > 0 ||
> -            ((vmflags & VISIBILITYMAP_VALID_BITS) && XLogHintBitIsNeeded()))
> +        if ((vmflags & VISIBILITYMAP_VALID_BITS) && !PageIsAllVisible(page))
> +        {
> +            PageSetAllVisible(page);
> +            mark_buffer_dirty = true;
> +
> +            /*
> +             * See log_heap_prune_and_freeze() for commentary on when we set
> +             * the heap page LSN.
> +             */
> +            if (XLogHintBitIsNeeded())
> +                set_lsn = true;
> +        }

Maybe worth adding something like Assert(!set_lsn || mark_buffer_dirty)?


> +/*
> + * Decide whether to set the visibility map bits for heap_blk, using
> + * information from PruneState and blk_known_av. Some callers may already
> + * have examined this page’s VM bits (e.g., VACUUM in the previous
> + * heap_vac_scan_next_block() call) and can pass that along.

That's not entirely trivial to follow, tbh. As mentioned above, it's not clear
to me how the state of a block where did determine that the block is *not*
all-visible is represented.


> + * Returns true if one or both VM bits should be set, along with the desired
> + * flags in *vmflags. Also indicates via do_set_pd_vis whether PD_ALL_VISIBLE
> + * should be set on the heap page.
> + */
> +static bool
> +heap_page_will_set_vis(Relation relation,
> +                       BlockNumber heap_blk,
> +                       Buffer heap_buf,
> +                       Buffer vmbuffer,
> +                       bool blk_known_av,
> +                       const PruneState *prstate,
> +                       uint8 *vmflags,
> +                       bool *do_set_pd_vis)
> +{
> +    Page        heap_page = BufferGetPage(heap_buf);
> +    bool        do_set_vm = false;
> +
> +    *do_set_pd_vis = false;
> +
> +
> +    /*
> +     * Now handle two potential corruption cases:
> +     *
> +     * These do not need to happen in a critical section and are not
> +     * WAL-logged.
> +     *
> +     * As of PostgreSQL 9.2, the visibility map bit should never be set if the
> +     * page-level bit is clear.  However, it's possible that in vacuum the bit
> +     * got cleared after heap_vac_scan_next_block() was called, so we must
> +     * recheck with buffer lock before concluding that the VM is corrupt.
> +     */
> +    else if (blk_known_av && !PageIsAllVisible(heap_page) &&
> +             visibilitymap_get_status(relation, heap_blk, &vmbuffer) != 0)
> +    {
> +        ereport(WARNING,
> +                (errcode(ERRCODE_DATA_CORRUPTED),
> +                 errmsg("page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
> +                        RelationGetRelationName(relation), heap_blk)));
> +
> +        visibilitymap_clear(relation, heap_blk, vmbuffer,
> +                            VISIBILITYMAP_VALID_BITS);

Wait, why is it ok to perform this check iff blk_known_av is set?


> +            old_vmbits = visibilitymap_set_vmbits(blockno,
> +                                                  vmbuffer, new_vmbits,
> +                                                  params->relation->rd_locator);
> +            if (old_vmbits == new_vmbits)
> +            {
> +                LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> +                /* Unset so we don't emit WAL since no change occurred */
> +                do_set_vm = false;
> +            }
> +        }

What can lead to this path being reached? Doesn't this mean that something
changed the state of the VM while we were holding an exclusive lock on the
heap buffer?


> +        /*
> +         * Emit a WAL XLOG_HEAP2_PRUNE* record showing what we did. If we were
> +         * only updating the VM and it turns out it was already set, we will
> +         * have unset do_set_vm earlier. As such, check it again before
> +         * emitting the record.
> +         */
> +        if (RelationNeedsWAL(params->relation) &&
> +            (do_prune || do_freeze || do_set_vm))
> +        {
>              log_heap_prune_and_freeze(params->relation, buffer,
> -                                      InvalidBuffer,    /* vmbuffer */
> -                                      0,    /* vmflags */
> +                                      do_set_vm ? vmbuffer : InvalidBuffer,
> +                                      do_set_vm ? new_vmbits : 0,
>                                        conflict_xid,
> -                                      true, params->reason,
> +                                      true, /* cleanup lock */
> +                                      do_set_pd_vis,
> +                                      params->reason,
>                                        prstate.frozen, prstate.nfrozen,
>                                        prstate.redirected, prstate.nredirected,
>                                        prstate.nowdead, prstate.ndead,

This function is now taking 16 parameters :/


> @@ -959,28 +1148,47 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>
>      END_CRIT_SECTION();
>
> +    if (do_set_vm)
> +        LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> +
> +    /*
> +     * During its second pass over the heap, VACUUM calls
> +     * heap_page_would_be_all_visible() to determine whether a page is
> +     * all-visible and all-frozen. The logic here is similar. After completing
> +     * pruning and freezing, use an assertion to verify that our results
> +     * remain consistent with heap_page_would_be_all_visible().
> +     */
> +#ifdef USE_ASSERT_CHECKING
> +    if (prstate.all_visible)
> +    {
> +        TransactionId debug_cutoff;
> +        bool        debug_all_frozen;
> +
> +        Assert(prstate.lpdead_items == 0);
> +        Assert(prstate.cutoffs);
> +
> +        if (!heap_page_is_all_visible(params->relation, buffer,
> +                                      prstate.cutoffs->OldestXmin,
> +                                      &debug_all_frozen,
> +                                      &debug_cutoff, off_loc))
> +            Assert(false);

I don't love Assert(false), because the message for the assert failure is
pretty much meaningless. Sometimes it's hard to avoid, but here you have an if
() that has no body other than Assert(false)? Just Assert the expression
directly.


> From 34f0009570e117d7d48b560cd097ee25c6cdcc7c Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Sat, 27 Sep 2025 11:55:21 -0400
> Subject: [PATCH v22 3/9] Eliminate XLOG_HEAP2_VISIBLE from empty-page vacuum
>
> As part of removing XLOG_HEAP2_VISIBLE records, phase I of VACUUM now
> marks empty pages all-visible in a XLOG_HEAP2_PRUNE_VACUUM_SCAN record.

This whole business of treating empty pages as all-visible continues to not
make any sense to me. Particularly in combination with a not crashsafe FSM it
just seems ... unhelpful. It also means that there there's a decent chance of
extra WAL when bulk extending. But that's not the fault of this change.


> From 0d6a06d4533cfe153440d301c3d20915ba07892f Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Sat, 27 Sep 2025 11:55:36 -0400
> Subject: [PATCH v22 4/9] Remove XLOG_HEAP2_VISIBLE entirely
>
> As no remaining users emit XLOG_HEAP2_VISIBLE records.
> This includes deleting the xl_heap_visible struct and all functions
> responsible for emitting or replaying XLOG_HEAP2_VISIBLE records.

Probably worth mentioning that this changes the VM API.


> @@ -2396,14 +2396,18 @@ get_conflict_xid(bool do_prune, bool do_freeze, bool do_set_vm,
>   *
>   * This is used for several different page maintenance operations:
>   *
> - * - Page pruning, in VACUUM's 1st pass or on access: Some items are
> + * - Page pruning, in vacuum phase I or on-access: Some items are
>   *   redirected, some marked dead, and some removed altogether.
>   *
> - * - Freezing: Items are marked as 'frozen'.
> + * - Freezing: During vacuum phase I, items are marked as 'frozen'
>   *
> - * - Vacuum, 2nd pass: Items that are already LP_DEAD are marked as unused.
> + * - Reaping: During vacuum phase III, items that are already LP_DEAD are
> + *   marked as unused.
>   *
> - * They have enough commonalities that we use a single WAL record for them
> + * - VM updates: After vacuum phases I and III, the heap page may be marked
> + *   all-visible and all-frozen.
> + *
> + * These changes all happen together, so we use a single WAL record for them
>   * all.
>   *
>   * If replaying the record requires a cleanup lock, pass cleanup_lock =
>   true.

How's that related to the commit's subject?


> From fd0455230968fd919999a5c035f3830d310f0e49 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 18 Jul 2025 16:30:04 -0400
> Subject: [PATCH v22 5/9] Rename GlobalVisTestIsRemovableXid() to
>  GlobalVisXidVisibleToAll()
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> The function is currently only used to check whether a tuple’s xmax is
> visible to all transactions (and thus removable). Upcoming changes will
> also use it to test whether a tuple’s xmin is visible to all to
> decide if a page can be marked all-visible in the visibility map.
>
> The new name, GlobalVisXidVisibleToAll(), better reflects this broader
> purpose.

If we want this - and I'm not convinced we do - I think it needs to go further
and change the other uses of removable in
procarray.c. ComputeXidHorizonsResult has a lot of related fields.

There's also GetOldestNonRemovableTransactionId(),
GlobalVisCheckRemovableXid(), GlobalVisCheckRemovableFullXid() that weren't
included in the renaming.


> From 565014e31aa117fb43993ee2e64da38ffb74f372 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 29 Jul 2025 14:38:24 -0400
> Subject: [PATCH v22 6/9] Use GlobalVisState in vacuum to determine page level
>  visibility
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> During vacuum's first and third phases, we examine tuples' visibility
> to determine if we can set the page all-visible in the visibility map.
>
> Previously, this check compared tuple xmins against a single XID chosen at
> the start of vacuum (OldestXmin). We now use GlobalVisState, which also
> enables future work to set the VM during on-access pruning, since ordinary
> queries have access to GlobalVisState but not OldestXmin.
>
> This also benefits vacuum directly: in some cases, GlobalVisState may
> advance during a vacuum, allowing more pages to become considered
> all-visible. And, in the future, we could easily add a heuristic to
> update GlobalVisState more frequently during vacuums of large tables. In
> the rare case that the GlobalVisState moves backward, vacuum falls back
> to OldestXmin to ensure we don’t attempt to freeze a dead tuple that
> wasn’t yet prunable according to the GlobalVisState.

I think it may be better to make sure that the GlobalVisState can't move
backward.


> From bced81f6df3d303679fac2a1414d42f0db401232 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 29 Jul 2025 14:34:30 -0400
> Subject: [PATCH v22 8/9] Allow on-access pruning to set pages all-visible
>
> Many queries do not modify the underlying relation. For such queries, if
> on-access pruning occurs during the scan, we can check whether the page
> has become all-visible and update the visibility map accordingly.
> Previously, only vacuum and COPY FREEZE marked pages as all-visible or
> all-frozen.

> Supporting this requires passing information about whether the relation
> is modified from the executor down to the scan descriptor.

I think it'd be good to split this part into a separate commit. The set of
folks to review that are distinct (and broader) from the ones looking at
heapam internals.


Greetings,

Andres Freund

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

26 ноября 2025 г., 00:43:58

Thanks for the review!

On Thu, Nov 20, 2025 at 8:10 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
>   * new_relfrozen_xid and new_relmin_mxid must provided by the caller if the
> - * HEAP_PRUNE_FREEZE option is set.  On entry, they contain the oldest XID and
> - * multi-XID seen on the relation so far.  They will be updated with oldest
> - * values present on the page after pruning.  After processing the whole
> - * relation, VACUUM can use these values as the new relfrozenxid/relminmxid
> - * for the relation.
> + * HEAP_PAGE_PRUNE_FREEZE option is set in params.  On entry, they contain the
> + * oldest XID and multi-XID seen on the relation so far.  They will be updated
> + * with oldest values present on the page after pruning.  After processing the
> + * whole relation, VACUUM can use these values as the new
> + * relfrozenxid/relminmxid for the relation.
>   */
>  void
> -heap_page_prune_and_freeze(Relation relation, Buffer buffer,
> -                                                  GlobalVisState *vistest,
> -                                                  int options,
> -                                                  struct VacuumCutoffs *cutoffs,
> +heap_page_prune_and_freeze(PruneFreezeParams *params,
>                                                    PruneFreezeResult *presult,
> -                                                  PruneReason reason,
>                                                    OffsetNumber *off_loc,
>                                                    TransactionId *new_relfrozen_xid,
>                                                    MultiXactId *new_relmin_mxid)
>  {
> ```
>
> For this function interface change, I got a concern. The old function comment says "cutoffs contains the freeze
cutoffs…. Required if HEAP_PRUNE_FREEZE option is set.”, meaning that cutoffs is only useful and must be set when
HEAP_PRUNE_FREEZEis set. But the new comment seems to have lost this indication. 

I did move that comment into the PruneFreezeParams struct definition.

> And in the old function interface, cutoffs sat right next to options, readers are easy to notice:
>
> * when options is 0, cutoffs is null
> ```
>                         heap_page_prune_and_freeze(relation, buffer, vistest, 0,
>                                                                            NULL, &presult, PRUNE_ON_ACCESS,
&dummy_off_loc,NULL, NULL); 
> ```
>
> * when options has HEAP_PAGE_PRUNE_FREEZE, cutoffs is passed in
> ```
>         prune_options = HEAP_PAGE_PRUNE_FREEZE;
>         if (vacrel->nindexes == 0)
>                 prune_options |= HEAP_PAGE_PRUNE_MARK_UNUSED_NOW;
>
>         heap_page_prune_and_freeze(rel, buf, vacrel->vistest, prune_options,
>                                                            &vacrel->cutoffs, &presult, PRUNE_VACUUM_SCAN,
>                                                            &vacrel->offnum,
>                                                            &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
> ```
>
> So, the change doesn’t break anything, but makes code a little bit harder to read. So, my suggestion is to add an
assertin heap_page_prune_and_freeze, something like: 
>
> Assert(!(params->options & HEAP_PAGE_PRUNE_FREEZE) || params->cutoffs != NULL);

That's fair. I've gone ahead and pushed a commit with your suggested assert.

> 2 - pushed 0001
> ```
> +       PruneFreezeParams params = {.relation = rel,.buffer = buf,
> +               .reason = PRUNE_VACUUM_SCAN,.options = HEAP_PAGE_PRUNE_FREEZE,
> +               .cutoffs = &vacrel->cutoffs,.vistest = vacrel->vistest
> +       };
> ```
>
> Using a designated initializer is not wrong, but makes future maintenance harder, because when a new field is added,
thisinitializer will leave the new field uninitiated. From my impression, I don’t remember I ever see a designated
initializerin PG code. I only remember 3 ways I have seen: 
>
> * use an initialize function to set every fields individually
> * palloc0 to set all 0, then set non-zero fields individually
> * {0} to set all 0, then set non-zero fields individually

Well, the main reason you don't see them much in the code is that a
lot of the code is old and we didn't require a c99-compliant compiler
until fairly recently (okay like 2018/2019) -- and thus couldn't use
designated initializers.

I agree that they are rare for structs (they are quite commonly used
with arrays), but they are there -- for example these bufmgr init
macros

#define BMR_REL(p_rel) \
    ((BufferManagerRelation){.rel = p_rel})
#define BMR_SMGR(p_smgr, p_relpersistence) \
    ((BufferManagerRelation){.smgr = p_smgr, .relpersistence =
p_relpersistence})
#define BMR_GET_SMGR(bmr) \
    (RelationIsValid((bmr).rel) ? RelationGetSmgr((bmr).rel) : (bmr).smgr)

I don't see how it would be harder to remember to initialize a field
with a designated initializer vs if you have to just remember to add a
line initializing that field in the code. And using a designated
initializer ensures all unspecified fields will be zeroed out.

In general, I have seen threads [1] encouraging the use of designated
initializers, so I'm inclined to leave it as is since it is committed,
and I haven't heard other pushback.

> 3 - pushed 0002
> ```
>                                         prstate->all_visible = false;
> +                                       prstate->all_frozen = false;
> ```
>
> Nit: Now setting the both fields to false repeat in 6 places. Maybe add a static inline function, say
PruneClearVisibilityFlags(),may improve maintainability. 

I see your point. However, I don't think it would necessarily be an
improvement. This function already has a lot of helpers you have to
jump to to understand what's going on. And in the place where they are
cleared most often, heap_prune_record_unchanged_lp_normal(), we set
other fields of the prstate directly, so it is nice visual symmetry in
my opinion to set them inline.

I did want to use chained assignment (all_visible = all_frozen =
false), but I have had people complain about that in my code before
more than once, so I refrained.

> 4 - pushed 0003
> ```
> + * opporunistically freeze, to indicate if the VM bits can be set.  They are
> ```
>
> Typo: opporunistically, missed a “t”.

Fixed in same commit that added the assert.

- Melanie

[1] https://www.postgresql.org/message-id/flat/5B873BED.9080501%40anastigmatix.net#4a067c1314783f0e171b4e1be76f7574

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

04 декабря 2025 г., 02:07:38

Thanks for the review! All the small changes you suggested I made in
attached v23 unless otherwise noted below.

On Mon, Nov 24, 2025 at 5:24 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2025-11-20 12:19:58 -0500, Melanie Plageman wrote:
> > Subject: [PATCH v22 1/9] Split heap_page_prune_and_freeze() into helpers
>
> One minor thing: It's slightly odd that prune_freeze_plan() gets an oid
> argument, prune_freeze_setup() gets the entire prstate,
> heap_page_will_freeze() gets the Relation. It's what they need, but still a
> bit odd.

They all get the PruneState actually.

I've committed this patch (but actually have to do a follow-on commit
to silence coverity. Will do that next.)

> > Subject: [PATCH v22 2/9] Eliminate XLOG_HEAP2_VISIBLE from vacuum phase I
> >  prune/freeze
>
>
> Hm. This change makes sense, but unfortunately I find it somewhat hard to
> review. There are a lot of changes that don't obviously work towards one
> goal in this commit.

I've split up the first commit into 4 patches in attached v23
(0002-0005). They are not meant to be committed separately but are
separate only for ease of review. They comprise the logical steps for
getting to the final code state. I originally had it split up but got
feedback it was more work to review them each. So, let's see how this
goes.

> >@@ -238,6 +239,16 @@ typedef struct PruneFreezeParams
>
> >+     * vmbuffer is the buffer that must already contain contain the required
> >+     * block of the visibility map if we are to update it. blk_known_av is the
> >+     * visibility status of the heap block as of the last call to
> >+     * find_next_unskippable_block().
> >+     */
> >+    Buffer      vmbuffer;
> >+    bool        blk_known_av;
>
> What is blk_known_av set to if the block is known to not be all visible?
> Compared to the case where we did not yet determine the visibility status of
> the block?

blk_known_av should always be set to false if the caller doesn't know.
It is used as an optimization. I've added to the comment in this
struct to clarify that. More on this further down in my mail.

> > + * Decide whether to set the visibility map bits for heap_blk, using
> > + * information from PruneState and blk_known_av. Some callers may already
> > + * have examined this page’s VM bits (e.g., VACUUM in the previous
> > + * heap_vac_scan_next_block() call) and can pass that along.
>
> That's not entirely trivial to follow, tbh. As mentioned above, it's not clear
> to me how the state of a block where did determine that the block is *not*
> all-visible is represented.

There is no need to distinguish between knowing it is not all-visible
and not knowing if it is all-visible. That is, "not known" and "known
not" are the same for our purposes. This is only an optimization and
not needed for correctness. I've tried to add comments to this effect
in various places where blk_known_av is used.

> > +     else if (blk_known_av && !PageIsAllVisible(heap_page) &&
> > +                      visibilitymap_get_status(relation, heap_blk, &vmbuffer) != 0)
> > +     {
> > +             ereport(WARNING,
> > +                             (errcode(ERRCODE_DATA_CORRUPTED),
> > +                              errmsg("page is not marked all-visible but visibility map bit is set in relation
\"%s\"page %u", 
> > +                                             RelationGetRelationName(relation), heap_blk)));
> > +
> > +             visibilitymap_clear(relation, heap_blk, vmbuffer,
> > +                                                     VISIBILITYMAP_VALID_BITS);
>
> Wait, why is it ok to perform this check iff blk_known_av is set?

This is existing logic in vacuum. It would be okay to perform the
check even if blk_known_av is false but might be too expensive for the
common case where the page is not all-visible (especially on-access).
The next vacuum should be able to enter this code path and fix it. Or
do you think it will be cheap enough because the caller will have read
in and pinned the VM page?

> > +                     old_vmbits = visibilitymap_set_vmbits(blockno,
> > +                                                                                               vmbuffer,
new_vmbits,
> > +
params->relation->rd_locator);
> > +                     if (old_vmbits == new_vmbits)
> > +                     {
> > +                             LockBuffer(vmbuffer, BUFFER_LOCK_UNLOCK);
> > +                             /* Unset so we don't emit WAL since no change occurred */
> > +                             do_set_vm = false;
> > +                     }
> > +             }
>
> What can lead to this path being reached? Doesn't this mean that something
> changed the state of the VM while we were holding an exclusive lock on the
> heap buffer?

This shouldn't be in this commit (I've fixed that). However, it is
needed once we have on-access VM setting because we could have set the
page all-visible in the VM on-access in between when
find_next_unskippable_block() first checks the VM and sets
all_visible_according_to_vm/blk_known_av and when we take the lock and
prune/freeze the page.

> >                       log_heap_prune_and_freeze(params->relation, buffer,
> > -                                                                       InvalidBuffer,        /* vmbuffer */
> > -                                                                       0,    /* vmflags */
> > +                                                                       do_set_vm ? vmbuffer : InvalidBuffer,
> > +                                                                       do_set_vm ? new_vmbits : 0,
> >                                                                         conflict_xid,
> > -                                                                       true, params->reason,
> > +                                                                       true, /* cleanup lock */
> > +                                                                       do_set_pd_vis,
> > +                                                                       params->reason,
> >                                                                         prstate.frozen, prstate.nfrozen,
> >                                                                         prstate.redirected, prstate.nredirected,
> >                                                                         prstate.nowdead, prstate.ndead,
>
> This function is now taking 16 parameters :/

Is this complaint about readability or performance of parameter
passing? Because if it's the latter, I can't imagine that will be
noticeable when compared to the overhead of emitting a WAL record.

I could add a struct just for passing the parameters to the
log_heap_prune_and_freeze(). Something like:

typedef struct PruneFreezeChanges
{
    int            nredirected;
    int            ndead;
    int            nunused;
    int            nfrozen;
    OffsetNumber *redirected;
    OffsetNumber *nowdead;
    OffsetNumber *nowunused;
    HeapTupleFreeze *frozen;
} PruneFreezeChanges;

PruneFreezeChanges c = {
        .redirected = prstate.redirected,
        .nredirected = prstate.nredirected,
        .ndead = prstate.ndead,
        .nowdead = prstate.nowdead,
        .nunused = prstate.nunused,
        .nowunused = prstate.nowunused,
        .nfrozen = prstate.nfrozen,
        .frozen = prstate.frozen,
};

log_heap_prune_and_freeze(params->relation, buffer,
                                                        InvalidBuffer,
   /* vmbuffer */
                                                        0,    /* vmflags */
                                                        conflict_xid,
                                                        true, params->reason,
                                                        c);

However, I fear it is a bit confusing to have this struct just to pass
the parameters to the log_heap_prune_and_freeze(). We can't use that
struct inline in the PruneState because then we would need all the
arrays to be inline in the PruneFreezeChanges struct which would cause
4*MaxHeapTuplesPerPage stack allocated OffsetNumbers in vacuum phase
III than it currently has and needs.

The only other related parameters I see that could be stuck into a
struct are vmflags and set_pd_all_vis -- maybe called VisiChanges or
HeapPageVisiChanges. But again, I'm not sure if it is worth adding a
new struct for this.

> > +#ifdef USE_ASSERT_CHECKING
> > +     if (prstate.all_visible)
> > +     {
> > +             TransactionId debug_cutoff;
> > +             bool            debug_all_frozen;
> > +
> > +             Assert(prstate.lpdead_items == 0);
> > +             Assert(prstate.cutoffs);
> > +
> > +             if (!heap_page_is_all_visible(params->relation, buffer,
> > +                                                                       prstate.cutoffs->OldestXmin,
> > +                                                                       &debug_all_frozen,
> > +                                                                       &debug_cutoff, off_loc))
> > +                     Assert(false);
>
> I don't love Assert(false), because the message for the assert failure is
> pretty much meaningless. Sometimes it's hard to avoid, but here you have an if
> () that has no body other than Assert(false)? Just Assert the expression
> directly.

This is existing code. I agree it's weird, but I remember Peter saying
something about why he did it this way that I no longer remember.
Anyway, 0001 changes the assert as you suggest.

> > Subject: [PATCH v22 3/9] Eliminate XLOG_HEAP2_VISIBLE from empty-page vacuum
> >
> > As part of removing XLOG_HEAP2_VISIBLE records, phase I of VACUUM now
> > marks empty pages all-visible in a XLOG_HEAP2_PRUNE_VACUUM_SCAN record.
>
> This whole business of treating empty pages as all-visible continues to not
> make any sense to me. Particularly in combination with a not crashsafe FSM it
> just seems ... unhelpful. It also means that there there's a decent chance of
> extra WAL when bulk extending. But that's not the fault of this change.

Is the argument for setting them av/af that we can skip them more
easily in future vacuums (i.e. not have to read in the page and take a
lock etc)?

> > Subject: [PATCH v22 4/9] Remove XLOG_HEAP2_VISIBLE entirely
> >
> > As no remaining users emit XLOG_HEAP2_VISIBLE records.
> > This includes deleting the xl_heap_visible struct and all functions
> > responsible for emitting or replaying XLOG_HEAP2_VISIBLE records.
>
> Probably worth mentioning that this changes the VM API.

I've added a mention about this in the commit.
Are you imagining I have any comments anywhere about how
XLOG_HEAP2_VISIBLE used to exist?

I realized I need to bump XLOG_PAGE_MAGIC in this commit because the
code to replay XLOG_HEAP2_VISIBLE records is gone now.

What I'm not sure is if I have to bump it in some of the other commits
that change which WAL records are emitted by a particular operation
(e.g. not emitting a separate VM record from phase I of vacuum).

> > - * - Page pruning, in VACUUM's 1st pass or on access: Some items are
> > + * - Page pruning, in vacuum phase I or on-access: Some items are
> >   *   redirected, some marked dead, and some removed altogether.
> >   *
> > - * - Freezing: Items are marked as 'frozen'.
> > + * - Freezing: During vacuum phase I, items are marked as 'frozen'
> >   *
> > - * - Vacuum, 2nd pass: Items that are already LP_DEAD are marked as unused.
> > + * - Reaping: During vacuum phase III, items that are already LP_DEAD are
> > + *   marked as unused.
> >   *
> > - * They have enough commonalities that we use a single WAL record for them
> > + * - VM updates: After vacuum phases I and III, the heap page may be marked
> > + *   all-visible and all-frozen.
> > + *
> > + * These changes all happen together, so we use a single WAL record for them
> >   * all.
> >   *
> >   * If replaying the record requires a cleanup lock, pass cleanup_lock =
> >   true.
>
> How's that related to the commit's subject?

Oops, I moved it to the relevant commit.

> > Subject: [PATCH v22 5/9] Rename GlobalVisTestIsRemovableXid() to
> >  GlobalVisXidVisibleToAll()
> >
> > The function is currently only used to check whether a tuple’s xmax is
> > visible to all transactions (and thus removable). Upcoming changes will
> > also use it to test whether a tuple’s xmin is visible to all to
> > decide if a page can be marked all-visible in the visibility map.
> >
> > The new name, GlobalVisXidVisibleToAll(), better reflects this broader
> > purpose.
>
> If we want this - and I'm not convinced we do - I think it needs to go further
> and change the other uses of removable in
> procarray.c. ComputeXidHorizonsResult has a lot of related fields.
>
> There's also GetOldestNonRemovableTransactionId(),
> GlobalVisCheckRemovableXid(), GlobalVisCheckRemovableFullXid() that weren't
> included in the renaming.

Okay, I see what you are saying. When you say you're not sure if we
want "this" -- do you mean using GlobalVisState for determining if
xmins are visible to all (which is required to set the VM on-access)
or do you mean renaming those functions?

If we're just talking about the renaming, looking at procarray.c, it
is full of the word "removable" because its functions were largely
used to examine and determine if everyone can see an xmax as committed
and thus if that tuple is removable from their perspective. But
nothing about the code that I can see means it has to be an xmax. We
could just as well use the functions to determine if everyone can see
an xmin as committed.

I don't see how we can leave the names as is and use it on xmins
because that tuple is _not_ removable based on testing if everyone can
see the xmin. So the function basically returns an incorrect result.

That being said, the problem with replacing "removable" with "visible
to all" -- which isn't _terrible_ -- means we have to replace
"nonremovable" with "not visible to all" -- which is terrible.

I think getting rid of "removable" from procarray.c would be an
improvement because that file feels tightly coupled to vacuum and
removing tuples because of the names of variables and functions when
actually its functionality isn't. So, the issue is coming up with
something palatable.

One alternative idea (that requires no renaming) is to add a wrapper
function somewhere outside procarray.c which invokes
GlobalVisTestIsRemovableXid() but is called something like
XidVisibleToAll() and is documented for use with xmins/etc. It would
avoid the messy work of coming up with a good name. What do you think?

> > Subject: [PATCH v22 6/9] Use GlobalVisState in vacuum to determine page level
> >  visibility
> >
> > This also benefits vacuum directly: in some cases, GlobalVisState may
> > advance during a vacuum, allowing more pages to become considered
> > all-visible. And, in the future, we could easily add a heuristic to
> > update GlobalVisState more frequently during vacuums of large tables. In
> > the rare case that the GlobalVisState moves backward, vacuum falls back
> > to OldestXmin to ensure we don’t attempt to freeze a dead tuple that
> > wasn’t yet prunable according to the GlobalVisState.
>
> I think it may be better to make sure that the GlobalVisState can't move
> backward.

Do you mean that I shouldn't use the GlobalVisState to determine
visibility until I make sure it can't move backwards?

There is actually no functional difference in my patch set with the
code this commit message refers to (in heap_prune_satisfies_vacuum()).
I only mentioned it to make sure folks knew that even though I was
widening usage of GlobalVisState, we wouldn't encounter
synchronization issues with freezing horizons.

> > Subject: [PATCH v22 8/9] Allow on-access pruning to set pages all-visible
> >
> > Many queries do not modify the underlying relation. For such queries, if
> > on-access pruning occurs during the scan, we can check whether the page
> > has become all-visible and update the visibility map accordingly.
> > Previously, only vacuum and COPY FREEZE marked pages as all-visible or
> > all-frozen.
>
> > Supporting this requires passing information about whether the relation
> > is modified from the executor down to the scan descriptor.
>
> I think it'd be good to split this part into a separate commit. The set of
> folks to review that are distinct (and broader) from the ones looking at
> heapam internals.

Good point. I've split it into 3 commits in this patch set (0011-0013)

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

04 декабря 2025 г., 02:08:53

On Mon, Nov 24, 2025 at 3:08 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
> > On Nov 21, 2025, at 09:09, Chao Li <li.evan.chao@gmail.com> wrote:
> >
> > I’d stop here today, and continue reviewing rest commits in next week.
>
> I continue reviewing today.

I incorporated all your feedback in my recently posted v23. Thanks for
the review!

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Chao Li

Дата:

04 декабря 2025 г., 08:10:33

Hi Melanie,

I resisted this patch again today. I reviewed 0001-0004, and got a few more comments:

> On Dec 4, 2025, at 07:07, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
>
<v23-0001-Simplify-vacuum-visibility-assertion.patch><v23-0002-Refactor-lazy_scan_prune-VM-set-logic-into-helpe.patch><v23-0003-Set-the-VM-in-prune-code.patch><v23-0004-Move-VM-assert-into-prune-freeze-code.patch><v23-0005-Eliminate-XLOG_HEAP2_VISIBLE-from-vacuum-phase-I.patch><v23-0006-Eliminate-XLOG_HEAP2_VISIBLE-from-empty-page-vac.patch><v23-0007-Remove-XLOG_HEAP2_VISIBLE-entirely.patch><v23-0008-Rename-GlobalVisTestIsRemovableXid-to-GlobalVisX.patch><v23-0009-Use-GlobalVisState-in-vacuum-to-determine-page-l.patch><v23-0010-Unset-all_visible-sooner-if-not-freezing.patch><v23-0011-Track-which-relations-are-modified-by-a-query.patch><v23-0012-Pass-down-information-on-table-modification-to-s.patch><v23-0013-Allow-on-access-pruning-to-set-pages-all-visible.patch><v23-0014-Set-pd_prune_xid-on-insert.patch>

1 - 0002
```
+static bool
+heap_page_will_set_vis(Relation relation,
+                       BlockNumber heap_blk,
+                       Buffer heap_buf,
+                       Buffer vmbuffer,
+                       bool all_visible_according_to_vm,
+                       const PruneFreezeResult *presult,
+                       uint8 *new_vmbits,
+                       bool *do_set_pd_vis)
```

Actually, I wanted to comment on the new function name in last round of review, but I guess I missed that.

I was very confused what “set_vis” means, and finally figured out “vis” should stand for “visibility”. Here “vis”
actuallymeans “visibility map bits”. There is the other “vis” in the last parameter’s name “do_set_pd_vis” where the
“vis”should be mean “PD_ALL_VISIBLE” bit. So the two “vis” feels making things confusing. 

How about rename the function to “heap_page_will_set_vm_bits”, and rename the last parameter to “do_set_all_visible”?

2 - 0002
```
+ * Decide whether to set the visibility map bits for heap_blk, using
+ * information from PruneFreezeResult and all_visible_according_to_vm. This
+ * function does not actually set the VM bit or page-level hint,
+ * PD_ALL_VISIBLE.
+ *
+ * If it finds that the page-level visibility hint or VM is corrupted, it will
+ * fix them by clearing the VM bit and page hint. This does not need to be
+ * done in a critical section.
+ *
+ * Returns true if one or both VM bits should be set, along with the desired
+ * flags in *new_vmbits. Also indicates via do_set_pd_vis whether
+ * PD_ALL_VISIBLE should be set on the heap page.
+ */
```

This function comment mentions PD_ALL_VISIBLE twice, but never mentions ALL_FROZEN. So “Returns true if one or both VM
bitsshould be set” fells unclear. How about rephrase like "Returns true if the all-visible and/or all-frozen VM bits
shouldbe set.” 

3 - 0002
```
+    /*
+     * Now handle two potential corruption cases:
+     *
+     * These do not need to happen in a critical section and are not
+     * WAL-logged.
+     *
+     * As of PostgreSQL 9.2, the visibility map bit should never be set if the
+     * page-level bit is clear.  However, it's possible that the bit got
+     * cleared after heap_vac_scan_next_block() was called, so we must recheck
+     * with buffer lock before concluding that the VM is corrupt.
+     */
+    else if (all_visible_according_to_vm && !PageIsAllVisible(heap_page) &&
+             visibilitymap_get_status(relation, heap_blk, &vmbuffer) != 0)
+    {
+        ereport(WARNING,
+                (errcode(ERRCODE_DATA_CORRUPTED),
+                 errmsg("page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u",
+                        RelationGetRelationName(relation), heap_blk)));
+
+        visibilitymap_clear(relation, heap_blk, vmbuffer,
+                            VISIBILITYMAP_VALID_BITS);
+    }
```

Here in the comment and error message, I guess “visibility map bit” refers to “all visible bit”, can we be explicit?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

09 декабря 2025 г., 20:48:45

On Thu, Dec 4, 2025 at 12:11 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
> I resisted this patch again today. I reviewed 0001-0004, and got a few more comments:

Thanks for the review! v24 attached with updates you suggested as well
as the bug fix described below.

I realized my code didn't mark the heap buffer dirty if we were not
modifying it (i.e. only setting the VM). This trips an assert in
XLogRegisterBuffer() which requires that all buffers registered with
the WAL machinery are marked dirty unless REGBUF_NO_CHANGE is passed.

It wasn't possible to hit it in master because we unconditionally
dirtied the buffer if we found the VM not set in
find_next_unskippable_block() -- even if we made no changes to the
heap buffer. But my refactoring only dirtied the heap buffer if we
modified it (either pruning/freezing or setting PD_ALL_VISIBLE).

In attached v24, I once again always dirty the heap buffer before
registering it. We can't skip adding the heap buffer to the WAL chain
even if we didn't modify it, because we use it to update the freespace
map during recovery. We could pass REGBUF_NO_CHANGE when the heap
buffer is completely unmodified. But the delicate special case logic
doesn't seem worth the effort to maintain, as the only time the heap
buffer should be unmodified is when the VM has been truncated or
removed. I added a test to the commit doing this refactoring that
would have caught my mistake (0003).

I also split the refactoring of the VM setting logic into more commits
to help make it clearer (0003-0004). We could technically commit the
refactoring commits to master. I had not originally intended to do so
since they do not have independent value beyond clarity for the
reviewer.

In this set 0001 and 0002 are independent. 0003-0007 are all small
steps toward the single change in 0007 which combines the VM updates
into the same WAL record as pruning and freezing. 0008 and 0009 are
removing the rest of XLOG_HEAP2_VISIBLE. 0010 - 0012 are refactoring
needed to set the VM during on-access pruning. 0013 - 0015 are small
steps toward setting the VM on-access. And 0016 sets the prune xid on
insert so we may set the VM on-access for pages that have only new
data.

> +static bool
> +heap_page_will_set_vis(Relation relation,
>
> Actually, I wanted to comment on the new function name in last round of review, but I guess I missed that.
>
> I was very confused what “set_vis” means, and finally figured out “vis” should stand for “visibility”. Here “vis”
actuallymeans “visibility map bits”. There is the other “vis” in the last parameter’s name “do_set_pd_vis” where the
“vis”should be mean “PD_ALL_VISIBLE” bit. So the two “vis” feels making things confusing. 
>
> How about rename the function to “heap_page_will_set_vm_bits”, and rename the last parameter to “do_set_all_visible”?

I named it that way because it was responsible for telling us what we
should set the VM to _and_ if we should set PD_ALL_VISIBLE. However,
once I corrected the bug mentioned above, we always set PD_ALL_VISIBLE
if setting the VM, so I was able to remove this ambiguity. As such
I've renamed the function heap_page_will_set_vm() (and removed the
last parameter).

> + * Decide whether to set the visibility map bits for heap_blk, using
> + * information from PruneFreezeResult and all_visible_according_to_vm. This
> + * function does not actually set the VM bit or page-level hint,
> + * PD_ALL_VISIBLE.
> + *
> + * If it finds that the page-level visibility hint or VM is corrupted, it will
> + * fix them by clearing the VM bit and page hint. This does not need to be
> + * done in a critical section.
> + *
> + * Returns true if one or both VM bits should be set, along with the desired
> + * flags in *new_vmbits. Also indicates via do_set_pd_vis whether
> + * PD_ALL_VISIBLE should be set on the heap page.
> + */
> ```
>
> This function comment mentions PD_ALL_VISIBLE twice, but never mentions ALL_FROZEN. So “Returns true if one or both
VMbits should be set” fells unclear. How about rephrase like "Returns true if the all-visible and/or all-frozen VM bits
shouldbe set.” 

PD_ALL_VISIBLE is the page-level visibility hint (not the VM bit) and
there is no page level frozen hint. It doesn't mention that the VM
bits are all-visible and all-frozen, though, so I have modified the
comment a bit to make sure the all-frozen bit of the VM is mentioned.

> +        * Now handle two potential corruption cases:
> +        *
> +        * These do not need to happen in a critical section and are not
> +        * WAL-logged.
> +        *
> +        * As of PostgreSQL 9.2, the visibility map bit should never be set if the
> +        * page-level bit is clear.  However, it's possible that the bit got
> +        * cleared after heap_vac_scan_next_block() was called, so we must recheck
> +        * with buffer lock before concluding that the VM is corrupt.
> +        */
> +       else if (all_visible_according_to_vm && !PageIsAllVisible(heap_page) &&
> +                        visibilitymap_get_status(relation, heap_blk, &vmbuffer) != 0)
> +       {
> +               ereport(WARNING,
> +                               (errcode(ERRCODE_DATA_CORRUPTED),
> +                                errmsg("page is not marked all-visible but visibility map bit is set in relation
\"%s\"page %u", 
> +                                               RelationGetRelationName(relation), heap_blk)));
> +
> +               visibilitymap_clear(relation, heap_blk, vmbuffer,
> +                                                       VISIBILITYMAP_VALID_BITS);
> +       }
> ```
>
> Here in the comment and error message, I guess “visibility map bit” refers to “all visible bit”, can we be explicit?

This is an existing comment in lazy_scan_prune() that I simply moved.
It isn't valid for the all-frozen bit to be set unless the all-visible
bit is set. I'm not sure whether specifying which bits were set in the
warning will help users debug the corruption they are seeing. But I
think it is a reasonable suggestion to make. Perhaps it is worth
suggesting this (adding the specific vmbits to the warning message) in
a separate thread since it is an independent improvement on master?

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

11 декабря 2025 г., 02:35:47

On Tue, Dec 9, 2025 at 12:48 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> In this set 0001 and 0002 are independent. 0003-0007 are all small
> steps toward the single change in 0007 which combines the VM updates
> into the same WAL record as pruning and freezing. 0008 and 0009 are
> removing the rest of XLOG_HEAP2_VISIBLE. 0010 - 0012 are refactoring
> needed to set the VM during on-access pruning. 0013 - 0015 are small
> steps toward setting the VM on-access. And 0016 sets the prune xid on
> insert so we may set the VM on-access for pages that have only new
> data.

I committed 0001 and 0002. attached v25 reflects that.
0001-0004 refactoring steps for eliminate visible record from phase I
(not probably independent commits in the end)
0005 eliminate XLOG_HEAP2_VISIBLE from phase I vac
0006-0007 removing the rest of XLOG_HEAP2_VISIBLE
0008-0010 refactoring for setting VM on-access
0011-0013 setting the VM on-access
0014 - setting pd_prune_xid on insert

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Chao Li

Дата:

11 декабря 2025 г., 07:06:57


> On Dec 11, 2025, at 07:35, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Tue, Dec 9, 2025 at 12:48 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>>
>> In this set 0001 and 0002 are independent. 0003-0007 are all small
>> steps toward the single change in 0007 which combines the VM updates
>> into the same WAL record as pruning and freezing. 0008 and 0009 are
>> removing the rest of XLOG_HEAP2_VISIBLE. 0010 - 0012 are refactoring
>> needed to set the VM during on-access pruning. 0013 - 0015 are small
>> steps toward setting the VM on-access. And 0016 sets the prune xid on
>> insert so we may set the VM on-access for pages that have only new
>> data.
>
> I committed 0001 and 0002. attached v25 reflects that.
> 0001-0004 refactoring steps for eliminate visible record from phase I
> (not probably independent commits in the end)
> 0005 eliminate XLOG_HEAP2_VISIBLE from phase I vac
> 0006-0007 removing the rest of XLOG_HEAP2_VISIBLE
> 0008-0010 refactoring for setting VM on-access
> 0011-0013 setting the VM on-access
> 0014 - setting pd_prune_xid on insert
>
> - Melanie
>
<v25-0001-Combine-visibilitymap_set-cases-in-lazy_scan_pru.patch><v25-0002-Refactor-lazy_scan_prune-VM-set-logic-into-helpe.patch><v25-0003-Set-the-VM-in-heap_page_prune_and_freeze.patch><v25-0004-Move-VM-assert-into-prune-freeze-code.patch><v25-0005-Eliminate-XLOG_HEAP2_VISIBLE-from-vacuum-phase-I.patch><v25-0006-Eliminate-XLOG_HEAP2_VISIBLE-from-empty-page-vac.patch><v25-0007-Remove-XLOG_HEAP2_VISIBLE-entirely.patch><v25-0008-Rename-GlobalVisTestIsRemovableXid-to-GlobalVisX.patch><v25-0009-Use-GlobalVisState-in-vacuum-to-determine-page-l.patch><v25-0011-Track-which-relations-are-modified-by-a-query.patch><v25-0012-Pass-down-information-on-table-modification-to-s.patch><v25-0013-Allow-on-access-pruning-to-set-pages-all-visible.patch><v25-0014-Set-pd_prune_xid-on-insert.patch>

A few more small comments. Sorry for keeping come out new comments. Actually I learned a lot about vacuum from
reviewingthis patch. 

1 - 0001
```
+-- the checkpoint cleans the buffer dirtied by freezing the sole tuple
+checkpoint;
+-- truncating the VM ensures that the next vacuum will need to set it
+select pg_truncate_visibility_map('test_vac_unmodified_heap');
+-- vacuum sets the VM but does not need to set PD_ALL_VISIBLE so no heap page
+-- modification
+vacuum test_vac_unmodified_heap;
```

The last vacuum is expected to set vm bits, but the test doesn’t verify that. Should we verify that like:
```
evantest=# SELECT blkno, all_visible, all_frozen FROM pg_visibility_map('test_vac_unmodified_heap');
 blkno | all_visible | all_frozen
-------+-------------+------------
     0 | t           | t
(1 row)
```

As you have been using the extension pg_visibility, adding the verification with pg_visibility_map() should not be a
burden.

2 - 0001
```
         if (presult.all_frozen)
         {
+            /*
+             * We can pass InvalidTransactionId as our cutoff_xid, since a
+             * snapshotConflictHorizon sufficient to make everything safe for
+             * REDO was logged when the page's tuples were frozen.
+             */
             Assert(!TransactionIdIsValid(presult.vm_conflict_horizon));
-            flags |= VISIBILITYMAP_ALL_FROZEN;
+            new_vmbits |= VISIBILITYMAP_ALL_FROZEN;
         }
```

The comment here is a little confusing. In the old code, the Assert() as immediately above the call
visibilitymap_set(),and cutoff_xid is a parameter to the call. But the new code moves the Assert() as well as the
commentfar away from the call visibilitymap_set(), so I think the comment should stay together with the call of
visibilitymap_set().

3 - 0002
```
 * If it finds that the page-level visibility hint or VM is corrupted, it will
* fix them by clearing the VM bits and visibility page hint. This does not
```

In the second line, “visibility page hint” is understandable but feels not quite good. I know it’s actually “page-level
visibilityhint”, so how about just “visibility hint”. 

4 - 0002
```
     /*
-     * As of PostgreSQL 9.2, the visibility map bit should never be set if the
-     * page-level bit is clear.  However, it's possible that the bit got
-     * cleared after heap_vac_scan_next_block() was called, so we must recheck
-     * with buffer lock before concluding that the VM is corrupt.
+     * For the purposes of logging, count whether or not the page was newly
+     * set all-visible and, potentially, all-frozen.
      */
-    else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
-             visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
+    if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0 &&
+        (new_vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
     {
```

Without do_set_vm==true, old_vmbits will only be 0, thus this “if-elseif” that uses old_vmbits should be moved into “if
(do_set_vm)”.From this perspective, if not do_set_vm, this function can return early, like: 

```
Do_set_vm = heap_page_will_set_vm(&new_vmbits)
If (!do_set_vm)
   Return presult.ndeleted;

PageSetAllVisible(page);
MarkBufferDirty(buf);
old_vmbits = visibilitymap_set(new_vmbits);
If (old_vmbits..)
{
..
}
Else if (old_vmbits…)
{
…
}

Return presult.ndeleted;
```

5 - 0003
```
 /*
  *    lazy_scan_prune() -- lazy_scan_heap() pruning and freezing.
  *
@@ -2076,15 +1979,14 @@ lazy_scan_prune(LVRelState *vacrel,
                 bool *vm_page_frozen)
 {
     Relation    rel = vacrel->rel;
-    bool        do_set_vm = false;
-    uint8        new_vmbits = 0;
-    uint8        old_vmbits = 0;
     PruneFreezeResult presult;
     PruneFreezeParams params = {
         .relation = rel,
         .buffer = buf,
+        .vmbuffer = vmbuffer,
+        .blk_known_av = all_visible_according_to_vm,
         .reason = PRUNE_VACUUM_SCAN,
-        .options = HEAP_PAGE_PRUNE_FREEZE,
+        .options = HEAP_PAGE_PRUNE_FREEZE | HEAP_PAGE_PRUNE_UPDATE_VM,
         .vistest = vacrel->vistest,
         .cutoffs = &vacrel->cutoffs,
     };
```

This maybe a legacy bug. Here presult is not initialized, and it is immediately passed to heap_page_prune_and_freeze():

```
    heap_page_prune_and_freeze(¶ms,
                               &presult, <=== here
                               &vacrel->offnum,
                               &vacrel->NewRelfrozenXid, &vacrel->NewRelminMxid);
```

Then heap_page_prune_and_freeze() immediately calls prune_freeze_setup():
```
    /* Initialize prstate */
    prune_freeze_setup(params,
                       new_relfrozen_xid, new_relmin_mxid,
                       presult, &prstate);
```

And prune_freeze_setup() takes presult as a const pointer:
```
static void
prune_freeze_setup(PruneFreezeParams *params,
                   TransactionId *new_relfrozen_xid,
                   MultiXactId *new_relmin_mxid,
                   const PruneFreezeResult *presult, <=== here
                   PruneState *prstate)
{
    prstate->deadoffsets = (OffsetNumber *) presult->deadoffsets; <== here, presult->deadoffsets could be a random
value
}
```

As this is a separate issue off the current patch, I just filed a new patch to fix it. Please take a look at:
https://www.postgresql.org/message-id/CAEoWx2%3DjiD1nqch4JQN%2BodAxZSD7mRvdoHUGJYN2r6tQG_66yQ%40mail.gmail.com

6 - 0003
```
+ * Returns true if one or both VM bits should be set, along with returning the
+ * desired what bits should be set in the VM in *new_vmbits.
```

Looks like a typo: “returning the desired what bits should be set”, maybe change to “returning the desired bits to be
set”.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Peter Eisentraut

Дата:

13 декабря 2025 г., 16:59:29

On 20.11.25 18:19, Melanie Plageman wrote:
> +    prstate->deadoffsets = (OffsetNumber *) presult->deadoffsets;

In your patch 
v22-0001-Split-heap_page_prune_and_freeze-into-helpers.patch, the 
assignment above casts away the const qualification of the function 
argument presult:

+static void
+prune_freeze_setup(PruneFreezeParams *params,
+                   TransactionId new_relfrozen_xid,
+                   MultiXactId new_relmin_mxid,
+                   const PruneFreezeResult *presult,
+                   PruneState *prstate)

(The cast is otherwise unnecessary, since the underlying type is the 
same on both sides.)

Since prstate->deadoffsets is in fact later modified, this makes the 
original const qualification invalid.

I suggest the attached patch to remove the faulty const qualification 
and the then-unnecessary cast.

Вложения

0001-Fix-const-qualification-in-prune_freeze_setup.patch.nocfbot

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

16 декабря 2025 г., 00:05:19

On Sat, Dec 13, 2025 at 8:59 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 20.11.25 18:19, Melanie Plageman wrote:
> > +     prstate->deadoffsets = (OffsetNumber *) presult->deadoffsets;
>
> In your patch
> v22-0001-Split-heap_page_prune_and_freeze-into-helpers.patch, the
> assignment above casts away the const qualification of the function
> argument presult:

Yea, this code (prune_freeze_setup() with a const-qualified
PruneFreezeResult parameter) is actually already in master -- not just
in this patchset.

> +static void
> +prune_freeze_setup(PruneFreezeParams *params,
> +                                  TransactionId new_relfrozen_xid,
> +                                  MultiXactId new_relmin_mxid,
> +                                  const PruneFreezeResult *presult,
> +                                  PruneState *prstate)
>
> (The cast is otherwise unnecessary, since the underlying type is the
> same on both sides.)
>
> Since prstate->deadoffsets is in fact later modified, this makes the
> original const qualification invalid.

I didn't realize I was misusing const here. What I meant to indicate
by defining the prune_freeze_setup() parameter, as const, is that the
PruneFreezeResult wouldn't be modified by prune_freeze_setup(). I did
not mean to indicate that no members of PruneFreezeResult would ever
be modified. deadoffsets is not modified in prune_freeze_setup(). So,
are you saying that I can't define a parameter as const if even the
caller modifies it?

I'm fine with committing a change, I just want to understand.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

16 декабря 2025 г., 00:29:03

Hi,

Attached v26 includes a new patch, 0002, which gets rid of
all_visible_according_to_vm in lazy_scan_prune(). We've kept this
cached copy of the all-visible bit since the VM was added way back in
608195a3a365. Back then, the VM wasn't pinned unless
all_visible_according_to_vm was false. Now that we unconditionally
have the VM page pinned, there isn't much performance benefit to using
that cached value. I did some testing of the worst possible case and
saw no difference in timing. By removing that, we simplify heap vacuum
code now.  And we improve clarity once the VM update is combined into
the prune/freeze WAL record and when the VM is set on-access.

I think 0001 and 0002 (and maybe 0003) are worthwhile clarity
improvements on their own.

On Wed, Dec 10, 2025 at 11:07 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> A few more small comments. Sorry for keeping come out new comments. Actually I learned a lot about vacuum from
reviewingthis patch. 

Thanks for the continued review. Your feedback is improving the patchset.

> The last vacuum is expected to set vm bits, but the test doesn’t verify that. Should we verify that like:
> ```
> evantest=# SELECT blkno, all_visible, all_frozen FROM pg_visibility_map('test_vac_unmodified_heap');
>  blkno | all_visible | all_frozen
> -------+-------------+------------
>      0 | t           | t
> (1 row)

I've done this. I've actually added three such verifications -- one
after each step where the VM is expected to change. It shouldn't be
very expensive, so I think it is okay. The way the test would fail if
the buffer wasn't correctly dirtied is that it would assert out -- so
the visibility map test wouldn't even have a chance to fail. But, I
think it is also okay to confirm that the expected things are
happening with the VM -- it just gives us extra coverage.

>                 if (presult.all_frozen)
>                 {
> +                       /*
> +                        * We can pass InvalidTransactionId as our cutoff_xid, since a
> +                        * snapshotConflictHorizon sufficient to make everything safe for
> +                        * REDO was logged when the page's tuples were frozen.
> +                        */
>                         Assert(!TransactionIdIsValid(presult.vm_conflict_horizon));
> -                       flags |= VISIBILITYMAP_ALL_FROZEN;
> +                       new_vmbits |= VISIBILITYMAP_ALL_FROZEN;
>                 }
>
> The comment here is a little confusing. In the old code, the Assert() as immediately above the call
visibilitymap_set(),and cutoff_xid is a parameter to the call. But the new code moves the Assert() as well as the
commentfar away from the call visibilitymap_set(), so I think the comment should stay together with the call of
visibilitymap_set().

Good point. I've moved it closer to visibilitymap_set() and modified
and moved the assert so that it is together with the comment. I think
the comment makes little sense without the assertion.

>  * If it finds that the page-level visibility hint or VM is corrupted, it will
> * fix them by clearing the VM bits and visibility page hint. This does not
>
> In the second line, “visibility page hint” is understandable but feels not quite good. I know it’s actually
“page-levelvisibility hint”, so how about just “visibility hint”. 

I've changed this.

>         /*
> -        * As of PostgreSQL 9.2, the visibility map bit should never be set if the
> -        * page-level bit is clear.  However, it's possible that the bit got
> -        * cleared after heap_vac_scan_next_block() was called, so we must recheck
> -        * with buffer lock before concluding that the VM is corrupt.
> +        * For the purposes of logging, count whether or not the page was newly
> +        * set all-visible and, potentially, all-frozen.
>          */
> -       else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
> -                        visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)
> +       if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0 &&
> +               (new_vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
>         {
> ```
>
> Without do_set_vm==true, old_vmbits will only be 0, thus this “if-elseif” that uses old_vmbits should be moved into
“if(do_set_vm)”. From this perspective, if not do_set_vm, this function can return early, like: 

Good point. I've actually gone ahead in 0002 and refactored this whole
section a bit (I got rid of all_visible_according_to_vm). 0002 is a
new patch in this attached v26, and it needs review. I think this
refactoring makes the code quite a bit clearer -- especially once we
start setting the VM on-access. It does, amongst other things, return
early if all_visible is false, like you suggested.

> + * Returns true if one or both VM bits should be set, along with returning the
> + * desired what bits should be set in the VM in *new_vmbits.
> ```
>
> Looks like a typo: “returning the desired what bits should be set”, maybe change to “returning the desired bits to be
set”.

Fixed.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Peter Eisentraut

Дата:

16 декабря 2025 г., 15:18:25

On 15.12.25 22:05, Melanie Plageman wrote:
> On Sat, Dec 13, 2025 at 8:59 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>>
>> On 20.11.25 18:19, Melanie Plageman wrote:
>>> +     prstate->deadoffsets = (OffsetNumber *) presult->deadoffsets;
>>
>> In your patch
>> v22-0001-Split-heap_page_prune_and_freeze-into-helpers.patch, the
>> assignment above casts away the const qualification of the function
>> argument presult:
> 
> Yea, this code (prune_freeze_setup() with a const-qualified
> PruneFreezeResult parameter) is actually already in master -- not just
> in this patchset.
> 
>> +static void
>> +prune_freeze_setup(PruneFreezeParams *params,
>> +                                  TransactionId new_relfrozen_xid,
>> +                                  MultiXactId new_relmin_mxid,
>> +                                  const PruneFreezeResult *presult,
>> +                                  PruneState *prstate)
>>
>> (The cast is otherwise unnecessary, since the underlying type is the
>> same on both sides.)
>>
>> Since prstate->deadoffsets is in fact later modified, this makes the
>> original const qualification invalid.
> 
> I didn't realize I was misusing const here. What I meant to indicate
> by defining the prune_freeze_setup() parameter, as const, is that the
> PruneFreezeResult wouldn't be modified by prune_freeze_setup(). I did
> not mean to indicate that no members of PruneFreezeResult would ever
> be modified.

I'm not sure there is a difference between these two statements.  The 
struct won't be modified is the same as none of its fields will be modified.

> deadoffsets is not modified in prune_freeze_setup(). So,
> are you saying that I can't define a parameter as const if even the
> caller modifies it?

You are not modifying deadoffsets in prune_freeze_setup(), but you are 
assigning its address to a pointer variable that is not const-qualified, 
and so it could be used to modify it later on.

A caller to prune_freeze_setup() that sees the signature const 
PruneFreezeResult *presult could pass a pointer to a PruneFreezeResult 
object that is notionally in read-only memory.  But through the 
non-const-qualified pointer you could later modify the pointed-to 
memory, which would be invalid.  The point of propagating the qualifiers 
is to prevent that at compile time.

If what you want is something like, "prune_freeze_setup() does not 
change any of the fields of what presult points to, but it does record a 
pointer to one of its fields with the intention of modifying it later 
after prune_freeze_setup() is finished", then I think C cannot represent 
that with this API.

Here is a simplified example:

#include <stdlib.h>

// corresponds to PruneFreezeResult
struct foo
{
    int offsets[5];
};

// corresponds to PruneState
struct bar
{
    int *offsets;
};

static void setup(const struct foo *f)
{
    struct bar *b = malloc(sizeof(struct bar));

    b->offsets = f->offsets;  // warning
}

This produces a warning:

test.c:20:20: warning: assignment discards 'const' qualifier from 
pointer target type

The reason is that what "f" points to is const, which means that all its 
fields are const.  The fix is to remove the const from the function 
argument declaration.

One of the possible sources of confusion here is that one struct uses an 
array and the other a pointer, and these sometimes behave similarly and 
sometimes not.

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

16 декабря 2025 г., 19:07:17

On Tue, Dec 16, 2025 at 7:18 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>
> You are not modifying deadoffsets in prune_freeze_setup(), but you are
> assigning its address to a pointer variable that is not const-qualified,
> and so it could be used to modify it later on.
>
> A caller to prune_freeze_setup() that sees the signature const
> PruneFreezeResult *presult could pass a pointer to a PruneFreezeResult
> object that is notionally in read-only memory.  But through the
> non-const-qualified pointer you could later modify the pointed-to
> memory, which would be invalid.  The point of propagating the qualifiers
> is to prevent that at compile time.

Thanks for the explanation. I've committed your proposed fix.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

16 декабря 2025 г., 19:58:50

On Wed, Dec 3, 2025 at 6:07 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> If we're just talking about the renaming, looking at procarray.c, it
> is full of the word "removable" because its functions were largely
> used to examine and determine if everyone can see an xmax as committed
> and thus if that tuple is removable from their perspective. But
> nothing about the code that I can see means it has to be an xmax. We
> could just as well use the functions to determine if everyone can see
> an xmin as committed.

In the attached v27, I've removed the commit that renamed functions in
procarray.c. I've added a single wrapper GlobalVisTestXidNotRunning()
that is used in my code where I am testing live tuples. I think you'll
find that I've addressed all of your review comments now -- as I've
also gotten rid of the confusing blk_known_av logic through a series
of refactors.

The one outstanding point is which commits should bump
XLOG_PAGE_MAGIC. (also review of the reworked patches).

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

17 декабря 2025 г., 21:27:05

Hi!

in v27-0001:
> Melanie Plageman <melanieplageman(at)gmail(dot)com> wrote:
> > The last vacuum is expected to set vm bits, but the test doesn’t verify that. Should we verify that like:
> > ```
> > evantest=# SELECT blkno, all_visible, all_frozen FROM pg_visibility_map('test_vac_unmodified_heap');
> > blkno | all_visible | all_frozen
> > -------+-------------+------------
> > 0 | t | t
> > (1 row)

> I've done this. I've actually added three such verifications -- one
> after each step where the VM is expected to change. It shouldn't be
> very expensive, so I think it is okay. The way the test would fail if
> the buffer wasn't correctly dirtied is that it would assert out -- so
> the visibility map test wouldn't even have a chance to fail. But, I
> think it is also okay to confirm that the expected things are
> happening with the VM -- it just gives us extra coverage.

+1 on extra coverage. Should we also do sql-level check that the VM
indeed does not need to set PD_ALL_VISIBLE (check header bytes using
pageinspect?).

v27-0003 & v27-0004: I did not get the exact reason we introduced
`identify_and_fix_vm_corruption` in 0003 and moved code in 0004 to
another place. I can see we have this starting v25 of patch set. Well,
maybe this is not an issue at all...

in v27-0005. This patch changes code which is not exercised in
tests[0]. I spent some time understanding the conditions when we
entered this. There is a comment about non-finished relation
extension, but I got no success trying to reproduce this. I ended up
modifying code to lose PageSetAllVisible in proper places and running
vacuum. Looks like everything works as expected. I will spend some
more time on this, maybe I will be successful in writing an
injection-point-based TAP test which hits this...

[0] https://coverage.postgresql.org/src/backend/access/heap/vacuumlazy.c.gcov.html#1902
--
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

18 декабря 2025 г., 03:30:01

Thanks for the review!

In addition to addressing your feedback, attached v28 includes a
number of small fixes to comments, commit messages, and other things.
Notably, I've added one new refactoring patch 0009, which reduces the
diff of 0010 -- using the GlobalVisState instead of OldestXmin for
page visibility -- even further.

On Wed, Dec 17, 2025 at 1:27 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> > I've done this. I've actually added three such verifications -- one
> > after each step where the VM is expected to change. It shouldn't be
> > very expensive, so I think it is okay. The way the test would fail if
> > the buffer wasn't correctly dirtied is that it would assert out -- so
> > the visibility map test wouldn't even have a chance to fail. But, I
> > think it is also okay to confirm that the expected things are
> > happening with the VM -- it just gives us extra coverage.
>
> +1 on extra coverage. Should we also do sql-level check that the VM
> indeed does not need to set PD_ALL_VISIBLE (check header bytes using
> pageinspect?).

That's an interesting idea. I checked and, AFAICT, there are no tests
currently directly comparing the flags column returned by the
pageinspect page_header() function to one of the flag values. I've
added the following to attached v28.

SELECT (flags & x'0004'::int) <> 0
        FROM page_header(get_raw_page('test_vac_unmodified_heap', 0));

But I'm not sure if it is weird/confusing to be comparing the flag
directly to the number 4 like this. I don't really want to bother with
adding another function to pageinspect returning the status of
PD_ALL_VISIBLE (like page_visible() or something).

> v27-0003 & v27-0004: I did not get the exact reason we introduced
> `identify_and_fix_vm_corruption` in 0003 and moved code in 0004 to
> another place. I can see we have this starting v25 of patch set. Well,
> maybe this is not an issue at all...

It's mostly for ease of review. This is a pretty sensitive area of
code, so I thought it would be easier for the reviewer to confirm
correctness if I split it up. Andres had mentioned that the commit was
hard to review because so many different things were happening.

In v27, 0003 moves the VM clear code into a helper. 0004 and 0005
moves all the VM setting/clearing code to
heap_page_prune_and_freeze(). And 0006 actually sets the VM in the
same critical section as pruning/freezing and emits a single WAL
record.

I'm not really sure which commits should stay independent in the final
version I push to master.

> in v27-0005. This patch changes code which is not exercised in
> tests[0]. I spent some time understanding the conditions when we
> entered this. There is a comment about non-finished relation
> extension, but I got no success trying to reproduce this. I ended up
> modifying code to lose PageSetAllVisible in proper places and running
> vacuum. Looks like everything works as expected. I will spend some
> more time on this, maybe I will be successful in writing an
> injection-point-based TAP test which hits this...

Based on the coverage report link you provided, that code is changed
by v27 0007, not 0005. 0005 is about moving an assertion out of
lazy_scan_prune(). 0007 changes lazy_scan_new_or_empty() (the code in
question).

Regarding 0007, it looks like what is uncovered (the orange bits in
the coverage report are uncovered, I assume) is empty pages _without_
PD_ALL_VISIBLE set. I don't see anywhere where PageSetAllVisible() is
called except vacuum and COPY FREEZE.

If I was trying to guess how empty pages with PD_ALL_VISIBLE set are
getting vacuumed, I would think it is due to SKIP_PAGES_THRESHOLD
causing us to vacuum an all-frozen empty page.

Then the question is, why wouldn't we have coverage of the empty page
first being set all-visible/all-frozen? It can't be COPY FREEZE
because the page is empty. And it can't be vacuum, because then we
would have coverage. It's very mysterious.

It would be good to have coverage for this case. I don't think you'll
need an injection point for the main case of "empty page not yet set
all-visible is vacuumed for the first time" (unless I'm
misunderstanding something).

I'm not sure how you'll test the "vacuuming an empty, previously
uninitialized page" case described in this comment, though.

             * It's possible that another backend has extended the heap,
             * initialized the page, and then failed to WAL-log the page due
             * to an ERROR.  Since heap extension is not WAL-logged, recovery
             * might try to replay our record setting the page all-visible and
             * find that the page isn't initialized, which will cause a PANIC.
             * To prevent that, check whether the page has been previously
             * WAL-logged, and if not, do that now.

You'd want to force an error during relation extension and then vacuum
the page. I don't know if you need an injection point to force the
error -- depends on what kind of error, I think.

So that I know for attribution, did you review 0003-0005?

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

18 декабря 2025 г., 11:55:46

On Thu, 18 Dec 2025 at 05:30, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> > in v27-0005. This patch changes code which is not exercised in
> > tests[0]. I spent some time understanding the conditions when we
> > entered this. There is a comment about non-finished relation
> > extension, but I got no success trying to reproduce this. I ended up
> > modifying code to lose PageSetAllVisible in proper places and running
> > vacuum. Looks like everything works as expected. I will spend some
> > more time on this, maybe I will be successful in writing an
> > injection-point-based TAP test which hits this...
>
> Based on the coverage report link you provided, that code is changed
> by v27 0007, not 0005. 0005 is about moving an assertion out of
> lazy_scan_prune(). 0007 changes lazy_scan_new_or_empty() (the code in
> question).
>
> Regarding 0007, it looks like what is uncovered (the orange bits in
> the coverage report are uncovered, I assume) is empty pages _without_
> PD_ALL_VISIBLE set. I don't see anywhere where PageSetAllVisible() is
> called except vacuum and COPY FREEZE.

Sure, I meant 0007.

> If I was trying to guess how empty pages with PD_ALL_VISIBLE set are
> getting vacuumed, I would think it is due to SKIP_PAGES_THRESHOLD
> causing us to vacuum an all-frozen empty page.

Yes, vacuum (disable_page_skipping);

> Then the question is, why wouldn't we have coverage of the empty page
> first being set all-visible/all-frozen? It can't be COPY FREEZE
> because the page is empty. And it can't be vacuum, because then we
> would have coverage. It's very mysterious.
>
> It would be good to have coverage for this case. I don't think you'll
> need an injection point for the main case of "empty page not yet set
> all-visible is vacuumed for the first time" (unless I'm
> misunderstanding something).
>
> I'm not sure how you'll test the "vacuuming an empty, previously
> uninitialized page" case described in this comment, though.
>
>              * It's possible that another backend has extended the heap,
>              * initialized the page, and then failed to WAL-log the page due
>              * to an ERROR.  Since heap extension is not WAL-logged, recovery
>              * might try to replay our record setting the page all-visible and
>              * find that the page isn't initialized, which will cause a PANIC.
>              * To prevent that, check whether the page has been previously
>              * WAL-logged, and if not, do that now.
>
> You'd want to force an error during relation extension and then vacuum
> the page. I don't know if you need an injection point to force the
> error -- depends on what kind of error, I think.

I did small archeology and this "if (PageIsEmpty(page)) {   if
(!PageIsAllVisible(page)) { .... }}" code  originates back to
608195a3a365. Comment about not WAL-logged relation extension is from
a6370fd9ed3d, and I don't think we need to think about this case.

I am currently inclined to think that we cannot see an empty page that
has PD_ALL_VISIBLE not-set. This is because when we make a page empty,
we are in a critical section, and we WAL-log everything we do, so our
changes should not be half-made. Maybe as of 608195a3a365, there was a
case with empry-page-without-PD_ALL_VISIBLE, but I dont think this
happens on HEAD.

> So that I know for attribution, did you review 0003-0005?

yes, but I did not have any valuable review points for them.


Also, after the whole set is committed, we should then never
experience discrepancy between  PD_ALL_VISIBLE and VM bits? Because
they will be set in a single WAL record. The only cases when heap and
VM disagrees on all-visibility then are corruption,
pg_visibilitymap_truncate and old data (data before v19+ upgrade?)
If my understanding is correct, should we add document this?

-- 
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

18 декабря 2025 г., 18:18:09

On Thu, Dec 18, 2025 at 3:55 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> On Thu, 18 Dec 2025 at 05:30, Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> > If I was trying to guess how empty pages with PD_ALL_VISIBLE set are
> > getting vacuumed, I would think it is due to SKIP_PAGES_THRESHOLD
> > causing us to vacuum an all-frozen empty page.
>
> Yes, vacuum (disable_page_skipping);

Ah, right, that would be a reliable way for it to happen.

> > Then the question is, why wouldn't we have coverage of the empty page
> > first being set all-visible/all-frozen? It can't be COPY FREEZE
> > because the page is empty. And it can't be vacuum, because then we
> > would have coverage. It's very mysterious.
<--snip-->
> I am currently inclined to think that we cannot see an empty page that
> has PD_ALL_VISIBLE not-set. This is because when we make a page empty,
> we are in a critical section, and we WAL-log everything we do, so our
> changes should not be half-made. Maybe as of 608195a3a365, there was a
> case with empry-page-without-PD_ALL_VISIBLE, but I dont think this
> happens on HEAD.

Right, so the way that empty pages get set PD_ALL_VISIBLE is when a
page has all its tuples deleted, the next time it is vacuumed it will
be set all-visible and all-frozen and have PD_ALL_VISIBLE set. (if
it's a trailing page it will be truncated, but any non-trailing page
will be like this).

But you are right, I don't see any non-error code path where a heap
page would become empty (all line pointers set unused) and then not be
set all-visible. Only vacuum sets line pointers unused and if all the
line pointers are unused it will always set the page all-visible.

I think, though, that if we error out in lazy_scan_prune() after
returning from heap_page_prune_and_freeze() such that we don't set the
empty page all-visible, we can end up with an empty page without
PD_ALL_VISIBLE set. You can see how this might work by patching the VM
set code in lazy_scan_prune() to skip empty pages.

> I did small archeology and this "if (PageIsEmpty(page)) {   if
> (!PageIsAllVisible(page)) { .... }}" code  originates back to
> 608195a3a365. Comment about not WAL-logged relation extension is from
> a6370fd9ed3d, and I don't think we need to think about this case.

Thanks for looking into this. Even if this code was added to handle
the error codepath I mentioned above, it seems like it would have been
good enough to just let lazy_scan_prune() handle setting the empty
page all-visible the next time the page was vacuumed. Since there is
no non-error code path where this can happen, it doesn't seem like it
would merit its own special case.

It is possible it was more common as of 608195a3a365, as you say.

I don't understand how the bug fixed by a6370fd9ed3d can happen. When
a new page is initialized, flags are set to 0, so regardless of WAL
logging of the extension not happening, how would the new page have
been set PD_ALL_VISIBLE?  We'll have to ask Andres or Robert about how
this was hit.

> Also, after the whole set is committed, we should then never
> experience discrepancy between  PD_ALL_VISIBLE and VM bits? Because
> they will be set in a single WAL record. The only cases when heap and
> VM disagrees on all-visibility then are corruption,
> pg_visibilitymap_truncate and old data (data before v19+ upgrade?)
> If my understanding is correct, should we add document this?

Even on current master, I don't see a scenario other than VM
corruption or truncation where PD_ALL_VISIBLE can be set but not the
VM (or vice versa). The only way would be if you error out after
setting PD_ALL_VISIBLE before setting the VM. Setting PD_ALL_VISIBLE
is not in a critical section in lazy_scan_prune(), so it won't panic
and dump shared memory, so the buffer with PD_ALL_VISIBLE set may
later get written out. But the only obvious way I see to error out of
MarkBufferDirty() is if the buffer is not valid -- which would have
kept us from doing previous operations on the buffer, I would think.

It's true this will no longer happen after my patches, as
PageSetAllVisible() will happen in a critical section. We could add a
comment about this particular scenario in the code somewhere. But I
don't think we should document it in any user-facing documentation
since you could still truncate the VM and have the two out of sync.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

18 декабря 2025 г., 18:45:58

On Thu, 18 Dec 2025 at 20:18, Melanie Plageman
<melanieplageman@gmail.com> wrote:
> > Also, after the whole set is committed, we should then never
> > experience discrepancy between  PD_ALL_VISIBLE and VM bits? Because
> > they will be set in a single WAL record. The only cases when heap and
> > VM disagrees on all-visibility then are corruption,
> > pg_visibilitymap_truncate and old data (data before v19+ upgrade?)
> > If my understanding is correct, should we add document this?
>
> Even on current master, I don't see a scenario other than VM
> corruption or truncation where PD_ALL_VISIBLE can be set but not the
> VM (or vice versa). The only way would be if you error out after
> setting PD_ALL_VISIBLE before setting the VM. Setting PD_ALL_VISIBLE
> is not in a critical section in lazy_scan_prune(), so it won't panic
> and dump shared memory, so the buffer with PD_ALL_VISIBLE set may
> later get written out. But the only obvious way I see to error out of
> MarkBufferDirty() is if the buffer is not valid -- which would have
> kept us from doing previous operations on the buffer, I would think.
>

Well... I may be missing something, but on current HEAD,
XLOG_HEAP2_PRUNE_VACUUM_SCAN and XLOG_HEAP2_VISIBLE are two different
record, XLOG_HEAP2_PRUNE_VACUUM_SCAN being always emitted first. So,
WAL writer may end up kill-9-ed just after
XLOG_HEAP2_PRUNE_VACUUM_SCAN makes it to the disk, and
XLOG_HEAP2_VISIBLE never. Crash recovery then, and we have
discrepancy. This does not happen with a single WAL record.
Another simple reproducer here: standby streaming, receiving
XLOG_HEAP2_PRUNE_VACUUM_SCAN from primary, Then network becomes bad,
and we never get XLOG_HEAP2_VISIBLE from primary. Then we promoted by
the admin. And again, VM bit vs PD_ALL_VISIBLE discrepancy. Am I
missing something?

-- 
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

18 декабря 2025 г., 21:07:20

On Thu, 18 Dec 2025 at 20:18, Melanie Plageman
<melanieplageman@gmail.com> wrote:

> But you are right, I don't see any non-error code path where a heap
> page would become empty (all line pointers set unused) and then not be
> set all-visible. Only vacuum sets line pointers unused and if all the
> line pointers are unused it will always set the page all-visible.
>
> I think, though, that if we error out in lazy_scan_prune() after
> returning from heap_page_prune_and_freeze() such that we don't set the
> empty page all-visible, we can end up with an empty page without
> PD_ALL_VISIBLE set. You can see how this might work by patching the VM
> set code in lazy_scan_prune() to skip empty pages.
>

Thank you for your explanation!  I completely forgot that PD_ALL_VIS
is a non-persistent change (hint bit). so its update can be trivially
lost.
The simplest real-life example is being killed just after returning
from heap_page_prune_and_freeze, yes.
PFA tap test covering lazy_scan_new_or_empty code path for
empty-but-not-all-visible page

-- 
Best regards,
Kirill Reshke

Вложения

v1-0001-Add-TAP-test-for-empty-page-vacuum.patch

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

18 декабря 2025 г., 22:57:57

On Thu, Dec 18, 2025 at 1:07 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> On Thu, 18 Dec 2025 at 20:18, Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> > But you are right, I don't see any non-error code path where a heap
> > page would become empty (all line pointers set unused) and then not be
> > set all-visible. Only vacuum sets line pointers unused and if all the
> > line pointers are unused it will always set the page all-visible.
> >
> > I think, though, that if we error out in lazy_scan_prune() after
> > returning from heap_page_prune_and_freeze() such that we don't set the
> > empty page all-visible, we can end up with an empty page without
> > PD_ALL_VISIBLE set. You can see how this might work by patching the VM
> > set code in lazy_scan_prune() to skip empty pages.
>
> Thank you for your explanation!  I completely forgot that PD_ALL_VIS
> is a non-persistent change (hint bit). so its update can be trivially
> lost.
> The simplest real-life example is being killed just after returning
> from heap_page_prune_and_freeze, yes.
> PFA tap test covering lazy_scan_new_or_empty code path for
> empty-but-not-all-visible page

Cool test! I'm going to have to think more about whether or not it is
worth adding a whole new TAP test for this codepath. Is there an
existing TAP test we could add it to so we don't need to make a new
cluster, etc? How long does the test take to run? Obviously it will be
quite short, but every bit we add to the test suite counts. I don't
actually know how much overhead there is with injection points.

I was chatting with Andres and he mentioned there is one other case
where you can end up in this code path (empty page without
PD_ALL_VISIBLE set) and this case does actually trigger this code:

            if (RelationNeedsWAL(vacrel->rel) &&
                !XLogRecPtrIsValid(PageGetLSN(page)))
                log_newpage_buffer(buf, true);

If you are inserting to a new page and you successfully call
PageInit() (making the page no longer considered new by PageIsNew()
because pd_upper will be set) but you error out before actually
inserting the tuple, then you will have an empty page without
PD_ALL_VISIBLE set. And assuming you error out before emitting WAL,
the page will not have a valid LSN set. So you will hit that code
which calls log_newpage_buffer().

I would say this case is so narrow (the log_newpage_buffer() codepath
in lazy_scan_new_or_empty()), it's not worth the added test overhead,
but I just wanted to share what I learned about when this code could
be hit.

Previously it was more common in the bulk extension case to have empty
pages not set PD_ALL_VISIBLE because bulk extension would call
PageInit() on all of the pages it extended so all the pages except the
target page were empty (today they are not initialized so they go into
the PageIsNew() branch).

So, in both cases, it seems like the empty page not set PD_ALL_VISIBLE
mostly only hit if we previously errored out.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

18 декабря 2025 г., 23:04:34

On Thu, Dec 18, 2025 at 10:46 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> On Thu, 18 Dec 2025 at 20:18, Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > > Also, after the whole set is committed, we should then never
> > > experience discrepancy between  PD_ALL_VISIBLE and VM bits? Because
> > > they will be set in a single WAL record. The only cases when heap and
> > > VM disagrees on all-visibility then are corruption,
> > > pg_visibilitymap_truncate and old data (data before v19+ upgrade?)
> > > If my understanding is correct, should we add document this?
> >
> > Even on current master, I don't see a scenario other than VM
> > corruption or truncation where PD_ALL_VISIBLE can be set but not the
> > VM (or vice versa). The only way would be if you error out after
> > setting PD_ALL_VISIBLE before setting the VM. Setting PD_ALL_VISIBLE
> > is not in a critical section in lazy_scan_prune(), so it won't panic
> > and dump shared memory, so the buffer with PD_ALL_VISIBLE set may
> > later get written out. But the only obvious way I see to error out of
> > MarkBufferDirty() is if the buffer is not valid -- which would have
> > kept us from doing previous operations on the buffer, I would think.
>
> Well... I may be missing something, but on current HEAD,
> XLOG_HEAP2_PRUNE_VACUUM_SCAN and XLOG_HEAP2_VISIBLE are two different
> record, XLOG_HEAP2_PRUNE_VACUUM_SCAN being always emitted first. So,
> WAL writer may end up kill-9-ed just after
> XLOG_HEAP2_PRUNE_VACUUM_SCAN makes it to the disk, and
> XLOG_HEAP2_VISIBLE never. Crash recovery then, and we have
> discrepancy. This does not happen with a single WAL record.
> Another simple reproducer here: standby streaming, receiving
> XLOG_HEAP2_PRUNE_VACUUM_SCAN from primary, Then network becomes bad,
> and we never get XLOG_HEAP2_VISIBLE from primary. Then we promoted by
> the admin. And again, VM bit vs PD_ALL_VISIBLE discrepancy. Am I
> missing something?

Well, currently XLOG_HEAP2_PRUNE_VACUUM_SCAN doesn't set
PD_ALL_VISIBLE. PD_ALL_VISIBLE is WAL-logged in the XLOG_HEAP2_VISIBLE
record because in lazy_scan_prune() we call PageSetAllVisible() and
then visibilitymap_set() -> log_heap_visible() adds the heap buffer to
the WAL chain (with XLogRegisterBuffer()).

And if you notice when XLOG_HEAP2_VISIBLE is replayed in
heap_xlog_visible(), that is where we do PageSetAllVisible() on the
heap page.

So I think you can end up with PD_ALL_VISIBLE set if you error out
precisely between setting it and WAL logging it because we don't set
it in a critical section. But you can't end up with a WAL record that
sets PD_ALL_VISIBLE and another one that sets the VM.

Once we have my code changes, you can never end up with PD_ALL_VISIBLE
set and the VM not set because they are in the same critical section
and if we error out, it will cause a panic which will purge shared
memory.

- Melanie

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

18 декабря 2025 г., 23:31:27

On Fri, 19 Dec 2025 at 00:58, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Dec 18, 2025 at 1:07 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > On Thu, 18 Dec 2025 at 20:18, Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> >
> > > But you are right, I don't see any non-error code path where a heap
> > > page would become empty (all line pointers set unused) and then not be
> > > set all-visible. Only vacuum sets line pointers unused and if all the
> > > line pointers are unused it will always set the page all-visible.
> > >
> > > I think, though, that if we error out in lazy_scan_prune() after
> > > returning from heap_page_prune_and_freeze() such that we don't set the
> > > empty page all-visible, we can end up with an empty page without
> > > PD_ALL_VISIBLE set. You can see how this might work by patching the VM
> > > set code in lazy_scan_prune() to skip empty pages.
> >
> > Thank you for your explanation!  I completely forgot that PD_ALL_VIS
> > is a non-persistent change (hint bit). so its update can be trivially
> > lost.
> > The simplest real-life example is being killed just after returning
> > from heap_page_prune_and_freeze, yes.
> > PFA tap test covering lazy_scan_new_or_empty code path for
> > empty-but-not-all-visible page
>
> Cool test! I'm going to have to think more about whether or not it is
> worth adding a whole new TAP test for this codepath. Is there an
> existing TAP test we could add it to so we don't need to make a new
> cluster, etc? How long does the test take to run? Obviously it will be
> quite short, but every bit we add to the test suite counts. I don't
> actually know how much overhead there is with injection points.
>

Well, on my pc this test runs in ~1.5 sec. I did not find any other
TAP test to place this, so created a new.
Actually, I only check for specific patterns in the log file of the
cluster in this test, so this test can instead be a regression test.

```
reshke=# VACUUM (DISABLE_PAGE_SKIPPING) vac_empty_test;
NOTICE:  notice triggered for injection point vacuum-empty-page-non-all-vis
VACUUM
reshke=#
```
We will just check in the .out file that the code hits
'vacuum-empty-page-non-all-vis' after an error.
injection points overhead should not be that awful, just from my
experience. Maybe buildfarm members can say something here, I dunno.

Also, we already have a bunch of regression+inj point tests for some
rare cases, exempli gratia
src/test/modules/nbtree/sql/nbtree_half_dead_pages.sql.

--
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Xuneng Zhou

Дата:

19 декабря 2025 г., 06:38:24

He Melanie,

Thanks for working on this.

On Wed, Dec 17, 2025 at 12:59 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Wed, Dec 3, 2025 at 6:07 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > If we're just talking about the renaming, looking at procarray.c, it
> > is full of the word "removable" because its functions were largely
> > used to examine and determine if everyone can see an xmax as committed
> > and thus if that tuple is removable from their perspective. But
> > nothing about the code that I can see means it has to be an xmax. We
> > could just as well use the functions to determine if everyone can see
> > an xmin as committed.
>
> In the attached v27, I've removed the commit that renamed functions in
> procarray.c. I've added a single wrapper GlobalVisTestXidNotRunning()
> that is used in my code where I am testing live tuples. I think you'll
> find that I've addressed all of your review comments now -- as I've
> also gotten rid of the confusing blk_known_av logic through a series
> of refactors.
>
> The one outstanding point is which commits should bump
> XLOG_PAGE_MAGIC. (also review of the reworked patches).
>
> - Melanie

I’ve done a basic review of patches 1 and 2. Here are some comments
which may be somewhat immature, as this is a fairly large change set
and I’m new to some parts of the code.

1) Potential stale old_vmbits after VM repair n v2

// Corruption check 1
if (!PageIsAllVisible(page) &&
(old_vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
{
visibilitymap_clear(...); // VM now cleared to 0
// but old_vmbits still holds ALL_VISIBLE
}

// ... later ...

if (!presult.all_visible)
return presult.ndeleted; // Not taken if presult.all_visible=true

new_vmbits = VISIBILITYMAP_ALL_VISIBLE; // Want to set this

if (old_vmbits == new_vmbits) // Stale old_vmbits=ALL_VISIBLE,
new_vmbits=ALL_VISIBLE
  return presult.ndeleted; // issue: early return

After corruption repair clears the VM, old_vmbits is stale. The early
return can fire unexpectedly, leaving the VM cleared when it should be
re-set. Should we reset old_vmbits = 0 after the visibilitymap_clear?

2) Add Assert(BufferIsDirty(buf))

Since the patch's core claim is "buffer must be dirty before WAL
registration", an assertion encodes this invariant. Should we add:

Assert(BufferIsValid(buf));
Assert(BufferIsDirty(buf));

right before the visibilitymap_set() call?

3) Comment about "only scenario"

The comment at lines:
> "The only scenario where it is not already dirty is if the VM was removed…"

This phrasing could become misleading after future refactors. Can we
make it more direct like:

> "We must mark the heap buffer dirty before calling visibilitymap_set(), because it may WAL-log the buffer and
XLogRegisterBuffer()requires it." 

4) Comment clarity

Current comment:

> "Even if PD_ALL_VISIBLE is already set, we don't need to worry about unnecessarily dirtying the heap buffer, as it
mustbe marked dirty before adding it to the WAL chain. The only scenario where it is not already dirty is if the VM was
removed..."

In this test we now call MarkBufferDirty() on the heap page even when
only setting the VM, so the comments claiming “does not need to modify
the heap buffer”/“no heap page modification” might be misleading. It
might be better to say the test doesn’t need to modify heap
tuples/page contents or doesn’t need to prune/freeze.

--
Best,
Xuneng

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

20 декабря 2025 г., 00:09:47

Attached v29 addresses some feedback and also corrects a small error
with the assertion I had added in the previous version's 0009.

On Thu, Dec 18, 2025 at 10:38 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> I’ve done a basic review of patches 1 and 2. Here are some comments
> which may be somewhat immature, as this is a fairly large change set
> and I’m new to some parts of the code.
>
> 1) Potential stale old_vmbits after VM repair n v2

Good catch! I've fixed this in attached v29.

> 2) Add Assert(BufferIsDirty(buf))
>
> Since the patch's core claim is "buffer must be dirty before WAL
> registration", an assertion encodes this invariant. Should we add:
>
> Assert(BufferIsValid(buf));
> Assert(BufferIsDirty(buf));
>
> right before the visibilitymap_set() call?

There are already assertions that will trip in various places -- most
importantly in XLogRegisterBuffer(), which is the one that inspired
this refactor.

> The comment at lines:
> > "The only scenario where it is not already dirty is if the VM was removed…"
>
> This phrasing could become misleading after future refactors. Can we
> make it more direct like:
>
> > "We must mark the heap buffer dirty before calling visibilitymap_set(), because it may WAL-log the buffer and
XLogRegisterBuffer()requires it." 

I see your point about future refactors missing updating comments like
this. But, I don't think we are going to refactor the code such that
we can have PD_ALL_VISIBLE set without the VM bits set more often.
Also, it is common practice in Postgres to describe very specific edge
cases or odd scenarios in order to explain code that may seem
confusing without the comment. It does risk that comment later
becoming stale, but it is better that future developers understand why
the code is there.

That being said, I take your point that the comment is confusing. I
have updated it in a different way.

> > "Even if PD_ALL_VISIBLE is already set, we don't need to worry about unnecessarily dirtying the heap buffer, as it
mustbe marked dirty before adding it to the WAL chain. The only scenario where it is not already dirty is if the VM was
removed..."
>
> In this test we now call MarkBufferDirty() on the heap page even when
> only setting the VM, so the comments claiming “does not need to modify
> the heap buffer”/“no heap page modification” might be misleading. It
> might be better to say the test doesn’t need to modify heap
> tuples/page contents or doesn’t need to prune/freeze.

The point I'm trying to make is that we have to dirty the buffer even
if we don't modify the page because of the XLOG sub-system
requirements. And, it may seem like a waste to do that if not
modifying the page, but the page will rarely be clean anyway. I've
tried to make this more clear in attached v29.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Kirill Reshke

Дата:

20 декабря 2025 г., 15:32:38

On Sat, 20 Dec 2025 at 02:10, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Attached v29 addresses some feedback and also corrects a small error
> with the assertion I had added in the previous version's 0009.
>
> On Thu, Dec 18, 2025 at 10:38 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > I’ve done a basic review of patches 1 and 2. Here are some comments
> > which may be somewhat immature, as this is a fairly large change set
> > and I’m new to some parts of the code.
> >
> > 1) Potential stale old_vmbits after VM repair n v2
>
> Good catch! I've fixed this in attached v29.
>
> > 2) Add Assert(BufferIsDirty(buf))
> >
> > Since the patch's core claim is "buffer must be dirty before WAL
> > registration", an assertion encodes this invariant. Should we add:
> >
> > Assert(BufferIsValid(buf));
> > Assert(BufferIsDirty(buf));
> >
> > right before the visibilitymap_set() call?
>
> There are already assertions that will trip in various places -- most
> importantly in XLogRegisterBuffer(), which is the one that inspired
> this refactor.
>
> > The comment at lines:
> > > "The only scenario where it is not already dirty is if the VM was removed…"
> >
> > This phrasing could become misleading after future refactors. Can we
> > make it more direct like:
> >
> > > "We must mark the heap buffer dirty before calling visibilitymap_set(), because it may WAL-log the buffer and
XLogRegisterBuffer()requires it." 
>
> I see your point about future refactors missing updating comments like
> this. But, I don't think we are going to refactor the code such that
> we can have PD_ALL_VISIBLE set without the VM bits set more often.
> Also, it is common practice in Postgres to describe very specific edge
> cases or odd scenarios in order to explain code that may seem
> confusing without the comment. It does risk that comment later
> becoming stale, but it is better that future developers understand why
> the code is there.
>
> That being said, I take your point that the comment is confusing. I
> have updated it in a different way.
>
> > > "Even if PD_ALL_VISIBLE is already set, we don't need to worry about unnecessarily dirtying the heap buffer, as
itmust be marked dirty before adding it to the WAL chain. The only scenario where it is not already dirty is if the VM
wasremoved..." 
> >
> > In this test we now call MarkBufferDirty() on the heap page even when
> > only setting the VM, so the comments claiming “does not need to modify
> > the heap buffer”/“no heap page modification” might be misleading. It
> > might be better to say the test doesn’t need to modify heap
> > tuples/page contents or doesn’t need to prune/freeze.
>
> The point I'm trying to make is that we have to dirty the buffer even
> if we don't modify the page because of the XLOG sub-system
> requirements. And, it may seem like a waste to do that if not
> modifying the page, but the page will rarely be clean anyway. I've
> tried to make this more clear in attached v29.
>
> - Melanie


Hi! I checked v29-0009, about HeapTupleSatisfiesVacuumHorizon. Origins
of this code track down to fdf9e21196a6 which was committed as part of
[0], at which point
there was no HeapTupleSatisfiesVacuumHorizon function. I guess this is
the reason this optimization was not performed earlier.

I also think this patch is correct, because we do similar things for
HEAPTUPLE_DEAD & HEAPTUPLE_RECENTLY_DEAD, and
HeapTupleSatisfiesVacuumHorizon is just a proxy to
HeapTupleSatisfiesVacuumHorizon with only difference in DEAD VS
RECENTLY_DEAD handling.


Similar change could be done at heapam_scan_analyze_next_tuple

...
case HEAPTUPLE_DEAD:
case HEAPTUPLE_RECENTLY_DEAD:
/* Count dead and recently-dead rows */
*deadrows += 1;
break;
...



[0] https://www.postgresql.org/message-id/CABOikdP0meGuXPPWuYrP%3DvDvoqUdshF2xJAzZHWSKg03Rz_%2B9Q%40mail.gmail.com


--
Best regards,
Kirill Reshke

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Chao Li

Дата:

22 декабря 2025 г., 10:19:39


> On Dec 20, 2025, at 05:09, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> Attached v29 addresses some feedback and also corrects a small error
> with the assertion I had added in the previous version's 0009.
>
> On Thu, Dec 18, 2025 at 10:38 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>>
>> I’ve done a basic review of patches 1 and 2. Here are some comments
>> which may be somewhat immature, as this is a fairly large change set
>> and I’m new to some parts of the code.
>>
>> 1) Potential stale old_vmbits after VM repair n v2
>
> Good catch! I've fixed this in attached v29.
>
>> 2) Add Assert(BufferIsDirty(buf))
>>
>> Since the patch's core claim is "buffer must be dirty before WAL
>> registration", an assertion encodes this invariant. Should we add:
>>
>> Assert(BufferIsValid(buf));
>> Assert(BufferIsDirty(buf));
>>
>> right before the visibilitymap_set() call?
>
> There are already assertions that will trip in various places -- most
> importantly in XLogRegisterBuffer(), which is the one that inspired
> this refactor.
>
>> The comment at lines:
>>> "The only scenario where it is not already dirty is if the VM was removed…"
>>
>> This phrasing could become misleading after future refactors. Can we
>> make it more direct like:
>>
>>> "We must mark the heap buffer dirty before calling visibilitymap_set(), because it may WAL-log the buffer and
XLogRegisterBuffer()requires it." 
>
> I see your point about future refactors missing updating comments like
> this. But, I don't think we are going to refactor the code such that
> we can have PD_ALL_VISIBLE set without the VM bits set more often.
> Also, it is common practice in Postgres to describe very specific edge
> cases or odd scenarios in order to explain code that may seem
> confusing without the comment. It does risk that comment later
> becoming stale, but it is better that future developers understand why
> the code is there.
>
> That being said, I take your point that the comment is confusing. I
> have updated it in a different way.
>
>>> "Even if PD_ALL_VISIBLE is already set, we don't need to worry about unnecessarily dirtying the heap buffer, as it
mustbe marked dirty before adding it to the WAL chain. The only scenario where it is not already dirty is if the VM was
removed..."
>>
>> In this test we now call MarkBufferDirty() on the heap page even when
>> only setting the VM, so the comments claiming “does not need to modify
>> the heap buffer”/“no heap page modification” might be misleading. It
>> might be better to say the test doesn’t need to modify heap
>> tuples/page contents or doesn’t need to prune/freeze.
>
> The point I'm trying to make is that we have to dirty the buffer even
> if we don't modify the page because of the XLOG sub-system
> requirements. And, it may seem like a waste to do that if not
> modifying the page, but the page will rarely be clean anyway. I've
> tried to make this more clear in attached v29.
>
> - Melanie
>
<v29-0001-Combine-visibilitymap_set-cases-in-lazy_scan_pru.patch><v29-0002-Eliminate-use-of-cached-VM-value-in-lazy_scan_pr.patch><v29-0003-Refactor-lazy_scan_prune-VM-clear-logic-into-hel.patch><v29-0004-Set-the-VM-in-heap_page_prune_and_freeze.patch><v29-0005-Move-VM-assert-into-prune-freeze-code.patch><v29-0006-Eliminate-XLOG_HEAP2_VISIBLE-from-vacuum-phase-I.patch><v29-0007-Eliminate-XLOG_HEAP2_VISIBLE-from-empty-page-vac.patch><v29-0008-Remove-XLOG_HEAP2_VISIBLE-entirely.patch><v29-0009-Simplify-heap_page_would_be_all_visible-visibili.patch><v29-0010-Use-GlobalVisState-in-vacuum-to-determine-page-l.patch><v29-0011-Unset-all_visible-sooner-if-not-freezing.patch><v29-0012-Track-which-relations-are-modified-by-a-query.patch><v29-0013-Pass-down-information-on-table-modification-to-s.patch><v29-0014-Allow-on-access-pruning-to-set-pages-all-visible.patch><v29-0015-Set-pd_prune_xid-on-insert.patch>

A few more comments on v29:

1 - 0002 - Looks like since 0002, visibilitymap_set()’s return value is no longer used, so do we need to update the
functionand change return type to void? I remember in some patches, to address Coverity alerts, people had to do
“(void)function_with_a_return_value()”. 

2 - 0003
```
+ * Helper to correct any corruption detected on an heap page and its
```

Nit: “an” -> “a”

3 - 0003
```
+static bool
+identify_and_fix_vm_corruption(Relation rel, Buffer heap_buffer,
+                               BlockNumber heap_blk, Page heap_page,
+                               int nlpdead_items,
+                               Buffer vmbuffer,
+                               uint8 vmbits)
+{
+    Assert(visibilitymap_get_status(rel, heap_blk, &vmbuffer) == vmbits);
```

Right before this function is called:
```
     old_vmbits = visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer);
+    if (identify_and_fix_vm_corruption(vacrel->rel, buf, blkno, page,
+                                       presult.lpdead_items, vmbuffer,
+                                       old_vmbits))
```

So, the Assert() is checking if old_vmbits is newly returned from visibilitymap_get_status(), in that case,
identify_and_fix_vm_corruption()can take vmbits as a pointer , and it calls visibilitymap_get_status() to get vmbits
itselfand returns vmbits via the pointer, so that we don’t need to call visibilitymap_get_status() twice. 

4 - 0004
```
+     * These are only set if the HEAP_PAGE_PRUNE_UPDATE_VM option is set and
+     * we have attempted to update the VM.
+     */
+    uint8        new_vmbits;
+    uint8        old_vmbits;
```

The comment feels a little confusing to me. "HEAP_PAGE_PRUNE_UPDATE_VM option is set” is a clear indication, but how to
decide"we have attempted to update the VM”? By reading the code: 
```
+    prstate->attempt_update_vm =
+        (params->options & HEAP_PAGE_PRUNE_UPDATE_VM) != 0;
```

It’s just the result of HEAP_PAGE_PRUNE_UPDATE_VM being set. So, maybe we don’t the “and” part.

5 - 0004
```
+ * Returns true if one or both VM bits should be set, along with returning the
+ * current value of the VM bits in *old_vmbits and the desired new value of
+ * the VM bits in *new_vmbits.
+ */
+static bool
+heap_page_will_set_vm(PruneState *prstate,
+                      Relation relation,
+                      BlockNumber heap_blk, Buffer heap_buffer, Page heap_page,
+                      Buffer vmbuffer,
+                      int nlpdead_items,
+                      uint8 *old_vmbits,
+                      uint8 *new_vmbits)
+{
+    if (!prstate->attempt_update_vm)
+        return false;
```

old_vmbits and new_vmbits are purely output parameters. So, maybe we should set them to 0 inside this function instead
ofrelying on callers to initialize them. 

I think this is a similar case where I raised a comment earlier about initializing presult to {0} in the callers, and
youonly wanted to set presult in heap_page_prune_and_freeze(). 

6 - 0004
```
@@ -823,13 +975,19 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
                            MultiXactId *new_relmin_mxid)
 {
     Buffer        buffer = params->buffer;
+    Buffer        vmbuffer = params->vmbuffer;
     Page        page = BufferGetPage(buffer);
+    BlockNumber blockno = BufferGetBlockNumber(buffer);
     PruneState    prstate;
     bool        do_freeze;
     bool        do_prune;
     bool        do_hint_prune;
+    bool        do_set_vm;
     bool        did_tuple_hint_fpi;
     int64        fpi_before = pgWalUsage.wal_fpi;
+    uint8        new_vmbits = 0;
+    uint8        old_vmbits = 0;
+

     /* Initialize prstate */
```

Nit: an extra empty line is added.

7 - 0005
```
-     * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples, and
-     * will return 'all_visible', 'all_frozen' flags to the caller.
+     * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples
```

Nit: a tailing dot is needed in the end of the comment line.

8 - 0005
```
@@ -978,6 +1003,7 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
     Buffer        vmbuffer = params->vmbuffer;
     Page        page = BufferGetPage(buffer);
     BlockNumber blockno = BufferGetBlockNumber(buffer);
+    TransactionId vm_conflict_horizon = InvalidTransactionId;
```

I guess the variable name “vm_conflict_horizon” comes from the old "presult->vm_conflict_horizon”. But in the new
logic,this variable is used more generic, for example Assert(debug_cutoff == vm_conflict_horizon). I see 0006 has
renamedto “conflict_xid”, so it’s up to you if or not rename it. But to make the commit self-contained, I’d suggest
renamingit. 

9 - 0006
```
@@ -3537,6 +3537,7 @@ heap_page_would_be_all_visible(Relation rel, Buffer buf,
     {
         ItemId        itemid;
         HeapTupleData tuple;
+        TransactionId dead_after = InvalidTransactionId;
```

This initialization seems to not needed, as HeapTupleSatisfiesVacuumHorizon() will always set a value to it.

10 - 0010
```
+                 * there is any snapshot that still consider the newest xid on
```

Nit: consider -> considers

11 - 0011
```
+     * page. If we won't attempt freezing, just unset all-visible now, though.
      */
+    if (!prstate->attempt_freeze)
+    {
+        prstate->all_visible = false;
+        prstate->all_frozen = false;
+    }
```

The comment says “just unset all-visible”, but the code actually also unset all_frozen.

12 - 0012
```
+    /*
+     * RT indexes of relations modified by the query either through
+     * UPDATE/DELETE/INSERT/MERGE or SELECT FOR UPDATE
+     */
+    Bitmapset  *es_modified_relids;
```

As we intentionally only want indexes, does it make sense to just name the field es_modified_rtindexes to make it more
explicit.

13 - 0012
```
+            /* If it has a rowmark, the relation is modified */
+            estate->es_modified_relids = bms_add_member(estate->es_modified_relids,
+                                                        rc->rti);
```

I think this comment is a little misleading, because SELECT FOR UPDATE/SHARE doesn’t always modify tuples of the
relation.If a reader not associating this code with this patch, he may consider the comment is wrong. So, I think we
shouldmake the comment more explicit. Maybe rephrase like “If it has a rowmark, the relation may modify or lock heap
pages”.

14 - 0015 - commit message
```
Setting pd_prune_xid on insert can cause a page to be dirtied and
written out when it previously would not have been, affetcting the
```

Typo: affetcting -> affecting

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

22 декабря 2025 г., 20:57:16

On Mon, Dec 22, 2025 at 2:20 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
> A few more comments on v29:

Thanks for the continued review! I've attached v30.

> 1 - 0002 - Looks like since 0002, visibilitymap_set()’s return value is no longer used, so do we need to update the
functionand change return type to void? I remember in some patches, to address Coverity alerts, people had to do
“(void)function_with_a_return_value()”. 

I was torn about whether or not to change the return value. Coverity
doesn't always warn about unused return values. Usually it warns if it
perceives the return value as needed for error checking or if it
thinks not using the return value is incorrect. It may still warn in
this case, but it's not obvious to me which way it would go.

I have changed the function signature as you suggested in v30.

My hesitation is that visibilitymap_set() is in a header file and
could be used by extensions/forks, etc. Adding more information by
changing a return value from void to non-void doesn't have any
negative effect on those potential callers. But taking away a return
value is more likely to affect them in a potentially negative way.

However, I'm significantly changing the signature in this release, so
everybody that used it will have to change their code completely
anyway. Also, I just added a return value for visibilitymap_set() in
the previous release (18). Historically, it returned void. So, I've
gone with your suggestion.

> +static bool
> +identify_and_fix_vm_corruption(Relation rel, Buffer heap_buffer,
> +                                                          BlockNumber heap_blk, Page heap_page,
> +                                                          int nlpdead_items,
> +                                                          Buffer vmbuffer,
> +                                                          uint8 vmbits)
> +{
> +       Assert(visibilitymap_get_status(rel, heap_blk, &vmbuffer) == vmbits);
> ```
>
> Right before this function is called:
> ```
>         old_vmbits = visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer);
> +       if (identify_and_fix_vm_corruption(vacrel->rel, buf, blkno, page,
> +                                                                          presult.lpdead_items, vmbuffer,
> +                                                                          old_vmbits))
> ```
>
> So, the Assert() is checking if old_vmbits is newly returned from visibilitymap_get_status(), in that case,
identify_and_fix_vm_corruption()can take vmbits as a pointer , and it calls visibilitymap_get_status() to get vmbits
itselfand returns vmbits via the pointer, so that we don’t need to call visibilitymap_get_status() twice. 

I see what you are saying, and I did consider this.
visibilitymap_get_status() is only called the second time in assert
builds, and it isn't so expensive to do it that it is worth worrying
about.  I added the assertion to prevent other callers from calling
identify_and_fix_vm_corruption() with random VM bits unassociated with
the vmbuffer passed in.

The reason I don't think identify_and_fix_vm_corruption() should be
the one to call visibilitymap_get_status() and initialize old_vmbits
is that it shouldn't be a required step to setting the VM.
identify_and_fix_vm_corruption()'s job is to identify and fix
corruption -- not get the VM bits for when we set them. In fact, it
may make sense someday to check that the VM and PD_ALL_VISIBLE are in
sync before pruning and freezing is even started. (Of course, we can't
check the number of lpdead items until after).

Regarding having *old_vmbits as a return value. I thought about
directly returning the result of visibilitymap_clear() from
identify_and_fix_vm_corruption(). The reason I didn't is that if
PD_ALL_VISIBLE is set and nlpdead_items > 0 but the VM is clear,
visibilitymap_clear() will return false -- because it didn't need to
clear the VM bits. And I think we want
identify_and_fix_vm_corruption() to return true if it cleared
corruption at all.

I don't think we should have identify_and_fix_vm_corruption() reset
old_vmbits to 0 (and pass it by reference), because the caller may
want to know the value of old_vmbits before we cleared corruption.

> +        * These are only set if the HEAP_PAGE_PRUNE_UPDATE_VM option is set and
> +        * we have attempted to update the VM.
> +        */
> +       uint8           new_vmbits;
> +       uint8           old_vmbits;
> ```
>
> The comment feels a little confusing to me. "HEAP_PAGE_PRUNE_UPDATE_VM option is set” is a clear indication, but how
todecide "we have attempted to update the VM”? By reading the code: 
> ```
> +       prstate->attempt_update_vm =
> +               (params->options & HEAP_PAGE_PRUNE_UPDATE_VM) != 0;
>
> It’s just the result of HEAP_PAGE_PRUNE_UPDATE_VM being set. So, maybe we don’t the “and” part.

Good point. Fixed.

> +static bool
> +heap_page_will_set_vm(PruneState *prstate,
> +                                         Relation relation,
> +                                         BlockNumber heap_blk, Buffer heap_buffer, Page heap_page,
> +                                         Buffer vmbuffer,
> +                                         int nlpdead_items,
> +                                         uint8 *old_vmbits,
> +                                         uint8 *new_vmbits)
> +{
> +       if (!prstate->attempt_update_vm)
> +               return false;
> ```
>
> old_vmbits and new_vmbits are purely output parameters. So, maybe we should set them to 0 inside this function
insteadof relying on callers to initialize them. 
>
> I think this is a similar case where I raised a comment earlier about initializing presult to {0} in the callers, and
youonly wanted to set presult in heap_page_prune_and_freeze(). 

I see your point. It does feel a little bit different to me since they
are local variables and coverity may not actually be able to tell they
are being unconditionally initialized by heap_page_will_set_vm(). The
other local variables that are not initialized at the top are all
unconditionally set by helper return values. But my decision to
initialize them was more instinct than rationality. I've changed it as
you suggested.

> -        * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples, and
> -        * will return 'all_visible', 'all_frozen' flags to the caller.
> +        * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples
>
> Nit: a tailing dot is needed in the end of the comment line.

I've changed it. One interesting thing is that our "policy" for
periods in comments is that we don't put periods at the end of
one-line comments and we do put them at the end of mult-line comment
sentences. This is a one-line comment inside a comment block, so I
wasn't sure what to do. If you noticed it, and it bothered you, it's
easy enough to change, though.

> @@ -978,6 +1003,7 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>         Buffer          vmbuffer = params->vmbuffer;
>         Page            page = BufferGetPage(buffer);
>         BlockNumber blockno = BufferGetBlockNumber(buffer);
> +       TransactionId vm_conflict_horizon = InvalidTransactionId;
> ```
>
> I guess the variable name “vm_conflict_horizon” comes from the old "presult->vm_conflict_horizon”. But in the new
logic,this variable is used more generic, for example Assert(debug_cutoff == vm_conflict_horizon). I see 0006 has
renamedto “conflict_xid”, so it’s up to you if or not rename it. But to make the commit self-contained, I’d suggest
renamingit. 

As of this patch, it is still being exclusively used as the conflict
XID for setting the visibility map. And it still is the visibility
horizon. I rename it to conflict xid once it includes more than just
the visibility horizon for an all-visible page. In that assertion, it
is also the visibility horizon for an all-visible page.

> 9 - 0006
>
> @@ -3537,6 +3537,7 @@ heap_page_would_be_all_visible(Relation rel, Buffer buf,
>         {
>                 ItemId          itemid;
>                 HeapTupleData tuple;
> +               TransactionId dead_after = InvalidTransactionId;
> ```
>
> This initialization seems to not needed, as HeapTupleSatisfiesVacuumHorizon() will always set a value to it.

I think this is a comment for a later patch in the set (you originally
said it was from 0006), but I've changed dead_after to not be
initialized like this.

> +       /*
> +        * RT indexes of relations modified by the query either through
> +        * UPDATE/DELETE/INSERT/MERGE or SELECT FOR UPDATE
> +        */
> +       Bitmapset  *es_modified_relids;
> ```
>
> As we intentionally only want indexes, does it make sense to just name the field es_modified_rtindexes to make it
moreexplicit. 

I'm torn about this. I named it like this partially because the struct
member two above it in the estate, es_unpruned_relids, is also a
bitmapset of range table indexes and yet is called x_relids. Though
the bitmapset is one of indexes into the range table, they are the
indexes of relation IDs in that range table. I think this could go
either way, so I've left it as is for now and will think more about it
once this patch is closer to being committed.

> +                       /* If it has a rowmark, the relation is modified */
> +                       estate->es_modified_relids = bms_add_member(estate->es_modified_relids,
> +
rc->rti);
> ```
>
> I think this comment is a little misleading, because SELECT FOR UPDATE/SHARE doesn’t always modify tuples of the
relation.If a reader not associating this code with this patch, he may consider the comment is wrong. So, I think we
shouldmake the comment more explicit. Maybe rephrase like “If it has a rowmark, the relation may modify or lock heap
pages”.

I see what you are saying. It's a good point. However, the reason we
don't want to set the VM for SELECT FOR UPDATE is not because the
SELECT FOR UPDATE will lock the relation but because it is usually
indicating that we intend to modify the relation (when we do the
update). As such, I've updated the comment to say "If it has a
rowmark, the relation may be modified" -- which leaves it more open.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

22 декабря 2025 г., 21:20:59

On Sat, Dec 20, 2025 at 7:32 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> Hi! I checked v29-0009, about HeapTupleSatisfiesVacuumHorizon. Origins
> of this code track down to fdf9e21196a6 which was committed as part of
> [0], at which point
> there was no HeapTupleSatisfiesVacuumHorizon function. I guess this is
> the reason this optimization was not performed earlier.

Thanks for taking a look into this!

> I also think this patch is correct, because we do similar things for
> HEAPTUPLE_DEAD & HEAPTUPLE_RECENTLY_DEAD, and
> HeapTupleSatisfiesVacuumHorizon is just a proxy to
> HeapTupleSatisfiesVacuumHorizon with only difference in DEAD VS
> RECENTLY_DEAD handling.
>
> Similar change could be done at heapam_scan_analyze_next_tuple
>
> ...
> case HEAPTUPLE_DEAD:
> case HEAPTUPLE_RECENTLY_DEAD:
> /* Count dead and recently-dead rows */
> *deadrows += 1;
> break;

In v30 sent here [1], I did end up making this change in 0010. I just
realized that I should have also changed
table_scan_analyze_next_tuple() and removed the call to
GetOldestRemovableTransactionId(). I've done that in attached v31.

I'm not sure we should change the table AM API (by removing
OldestXmin), though. I looked for table AMs implementing
scan_analyze_next_tuple() to see if they use OldestXmin. I found two:
OrioleDB [2] and Citus columnar [3], which both implement
scan_analyze_next_tuple() and neither of them use OldestXmin. I
couldn't easily find other table AMs implementing
scan_analyze_next_tuple(). I don't have a strong sense of whether or
not I should make this change. Changing it is churn to a public API
and doesn't specifically enable us to do something.

I could also just leave it unused by heapam's implementation. I
haven't checked what, if any, other table AMs callbacks have
parameters completely unused by their heap implementation.

So, I'm on the fence about whether or not to make the change at all,
and, if I do, whether or not to change the table AM callback. That is
done in v31, though, so we can discuss.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_ZCjHoRPfQ8AbMrFY8TOMCPAvZ0_m9SX7yg0edfTk45-g%40mail.gmail.com
[2]
https://github.com/orioledb/orioledb/blob/acff65984d106dabf708a179e2c6694297e08c02/src/tableam/handler.c#L978C68-L978C78
[3]
https://github.com/citusdata/citus/blob/ee3812d267db3ab007efb6f5f432c82c1f448695/src/backend/columnar/columnar_tableam.c#L1418

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Chao Li

Дата:

23 декабря 2025 г., 03:00:57


> On Dec 23, 2025, at 01:57, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Mon, Dec 22, 2025 at 2:20 AM Chao Li <li.evan.chao@gmail.com> wrote:
>>
>> A few more comments on v29:
>
> Thanks for the continued review! I've attached v30.
>
>> 1 - 0002 - Looks like since 0002, visibilitymap_set()’s return value is no longer used, so do we need to update the
functionand change return type to void? I remember in some patches, to address Coverity alerts, people had to do
“(void)function_with_a_return_value()”. 
>
> I was torn about whether or not to change the return value. Coverity
> doesn't always warn about unused return values. Usually it warns if it
> perceives the return value as needed for error checking or if it
> thinks not using the return value is incorrect. It may still warn in
> this case, but it's not obvious to me which way it would go.
>
> I have changed the function signature as you suggested in v30.
>
> My hesitation is that visibilitymap_set() is in a header file and
> could be used by extensions/forks, etc. Adding more information by
> changing a return value from void to non-void doesn't have any
> negative effect on those potential callers. But taking away a return
> value is more likely to affect them in a potentially negative way.
>
> However, I'm significantly changing the signature in this release, so
> everybody that used it will have to change their code completely
> anyway. Also, I just added a return value for visibilitymap_set() in
> the previous release (18). Historically, it returned void. So, I've
> gone with your suggestion.

From a previous patch, I learned from Peter Eisentraut that “We don't care about ABI changes in major releases.”, see:

https://www.postgresql.org/message-id/70913dbd-dadf-4560-9f81-c0df72bf6578%40eisentraut.org

>> -        * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples, and
>> -        * will return 'all_visible', 'all_frozen' flags to the caller.
>> +        * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples
>>
>> Nit: a tailing dot is needed in the end of the comment line.
>
> I've changed it. One interesting thing is that our "policy" for
> periods in comments is that we don't put periods at the end of
> one-line comments and we do put them at the end of mult-line comment
> sentences. This is a one-line comment inside a comment block, so I
> wasn't sure what to do. If you noticed it, and it bothered you, it's
> easy enough to change, though.

If this is a one-line comment, I would have not been caring about the tailing period.

The problem is this is a paragraph of a block comment, and the above and below paragraphs all have tailing periods. So,
forconsistency, I raised the comment. 
```
      * HEAP_PAGE_PRUNE_MARK_UNUSED_NOW indicates that dead items can be set
      * LP_UNUSED during pruning.   <=== Has a tailing period
      *
-     * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples, and
-     * will return 'all_visible', 'all_frozen' flags to the caller.
+     * HEAP_PAGE_PRUNE_FREEZE indicates that we will also freeze tuples <=== Not a tailing period
      *
      * HEAP_PAGE_PRUNE_UPDATE_VM indicates that we will set the page's status
      * in the VM.                                 <=== Has a tailing period
```

>
>> 9 - 0006
>>
>> @@ -3537,6 +3537,7 @@ heap_page_would_be_all_visible(Relation rel, Buffer buf,
>>        {
>>                ItemId          itemid;
>>                HeapTupleData tuple;
>> +               TransactionId dead_after = InvalidTransactionId;
>> ```
>>
>> This initialization seems to not needed, as HeapTupleSatisfiesVacuumHorizon() will always set a value to it.
>
> I think this is a comment for a later patch in the set (you originally
> said it was from 0006), but I've changed dead_after to not be
> initialized like this.

My bad. This comment was actually for 0009. In v31, I see you have removed the initialization to dead_after.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От

Melanie Plageman

Дата:

23 декабря 2025 г., 04:18:05

On Mon, Dec 22, 2025 at 7:01 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> > On Dec 23, 2025, at 01:57, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> > My hesitation is that visibilitymap_set() is in a header file and
> > could be used by extensions/forks, etc. Adding more information by
> > changing a return value from void to non-void doesn't have any
> > negative effect on those potential callers. But taking away a return
> > value is more likely to affect them in a potentially negative way.
> >
> > However, I'm significantly changing the signature in this release, so
> > everybody that used it will have to change their code completely
> > anyway. Also, I just added a return value for visibilitymap_set() in
> > the previous release (18). Historically, it returned void. So, I've
> > gone with your suggestion.
>
> From a previous patch, I learned from Peter Eisentraut that “We don't care about ABI changes in major releases.”,
see:

Right, it is totally okay to change function APIs in a major release.
My point was not that it wasn't allowed but that if people are getting
useful information returned from that function, or if we think we
might want that information again in the future, we should think twice
before changing it. But, in this case, I think we don't need to worry
about it.

- Melanie

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения