Обсуждение: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

Поиск
Список
Период
Сортировка

eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Hi,

The attached patch set eliminates xl_heap_visible, the WAL record
emitted when a block of the heap is set all-visible/frozen in the
visibility map. Instead, it includes the information needed to update
the VM in the WAL record already emitted by the operation modifying
the heap page.

Currently COPY FREEZE and vacuum are the only operations that set the
VM. So, this patch modifies the xl_heap_multi_insert and xl_heap_prune
records.

The result is a dramatic reduction in WAL volume for these operations.
I've included numbers below.

I also think that it makes more sense to include changes to the VM in
the same WAL record as the changes that rendered the page all-visible.
In some cases, we will only set the page all-visible, but that is in
the context of the operation on the heap page which discovered that it
was all-visible. Therefore, I find this to be a clarity as well as a
performance improvement.

This project is also the first step toward setting the VM on-access
for queries which do not modify the page. There are a few design
issues that must be sorted out for that project which I will detail
separately. Note that this patch set currently does not implement
setting the VM on-access.

The attached patch set isn't 100% polished. I think some of the
variable names and comments could use work, but I'd like to validate
the idea of doing this before doing a full polish. This is a summary
of what is in the set:

Patches:
0001 - 0002: cleanup
0003 - 0004: refactoring
0005: COPY FREEZE changes
0006: refactoring
0007: vacuum phase III changes
0008: vacuum phase I empty page changes
0009 - 0012: refactoring
0013: vacuum phase I normal page changes
0014: cleanup

Performance benefits of eliminating xl_heap_visible:

vacuum of table with index (DDL at bottom of email)
--
master -> patch
WAL bytes: 405346 -> 303088 = 25% reduction
WAL records: 6682 -> 4459 = 33% reduction

vacuum of table without index
--
master -> patch
WAL records: 4452 -> 2231 = 50% reduction
WAL bytes: 289016 -> 177978 = 38% reduction

COPY FREEZE of table without index
--
master -> patch
WAL records: 3672777 -> 1854589 = 50% reduction
WAL bytes: 841340339 -> 748545732  = 11% reduction (new pages need a
copy of the whole page)

table for vacuum example:
--
create table foo(a int, b numeric, c numeric) with (autovacuum_enabled= false);
insert into foo select i % 18, repeat('1', 400)::numeric, repeat('2',
400)::numeric from generate_series(1,40000)i;
-- don't make index for no-index case
create index on foo(a);
delete from foo where a = 1;
vacuum (verbose, process_toast false) foo;


copy freeze example:
--
-- create a data file
create table large(a int, b int) with (autovacuum_enabled = false,
fillfactor = 10);
insert into large SELECT generate_series(1,40000000)i, 1;
copy large to 'large.data';

-- example
BEGIN;
create table large(a int, b int) with (autovacuum_enabled = false,
fillfactor = 10);
COPY large FROM 'large.data' WITH (FREEZE);
COMMIT;

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Mon, Jun 23, 2025 at 4:25 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> The attached patch set eliminates xl_heap_visible, the WAL record
> emitted when a block of the heap is set all-visible/frozen in the
> visibility map. Instead, it includes the information needed to update
> the VM in the WAL record already emitted by the operation modifying
> the heap page.

Rebased in light of recent changes on master:

0001: cleanup
0002: preparatory work
0003: eliminate xl_heap_visible for COPY FREEZE
0004 - 0005: eliminate xl_heap_visible for vacuum's phase III
0006: eliminate xl_heap_visible for vacuum phase I empty pages
0007 - 0010: preparatory refactoring
0011: eliminate xl_heap_visible from vacuum phase I prune/freeze
0012: remove xl_heap_visible

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Thu, Jun 26, 2025 at 6:04 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Rebased in light of recent changes on master:

This needed another rebase, and, in light of the discussion in [1],
I've also removed the patch to add heap wrappers for setting pages
all-visible.

More notably, the final patch (0012) in attached v3 allows on-access
pruning to set the VM.

To do this, it plumbs some information down from the executor to the
table scan about whether or not the table is modified by the query. We
don't want to set the VM only to clear it while scanning pages for an
UPDATE or while locking rows in a SELECT FOR UPDATE.

Because we only do on-access pruning when pd_prune_xid is valid, we
shouldn't need much of a heuristic for deciding when to set the VM
on-access -- but I've included one anyway: we only do it if we are
actually pruning or if the page is already dirty and no FPI would be
emitted.

You can see it in action with the following:

create extension pg_visibility;
create table foo (a int, b int) with (autovacuum_enabled=false, fillfactor=90);
insert into foo select generate_series(1,300), generate_series(1,300);
create index on foo (a);
update foo set b = 51 where b = 50;
select * from foo where a = 50;
select * from pg_visibility_map_summary('foo');

The SELECT will set a page all-visible in the VM.
In this patch set, on-access pruning is enabled for sequential scans
and the underlying heap relation in index scans and bitmap heap scans.
This example can exercise any of the three if you toggle
enable_indexscan and enable_bitmapscan appropriately.

From a performance perspective, If you run a trivial pgbench, you can
see far more all-visible pages set in the pgbench_[x] relations with
no noticeable overhead. But, I'm planning to do some performance
experiments to show how this affects our ability to choose index only
scan plans in realistic workloads.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_Yj%3DyrL%2BgGGsqfYVQcYn7rDp6hDeoF1vN453JDp8dEY%2Bw%40mail.gmail.com

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Wed, Jul 9, 2025 at 5:59 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Jun 26, 2025 at 6:04 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > Rebased in light of recent changes on master:
>
> This needed another rebase, and, in light of the discussion in [1],
> I've also removed the patch to add heap wrappers for setting pages
> all-visible.

Andrey Borodin made the excellent point off-list that I forgot to
remove the xl_heap_visible struct itself -- which is rather important
to a patch set purporting to eliminate xl_heap_visible! New version
attached.


- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Andrey Borodin
Дата:

> On 12 Jul 2025, at 03:19, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> remove the xl_heap_visible struct

Same goes for VISIBILITYMAP_XLOG_CATALOG_REL and XLOG_HEAP2_VISIBLE. But please do not rush to remove it, perhaps I
willhave a more exhaustive list later. Currently the patch set is expected to be unpolished. 
I just need to absorb all effects to have a high-level evaluation of the patch set effect.

I'm still trying to grasp connection of first patch with Assert(prstate->cutoffs) to other patches;

Also, I'd prefer "page is not marked all-visible but visibility map bit is set in relation" to emit XX001 for
monitoringreasons, but again, this is small note, while I need a broader picture. 

So far I do not see any general problems in delegating redo work from xl_heap_visible to other record. FWIW I observed
severalcases of VM corruptions that might be connected to the fact that we log VM changes independently of data changes
thatcaused VM to change. But I have no real evidence or understanding what happened. 


Best regards, Andrey Borodin.


Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Sun, Jul 13, 2025 at 2:34 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> > On 12 Jul 2025, at 03:19, Melanie Plageman <melanieplageman@gmail.com> wrote:
> >
> > remove the xl_heap_visible struct
>
> Same goes for VISIBILITYMAP_XLOG_CATALOG_REL and XLOG_HEAP2_VISIBLE. But please do not rush to remove it, perhaps I
willhave a more exhaustive list later. Currently the patch set is expected to be unpolished. 
> I just need to absorb all effects to have a high-level evaluation of the patch set effect.

I actually did remove those if you check the last version posted. I
did notice there is one remaining comment referring to
XLOG_HEAP2_VISIBLE I missed somehow, but the actual enums/macros were
removed already.

> I'm still trying to grasp connection of first patch with Assert(prstate->cutoffs) to other patches;

I added this because I noticed that it was used without validating it
was provided in that location. The last patch in the set which sets
the VM on access changes where cutoffs are used, so I noticed what I
felt was a missing assert in master while developing that page.

> Also, I'd prefer "page is not marked all-visible but visibility map bit is set in relation" to emit XX001 for
monitoringreasons, but again, this is small note, while I need a broader picture. 

Could you clarify what you mean by this? Are you talking about the
string representation of the visibility map bits in the WAL record
representations in heapdesc.c?

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Andrey Borodin
Дата:

> On 14 Jul 2025, at 00:15, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
>>
>> Also, I'd prefer "page is not marked all-visible but visibility map bit is set in relation" to emit XX001 for
monitoringreasons, but again, this is small note, while I need a broader picture. 
>
> Could you clarify what you mean by this? Are you talking about the
> string representation of the visibility map bits in the WAL record
> representations in heapdesc.c?

This might be a bit off-topic for this thread, but as long as the patch touches that code we can look into this too.

If VM bit all-visible is set while page is not all-visible IndexOnlyScan will show incorrect results. I observed this
inconsistencyfew times on production. 

Two persistent subsystems (VM and heap) contradict each other, that's why I think this is a data corruption. Yes, we
canrepair the VM by assuming heap to be the source of truth in this case. But we must also emit ERRCODE_DATA_CORRUPTED
XX001code into the logs. In many cases this will alert on-call SRE. 

To do so I propose to replace elog(WARNING,...) with ereport(WARNING,(errcode(ERRCODE_DATA_CORRUPTED),..).


Best regards, Andrey Borodin.


Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for continuing to take a look, Andrey.

On Mon, Jul 14, 2025 at 2:37 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> This might be a bit off-topic for this thread, but as long as the patch touches that code we can look into this too.
>
> If VM bit all-visible is set while page is not all-visible IndexOnlyScan will show incorrect results. I observed this
inconsistencyfew times on production. 

That's very unfortunate. I wonder what could be causing this. Do you
suspect a bug in Postgres? Or something wrong with the disk, etc?

> Two persistent subsystems (VM and heap) contradict each other, that's why I think this is a data corruption. Yes, we
canrepair the VM by assuming heap to be the source of truth in this case. But we must also emit ERRCODE_DATA_CORRUPTED
XX001code into the logs. In many cases this will alert on-call SRE. 
>
> To do so I propose to replace elog(WARNING,...) with ereport(WARNING,(errcode(ERRCODE_DATA_CORRUPTED),..).

Ah, you mean the warnings currently in lazy_scan_prune(). To me this
suggestion makes sense. I see at least one other example with
ERRCODE_DATA_CORRUPTED that is an error level below ERROR.

I have attached a cleaned up and updated version of the patch set (it
doesn't yet include your suggested error message change).


What's new in this version
-----
In addition to general code, comment, and commit message improvements,
notable changes are as follows:

- I have used the GlobalVisState for determining if the whole page is
visible in a more natural way.

- I micro-benchmarked and identified some sources of regression in the
additional code SELECT queries would do to set the VM. So, there are
several new commits addressing these (for example inlining several
functions and unsetting all-visible when we see a dead tuple if we
won't attempt freezing).

- Because heap_page_prune_and_freeze() was getting long, I added some
helper functions.


Performance impact of setting the VM on-access
-------
I found that with the patch set applied, we set many pages all-visible
in the VM on access, resulting in a higher overall number of pages set
all-visible, reducing load for vacuum, and dramatically decreasing
heap fetches by index-only scans.

I devised a simple benchmark -- with 8 workers inserting 20 rows at a
time into a table with a few columns and updating a single row that
they just inserted. Another worker queries the table 1x second using
an index.

After running the benchmark for a few minutes, though the table was
autovacuumed several times in both cases, with the patchset applied,
15% more blocks were all-visible at the end of the benchmark.

And with my patch applied, index-only scans did far fewer heap
fetches. A SELECT count(*) of the table at the same point in the
benchmark did 10,000 heap fetches on master and 500 with the patch
applied (I used auto_explain to determine this).

With my patch applied, autovacuum workers write half as much WAL as on
master. Some of this is courtesy of other patches in the set which
eliminate separate WAL records for setting the page all-visible. But,
vacuum is also scanning fewer pages and dirtying fewer buffers because
they are being set all-visible on-access.

There are more details about the benchmark at the end of the email.


Setting pd_prune_xid on insert
------
The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
patch in the set. It sets pd_prune_xid on insert (so pages filled by
COPY or insert can also be set all-visible in the VM before they are
vacuumed). I gave it a .txt extension because it currently fails
035_standby_logical_decoding due to a recovery conflict. I need to
investigate more to see if this is a bug in my patch set or elsewhere
in Postgres.

Besides the failing test, I have a feeling that my current heuristic
for whether or not to set the VM on-access is not quite right for
pages that have only been inserted to -- and if we get it wrong, we've
wasted those CPU cycles because we didn't otherwise need to prune the
page.


- Melanie


Benchmark
-------
psql -c "
DROP TABLE IF EXISTS simple_table;

CREATE TABLE simple_table (
    id SERIAL PRIMARY KEY,
    group_id INT NOT NULL,
    data TEXT,
    created_at TIMESTAMPTZ DEFAULT now()
);

create index on simple_table(group_id);
"

pgbench \
  --no-vacuum \
  --random-seed=0 \
  -c 8 \
  -j 8 \
  -M prepared \
  -T 200 \
  > "pgbench_run_summary_update_${version}" \
-f- <<EOF &
\set gid random(1,1000)

INSERT INTO simple_table (group_id, data)
  SELECT :gid, 'inserted'
  RETURNING id \gset

update simple_table set data = 'updated' where id = :id;

insert into simple_table (group_id, data)
  select :gid, 'inserted'
  from generate_series(1,20);
EOF
insert_pid=$!

pgbench \
  --no-vacuum \
  --random-seed=0 \
  -c 1 \
  -j 1 \
  --rate=1 \
  -M prepared \
  -T 200 \
  > "pgbench_run_summary_select_${version}" \
-f- <<EOF
\set gid random(1, 1000)
select max(created_at) from simple_table where group_id = :gid;
select count(*) from simple_table where group_id = :gid;
EOF

wait $insert_pid

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> patch in the set. It sets pd_prune_xid on insert (so pages filled by
> COPY or insert can also be set all-visible in the VM before they are
> vacuumed). I gave it a .txt extension because it currently fails
> 035_standby_logical_decoding due to a recovery conflict. I need to
> investigate more to see if this is a bug in my patch set or elsewhere
> in Postgres.

I figured out that if we set the VM on-access, we need to enable
hot_standby_feedback in more places in 035_standby_logical_decoding.pl
to avoid recovery conflicts. I've done that in the attached updated
version 6. There are a few other issues in
035_standby_logical_decoding.pl that I reported here [1]. With these
changes, setting pd_prune_xid on insert passes tests. Whether or not
we want to do it (and what the heuristic should be for deciding when
to do it) is another question.

- Melanie

[1] https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Sat, 2 Aug 2025 at 02:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> > patch in the set. It sets pd_prune_xid on insert (so pages filled by
> > COPY or insert can also be set all-visible in the VM before they are
> > vacuumed). I gave it a .txt extension because it currently fails
> > 035_standby_logical_decoding due to a recovery conflict. I need to
> > investigate more to see if this is a bug in my patch set or elsewhere
> > in Postgres.
>
> I figured out that if we set the VM on-access, we need to enable
> hot_standby_feedback in more places in 035_standby_logical_decoding.pl
> to avoid recovery conflicts. I've done that in the attached updated
> version 6. There are a few other issues in
> 035_standby_logical_decoding.pl that I reported here [1]. With these
> changes, setting pd_prune_xid on insert passes tests. Whether or not
> we want to do it (and what the heuristic should be for deciding when
> to do it) is another question.
>
> - Melanie
>
> [1]
https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com

Hi!

Andrey told me off-list about this thread and I decided to take a look.

I tried to play with each patch in this patchset and find a
corruption, but I was unsuccessful. I will conduct further tests
later. I am not implying that I suspect this patchset causes any
corruption; I am merely attempting to verify it.

I also have few comments and questions. Here is my (very limited)
review of 0001, because I believe that removing xl_heap_visible from
COPY FREEZE is pure win, so this patch can be very beneficial by
itself.

visibilitymap_set_vmbyte is introduced in 0001 and removed in 0012.
This is strange to me, maybe we can avoid visibilitymap_set_vmbyte in
first place?

In 0001:

1)
should we add "Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
LW_EXCLUSIVE));" in visibilitymap_set_vmbyte?

Also here  `Assert(visibilitymap_pin_ok(BufferGetBlockNumber(buffer),
vmbuffer));` can be beneficial:

>/*
>+ * If we're only adding already frozen rows to a previously empty
>+ * page, mark it as all-frozen and update the visibility map. We're
>+ * already holding a pin on the vmbuffer.
>+ */
>   else if (all_frozen_set)
>+ {
>    PageSetAllVisible(page);
>+ LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
>+ visibilitymap_set_vmbyte(relation,
>+ BufferGetBlockNumber(buffer),
>+ vmbuffer,
>+ VISIBILITYMAP_ALL_VISIBLE |
>+ VISIBILITYMAP_ALL_FROZEN);
>+ }

2)
in heap_xlog_multi_insert:

+
+ visibilitymap_pin(reln, blkno, &vmbuffer);
+ visibilitymap_set_vmbyte(....)

Do we need to pin vmbuffer here? Looks like
XLogReadBufferForRedoExtended already pins vmbuffer. I verified this
with CheckBufferIsPinnedOnce(vmbuffer) just before visibilitymap_pin
and COPY ... WITH (FREEZE true) test.

3)
>+
> +#ifdef TRACE_VISIBILITYMAP
> + elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
> +#endif

I can see this merely copy-pasted from visibilitymap_set, but maybe
display "flags" also?

4) visibilitymap_set receives  XLogRecPtr recptr parameters, which is
set to WAL record lsn during recovery and to InvalidXLogRecPtr
otherwise. visibilitymap_set manages VM page LSN bases on this recptr
value (inside function logic). visibilitymap_set_vmbyte behaves
vise-versa and makes its caller responsible for page LSN management.
Maybe we should keep these two functions akin to each other?


--
Best regards,
Kirill Reshke



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Sat, 2 Aug 2025 at 02:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> > patch in the set. It sets pd_prune_xid on insert (so pages filled by
> > COPY or insert can also be set all-visible in the VM before they are
> > vacuumed). I gave it a .txt extension because it currently fails
> > 035_standby_logical_decoding due to a recovery conflict. I need to
> > investigate more to see if this is a bug in my patch set or elsewhere
> > in Postgres.
>
> I figured out that if we set the VM on-access, we need to enable
> hot_standby_feedback in more places in 035_standby_logical_decoding.pl
> to avoid recovery conflicts. I've done that in the attached updated
> version 6. There are a few other issues in
> 035_standby_logical_decoding.pl that I reported here [1]. With these
> changes, setting pd_prune_xid on insert passes tests. Whether or not
> we want to do it (and what the heuristic should be for deciding when
> to do it) is another question.
>
> - Melanie
>
> [1]
https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com


0002 No comments from me. Looks straightforward.

Few comments on 0003.

1) This patch introduces XLHP_HAS_VMFLAGS. However it lacks some
helpful comments about this new status bit.

a) In heapam_xlog.h, in xl_heap_prune struct definition:


/*
* If XLHP_HAS_CONFLICT_HORIZON is set, the conflict horizon XID follows,
* unaligned
*/
+ /* If XLHP_HAS_VMFLAGS is set, newly set visibility map bits comes,
unaligned */

b)

we can add here comment why we use  memcpy assignment, akin to /*
memcpy() because snapshot_conflict_horizon is stored unaligned */

+ /* Next are the optionally included vmflags. Copy them out for later use. */
+ if ((xlrec.flags & XLHP_HAS_VMFLAGS) != 0)
+ {
+ memcpy(&vmflags, maindataptr, sizeof(uint8));
+ maindataptr += sizeof(uint8);

2) Should we move conflict_xid = visibility_cutoff_xid; assignment
just after heap_page_is_all_visible_except_lpdead call in
lazy_vacuum_heap_page?

3) Looking at this diff, do not comprehend one bit: how are we
protected from passing an all-visible page to lazy_vacuum_heap_page. I
did not manage to reproduce such behaviour though.

+ if ((vmflags & VISIBILITYMAP_VALID_BITS) != 0)
+ {
+ Assert(!PageIsAllVisible(page));
+ set_pd_all_vis = true;
+ LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
+ PageSetAllVisible(page);
+ visibilitymap_set_vmbyte(vacrel->rel,
+ blkno,
+



--
Best regards,
Kirill Reshke



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Sat, 2 Aug 2025 at 02:36, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Jul 31, 2025 at 6:58 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > The patch "Set-pd_prune_xid-on-insert.txt" can be applied as the last
> > patch in the set. It sets pd_prune_xid on insert (so pages filled by
> > COPY or insert can also be set all-visible in the VM before they are
> > vacuumed). I gave it a .txt extension because it currently fails
> > 035_standby_logical_decoding due to a recovery conflict. I need to
> > investigate more to see if this is a bug in my patch set or elsewhere
> > in Postgres.
>
> I figured out that if we set the VM on-access, we need to enable
> hot_standby_feedback in more places in 035_standby_logical_decoding.pl
> to avoid recovery conflicts. I've done that in the attached updated
> version 6. There are a few other issues in
> 035_standby_logical_decoding.pl that I reported here [1]. With these
> changes, setting pd_prune_xid on insert passes tests. Whether or not
> we want to do it (and what the heuristic should be for deciding when
> to do it) is another question.
>
> - Melanie
>
> [1]
https://www.postgresql.org/message-id/flat/CAAKRu_YO2mEm%3DZWZKPjTMU%3DgW5Y83_KMi_1cr51JwavH0ctd7w%40mail.gmail.com

v6-0015:
I chose to verify whether this single modification would be beneficial
on the HEAD.

Benchmark I did:

```

\timing
CREATE TABLE zz(i int);
alter table zz set (autovacuum_enabled = false);
TRUNCATE zz;
copy zz from program 'yes 2 | head -n 180000000';
copy zz from program 'yes 2 | head -n 180000000';

delete from zz where (REPLACE(REPLACE(ctid::text, '(', '{'), ')',
'}')::int[])[2] = 7 ;

VACUUM FREEZE zz;
```

And I checked perf top footprint for last statement (vacuum).  My
detailed results are attached. It is a HEAD vs HEAD+v6-0015 benchmark.

TLDR: function inlining is indeed beneficial, TransactionIdPrecedes
function disappears from perf top footprint, though query runtime is
not changed much. So, while not resulting in query speedup, this can
save CPU.
Maybe we can derive an artificial benchmark, which will show query
speed up, but for now I dont have one.

--
Best regards,
Kirill Reshke

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for all the reviews. I'm working on responding to your previous
mails with a new version.

On Wed, Aug 27, 2025 at 8:55 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> v6-0015:
> I chose to verify whether this single modification would be beneficial
> on the HEAD.
>
> Benchmark I did:
>
> ```
>
> \timing
> CREATE TABLE zz(i int);
> alter table zz set (autovacuum_enabled = false);
> TRUNCATE zz;
> copy zz from program 'yes 2 | head -n 180000000';
> copy zz from program 'yes 2 | head -n 180000000';
>
> delete from zz where (REPLACE(REPLACE(ctid::text, '(', '{'), ')',
> '}')::int[])[2] = 7 ;
>
> VACUUM FREEZE zz;
> ```
>
> And I checked perf top footprint for last statement (vacuum).  My
> detailed results are attached. It is a HEAD vs HEAD+v6-0015 benchmark.
>
> TLDR: function inlining is indeed beneficial, TransactionIdPrecedes
> function disappears from perf top footprint, though query runtime is
> not changed much. So, while not resulting in query speedup, this can
> save CPU.
> Maybe we can derive an artificial benchmark, which will show query
> speed up, but for now I dont have one.

I'm not surprised that vacuum freeze does not show a speed up from the
function inlining.

This patch was key for avoiding a regression in the most contrived
worst case scenario example of setting the VM on-access. That is, if
you are pruning only a single tuple on the page as part of a SELECT
query that returns no tuples (think SELECT * FROM foo OFFSET N where N
is greater than the number of rows in the table), and I add
determining if the page is all visible, then the overhead of these
extra function calls in heap_prune_record_unchanged_lp_normal() is
noticeable.

We might be able to come up with a similar example in vacuum without
freeze since it will try to determine if the page is all-visible. Your
example is still running on my machine, though, so I haven't verified
this yet :)

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for the review! Updates are in attached v7.

One note on 0022 in the set, which sets pd_prune_xid on insert: the
recently added index-killtuples isolation test was failing with this
patch applied. With the patch, the "access" step reports more heap
page hits than before. After some analysis, it seems one of the heap
pages in kill_prior_tuples table is now being pruned in an earlier
step. Somehow this leads to 4 hits counted instead of 3 (even though
there are only 4 blocks in the relation). I recall Bertrand mentioning
something in some other thread about hits being double counted with
AIO reads, so I'm going to try and go dig that up. But, for now, I've
modified the test -- I believe the patch is only revealing an issue
with instrumentation, not causing a bug.

On Tue, Aug 26, 2025 at 5:58 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> visibilitymap_set_vmbyte is introduced in 0001 and removed in 0012.
> This is strange to me, maybe we can avoid visibilitymap_set_vmbyte in
> first place?

The reason for this is that in the earlier patch I introduce
visibilitymap_set_vmbyte() for one user while other users still use
visibilitymap_set(). But, by the end of the set, all users use
visibillitymap_set_vmbyte(). So I think it makes most sense for it to
be named visibilitymap_set() at that point. Until all users are
committed, the two functions both have to exist and need different
names.

> In 0001:
> should we add "Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr),
> LW_EXCLUSIVE));" in visibilitymap_set_vmbyte?

I don't want any operations on the heap buffer (including asserts) in
visibilitymap_set_vmbyte(). The heap block is only provided to look up
the VM bits.

I think your idea is a good one for the existing visibilitymap_set(),
though, so I've added a new patch to the set (0002) which does this. I
also added a similar assertion for the vmbuffer to
visibilitymap_set_vmbyte().

> Also here  `Assert(visibilitymap_pin_ok(BufferGetBlockNumber(buffer),
> vmbuffer));` can be beneficial:

I had omitted this because the same logic is checked inside of
visiblitymap_set_vmbyte() with an error occurring if conditions are
not met. However, since the same is true in visibilitymap_set() and
heap_multi_insert() still asserted visiblitymap_pin_ok(), I've added
it back to my patch set.

> in heap_xlog_multi_insert:
> +
> + visibilitymap_pin(reln, blkno, &vmbuffer);
> + visibilitymap_set_vmbyte(....)
>
> Do we need to pin vmbuffer here? Looks like
> XLogReadBufferForRedoExtended already pins vmbuffer. I verified this
> with CheckBufferIsPinnedOnce(vmbuffer) just before visibilitymap_pin
> and COPY ... WITH (FREEZE true) test.

I thought the reason visibilitymap_set() did it was that it was
possible for the block of the VM corresponding to the block of the
heap to be different during recovery than it was when emitting the
record, and thus we needed the part of visiblitymap_pin() that
released the old vmbuffer and got the new one corresponding to the
heap block.

I can't quite think of how this could happen though.

Assuming it can't happen, then we can get rid of visiblitymap_pin()
(and add visibilitymap_pin_ok()) in both visiblitymap_set_vmbyte() and
visibilitymap_set(). I've done this to visibilitymap_set() in a
separate patch 0001. I would like other opinions/confirmation that the
block of the VM corresponding to the heap block cannot differ during
recovery from that what it was when the record was emitted during
normal operation, though.

> > +#ifdef TRACE_VISIBILITYMAP
> > + elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk);
> > +#endif
>
> I can see this merely copy-pasted from visibilitymap_set, but maybe
> display "flags" also?

Done in attached.

> 4) visibilitymap_set receives  XLogRecPtr recptr parameters, which is
> set to WAL record lsn during recovery and to InvalidXLogRecPtr
> otherwise. visibilitymap_set manages VM page LSN bases on this recptr
> value (inside function logic). visibilitymap_set_vmbyte behaves
> vise-versa and makes its caller responsible for page LSN management.
> Maybe we should keep these two functions akin to each other?

So, with visibilitymap_set_vmbyte(), the whole idea is to just update
the VM and then leave the WAL logging and other changes to the caller
(like marking the buffer dirty, setting the page LSN, etc). The series
of operations needed to make a persistent change are up to the caller.
visibilitymap_set() is meant to just make sure that the correct bits
in the VM are set for the given heap block.

I looked at ways of making the current visibilitymap_set() API cleaner
-- like setting the heap page LSN with the VM recptr in the caller of
visibilitymap_set() instead. There wasn't a way of doing it that
seemed like enough of an improvement to merit the change.

Not to mention, the goal of the patchset is to remove the current
visibilitymap_set(), so I'm not too worried about parity between the
two functions. They may coexist for awhile, but hopefully today's
visibilitymap_set() will eventually be removed.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Tue, Aug 26, 2025 at 4:01 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> Few comments on 0003.
>
> 1) This patch introduces XLHP_HAS_VMFLAGS. However it lacks some
> helpful comments about this new status bit.

I added the ones you suggested in my v7 posted here [1].

> 2) Should we move conflict_xid = visibility_cutoff_xid; assignment
> just after heap_page_is_all_visible_except_lpdead call in
> lazy_vacuum_heap_page?

Why would we want to do that? We only want to set it if the page is
all visible, so we would have to guard it similarly.

> 3) Looking at this diff, do not comprehend one bit: how are we
> protected from passing an all-visible page to lazy_vacuum_heap_page. I
> did not manage to reproduce such behaviour though.
>
> + if ((vmflags & VISIBILITYMAP_VALID_BITS) != 0)
> + {
> + Assert(!PageIsAllVisible(page));
> + set_pd_all_vis = true;
> + LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> + PageSetAllVisible(page);
> + visibilitymap_set_vmbyte(vacrel->rel,
> + blkno,

So, for one, there is an assert just above this code in
lazy_vacuum_heap_page() that nunused > 0 -- so we know that the page
couldn't have been all-visible already because it had unused line
pointers.

Otherwise, if it was possible for an already all-visible page to get
here, the same thing would happen that happens on master --
heap_page_is_all_visible[_except_lpdead()] would return true and we
would try to set the VM which would end up being a no-op.

- Melanie

[1] https://www.postgresql.org/message-id/CAAKRu_YD0ecXeAh%2BDmJpzQOJwcRzmMyGdcc5W_0pEF78rYSJkQ%40mail.gmail.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Thu, 28 Aug 2025 at 00:02, Melanie Plageman
<melanieplageman@gmail.com> wrote:

> > Do we need to pin vmbuffer here? Looks like
> > XLogReadBufferForRedoExtended already pins vmbuffer. I verified this
> > with CheckBufferIsPinnedOnce(vmbuffer) just before visibilitymap_pin
> > and COPY ... WITH (FREEZE true) test.
>
> I thought the reason visibilitymap_set() did it was that it was
> possible for the block of the VM corresponding to the block of the
> heap to be different during recovery than it was when emitting the
> record, and thus we needed the part of visiblitymap_pin() that
> released the old vmbuffer and got the new one corresponding to the
> heap block.
>
> I can't quite think of how this could happen though.
>
> Assuming it can't happen, then we can get rid of visiblitymap_pin()
> (and add visibilitymap_pin_ok()) in both visiblitymap_set_vmbyte() and
> visibilitymap_set(). I've done this to visibilitymap_set() in a
> separate patch 0001. I would like other opinions/confirmation that the
> block of the VM corresponding to the heap block cannot differ during
> recovery from that what it was when the record was emitted during
> normal operation, though.

I did micro git-blame research here. I spotted only one related change
[0]. Looks like before this change pin was indeed needed.
But not after this change, so this visibilitymap_pin is just an oversight?
Related thread is [1]. I quickly checked the discussion in this
thread, and it looks like no one was bothered about these lines or VM
logging changes (in this exact pin buffer aspect). The discussion was
of other aspects of this commit.

[0] https://github.com/postgres/postgres/commit/2c03216d8311
[1] https://www.postgresql.org/message-id/533D6CBF.6080203%40vmware.com


-- 
Best regards,
Kirill Reshke



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Thu, Aug 28, 2025 at 5:12 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> I did micro git-blame research here. I spotted only one related change
> [0]. Looks like before this change pin was indeed needed.
> But not after this change, so this visibilitymap_pin is just an oversight?
> Related thread is [1]. I quickly checked the discussion in this
> thread, and it looks like no one was bothered about these lines or VM
> logging changes (in this exact pin buffer aspect). The discussion was
> of other aspects of this commit.

Wow, thanks so much for doing that research. Looking at it myself, it
does indeed seem like just an oversight. It isn't harmful since it
won't take another pin, but it is confusing, so I think we should at
least remove it in master. I'm not as sure about back branches.

I would like someone to confirm that there is no way we could end up
with a different block of the VM containing the vm bits for a heap
block during recovery than during normal operation.

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Tue, Sep 2, 2025 at 5:52 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Aug 28, 2025 at 5:12 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > I did micro git-blame research here. I spotted only one related change
> > [0]. Looks like before this change pin was indeed needed.
> > But not after this change, so this visibilitymap_pin is just an oversight?
> > Related thread is [1]. I quickly checked the discussion in this
> > thread, and it looks like no one was bothered about these lines or VM
> > logging changes (in this exact pin buffer aspect). The discussion was
> > of other aspects of this commit.
>
> Wow, thanks so much for doing that research. Looking at it myself, it
> does indeed seem like just an oversight. It isn't harmful since it
> won't take another pin, but it is confusing, so I think we should at
> least remove it in master. I'm not as sure about back branches.

I've updated the commit message in the patch set to reflect the
research you did in attached v8.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Andres Freund
Дата:
Hi,

On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 27 Aug 2025 08:50:15 -0400
> Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay

LGTM.


> From 7c5cb3edf89735eaa8bee9ca46111bd6c554720b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 27 Aug 2025 10:07:29 -0400
> Subject: [PATCH v8 02/22] Add assert and log message to visibilitymap_set

LGTM.


> From 07f31099754636ec9dabf6cca06c33c4b19c230c Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 17 Jun 2025 17:22:10 -0400
> Subject: [PATCH v8 03/22] Eliminate xl_heap_visible in COPY FREEZE
>
> Instead of emitting a separate WAL record for setting the VM bits in
> xl_heap_visible, specify the changes to make to the VM block in the
> xl_heap_multi_insert record instead.
>
> Author: Melanie Plageman <melanieplageman@gmail.com>
> Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com


> +        /*
> +         * If we're only adding already frozen rows to a previously empty
> +         * page, mark it as all-frozen and update the visibility map. We're
> +         * already holding a pin on the vmbuffer.
> +         */
>          else if (all_frozen_set)
> +        {
> +            Assert(visibilitymap_pin_ok(BufferGetBlockNumber(buffer), vmbuffer));
>              PageSetAllVisible(page);
> +            LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
> +            visibilitymap_set_vmbyte(relation,
> +                                     BufferGetBlockNumber(buffer),
> +                                     vmbuffer,
> +                                     VISIBILITYMAP_ALL_VISIBLE |
> +                                     VISIBILITYMAP_ALL_FROZEN);
> +        }

From an abstraction POV I don't love that heapam now is responsible for
acquiring and releasing the lock. But that ship already kind of has sailed, as
heapam.c is already responsible for releasing the vm buffer etc...

I've wondered about splitting the responsibilities up into multiple
visibilitymap_set_* functions, so that heapam.c wouldn't need to acquire the
lock and set the LSN. But it's probably not worth it.


> +    /*
> +     * Now read and update the VM block. Even if we skipped updating the heap
> +     * page due to the file being dropped or truncated later in recovery, it's
> +     * still safe to update the visibility map.  Any WAL record that clears
> +     * the visibility map bit does so before checking the page LSN, so any
> +     * bits that need to be cleared will still be cleared.
> +     *
> +     * It is only okay to set the VM bits without holding the heap page lock
> +     * because we can expect no other writers of this page.
> +     */
> +    if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> +        XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> +                                      &vmbuffer) == BLK_NEEDS_REDO)
> +    {
> +        Relation    reln = CreateFakeRelcacheEntry(rlocator);
> +
> +        Assert(visibilitymap_pin_ok(blkno, vmbuffer));
> +        visibilitymap_set_vmbyte(reln, blkno,
> +                                 vmbuffer,
> +                                 VISIBILITYMAP_ALL_VISIBLE |
> +                                 VISIBILITYMAP_ALL_FROZEN);
> +
> +        /*
> +         * It is not possible that the VM was already set for this heap page,
> +         * so the vmbuffer must have been modified and marked dirty.
> +         */
> +        Assert(BufferIsDirty(vmbuffer));

How about making visibilitymap_set_vmbyte() return whether it needed to do
something? This seems somewhat indirect...

I think it might be good to encapsulate this code into a helper in
visibilitymap.c, there will be more callers in the subsequent patches.


> +/*
> + * Set flags in the VM block contained in the passed in vmBuf.
> + *
> + * This function is for callers which include the VM changes in the same WAL
> + * record as the modifications of the heap page which rendered it all-visible.
> + * Callers separately logging the VM changes should invoke visibilitymap_set()
> + * instead.
> + *
> + * Caller must have pinned and exclusive locked the correct block of the VM in
> + * vmBuf. This block should contain the VM bits for the given heapBlk.
> + *
> + * During normal operation (i.e. not recovery), this should be called in a
> + * critical section which also makes any necessary changes to the heap page
> + * and, if relevant, emits WAL.
> + *
> + * Caller is responsible for WAL logging the changes to the VM buffer and for
> + * making any changes needed to the associated heap page. This includes
> + * maintaining any invariants such as ensuring the buffer containing heapBlk
> + * is pinned and exclusive locked.
> + */
> +uint8
> +visibilitymap_set_vmbyte(Relation rel, BlockNumber heapBlk,
> +                         Buffer vmBuf, uint8 flags)

Why is it named vmbyte? This actually just sets the two bits corresponding to
the buffer, not the entire byte. So it seems somewhat misleading to reference
byte.




> From dc318358572f61efbd0e05aae2b9a077b422bcf5 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Wed, 18 Jun 2025 12:42:13 -0400
> Subject: [PATCH v8 05/22] Eliminate xl_heap_visible from vacuum phase III
>
> Instead of emitting a separate xl_heap_visible record for each page that
> is rendered all-visible by vacuum's third phase, include the updates to
> the VM in the already emitted xl_heap_prune record.

Reading through the change I didn't particularly like that there's another
optional field in xl_heap_prune, as it seemed liked something that should be
encoded in flags.  Of course there aren't enough flag bits available.  But
that made me look at the rest of the record: Uh, what do we use the reason
field for?  As far as I can tell f83d709760d8 added it without introducing any
users? It doesn't even seem to be set.


> @@ -51,10 +52,15 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>             (xlrec.flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
>
>      /*
> -     * We are about to remove and/or freeze tuples.  In Hot Standby mode,
> -     * ensure that there are no queries running for which the removed tuples
> -     * are still visible or which still consider the frozen xids as running.
> -     * The conflict horizon XID comes after xl_heap_prune.
> +     * After xl_heap_prune is the optional snapshot conflict horizon.
> +     *
> +     * In Hot Standby mode, we must ensure that there are no running queries
> +     * which would conflict with the changes in this record. If pruning, that
> +     * means we cannot remove tuples still visible to transactions on the
> +     * standby. If freezing, that means we cannot freeze tuples with xids that
> +     * are still considered running on the standby. And for setting the VM, we
> +     * cannot do so if the page isn't all-visible to all transactions on the
> +     * standby.
>       */

I'm a bit confused by this new comment - it sounds like we're deciding whether
to remove tuple versions, but that decision has long been made, no?



> @@ -2846,8 +2848,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>      OffsetNumber unused[MaxHeapTuplesPerPage];
>      int            nunused = 0;
>      TransactionId visibility_cutoff_xid;
> +    TransactionId conflict_xid = InvalidTransactionId;
>      bool        all_frozen;
>      LVSavedErrInfo saved_err_info;
> +    uint8        vmflags = 0;
> +    bool        set_pd_all_vis = false;
>
>      Assert(vacrel->do_index_vacuuming);
>
> @@ -2858,6 +2863,20 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
>                               VACUUM_ERRCB_PHASE_VACUUM_HEAP, blkno,
>                               InvalidOffsetNumber);
>
> +    if (heap_page_is_all_visible_except_lpdead(vacrel->rel, buffer,
> +                                               vacrel->cutoffs.OldestXmin,
> +                                               deadoffsets, num_offsets,
> +                                               &all_frozen, &visibility_cutoff_xid,
> +                                               &vacrel->offnum))
> +    {
> +        vmflags |= VISIBILITYMAP_ALL_VISIBLE;
> +        if (all_frozen)
> +        {
> +            vmflags |= VISIBILITYMAP_ALL_FROZEN;
> +            Assert(!TransactionIdIsValid(visibility_cutoff_xid));
> +        }
> +    }
> +
>      START_CRIT_SECTION();

I am rather confused - we never can set all-visible if there are any LP_DEAD
items left. If the idea is that we are removing the LP_DEAD items in
lazy_vacuum_heap_page() - what guarantees that all LP_DEAD items are being
removed? Couldn't some tuples get marked LP_DEAD by on-access pruning, after
vacuum visited the page and collected dead items?

Ugh, I see - it works because we pass in the set of dead items.  I think that
makes the name *really* misleading, it's not except LP_DEAD, it's except the
offsets passed in, no?

But then you actually check that the set of dead items didn't change - what
guarantees that?


I didn't look at the later patches, except that I did notice this:

> @@ -268,7 +264,7 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>          Relation    reln = CreateFakeRelcacheEntry(rlocator);
>
>          visibilitymap_pin(reln, blkno, &vmbuffer);
> -        old_vmbits = visibilitymap_set_vmbyte(reln, blkno, vmbuffer, vmflags);
> +        old_vmbits = visibilitymap_set(reln, blkno, vmbuffer, vmflags);
>          /* Only set VM page LSN if we modified the page */
>          if (old_vmbits != vmflags)
>              PageSetLSN(BufferGetPage(vmbuffer), lsn);
> @@ -279,143 +275,6 @@ heap_xlog_prune_freeze(XLogReaderState *record)
>          UnlockReleaseBuffer(vmbuffer);
>  }

Why are we manually pinning the vm buffer here? Shouldn't the xlog machinery
have done so, as you noticed in one of the early on patches?

Greetings,

Andres Freund



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Kirill Reshke
Дата:
On Wed, 3 Sept 2025 at 04:11, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Tue, Sep 2, 2025 at 5:52 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Thu, Aug 28, 2025 at 5:12 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> > >
> > > I did micro git-blame research here. I spotted only one related change
> > > [0]. Looks like before this change pin was indeed needed.
> > > But not after this change, so this visibilitymap_pin is just an oversight?
> > > Related thread is [1]. I quickly checked the discussion in this
> > > thread, and it looks like no one was bothered about these lines or VM
> > > logging changes (in this exact pin buffer aspect). The discussion was
> > > of other aspects of this commit.
> >
> > Wow, thanks so much for doing that research. Looking at it myself, it
> > does indeed seem like just an oversight. It isn't harmful since it
> > won't take another pin, but it is confusing, so I think we should at
> > least remove it in master. I'm not as sure about back branches.
>
> I've updated the commit message in the patch set to reflect the
> research you did in attached v8.
>
> - Melanie



Hi!

small comments regarding new series

0001, 0002, 0017 LGTM


In 0015:

```
reshke@yezzey-cbdb-bench:~/postgres$ git diff
src/backend/access/heap/pruneheap.c
diff --git a/src/backend/access/heap/pruneheap.c
b/src/backend/access/heap/pruneheap.c
index 05b51bd8d25..0794af9ae89 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -1398,7 +1398,7 @@ heap_prune_record_unchanged_lp_normal(Page page,
PruneState *prstate, OffsetNumb
                                /*
                                 * For now always use prstate->cutoffs
for this test, because
                                 * we only update 'all_visible' when
freezing is requested. We
-                                * could use
GlobalVisTestIsRemovableXid instead, if a
+                                * could use GlobalVisXidVisibleToAll
instead, if a
                                 * non-freezing caller wanted to set the VM bit.
                                 */
                                Assert(prstate->cutoffs);
```

Also, maybe GlobalVisXidTestAllVisible is a slightly better name? (The
term 'all-visible' is one that we occasionally utilize)


--
Best regards,
Kirill Reshke



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for the review!

On Tue, Sep 2, 2025 at 7:54 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> > From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Wed, 27 Aug 2025 08:50:15 -0400
> > Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay

I didn't push it yet because I did a new version that actually
eliminates the asserts in heap_multi_insert() before calling
visibilitymap_set() -- since they are redundant with checks inside
visibilitymap_set(). 0001 of attached v9 is what I plan to push,
barring any objections.

> > From 7c5cb3edf89735eaa8bee9ca46111bd6c554720b Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Wed, 27 Aug 2025 10:07:29 -0400
> > Subject: [PATCH v8 02/22] Add assert and log message to visibilitymap_set

I pushed this.

> From an abstraction POV I don't love that heapam now is responsible for
> acquiring and releasing the lock. But that ship already kind of has sailed, as
> heapam.c is already responsible for releasing the vm buffer etc...
>
> I've wondered about splitting the responsibilities up into multiple
> visibilitymap_set_* functions, so that heapam.c wouldn't need to acquire the
> lock and set the LSN. But it's probably not worth it.

Yea, I explored heap wrappers coupling heap operations related to
setting the VM along with the VM updates [1], but the results weren't
appealing. Setting the heap LSN and marking the heap buffer dirty and
such happens in a different place in different callers because it is
happening as part of the operations that actually end up rendering the
page all-visible.

And a VM-only helper would literally just acquire and release the lock
and set the LSN on the vm page -- which I don't think is worth it.

> > +     /*
> > +      * Now read and update the VM block. Even if we skipped updating the heap
> > +      * page due to the file being dropped or truncated later in recovery, it's
> > +      * still safe to update the visibility map.  Any WAL record that clears
> > +      * the visibility map bit does so before checking the page LSN, so any
> > +      * bits that need to be cleared will still be cleared.
> > +      *
> > +      * It is only okay to set the VM bits without holding the heap page lock
> > +      * because we can expect no other writers of this page.
> > +      */
> > +     if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> > +             XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> > +                                                                       &vmbuffer) == BLK_NEEDS_REDO)
> > +     {
> > +             Relation        reln = CreateFakeRelcacheEntry(rlocator);
> > +
> > +             Assert(visibilitymap_pin_ok(blkno, vmbuffer));
> > +             visibilitymap_set_vmbyte(reln, blkno,
> > +                                                              vmbuffer,
> > +                                                              VISIBILITYMAP_ALL_VISIBLE |
> > +                                                              VISIBILITYMAP_ALL_FROZEN);
> > +
> > +             /*
> > +              * It is not possible that the VM was already set for this heap page,
> > +              * so the vmbuffer must have been modified and marked dirty.
> > +              */
> > +             Assert(BufferIsDirty(vmbuffer));
>
> How about making visibilitymap_set_vmbyte() return whether it needed to do
> something? This seems somewhat indirect...

It does return the state of the previous bits. But, I am specifically
asserting that the buffer is dirty because I am about to set the page
LSN. So I don't just care that changes were made, I care that we
remembered to mark the buffer dirty.

> I think it might be good to encapsulate this code into a helper in
> visibilitymap.c, there will be more callers in the subsequent patches.

By the end of the set, the different callers have different
expectations (some don't expect the buffer to have been dirtied
necessarily) and where they do the various related operations is
spread out depending on the caller. I just couldn't come up with a
helper solution I liked.

That being said, I definitely don't think it's needed for this patch
(logging setting the VM in xl_heap_multi_insert()).

> > +uint8
> > +visibilitymap_set_vmbyte(Relation rel, BlockNumber heapBlk,
> > +                                              Buffer vmBuf, uint8 flags)
>
> Why is it named vmbyte? This actually just sets the two bits corresponding to
> the buffer, not the entire byte. So it seems somewhat misleading to reference
> byte.

Renamed it to visibilitymap_set_vmbits.

> > Instead of emitting a separate xl_heap_visible record for each page that
> > is rendered all-visible by vacuum's third phase, include the updates to
> > the VM in the already emitted xl_heap_prune record.
>
> Reading through the change I didn't particularly like that there's another
> optional field in xl_heap_prune, as it seemed liked something that should be
> encoded in flags.  Of course there aren't enough flag bits available.  But
> that made me look at the rest of the record: Uh, what do we use the reason
> field for?  As far as I can tell f83d709760d8 added it without introducing any
> users? It doesn't even seem to be set.

yikes, you are right about the "reason" member. Attached 0002 removes
it, and I'll go ahead and fix it in the back branches too. I can't
fathom how that slipped through the cracks. We do pass the PruneReason
for setting the rmgr info about what type of record it is (i.e. if it
is one emitted by vacuum phase I, phase III, or on-access pruning).
But we don't need or use a separate member.. I went back and tried to
figure out what the rationale was, but I couldn't find anything.

As for the VM flags being an optional unaligned member -- in v9, I've
expanded the flags member to a uint16 to make room for the extra
flags. Seems we've been surviving with using up 2 bytes this long.

> > @@ -51,10 +52,15 @@ heap_xlog_prune_freeze(XLogReaderState *record)
> >                  (xlrec.flags & (XLHP_HAS_REDIRECTIONS | XLHP_HAS_DEAD_ITEMS)) == 0);
> >
> >       /*
> > -      * We are about to remove and/or freeze tuples.  In Hot Standby mode,
> > -      * ensure that there are no queries running for which the removed tuples
> > -      * are still visible or which still consider the frozen xids as running.
> > -      * The conflict horizon XID comes after xl_heap_prune.
> > +      * After xl_heap_prune is the optional snapshot conflict horizon.
> > +      *
> > +      * In Hot Standby mode, we must ensure that there are no running queries
> > +      * which would conflict with the changes in this record. If pruning, that
> > +      * means we cannot remove tuples still visible to transactions on the
> > +      * standby. If freezing, that means we cannot freeze tuples with xids that
> > +      * are still considered running on the standby. And for setting the VM, we
> > +      * cannot do so if the page isn't all-visible to all transactions on the
> > +      * standby.
> >        */
>
> I'm a bit confused by this new comment - it sounds like we're deciding whether
> to remove tuple versions, but that decision has long been made, no?

Well, the comment is a revision of a comment that was already there on
essentially why replaying this record could cause recovery conflicts.
It mentioned pruning and freezing, so I expanded it to mention setting
the VM. Taking into account your confusion, I tried rewording it in
attached v9.

> > +     if (heap_page_is_all_visible_except_lpdead(vacrel->rel, buffer,
> > +
vacrel->cutoffs.OldestXmin,
> > +                                                                                        deadoffsets, num_offsets,
> > +                                                                                        &all_frozen,
&visibility_cutoff_xid,
> > +                                                                                        &vacrel->offnum))
>
> I am rather confused - we never can set all-visible if there are any LP_DEAD
> items left. If the idea is that we are removing the LP_DEAD items in
> lazy_vacuum_heap_page() - what guarantees that all LP_DEAD items are being
> removed? Couldn't some tuples get marked LP_DEAD by on-access pruning, after
> vacuum visited the page and collected dead items?
>
> Ugh, I see - it works because we pass in the set of dead items.  I think that
> makes the name *really* misleading, it's not except LP_DEAD, it's except the
> offsets passed in, no?
>
> But then you actually check that the set of dead items didn't change - what
> guarantees that?

So, I pass in the deadoffsets we got from the TIDStore. If the only
dead items on the page are exactly those dead items, then the page
will be all-visible as soon as we set those LP_UNUSED -- which we do
unconditionally. And we have the lock on the page, so no one can
on-access prune and make new dead items while we are in
lazy_vacuum_heap_page().

Given your confusion, I've refactored this and used a different
approach -- I explicitly check the passed-in deadoffsets array when I
encounter a dead item and see if it is there. That should hopefully
make it more clear.

> I didn't look at the later patches, except that I did notice this:
<--snip-->
> Why are we manually pinning the vm buffer here? Shouldn't the xlog machinery
> have done so, as you noticed in one of the early on patches?

Fixed. Thanks!

- Melanie

[1] [1]
https://www.postgresql.org/message-id/flat/CAAKRu_Yj%3DyrL%2BgGGsqfYVQcYn7rDp6hDeoF1vN453JDp8dEY%2Bw%40mail.gmail.com#94602c599abdc8dfc5e438bd24bd8d50

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Wed, Sep 3, 2025 at 5:06 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> small comments regarding new series
>
> 0001, 0002, 0017 LGTM

Thanks for continuing to review!

> In 0015:
>
> Also, maybe GlobalVisXidTestAllVisible is a slightly better name? (The
> term 'all-visible' is one that we occasionally utilize)

Actually, I was trying to distinguish it from all-visible because I
interpret that to mean every thing is visible -- as in, every tuple on
a page is visible to everyone. And here we are referring to one xid
and want to know if it is visible to everyone as no longer running. I
don't think my name  ("visible-to-all") is good, but I'm hesitant to
co-opt "all-visible" here.

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Fri, Sep 5, 2025 at 6:20 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> > On 2025-09-02 19:11:01 -0400, Melanie Plageman wrote:
> > > From dd98177294011ee93cac122405516abd89f4e393 Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman <melanieplageman@gmail.com>
> > > Date: Wed, 27 Aug 2025 08:50:15 -0400
> > > Subject: [PATCH v8 01/22] Remove unneeded VM pin from VM replay
>
> I didn't push it yet because I did a new version that actually
> eliminates the asserts in heap_multi_insert() before calling
> visibilitymap_set() -- since they are redundant with checks inside
> visibilitymap_set(). 0001 of attached v9 is what I plan to push,
> barring any objections.

I pushed this, so rebased v10 is  attached. I've added one new patch:
0002 adds ERRCODE_DATA_CORRUPTED to the existing log messages about
VM/data corruption in vacuum. Andrey Borodin earlier suggested this,
and I had neglected to include it.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Fri, Sep 5, 2025 at 6:20 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> yikes, you are right about the "reason" member. Attached 0002 removes
> it, and I'll go ahead and fix it in the back branches too.

I think changing this in the back-branches is a super-bad idea. If you
want, you can add a comment in the back-branches saying "oops, we
shipped a field that isn't used for anything", but changing the struct
definition is very likely to make 0 people happy and >0 people
unhappy. On the other hand, changing this in master is a good idea and
you should go ahead and do that before this creates any more
confusion.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Mon, Sep 8, 2025 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Sep 5, 2025 at 6:20 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > yikes, you are right about the "reason" member. Attached 0002 removes
> > it, and I'll go ahead and fix it in the back branches too.
>
> I think changing this in the back-branches is a super-bad idea. If you
> want, you can add a comment in the back-branches saying "oops, we
> shipped a field that isn't used for anything", but changing the struct
> definition is very likely to make 0 people happy and >0 people
> unhappy. On the other hand, changing this in master is a good idea and
> you should go ahead and do that before this creates any more
> confusion.

Yes, that makes 100% sense. It should have occurred to me. I've pushed
the commit to master. I didn't put an updated set of patches here in
case someone was already reviewing them, as nothing else has changed.

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Mon, Sep 8, 2025 at 11:44 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I pushed this, so rebased v10 is  attached. I've added one new patch:
> 0002 adds ERRCODE_DATA_CORRUPTED to the existing log messages about
> VM/data corruption in vacuum. Andrey Borodin earlier suggested this,
> and I had neglected to include it.

Writing "ereport(WARNING, (errcode(ERRCODE_DATA_CORRUPTED)" is very
much a minority position. Generally the call to errcode() is on the
following line. I think the commit message could use a bit of work,
too. The first sentence heavily duplicates the second and the fourth,
and the third sentence isn't sufficiently well-connected to the rest
to make it clear why you're restating this general principle in this
commit message.

Perhaps something like:

Add error codes when VACUUM discovers VM corruption

Commit fd6ec93bf890314ac694dc8a7f3c45702ecc1bbd and other previous
work has established the principle that when an error is potentially
reachable in case of on-disk corruption, but is not expected to be
reached otherwise, ERRCODE_DATA_CORRUPTED should be used. This allows
log monitoring software to search for evidence of corruption by
filtering on the error code.

That kibitzing aside, I think this is pretty clearly the right thing to do.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Mon, Sep 8, 2025 at 2:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Commit fd6ec93bf890314ac694dc8a7f3c45702ecc1bbd and other previous
> work has established the principle that when an error is potentially
> reachable in case of on-disk corruption, but is not expected to be
> reached otherwise, ERRCODE_DATA_CORRUPTED should be used. This allows
> log monitoring software to search for evidence of corruption by
> filtering on the error code.
>
> That kibitzing aside, I think this is pretty clearly the right thing to do.

Thanks for the suggested wording and the pointer to that thread.

I noticed that in that thread they decided to use errmsg_internal()
instead of errmsg() for a few different reasons -- one of which was
that the situation is not supposed to happen/cannot happen -- which I
don't really understand. It is a reachable code path. Another is that
it is extra work for translators, which I'm also not sure how to apply
to my situation. Are the VM corruption cases worth extra work to the
translators?

I think the most compelling reason is that people will want to search
for the error message in English online. So, for that reason, my
instinct is to use errmsg_internal() in my case as well.

- Melanie



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Mon, Sep 8, 2025 at 3:14 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I noticed that in that thread they decided to use errmsg_internal()
> instead of errmsg() for a few different reasons -- one of which was
> that the situation is not supposed to happen/cannot happen -- which I
> don't really understand. It is a reachable code path. Another is that
> it is extra work for translators, which I'm also not sure how to apply
> to my situation. Are the VM corruption cases worth extra work to the
> translators?
>
> I think the most compelling reason is that people will want to search
> for the error message in English online. So, for that reason, my
> instinct is to use errmsg_internal() in my case as well.

I don't find that reason particularly compelling -- people could want
to search for any error message, or they could equally want to be able
to read it without Google translate. Guessing which messages are
obscure enough that we need not translate them exceeds my powers. If I
were doing it, I'd make it errmsg() rather than errmsg_internal() and
let the translations team change it if they don't think it's worth
bothering with, because if you make it errmsg_internal() then they
won't see it until somebody complains about it not getting translated.
However, I suspect different committers would pursue different
strategies here.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
Reviewing 0003:

+               /*
+                * If we're only adding already frozen rows to a
previously empty
+                * page, mark it as all-frozen and update the
visibility map. We're
+                * already holding a pin on the vmbuffer.
+                */
                else if (all_frozen_set)
+               {
                        PageSetAllVisible(page);
+                       LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE);
+                       visibilitymap_set_vmbits(relation,
+
  BufferGetBlockNumber(buffer),
+
  vmbuffer,
+
  VISIBILITYMAP_ALL_VISIBLE |
+
  VISIBILITYMAP_ALL_FROZEN);

Locking a buffer in a critical section violates the order of
operations proposed in the 'Write-Ahead Log Coding' section of
src/backend/access/transam/README.

+        * Now read and update the VM block. Even if we skipped
updating the heap
+        * page due to the file being dropped or truncated later in
recovery, it's
+        * still safe to update the visibility map.  Any WAL record that clears
+        * the visibility map bit does so before checking the page LSN, so any
+        * bits that need to be cleared will still be cleared.
+        *
+        * It is only okay to set the VM bits without holding the heap page lock
+        * because we can expect no other writers of this page.

The first paragraph of this paraphrases a similar content in
xlog_heap_visible(), but I don't see the variation in phrasing as an
improvement.

The second paragraph does not convince me at all. I see no reason to
believe that this is safe, or that it is a good idea. The code in
xlog_heap_visible() thinks its OK to unlock and relock the page to
make visibilitymap_set() happy, which is cringy but probably safe for
lack of concurrent writers, but skipping locking altogether seems
deeply unwise.

- *             visibilitymap_set        - set a bit in a previously pinned page
+ *             visibilitymap_set        - set bit(s) in a previously
pinned page and log
+ *      visibilitymap_set_vmbits - set bit(s) in a pinned page

I suspect the indentation was done with a different mix of spaces and
tabs here, because this doesn't align for me.

In general, this idea makes some sense to me -- there doesn't seem to
be any particularly good reason why the visibility-map update should
be handled by a different WAL record than the all-visible flag on the
page itself. It's a little hard for me to make that statement too
conclusively without studying more of the patches than I've had time
to do today, but off the top of my head it seems to make sense.
However, I'm not sure you've taken enough care with the details here.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Mon, Sep 8, 2025 at 4:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Reviewing 0003:
>
> Locking a buffer in a critical section violates the order of
> operations proposed in the 'Write-Ahead Log Coding' section of
> src/backend/access/transam/README.

Right, I noticed some other callers of visibiltymap_set() (like
lazy_scan_new_or_empty()) did call it in a critical section (and it
exclusive locks the VM page), so I thought perhaps it was better to
keep this operation as close as possible to where we update the VM
(similar to how it is in master in visibilitymap_set()).

But, I think you're right that maintaining the order of operations
proposed in transam/README is more important. As such, in attached
v11, I've modified this patch and the other patches where I replace
visibilitymap_set() with visibilitymap_set_vmbits() to exclusively
lock the vmbuffer before the critical section.
visibilitymap_set_vmbits() asserts that we have the vmbuffer
exclusively locked, so we should be good.

> +        * Now read and update the VM block. Even if we skipped
> updating the heap
> +        * page due to the file being dropped or truncated later in
> recovery, it's
> +        * still safe to update the visibility map.  Any WAL record that clears
> +        * the visibility map bit does so before checking the page LSN, so any
> +        * bits that need to be cleared will still be cleared.
> +        *
> +        * It is only okay to set the VM bits without holding the heap page lock
> +        * because we can expect no other writers of this page.
>
> The first paragraph of this paraphrases a similar content in
> xlog_heap_visible(), but I don't see the variation in phrasing as an
> improvement.

The only difference is I replaced the phrase "LSN interlock" with
"being dropped or truncated later in recovery" -- which is more
specific and, I thought, more clear. Without this comment, it took me
some time to understand the scenarios that might lead us to skip
updating the heap block. heap_xlog_visible() has cause to describe
this situation in an earlier comment -- which is why I think the LSN
interlock comment is less confusing there.

Anyway, I'm open to changing the comment. I could:
1) copy-paste the same comment as heap_xlog_visible()
2) refer to the comment in heap_xlog_visible() (comment seemed a bit
short for that)
3) diverge the comments further by improving the new comment in
heap_xlog_multi_insert() in some way
4) something else?

> The second paragraph does not convince me at all. I see no reason to
> believe that this is safe, or that it is a good idea. The code in
> xlog_heap_visible() thinks its OK to unlock and relock the page to
> make visibilitymap_set() happy, which is cringy but probably safe for
> lack of concurrent writers, but skipping locking altogether seems
> deeply unwise.

Actually in master, heap_xlog_visible() has no lock on the heap page
when it calls visibiltymap_set(). It releases that lock before
recording the freespace in the FSM and doesn't take it again.

It does unlock and relock the VM page -- because visibilitymap_set()
expects to take the lock on the VM.

I agree that not holding the heap lock while updating the VM is
unsatisfying. We can't hold it while doing the IO to read in the VM
block in XLogReadBufferForRedoExtended(). So, we could take it again
before calling visibilitymap_set(). But we don't always have the heap
buffer, though. I suspect this is partially why heap_xlog_visible()
unconditionally passes InvalidBuffer to visibilitymap_set() as the
heap buffer and has special case handling for recovery when we don't
have the heap buffer.

In any case, it isn't an active bug, and I don't think future-proofing
VM replay (i.e. against parallel recovery) is a prerequisite for
committing this patch since it is also that way on master.

> - *             visibilitymap_set        - set a bit in a previously pinned page
> + *             visibilitymap_set        - set bit(s) in a previously
> pinned page and log
> + *      visibilitymap_set_vmbits - set bit(s) in a pinned page
>
> I suspect the indentation was done with a different mix of spaces and
> tabs here, because this doesn't align for me.

oops, fixed.

I pushed the ERRCODE_DATA_CORRUPTED patch, so attached v11 is rebased
and also has the changes mentioned above.

Since you've started reviewing the set, I'll note that patches
0005-0011 are split up for ease of review and it may not necessarily
make sense to keep that separation for eventual commit. They are a
series of steps to move VM updates from lazy_scan_prune() into
pruneheap.c.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Mon, Sep 8, 2025 at 6:29 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> But, I think you're right that maintaining the order of operations
> proposed in transam/README is more important. As such, in attached
> v11, I've modified this patch and the other patches where I replace
> visibilitymap_set() with visibilitymap_set_vmbits() to exclusively
> lock the vmbuffer before the critical section.
> visibilitymap_set_vmbits() asserts that we have the vmbuffer
> exclusively locked, so we should be good.

That sounds good. I think it is OK to keep some of the odd things that
we're currently doing if they're hard to eliminate, but if they're not
really needed then I'd rather see us standardize the code. I feel (and
I think you may agree, based on other conversations that we've had)
that the visibility map code is somewhat oddly structured, and I'd
like to see us push the amount of oddness down rather than up, if we
can reasonably do so without breaking everything.

> The only difference is I replaced the phrase "LSN interlock" with
> "being dropped or truncated later in recovery" -- which is more
> specific and, I thought, more clear. Without this comment, it took me
> some time to understand the scenarios that might lead us to skip
> updating the heap block. heap_xlog_visible() has cause to describe
> this situation in an earlier comment -- which is why I think the LSN
> interlock comment is less confusing there.
>
> Anyway, I'm open to changing the comment. I could:
> 1) copy-paste the same comment as heap_xlog_visible()
> 2) refer to the comment in heap_xlog_visible() (comment seemed a bit
> short for that)
> 3) diverge the comments further by improving the new comment in
> heap_xlog_multi_insert() in some way
> 4) something else?

IMHO, copying and pasting comments is not great, and comments with
identical intent and divergent wording are also not great. The former
is not great because having a whole bunch of copies of the same
comment, especially if it's a block comment rather than a 1-liner,
uses up a bunch of space and creates a maintenance hazard in the sense
that future updates might not get propagated to all copies. The latter
is not great because it makes it hard to grep for other instances that
should be adjusted when you adjust one, and also because if one
version really is better than the other than ideally we'd like to have
the good version everywhere. Of course, there's some tension between
these two goals. In this particular case, thinking a little harder
about your proposed change, it seems to me that "LSN interlock" is
more clear about what the immediate test is that would cause us to
skip updating the heap page, and "being dropped or truncated later in
recovery" is more clear about what the larger state of the world that
would lead to that situation is. But whatever preference anyone might
have about which way to go with that choice, it is hard to see why the
preference should go one way in one case and the other way in another
case. Therefore, I favor an approach that leads either to an identical
comment in both places, or to one comment referring to the other.

> > The second paragraph does not convince me at all. I see no reason to
> > believe that this is safe, or that it is a good idea. The code in
> > xlog_heap_visible() thinks its OK to unlock and relock the page to
> > make visibilitymap_set() happy, which is cringy but probably safe for
> > lack of concurrent writers, but skipping locking altogether seems
> > deeply unwise.
>
> Actually in master, heap_xlog_visible() has no lock on the heap page
> when it calls visibiltymap_set(). It releases that lock before
> recording the freespace in the FSM and doesn't take it again.
>
> It does unlock and relock the VM page -- because visibilitymap_set()
> expects to take the lock on the VM.
>
> I agree that not holding the heap lock while updating the VM is
> unsatisfying. We can't hold it while doing the IO to read in the VM
> block in XLogReadBufferForRedoExtended(). So, we could take it again
> before calling visibilitymap_set(). But we don't always have the heap
> buffer, though. I suspect this is partially why heap_xlog_visible()
> unconditionally passes InvalidBuffer to visibilitymap_set() as the
> heap buffer and has special case handling for recovery when we don't
> have the heap buffer.

You know, I wasn't thinking carefully enough about the distinction
between the heap page and the visibility map page here. I thought you
were saying that you were modifying a page without a lock on that
page, but you aren't: you're saying you're modifying a page without a
lock on another page to which it is related. The former seems
disastrous, but the latter might be OK. However, I'm sort of confused
about what the comment is trying to say to justify that:

+        * It is only okay to set the VM bits without holding the heap page lock
+        * because we can expect no other writers of this page.

It is not exactly clear to me whether "this page" here refers to the
heap page or the VM page. If it means the heap page, why should that
be so if we haven't got any kind of lock? If it means the VM page,
then why is the heap page even relevant?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Tue, Sep 9, 2025 at 10:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Sep 8, 2025 at 6:29 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
>
> > The only difference is I replaced the phrase "LSN interlock" with
> > "being dropped or truncated later in recovery" -- which is more
> > specific and, I thought, more clear. Without this comment, it took me
> > some time to understand the scenarios that might lead us to skip
> > updating the heap block. heap_xlog_visible() has cause to describe
> > this situation in an earlier comment -- which is why I think the LSN
> > interlock comment is less confusing there.
> >
> > Anyway, I'm open to changing the comment. I could:
> > 1) copy-paste the same comment as heap_xlog_visible()
> > 2) refer to the comment in heap_xlog_visible() (comment seemed a bit
> > short for that)
> > 3) diverge the comments further by improving the new comment in
> > heap_xlog_multi_insert() in some way
> > 4) something else?
>
> IMHO, copying and pasting comments is not great, and comments with
> identical intent and divergent wording are also not great. The former
> is not great because having a whole bunch of copies of the same
> comment, especially if it's a block comment rather than a 1-liner,
> uses up a bunch of space and creates a maintenance hazard in the sense
> that future updates might not get propagated to all copies. The latter
> is not great because it makes it hard to grep for other instances that
> should be adjusted when you adjust one, and also because if one
> version really is better than the other than ideally we'd like to have
> the good version everywhere. Of course, there's some tension between
> these two goals. In this particular case, thinking a little harder
> about your proposed change, it seems to me that "LSN interlock" is
> more clear about what the immediate test is that would cause us to
> skip updating the heap page, and "being dropped or truncated later in
> recovery" is more clear about what the larger state of the world that
> would lead to that situation is. But whatever preference anyone might
> have about which way to go with that choice, it is hard to see why the
> preference should go one way in one case and the other way in another
> case. Therefore, I favor an approach that leads either to an identical
> comment in both places, or to one comment referring to the other.

I see what you are saying.

For heap_xlog_visible() the LSN interlock comment is easier to parse
because of an earlier comment before reading the heap page:

    /*
     * Read the heap page, if it still exists. If the heap file has dropped or
     * truncated later in recovery, we don't need to update the page, but we'd
     * better still update the visibility map.
     */

I've gone with the direct copy-paste of the LSN interlock paragraph in
attached v12. I think referring to the other comment is too confusing
in context here. However, I also added a line about what could cause
the LSN interlock -- but above it, so as to retain grep-ability of the
other comment.

> > > The second paragraph does not convince me at all. I see no reason to
> > > believe that this is safe, or that it is a good idea. The code in
> > > xlog_heap_visible() thinks its OK to unlock and relock the page to
> > > make visibilitymap_set() happy, which is cringy but probably safe for
> > > lack of concurrent writers, but skipping locking altogether seems
> > > deeply unwise.
> >
> > Actually in master, heap_xlog_visible() has no lock on the heap page
> > when it calls visibiltymap_set(). It releases that lock before
> > recording the freespace in the FSM and doesn't take it again.
> >
> > It does unlock and relock the VM page -- because visibilitymap_set()
> > expects to take the lock on the VM.
> >
> > I agree that not holding the heap lock while updating the VM is
> > unsatisfying. We can't hold it while doing the IO to read in the VM
> > block in XLogReadBufferForRedoExtended(). So, we could take it again
> > before calling visibilitymap_set(). But we don't always have the heap
> > buffer, though. I suspect this is partially why heap_xlog_visible()
> > unconditionally passes InvalidBuffer to visibilitymap_set() as the
> > heap buffer and has special case handling for recovery when we don't
> > have the heap buffer.
>
> You know, I wasn't thinking carefully enough about the distinction
> between the heap page and the visibility map page here. I thought you
> were saying that you were modifying a page without a lock on that
> page, but you aren't: you're saying you're modifying a page without a
> lock on another page to which it is related. The former seems
> disastrous, but the latter might be OK. However, I'm sort of confused
> about what the comment is trying to say to justify that:
>
> +        * It is only okay to set the VM bits without holding the heap page lock
> +        * because we can expect no other writers of this page.
>
> It is not exactly clear to me whether "this page" here refers to the
> heap page or the VM page. If it means the heap page, why should that
> be so if we haven't got any kind of lock? If it means the VM page,
> then why is the heap page even relevant?

I've expanded the comment in v12. In normal operation we must have the
lock on the heap page when setting the VM bits because if another
backend cleared PD_ALL_VISIBLE, we could have the forbidden scenario
where PD_ALL_VISIBLE is clear and the VM is set. This is not allowed
because then someone else may read the VM, conclude the page is
all-visible, and then an index-only scan can return wrong results. In
recovery, there are no concurrent writers, so it can't happen.

It is worth discussing how to fix it in heap_xlog_visible() so that
future scenarios like parallel recovery could not break this. However,
this patch is not a deviation from the behavior on master, and,
technically the behavior on master works.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Tue, Sep 9, 2025 at 12:24 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> For heap_xlog_visible() the LSN interlock comment is easier to parse
> because of an earlier comment before reading the heap page:
>
>     /*
>      * Read the heap page, if it still exists. If the heap file has dropped or
>      * truncated later in recovery, we don't need to update the page, but we'd
>      * better still update the visibility map.
>      */
>
> I've gone with the direct copy-paste of the LSN interlock paragraph in
> attached v12. I think referring to the other comment is too confusing
> in context here. However, I also added a line about what could cause
> the LSN interlock -- but above it, so as to retain grep-ability of the
> other comment.

I think that reads a little strangely. I would consolidate: Note that
the heap relation may have been dropped or truncated, leading us to
skip updating the heap block due to the LSN interlock. However, even
in that case, it's still safe to update the visibility map, etc. The
rest of the comment is perhaps a tad more explicit than our usual
practice, but that might be a good thing, because sometimes we're a
little too terse about these critical details.

I just realized that I don't like this:

+ /*
+ * If we're only adding already frozen rows to a previously empty
+ * page, mark it as all-frozen and update the visibility map. We're
+ * already holding a pin on the vmbuffer.
+ */

The thing is, we rarely position a block comment just before an "else
if". There are probably instances, but it's not typical. That's why
the existing comment contains two "if blah then blah" statements of
which you deleted the second -- because it needed to cover both the
"if" and the "else if". An alternative style is to move the comment
down a nesting level and rephrase without the conditional, ie. "We're
only adding frozen rows to a previously empty page, so mark it as
all-frozen etc." But I don't know that I like doing that for one
branch of the "if" and not the other.

The rest of what's now 0001 looks OK to me now, although you might
want to wait for a review from somebody more knowledgeable about this
area.

Some very quick comments on the next few patches -- far from a full review:

0002. Looks boring, probably unobjectionable provided the payoff patch is OK.

0003. What you've done here with xl_heap_prune.flags is kind of
horrifying. The problem is that, while you've added code explaining
that VISIBILITYMAP_ALL_{VISIBLE,FROZEN} are honorary XLHP flags,
nobody who isn't looking directly at that comment is going to
understand the muddling of the two namespaces. I would suggest not
doing this, even if it means defining redundant constants and writing
technically-unnecessary code to translate between them.

0004. It is not clear to me why you need to get
log_heap_prune_and_freeze to do the work here. Why can't
log_newpage_buffer get the job done already?

0005. It looks a little curious that you delete the
identify-corruption logic from the end of the if-nest and add it to
the beginning. Ceteris paribus, you'd expect that to be worse, since
corruption is a rare case.

0006. "to me marked" -> "to be marked".

+                * If the heap page is all-visible but the VM bit is
not set, we don't
+                * need to dirty the heap page.  However, if checksums
are enabled, we
+                * do need to make sure that the heap page is dirtied
before passing
+                * it to visibilitymap_set(), because it may be logged.
                 */
-               PageSetAllVisible(page);
-               MarkBufferDirty(buf);
+               if (!PageIsAllVisible(page) || XLogHintBitIsNeeded())
+               {
+                       PageSetAllVisible(page);
+                       MarkBufferDirty(buf);
+               }

I really hate this. Maybe you're going to argue that it's not the job
of this patch to fix the awfulness here, but surely marking a buffer
dirty in case some other function decides to WAL-log it is a
ridiculous plan.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
Thanks for the review! I've made the changes to comments and minor
fixes you suggested in attached v13 and have limited my inline
responses to areas where further discussion is required.

On Tue, Sep 9, 2025 at 3:26 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> 0003. What you've done here with xl_heap_prune.flags is kind of
> horrifying. The problem is that, while you've added code explaining
> that VISIBILITYMAP_ALL_{VISIBLE,FROZEN} are honorary XLHP flags,
> nobody who isn't looking directly at that comment is going to
> understand the muddling of the two namespaces. I would suggest not
> doing this, even if it means defining redundant constants and writing
> technically-unnecessary code to translate between them.

Fair. I've introduced new XLHP flags in attached v13. Hopefully it
puts an end to the horror.

> 0004. It is not clear to me why you need to get
> log_heap_prune_and_freeze to do the work here. Why can't
> log_newpage_buffer get the job done already?

Well, I need something to emit the changes to the VM. I'm eliminating
all users of xl_heap_visible. Empty pages are the ones that benefit
the least from switching from xl_heap_visible -> xl_heap_prune. But,
if I don't transition them, we have to maintain all the
xl_heap_visible code (including visibilitymap_set() in its long form).

As for log_newpage_buffer(), I could keep it if you think it is too
confusing to change log_heap_prune_and_freeze()'s API (by passing
force_heap_fpi) to handle this case, I can leave log_newpage_buffer()
there and then call log_heap_prune_and_freeze().

I just thought it seemed simple to avoid emitting the new page record
and the VM update record, so why not -- but I don't have strong
feelings.

> 0005. It looks a little curious that you delete the
> identify-corruption logic from the end of the if-nest and add it to
> the beginning. Ceteris paribus, you'd expect that to be worse, since
> corruption is a rare case.

On master, the two corruption cases are sandwiched between the normal
VM set cases. And I actually think doing it this way is brittle. If
you put the cases which set the VM first, you have to have completely
bulletproof the if statements guarding them to foreclose any possible
corruption case from entering because otherwise you will overwrite the
corruption you then try to detect.

But, specifically, from a performance perspective:

I think moving up the third case doesn't matter because the check is so cheap:

    else if (presult.lpdead_items > 0 && PageIsAllVisible(page))

And as for moving up the second case (the other corruption case), the
non-cheap thing it does is call visibilitymap_get_status()

    else if (all_visible_according_to_vm && !PageIsAllVisible(page) &&
             visibilitymap_get_status(vacrel->rel, blkno, &vmbuffer) != 0)

But once you call visibilitymap_get_status() once, assuming there is
no corruption and you need to go set the VM, you've already got that
page of the VM read, so it is probably pretty cheap. Overall, I didn't
think this would add noticeable overhead or many wasted operations.

And I thought that reorganizing the code improved clarity as well as
decreased the likelihood of bugs from insufficiently guarding positive
cases against corrupt pages and overwriting corruption instead of
detecting it.

If we're really worried about it from a performance perspective, I
could add an extra test at the top of identify_and_fix_vm_corruption()
that dumps out early if (!all_visible_according_to_vm &&
presult.all_visible).

> +                * If the heap page is all-visible but the VM bit is
> not set, we don't
> +                * need to dirty the heap page.  However, if checksums
> are enabled, we
> +                * do need to make sure that the heap page is dirtied
> before passing
> +                * it to visibilitymap_set(), because it may be logged.
>                  */
> -               PageSetAllVisible(page);
> -               MarkBufferDirty(buf);
> +               if (!PageIsAllVisible(page) || XLogHintBitIsNeeded())
> +               {
> +                       PageSetAllVisible(page);
> +                       MarkBufferDirty(buf);
> +               }
>
> I really hate this. Maybe you're going to argue that it's not the job
> of this patch to fix the awfulness here, but surely marking a buffer
> dirty in case some other function decides to WAL-log it is a
> ridiculous plan.

Right, it isn't pretty. But I don't quite see what the alternative is.
We need to mark the buffer dirty before setting the LSN. We could
perhaps rewrite visibilitymap_set()'s API to return the LSN of the
xl_heap_visible record and stamp it on the heap buffer ourselves. But
1) I think visibilitymap_set() purposefully conceals its WAL logging
ways from the caller and propagating that info back up starts to make
the API messy in another way and 2) I'm a bit loath to make big
changes to visibilitymap_set() right now since my patch set eventually
resolves this by putting the changes to the VM and heap page in the
same WAL record.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
On Tue, Sep 9, 2025 at 7:08 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Fair. I've introduced new XLHP flags in attached v13. Hopefully it
> puts an end to the horror.

I suggest not renumbering all of the existing flags and just adding
these new ones at the end. Less code churn and more likely to break in
an obvious way if you mix up the two sets of flags.

More on 0002:

+ set_heap_lsn = XLogHintBitIsNeeded() ? true : set_heap_lsn;

Maybe just if (XLogHintBitIsNeeded) set_heap_lsn = true? I don't feel
super-strongly that what you've done is bad but it looks weird to my
eyes.

+ * If we released any space or line pointers or will be setting a page in
+ * the visibility map, measure the page's freespace to later update the

"setting a page in the visibility map" seems a little muddled to me.
You set bits, not pages.

+ * all-visible (or all-frozen, depending on the vacuum mode,) which is

This comma placement is questionable.

  /*
+ * Note that the heap relation may have been dropped or truncated, leading
+ * us to skip updating the heap block due to the LSN interlock. However,
+ * even in that case, it's still safe to update the visibility map. Any
+ * WAL record that clears the visibility map bit does so before checking
+ * the page LSN, so any bits that need to be cleared will still be
+ * cleared.
+ *
+ * Note that the lock on the heap page was dropped above. In normal
+ * operation this would never be safe because a concurrent query could
+ * modify the heap page and clear PD_ALL_VISIBLE -- violating the
+ * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in
+ * the VM is set.
+ *
+ * In recovery, we expect no other writers, so writing to the VM page
+ * without holding a lock on the heap page is considered safe enough. It
+ * is done this way when replaying xl_heap_visible records (see
  */

How many copies of this comment do you plan to end up with?

The comment for log_heap_prune_and_freeze seems to be anticipating future work.

> > 0004. It is not clear to me why you need to get
> > log_heap_prune_and_freeze to do the work here. Why can't
> > log_newpage_buffer get the job done already?
>
> Well, I need something to emit the changes to the VM. I'm eliminating
> all users of xl_heap_visible. Empty pages are the ones that benefit
> the least from switching from xl_heap_visible -> xl_heap_prune. But,
> if I don't transition them, we have to maintain all the
> xl_heap_visible code (including visibilitymap_set() in its long form).
>
> As for log_newpage_buffer(), I could keep it if you think it is too
> confusing to change log_heap_prune_and_freeze()'s API (by passing
> force_heap_fpi) to handle this case, I can leave log_newpage_buffer()
> there and then call log_heap_prune_and_freeze().
>
> I just thought it seemed simple to avoid emitting the new page record
> and the VM update record, so why not -- but I don't have strong
> feelings.

Yeah, I'm not sure what the right thing to do here is. I think I was
again experiencing brain fade by forgetting that there is a heap page
and a VM page and, of course, log_heap_newpage() probably isn't going
to touch the latter. So that makes sense. On the other hand, we could
only have one type of WAL record for every single operation in the
system if we gave it enough flags, and force_heap_fpi seems
suspiciously like a flag that turns this into a whole different kind
of WAL record.

> > 0005. It looks a little curious that you delete the
> > identify-corruption logic from the end of the if-nest and add it to
> > the beginning. Ceteris paribus, you'd expect that to be worse, since
> > corruption is a rare case.
>
> On master, the two corruption cases are sandwiched between the normal
> VM set cases. And I actually think doing it this way is brittle. If
> you put the cases which set the VM first, you have to have completely
> bulletproof the if statements guarding them to foreclose any possible
> corruption case from entering because otherwise you will overwrite the
> corruption you then try to detect.

Hmm. In the current code, we first test (!all_visible_according_to_vm
&& presult.all_visible), then (all_visible_according_to_vm &&
!PageIsAllVisible(page) && visibilitymap_get_status(vacrel->rel,
blkno, &vmbuffer) != 0), and then (presult.lpdead_items > 0 &&
PageIsAllVisible(page)). The first and second can never coexist,
because they require opposite values of all_visible_according_to_vm.
The second and third cannot coexist because they require opposite
values of PageIsAllVisible(page). It is not entirely obvious that the
first and third tests couldn't both pass, but you'd have to have
presult.all_visible and presult.lpdead_items > 0, and it's a bit hard
to see how heap_page_prune_and_freeze() could ever allow that.
Consider:

    if (prstate.all_visible && prstate.lpdead_items == 0)
    {
        presult->all_visible = prstate.all_visible;
        presult->all_frozen = prstate.all_frozen;
    }
    else
    {
        presult->all_visible = false;
        presult->all_frozen = false;
    }
...
    presult->lpdead_items = prstate.lpdead_items;

So I don't really think I'm persuaded that the current way is brittle.
But that having been said, I agree with you that the order of the
checks is kind of random, and I don't think it really matters that
much for performance. What does matter is clarity. I feel like what
I'd ideally like this logic to do is say: do we want the VM bit for
the page to be set to all-frozen, just all-visible, or neither? Then
push the VM bit to the correct state, dragging the page-level bit
along behind. And the current logic sort of does that. It's roughly:

1. Should we go from not-all-visible to either all-visible or
all-frozen? If yes, do so.
2. Should we go from either all-visible or all-frozen to
not-all-visible? If yes, do so.
3. Should we go from either all-visible or all-frozen to
not-all-visible for a different reason? If yes, do so.
4. Should we go from all-visible to all-frozen? If yes, do so.

But what's weird is that all the tests are written differently, and we
have two different reasons for going to not-all-visible, namely
PD_ALL_VISIBLE-not-set and dead-items-on-page, whereas there's only
one test for each of the other state-transitions, because the
decision-making for those cases is fully completed at an earlier
stage. I would kind of like to see this expressed in a way that first
decides which state transition to make (forward-to-all-frozen,
forward-to-all-visible, backward-to-all-visible,
backward-to-not-all-visible, nothing) and then does the corresponding
work. What you're doing instead is splitting half of those functions
off into a helper function while keeping the other half where they are
without cleaning up any of the logic. Now, maybe that's OK: I'm far
from having grokked the whole patch set. But it is not any more clear
than what we have now, IMHO, and perhaps even a bit less so.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Wed, Sep 10, 2025 at 4:01 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 7:08 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> > Fair. I've introduced new XLHP flags in attached v13. Hopefully it
> > puts an end to the horror.
>
> I suggest not renumbering all of the existing flags and just adding
> these new ones at the end. Less code churn and more likely to break in
> an obvious way if you mix up the two sets of flags.

Makes sense. In my attached v14, I have not renumbered them.

> More on 0002:

After an off-list discussion we had about how to make the patches in
the set progressively improve the code instead of just mechanically
refactoring it, I have made some big changes in the intermediate
patches in the set.

Before actually including the VM changes in the vacuum/prune WAL
records, I first include setting PD_ALL_VISIBLE with the other changes
to the heap page so that we can remove the heap page from the VM
setting WAL chain. This happens to fix the bug we discussed where if
you set an all-visible page all-frozen and checksums/wal_log_hints are
enabled, you may end up setting an LSN on a page that was not marked
dirty.

0001 is RFC but waiting on one other reviewer
0002 - 0007 is a bit of cleanup I had later in the patch set but moved
up because I think it made the intermediate patches better
0008 - 0012 removes the heap page from the XLOG_HEAP2_VISIBLE WAL
chain (it makes all callers of visibilitymap_set() set PD_ALL_VISIBLE
in the same WAL record as changes to the heap page)
0013 - 0018 finish the job eliminating XLOG_HEAP2_VISIBLE and set VM
bits in the same WAL record as the heap changes
0019 - 0024 set the VM on-access

>   /*
> + * Note that the heap relation may have been dropped or truncated, leading
> + * us to skip updating the heap block due to the LSN interlock. However,
> + * even in that case, it's still safe to update the visibility map. Any
> + * WAL record that clears the visibility map bit does so before checking
> + * the page LSN, so any bits that need to be cleared will still be
> + * cleared.
> + *
> + * Note that the lock on the heap page was dropped above. In normal
> + * operation this would never be safe because a concurrent query could
> + * modify the heap page and clear PD_ALL_VISIBLE -- violating the
> + * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in
> + * the VM is set.
> + *
> + * In recovery, we expect no other writers, so writing to the VM page
> + * without holding a lock on the heap page is considered safe enough. It
> + * is done this way when replaying xl_heap_visible records (see
>   */
>
> How many copies of this comment do you plan to end up with?

By the end, one for copy freeze replay and one for prune/freeze/vacuum
replay. I felt two wasn't too bad and was easier than meta-explaining
what the other comment was explaining.

> > > 0004. It is not clear to me why you need to get
> > > log_heap_prune_and_freeze to do the work here. Why can't
> > > log_newpage_buffer get the job done already?
> >
> > Well, I need something to emit the changes to the VM. I'm eliminating
> > all users of xl_heap_visible. Empty pages are the ones that benefit
> > the least from switching from xl_heap_visible -> xl_heap_prune. But,
> > if I don't transition them, we have to maintain all the
> > xl_heap_visible code (including visibilitymap_set() in its long form).
> >
> > As for log_newpage_buffer(), I could keep it if you think it is too
> > confusing to change log_heap_prune_and_freeze()'s API (by passing
> > force_heap_fpi) to handle this case, I can leave log_newpage_buffer()
> > there and then call log_heap_prune_and_freeze().
> >
> > I just thought it seemed simple to avoid emitting the new page record
> > and the VM update record, so why not -- but I don't have strong
> > feelings.
>
> Yeah, I'm not sure what the right thing to do here is. I think I was
> again experiencing brain fade by forgetting that there is a heap page
> and a VM page and, of course, log_heap_newpage() probably isn't going
> to touch the latter. So that makes sense. On the other hand, we could
> only have one type of WAL record for every single operation in the
> system if we gave it enough flags, and force_heap_fpi seems
> suspiciously like a flag that turns this into a whole different kind
> of WAL record.

I've kept log_heap_newpage() and used log_heap_prune_and_freeze() for
setting PD_ALL_VISIBLE and the VM.

> > > 0005. It looks a little curious that you delete the
> > > identify-corruption logic from the end of the if-nest and add it to
> > > the beginning. Ceteris paribus, you'd expect that to be worse, since
> > > corruption is a rare case.
> >
> > On master, the two corruption cases are sandwiched between the normal
> > VM set cases. And I actually think doing it this way is brittle. If
> > you put the cases which set the VM first, you have to have completely
> > bulletproof the if statements guarding them to foreclose any possible
> > corruption case from entering because otherwise you will overwrite the
> > corruption you then try to detect.
>
> Hmm. In the current code, we first test (!all_visible_according_to_vm
> && presult.all_visible), then (all_visible_according_to_vm &&
> !PageIsAllVisible(page) && visibilitymap_get_status(vacrel->rel,
> blkno, &vmbuffer) != 0), and then (presult.lpdead_items > 0 &&
> PageIsAllVisible(page)). The first and second can never coexist,
> because they require opposite values of all_visible_according_to_vm.
> The second and third cannot coexist because they require opposite
> values of PageIsAllVisible(page). It is not entirely obvious that the
> first and third tests couldn't both pass, but you'd have to have
> presult.all_visible and presult.lpdead_items > 0, and it's a bit hard
> to see how heap_page_prune_and_freeze() could ever allow that.
> Consider:
>
>     if (prstate.all_visible && prstate.lpdead_items == 0)
>     {
>         presult->all_visible = prstate.all_visible;
>         presult->all_frozen = prstate.all_frozen;
>     }
>     else
>     {
>         presult->all_visible = false;
>         presult->all_frozen = false;
>     }
> ...
>     presult->lpdead_items = prstate.lpdead_items;
>
> So I don't really think I'm persuaded that the current way is brittle.

I meant brittle because it has to be so carefully coded for it to work
out this way. If you ever wanted to change or enhance it, it's quite
hard to know how to make sure all of them are entirely mutually
exclusive.

> But that having been said, I agree with you that the order of the
> checks is kind of random, and I don't think it really matters that
> much for performance. What does matter is clarity. I feel like what
> I'd ideally like this logic to do is say: do we want the VM bit for
> the page to be set to all-frozen, just all-visible, or neither? Then
> push the VM bit to the correct state, dragging the page-level bit
> along behind. And the current logic sort of does that. It's roughly:
>
> 1. Should we go from not-all-visible to either all-visible or
> all-frozen? If yes, do so.
> 2. Should we go from either all-visible or all-frozen to
> not-all-visible? If yes, do so.
> 3. Should we go from either all-visible or all-frozen to
> not-all-visible for a different reason? If yes, do so.
> 4. Should we go from all-visible to all-frozen? If yes, do so.

I don't necessarily agree that fixing corruption and setting the VM
should be together -- they feel like separate things to me. But, I
don't feel strongly enough about it to push it.

> But what's weird is that all the tests are written differently, and we
> have two different reasons for going to not-all-visible, namely
> PD_ALL_VISIBLE-not-set and dead-items-on-page, whereas there's only
> one test for each of the other state-transitions, because the
> decision-making for those cases is fully completed at an earlier
> stage. I would kind of like to see this expressed in a way that first
> decides which state transition to make (forward-to-all-frozen,
> forward-to-all-visible, backward-to-all-visible,
> backward-to-not-all-visible, nothing) and then does the corresponding
> work. What you're doing instead is splitting half of those functions
> off into a helper function while keeping the other half where they are
> without cleaning up any of the logic. Now, maybe that's OK: I'm far
> from having grokked the whole patch set. But it is not any more clear
> than what we have now, IMHO, and perhaps even a bit less so.

In terms of my patch set, I do have to change something about this
mixture of fixing corruption and setting the VM because I need to set
the VM bits in the same critical section as making the other changes
to the heap page (pruning, etc) and include the VM set changes in the
same WAL record (note that clearing the VM to fix corruption is not
WAL-logged).

What I've gone with is determining what to set the VM bits to and then
fixing the corruption at the same time. Then, later, when making the
changes to the heap page, I actually set the VM. This is kind of the
opposite of what you suggested above -- determining what to set the
bits to altogether -- corruption and non-corruption cases together. I
don't think we can do that though, because fixing the corruption is
non WAL-logged changes to the page and VM and setting the VM bits is a
WAL-logged change. And, you can't clear bits with visibilitymap_set()
(there's an assertion about that). So you have to call different
functions (not to mention emit distinct error messages). I don't know
that I've come up with the ideal solution, though.

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Andres Freund
Дата:
Hi,

On 2025-09-17 20:10:07 -0400, Melanie Plageman wrote:
> 0001 is RFC but waiting on one other reviewer

> From cacff6c95e38d370b87148bc48cf6ac5f086ed07 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Tue, 17 Jun 2025 17:22:10 -0400
> Subject: [PATCH v14 01/24] Eliminate COPY FREEZE use of XLOG_HEAP2_VISIBLE
> diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c
> index cf843277938..faa7c561a8a 100644
> --- a/src/backend/access/heap/heapam_xlog.c
> +++ b/src/backend/access/heap/heapam_xlog.c
> @@ -551,6 +551,7 @@ heap_xlog_multi_insert(XLogReaderState *record)
>      int            i;
>      bool        isinit = (XLogRecGetInfo(record) & XLOG_HEAP_INIT_PAGE) != 0;
>      XLogRedoAction action;
> +    Buffer        vmbuffer = InvalidBuffer;
>
>      /*
>       * Insertion doesn't overwrite MVCC data, so no conflict processing is
> @@ -571,11 +572,11 @@ heap_xlog_multi_insert(XLogReaderState *record)
>      if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED)
>      {
>          Relation    reln = CreateFakeRelcacheEntry(rlocator);
> -        Buffer        vmbuffer = InvalidBuffer;
>
>          visibilitymap_pin(reln, blkno, &vmbuffer);
>          visibilitymap_clear(reln, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS);
>          ReleaseBuffer(vmbuffer);
> +        vmbuffer = InvalidBuffer;
>          FreeFakeRelcacheEntry(reln);
>      }
>
> @@ -662,6 +663,57 @@ heap_xlog_multi_insert(XLogReaderState *record)
>      if (BufferIsValid(buffer))
>          UnlockReleaseBuffer(buffer);
>
> +    buffer = InvalidBuffer;
> +
> +    /*
> +     * Now read and update the VM block.
> +     *
> +     * Note that the heap relation may have been dropped or truncated, leading
> +     * us to skip updating the heap block due to the LSN interlock.

I don't fully understand this - how does dropping/truncating the relation lead
to skipping due to the LSN interlock?


> +     * even in that case, it's still safe to update the visibility map. Any
> +     * WAL record that clears the visibility map bit does so before checking
> +     * the page LSN, so any bits that need to be cleared will still be
> +     * cleared.
> +     *
> +     * Note that the lock on the heap page was dropped above. In normal
> +     * operation this would never be safe because a concurrent query could
> +     * modify the heap page and clear PD_ALL_VISIBLE -- violating the
> +     * invariant that PD_ALL_VISIBLE must be set if the corresponding bit in
> +     * the VM is set.
> +     *
> +     * In recovery, we expect no other writers, so writing to the VM page
> +     * without holding a lock on the heap page is considered safe enough. It
> +     * is done this way when replaying xl_heap_visible records (see
> +     * heap_xlog_visible()).
> +     */
> +    if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> +        XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> +                                      &vmbuffer) == BLK_NEEDS_REDO)
> +    {

Why are we using RBM_ZERO_ON_ERROR here? I know it's copied from
heap_xlog_visible(), but I don't immediately understand (or remember) why we
do so there either?


> +        Page        vmpage = BufferGetPage(vmbuffer);
> +        Relation    reln = CreateFakeRelcacheEntry(rlocator);

Hm. Do we really need to continue doing this ugly fake relcache stuff? I'd
really like to eventually get rid of that and given that the new "code shape"
delegates a lot more responsibility to the redo routines, they should have a
fairly easy time not needing a fake relcache?  Afaict the relation already is
not used outside of debugging paths?


> +        /* initialize the page if it was read as zeros */
> +        if (PageIsNew(vmpage))
> +            PageInit(vmpage, BLCKSZ, 0);
> +
> +        visibilitymap_set_vmbits(reln, blkno,
> +                                 vmbuffer,
> +                                 VISIBILITYMAP_ALL_VISIBLE |
> +                                 VISIBILITYMAP_ALL_FROZEN);
> +
> +        /*
> +         * It is not possible that the VM was already set for this heap page,
> +         * so the vmbuffer must have been modified and marked dirty.
> +         */

I assume that's because we a) checked the LSN interlock b) are replaying
something that needed to newly set the bit?


Except for the above comments, this looks pretty good to me.


Seems 0002 should just be applied...


Re 0003: I wonder if it's getting to the point that a struct should be used as
the argument.

Greetings,

Andres Freund



Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Melanie Plageman
Дата:
On Thu, Sep 18, 2025 at 12:48 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2025-09-17 20:10:07 -0400, Melanie Plageman wrote:
>
> > +     /*
> > +      * Now read and update the VM block.
> > +      *
> > +      * Note that the heap relation may have been dropped or truncated, leading
> > +      * us to skip updating the heap block due to the LSN interlock.
>
> I don't fully understand this - how does dropping/truncating the relation lead
> to skipping due to the LSN interlock?

Yes, this wasn't right. I misunderstood.

What I think it should say is that if the heap update was skipped due
to LSN interlock we still have to replay the updates to the VM because
each vm page contains bits for multiple heap blocks and if the record
included a vm page FPI, subsequent updates to the VM may rely on this
FPI to avoid torn pages. We don't condition it on the heap redo having
been an FPI, probably because it is not worth it -- but I wonder if
that is worth calling out in the comment?

Do we also need to replay it when the heap redo returns BLK_NOTFOUND?
I assume this can happen in the case of relation dropped or truncated
-- but in this case there wouldn't be subsequent records updating the
VM for other heap blocks that we need to replay because the other heap
blocks won't be found either, right?

> > +     if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET &&
> > +             XLogReadBufferForRedoExtended(record, 1, RBM_ZERO_ON_ERROR, false,
> > +                                                                       &vmbuffer) == BLK_NEEDS_REDO)
> > +     {
>
> Why are we using RBM_ZERO_ON_ERROR here? I know it's copied from
> heap_xlog_visible(), but I don't immediately understand (or remember) why we
> do so there either?

It has been RBM_ZERO_ON_ERROR since XLogReadBufferForRedoExtended()
was introduced here in 2c03216d8311.
I think we probably do this because vm_readbuf() passes ReadBuffer()
RBM_ZERO_ON_ERROR and has this comment

     * For reading we use ZERO_ON_ERROR mode, and initialize the page if
     * necessary. It's always safe to clear bits, so it's better to clear
     * corrupt pages than error out.

Do you think I also should have a comment in heap_xlog_multi_insert()?

> > +             Page            vmpage = BufferGetPage(vmbuffer);
> > +             Relation        reln = CreateFakeRelcacheEntry(rlocator);
>
> Hm. Do we really need to continue doing this ugly fake relcache stuff? I'd
> really like to eventually get rid of that and given that the new "code shape"
> delegates a lot more responsibility to the redo routines, they should have a
> fairly easy time not needing a fake relcache?  Afaict the relation already is
> not used outside of debugging paths?

Yes, interestingly we don't have the relname in recovery anyway, so it
does all this fake relcache stuff only to convert the relfilenode to a
string and uses that.

The fake relcache stuff will still be used by visibilitymap_pin()
which callers like heap_xlog_delete() use to get the VM page. And I
don't think it is worth coming up with a version of that that doesn't
use the relcache. But you're right that the Relation is not needed for
visibilitymap_set_vmbits(). I've changed it to just take the relation
name as a string.


> > +             /* initialize the page if it was read as zeros */
> > +             if (PageIsNew(vmpage))
> > +                     PageInit(vmpage, BLCKSZ, 0);
> > +
> > +             visibilitymap_set_vmbits(reln, blkno,
> > +                                                              vmbuffer,
> > +                                                              VISIBILITYMAP_ALL_VISIBLE |
> > +                                                              VISIBILITYMAP_ALL_FROZEN);
> > +
> > +             /*
> > +              * It is not possible that the VM was already set for this heap page,
> > +              * so the vmbuffer must have been modified and marked dirty.
> > +              */
>
> I assume that's because we a) checked the LSN interlock b) are replaying
> something that needed to newly set the bit?

Yes, perhaps it is not worth having the assert since it attracts extra
attention to an invariant that is unlikely to be in danger of
regression.

> Seems 0002 should just be applied...

Done

> Re 0003: I wonder if it's getting to the point that a struct should be used as
> the argument.

I have been thinking about this. I have yet to come up with a good
idea for a struct name or multiple struct names that seem to fit here.
I could move the other output parameters into the PruneFreezeResult
and then maybe make some kind of PruneFreezeParameters struct or
something?

- Melanie

Вложения

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

От
Robert Haas
Дата:
I find this patch set quite hard to follow. 0001 altogether removes
the use of XLOG_HEAP2_VISIBLE in cases where we use
XLOG_HEAP2_MULTI_INSERT, but then 0007 (the next non-refactoring
patch) begins half-removing the dependency on XLOG_HEAP2_VISIBLE,
assisted by 0009 and 0010, and then later you come back and remove the
other half of the dependency. I know it was I who proposed (off-list)
first making the XLOG_HEAP2_VISIBLE record only deal with the VM page
and not the heap buffer, but I'm not sure that idea quite worked out
in terms of making this easier to follow. At the least, it seems weird
that COPY FREEZE is an exception that gets handled in a different way
than all the other cases, fully removing the dependency in one step.
It would also be nice if each time you repost this, or maybe in a
README that you post along beside the actual patches, you'd include
some kind of roadmap to help the reader understand the internal
structure of the patch set, like 1 does this, 2-9 get us to here,
10-whatever get us to this next place.

I don't really understand how the interlocking works. 0011 changes
visibilitymap_set so that it doesn't take the heap block as an
argument, but we'd better hold a lock on the heap page while setting
the VM bit, otherwise I think somebody could come along and modify the
heap page after we decided it was all-visible and before we actually
set the VM bit, which would be terrible. I would expect the comments
and the commit message to say something about that, but I don't see
that they do.

I find myself fearful of the way that 0007 propagates the existing
hacks around setting the VM bit into a new place:

+               /*
+                * We always emit a WAL record when setting
PD_ALL_VISIBLE, but we are
+                * careful not to emit a full page image unless
+                * checksums/wal_log_hints are enabled. We only set
the heap page LSN
+                * if full page images were an option when emitting
WAL. Otherwise,
+                * subsequent modifications of the page may
incorrectly skip emitting
+                * a full page image.
+                */
+               if (do_prune || nplans > 0 ||
+                       (xlrec.flags & XLHP_SET_PD_ALL_VIS &&
XLogHintBitIsNeeded()))
+                       PageSetLSN(page, lsn);

I suppose it's not the worst thing to duplicate this logic, because
you're later going to remove the original copy. But, it took me >10
minutes to find the text in src/backend/access/transam/README, in the
second half of the "Writing Hints" section, that explains the overall
principle here, and since the patch set doesn't seem to touch that
text, maybe you weren't even aware it was there. And, it's a little
weird to have a single WAL record that is either a hint or not a hint
depending on a complex set of conditions. (IMHO mixing & and &&
without parentheses is quite brave, and an explicit != 0 might not be
a bad idea either.)

Anyway, I kind of wonder if it's time to back out the hack that I
installed here many years ago. At the time, I thought that it would be
bad if a VACUUM swept over the visibility map setting VM bits and as a
result emitted an FPI for every page in the entire heap ... but
everyone who is running with checksums has accepted that cost already,
and with those being the default, that's probably going to be most
people. It would be even more compelling if we were going to freeze,
prune, and set all-visible on access, because then presumably the case
where we touch a page and ONLY set the VM bit would be rare, so the
cost of doing that wouldn't matter much, but I guess the patch doesn't
go that far -- we can freeze or set all-visible on access but not
prune, without which the scenario I was worrying about at the time is
still fairly plausible, I think, if checksums are turned off.

-- 
Robert Haas
EDB: http://www.enterprisedb.com