On Wed, Apr 14, 2021 at 7:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> If we create a table with vacuum_index_cleanup = off or execute VACUUM
> with INDEX_CLEANUP = off, vacuum updates pg_stat_all_tables.n_dead_tup
> to the number of HEAPTUPLE_RECENTLY_DEAD tuples. Whereas analyze
> updates it to the sum of the number of HEAPTUPLE_DEAD/RECENTLY_DEAD
> tuples and LP_DEAD line pointers. So if the table has many LP_DEAD
> line pointers due to skipping index cleanup, autovacuum is triggered
> every time after analyze/autoanalyze. This issue seems to happen also
> on back branches, probably from 12 where INDEX_CLEANUP option was
> introduced.
Hmm.
> I think we can have heapam_scan_analyze_next_tuple() not count LP_DEAD
> line pointer as lazy_scan_prune() does. Attached the patch for that.
lazy_scan_prune() is concerned about what the state of the table *will
be* when VACUUM finishes, based on its working assumption that index
vacuuming and heap vacuuming always go ahead. This is exactly the same
reason why lazy_scan_prune() will set LVPagePruneState.hastup to
'false' in the presence of an LP_DEAD item -- this is not how
count_nondeletable_pages() considers if the same page 'hastup' much
later on, right at the end of the VACUUM (it will only consider the
page safe to truncate away if it now only contains LP_UNUSED items --
LP_DEAD items make heap/table truncation unsafe).
In general accounting rules like this may need to work slightly
differently across near-identical functions because of "being versus
becoming" issues. It is necessary to distinguish between "being" code
(e.g., this ANALYZE code, count_nondeletable_pages() and its hastup
issue) and "becoming" code (e.g., lazy_scan_prune() ands its approach
to counting "remaining" dead tuples as well as hastup-ness). I tend to
doubt that your patch is the right approach because the two code paths
already "agree" once you assume that the LP_DEAD items that
lazy_scan_prune() sees will be gone at the end of the VACUUM. I do
agree that this is a problem, though.
Generally speaking, the "becoming" code from lazy_scan_prune() is not
100% sure that it will be correct in each case, for a large variety of
reasons. But I think that we should expect it to be mostly correct. We
definitely cannot allow it to be quite wrong all the time with some
workloads. And so I agree that this is a problem for the INDEX_CLEANUP
= off case, though it's equally an issue for the recently added
failsafe mechanism. I do not believe that it is a problem for the
bypass-indexes optimization, though, because that is designed to only
be applied when there are practically zero LP_DEAD items. The
optimization can make VACUUM record that there are zero dead tuples
after the VACUUM finishes, even though there were in fact a very small
non-zero number of dead tuples -- but that's not appreciably different
from any of the other ways that the count of dead tuples could be
inaccurate (e.g. concurrent opportunistic pruning). The specific tests
that we apply inside lazy_vacuum() should make sure that autovacuum
scheduling is never affected. The autovacuum scheduling code can
safely "believe" that the indexes were vacuumed, because it really is
the same as if there were precisely zero LP_DEAD items (or the same
for all practical purposes).
I'm not sure what to do, though. Both the INDEX_CLEANUP = off case and
the failsafe case are only intended for emergencies. And it's hard to
know what to do in a code path that is designed to rarely or never be
used.
--
Peter Geoghegan