Обсуждение: Why mark empty pages all visible?

Поиск
Список
Период
Сортировка

Why mark empty pages all visible?

От
Andres Freund
Дата:
Hi,

visibilitymap.c currently marks empty pages as all visible, including WAL
logging them:

    if (PageIsEmpty(page))
...
            /*
             * Empty pages are always all-visible and all-frozen (note that
             * the same is currently not true for new pages, see above).
             */
            if (!PageIsAllVisible(page))
...
                visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
                                  vmbuffer, InvalidTransactionId,
                                  VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);


It seems odd that we enter the page into the VM at this point. That means that
use of that page will now require a bit more work (including
RelationGetBufferForTuple() pinning it).

Note that we do *not* do so for new pages:

    if (PageIsNew(page))
    {
        /*
         * All-zeroes pages can be left over if either a backend extends the
         * relation by a single page, but crashes before the newly initialized
         * page has been written out, or when bulk-extending the relation
         * (which creates a number of empty pages at the tail end of the
         * relation), and then enters them into the FSM.
         *
         * Note we do not enter the page into the visibilitymap. That has the
         * downside that we repeatedly visit this page in subsequent vacuums,
         * but otherwise we'll never discover the space on a promoted standby.
         * The harm of repeated checking ought to normally not be too bad. The
         * space usually should be used at some point, otherwise there
         * wouldn't be any regular vacuums.
...
        return true;
    }


The standby.c reasoning seems to hold just as well for empty pages? In fact,
there might very well be many more empty pages than new pages.

Which of course also is also the only argument for putting empty pages into
the VM - there could be many of them, so we might not want to rescan them on a
regular basis. But there's actually also no real bound on the number of new
pages, so I'm not sure that argument goes all that far?

The standby argument also doesn't just seem to apply to the standby, but also
to crashes on the primary, as the FSM is not crashsafe...


I traced that through the versions - that behaviour originates in the original
commit adding the visibilitymap (608195a3a365). There's no comments back then
explaining why this behaviour was chosen.


This got a bit stranger with 44fa84881fff, because now we add the page into
the VM even if it currently is pinned:

        if (!ConditionalLockBufferForCleanup(buf))
...
            /* Check for new or empty pages before lazy_scan_noprune call */
            if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
                                       vmbuffer))
...


It seems quite odd to set a page to all visible that we could not currently
get a cleanup lock on - obviously evidence of another backend trying to to do
something with the page.

The main way to encounter this situation, afaict, is when
RelationGetTargetBufferForTuple() briefly releases the lock on a newly
extended page, to acquire the lock on the source page. The buffer is pinned,
but not locked in that situation.


I started to look into this in the context of
https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de and
https://postgr.es/m/20221029025420.eplyow6k7tgu6he3%40awork3.anarazel.de

which both would ever so slightly extend the window in which we don't hold a
lock on the page (to do a visibilitymap_pin() and RecordPageWithFreeSpace()
respectively).


It seems pretty clear that we shouldn't enter a currently-in-use page into the
VM or freespacemap. All that's going to do is to "disturb" the backend trying
to use that page (by directing other backends to it) and to make its job more
expensive.


It's less clear, but imo worth discussing, whether we should continue to set
empty pages to all-visible.


Greetings,

Andres Freund



Re: Why mark empty pages all visible?

От
Peter Geoghegan
Дата:
On Mon, Mar 27, 2023 at 6:48 PM Andres Freund <andres@anarazel.de> wrote:
> It seems odd that we enter the page into the VM at this point. That means that
> use of that page will now require a bit more work (including
> RelationGetBufferForTuple() pinning it).

I think that it's fairly obvious that it's *not* odd at all. If it
didn't do this then the pages would have to be scanned by VACUUM.

You haven't said anything about the leading cause of marking empty
pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop
marking empty pages all-frozen?

Actually it isn't technically an empty page according to
PageIsEmpty(), since I wrote PageTruncateLinePointerArray() in a way
that made it leave a heap page with at least a single LP_UNUSED item.
But it'll essentially leave behind an empty page in many cases. The
regression tests mark pages all-frozen in this path quite a bit more
often than any other path according to gcov.

> This got a bit stranger with 44fa84881fff, because now we add the page into
> the VM even if it currently is pinned:

> It seems quite odd to set a page to all visible that we could not currently
> get a cleanup lock on - obviously evidence of another backend trying to to do
> something with the page.

You can say the same thing about lazy_vacuum_heap_page(), too.
Including the part about cleanup locking.

> It seems pretty clear that we shouldn't enter a currently-in-use page into the
> VM or freespacemap. All that's going to do is to "disturb" the backend trying
> to use that page (by directing other backends to it) and to make its job more
> expensive.

I don't think that it's clear. What about the case where there is only
one tuple, on a page that we cannot cleanup lock? Where do you draw
the line?

--
Peter Geoghegan



Re: Why mark empty pages all visible?

От
Andres Freund
Дата:
Hi,

On 2023-03-27 20:12:11 -0700, Peter Geoghegan wrote:
> On Mon, Mar 27, 2023 at 6:48 PM Andres Freund <andres@anarazel.de> wrote:
> > It seems odd that we enter the page into the VM at this point. That means that
> > use of that page will now require a bit more work (including
> > RelationGetBufferForTuple() pinning it).
> 
> I think that it's fairly obvious that it's *not* odd at all. If it
> didn't do this then the pages would have to be scanned by VACUUM.

Yes - just like in the case of new pages.


> You haven't said anything about the leading cause of marking empty
> pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop
> marking empty pages all-frozen?

It's not obvious that it should - but it's not as clear a case as the
ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() one. In the
latter, we know
a) that we don't have to do any work to be able to advance the horizon
b) we know that somebody else has the page pinned

What's the point in marking it all-visible at that point? In quite likely be
from RelationGetBufferForTuple() having extended the relation and then briefly
needed to release the lock (to acquire the lock on otherBuffer or in
GetVisibilityMapPins()).


> > This got a bit stranger with 44fa84881fff, because now we add the page into
> > the VM even if it currently is pinned:
> 
> > It seems quite odd to set a page to all visible that we could not currently
> > get a cleanup lock on - obviously evidence of another backend trying to to do
> > something with the page.
> 
> You can say the same thing about lazy_vacuum_heap_page(), too.
> Including the part about cleanup locking.

I don't follow. In the ConditionalLockBufferForCleanup() ->
lazy_scan_new_or_empty() case we are dealing with an new or empty
page. Whereas lazy_vacuum_heap_page() deals with a page that definitely has
dead tuples on it.  How are those two cases comparable?


> > It seems pretty clear that we shouldn't enter a currently-in-use page into the
> > VM or freespacemap. All that's going to do is to "disturb" the backend trying
> > to use that page (by directing other backends to it) and to make its job more
> > expensive.
> 
> I don't think that it's clear. What about the case where there is only
> one tuple, on a page that we cannot cleanup lock? Where do you draw
> the line?

I don't see how that's comparable? For one, we might need to clean up that
tuple for vacuum to be able to advance the horizon. And as far as the
non-cleanup lock path goes, it actually can perform work there. And it doesn't
even need to acquire an exclusive lock.

Greetings,

Andres Freund



Re: Why mark empty pages all visible?

От
Peter Geoghegan
Дата:
On Mon, Mar 27, 2023 at 9:32 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-03-27 20:12:11 -0700, Peter Geoghegan wrote:
> > On Mon, Mar 27, 2023 at 6:48 PM Andres Freund <andres@anarazel.de> wrote:
> > > It seems odd that we enter the page into the VM at this point. That means that
> > > use of that page will now require a bit more work (including
> > > RelationGetBufferForTuple() pinning it).
> >
> > I think that it's fairly obvious that it's *not* odd at all. If it
> > didn't do this then the pages would have to be scanned by VACUUM.
>
> Yes - just like in the case of new pages.

I'm not saying that the status quo is free of contradictions. Only
that there seem to be contradictions in what you're saying now.

> > You haven't said anything about the leading cause of marking empty
> > pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop
> > marking empty pages all-frozen?
>
> It's not obvious that it should - but it's not as clear a case as the
> ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() one. In the
> latter, we know
> a) that we don't have to do any work to be able to advance the horizon
> b) we know that somebody else has the page pinned
>
> What's the point in marking it all-visible at that point? In quite likely be
> from RelationGetBufferForTuple() having extended the relation and then briefly
> needed to release the lock (to acquire the lock on otherBuffer or in
> GetVisibilityMapPins()).

I think that there is significant value in avoiding special cases, on
general principle. If we stopped doing this in
lazy_scan_new_or_empty() we'd be inventing a new thing that cleanup
locks are supposed to protect against. Maybe something like that would
make sense, but if so then make that argument, and make it explicitly
represented in the code.

> I don't follow. In the ConditionalLockBufferForCleanup() ->
> lazy_scan_new_or_empty() case we are dealing with an new or empty
> page. Whereas lazy_vacuum_heap_page() deals with a page that definitely has
> dead tuples on it.  How are those two cases comparable?

It doesn't have dead tuples anymore, though.

ISTM that there is an issue here with the definition of an empty page.
You're concerned about PageIsEmpty() pages. Which actually aren't
quite the same thing as an empty page left behind by
lazy_vacuum_heap_page(). It's just that this distinction isn't quite
acknowledged anywhere, and probably didn't exist at all at some point.
Maybe that should be addressed.

> > > It seems pretty clear that we shouldn't enter a currently-in-use page into the
> > > VM or freespacemap. All that's going to do is to "disturb" the backend trying
> > > to use that page (by directing other backends to it) and to make its job more
> > > expensive.
> >
> > I don't think that it's clear. What about the case where there is only
> > one tuple, on a page that we cannot cleanup lock? Where do you draw
> > the line?
>
> I don't see how that's comparable? For one, we might need to clean up that
> tuple for vacuum to be able to advance the horizon. And as far as the
> non-cleanup lock path goes, it actually can perform work there. And it doesn't
> even need to acquire an exclusive lock.

So we should put space in the FSM if it has one tuple, but not if it
has zero tuples? Though not if it has zero tuples following processing
by lazy_vacuum_heap_page()?

--
Peter Geoghegan



Re: Why mark empty pages all visible?

От
Andres Freund
Дата:
Hi,

On 2023-03-27 21:51:09 -0700, Peter Geoghegan wrote:
> On Mon, Mar 27, 2023 at 9:32 PM Andres Freund <andres@anarazel.de> wrote:
> > > You haven't said anything about the leading cause of marking empty
> > > pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop
> > > marking empty pages all-frozen?
> >
> > It's not obvious that it should - but it's not as clear a case as the
> > ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() one. In the
> > latter, we know
> > a) that we don't have to do any work to be able to advance the horizon
> > b) we know that somebody else has the page pinned
> >
> > What's the point in marking it all-visible at that point? In quite likely be
> > from RelationGetBufferForTuple() having extended the relation and then briefly
> > needed to release the lock (to acquire the lock on otherBuffer or in
> > GetVisibilityMapPins()).
> 
> I think that there is significant value in avoiding special cases, on
> general principle. If we stopped doing this in
> lazy_scan_new_or_empty() we'd be inventing a new thing that cleanup
> locks are supposed to protect against. Maybe something like that would
> make sense, but if so then make that argument, and make it explicitly
> represented in the code.

I will probably make that argument - so far I was just trying to understand
the intent of the current code. There aren't really comments explaining why we
want to mark currently-pinned empty/new pages all-visible and enter them into
the FSM.

Historically we did *not* enter currently pinned empty/new pages into the FSM
/ VM. Afaics that's new as of 44fa84881fff.


The reason I'm looking at this is that there's a lot of complexity at the
bottom of RelationGetBufferForTuple(), related to needing to release the lock
on the newly extended page and then needing to recheck whether there still is
free space on the page. And that it's not complicated enough
(c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked).

As far as I can tell, if we went back to not entering new/empty pages into
either VM or FSM, we could rely on the page not getting filled while just
pinning, not locking it.


> > I don't follow. In the ConditionalLockBufferForCleanup() ->
> > lazy_scan_new_or_empty() case we are dealing with an new or empty
> > page. Whereas lazy_vacuum_heap_page() deals with a page that definitely
> > has dead tuples on it.  How are those two cases comparable?
> 
> It doesn't have dead tuples anymore, though.
> 
> ISTM that there is an issue here with the definition of an empty page.
> You're concerned about PageIsEmpty() pages.

And not just any PageIsEmpty() page, ones that are currently pinned.

I also do wonder whether the different behaviour of PageIsEmpty() and
PageIsNew() pages makes sense. The justification for not marking PageIsNew()
pages as all-visible holds just as true for empty pages. There's just as much
free space there.

Greetings,

Andres Freund



Re: Why mark empty pages all visible?

От
Peter Geoghegan
Дата:
On Tue, Mar 28, 2023 at 12:01 PM Andres Freund <andres@anarazel.de> wrote:
> I will probably make that argument - so far I was just trying to understand
> the intent of the current code. There aren't really comments explaining why we
> want to mark currently-pinned empty/new pages all-visible and enter them into
> the FSM.

I don't think that not being able to immediately get a cleanup lock on
a page signifies much of anything. I never have.

> Historically we did *not* enter currently pinned empty/new pages into the FSM
> / VM. Afaics that's new as of 44fa84881fff.

Of course that's true, but I don't know why that history is
particularly important. Either way, holding a pin was never supposed
to work as an interlock against a page being concurrently set
all-visible, or having its space recorded in the FSM. It's true that
my work on VACUUM has shaken out a couple of bugs where we
accidentally relied on that being true. But those were all due to the
change in lazy_vacuum_heap_page() made in Postgres 14 -- not the
addition of lazy_scan_new_or_empty() in Postgres 15.

I actually think that I might agree with the substance of much of what
you're saying, but at the same time I don't think that you're defining
the problem in a way that's particularly helpful. I gather that you
*don't* want to do anything about the lazy_vacuum_heap_page behavior
with setting empty pages all-visible (just the lazy_scan_new_or_empty
behavior). So clearly this isn't really about marking empty pages
all-visible, with or without a cleanup lock. It's actually about
something rather more specific: the interactions with
RelationGetBufferForTuple.

I actually agree that VACUUM is way too unconcerned about interfering
with concurrent activity in terms of how it manages free space in the
FSM. But this seems like just about the least important example of
that (outside the context of your RelationGetBufferForTuple work). The
really important case (that VACUUM gets wrong) all involve recently
dead tuples. But I don't think that you want to talk about that right
now. You want to talk about RelationGetBufferForTuple.

> The reason I'm looking at this is that there's a lot of complexity at the
> bottom of RelationGetBufferForTuple(), related to needing to release the lock
> on the newly extended page and then needing to recheck whether there still is
> free space on the page. And that it's not complicated enough
> (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked).
>
> As far as I can tell, if we went back to not entering new/empty pages into
> either VM or FSM, we could rely on the page not getting filled while just
> pinning, not locking it.

What you're essentially arguing for is inventing a new rule that makes
the early lifetime of a page (what we currently call a PageIsEmpty()
page, and new pages) special, to avoid interference from VACUUM. I
have made similar arguments myself on quite a few occasions, so I'm
actually sympathetic. I just think that you should own it. And no, I'm
not just reflexively defending my work in 44fa84881fff; I actually
think that framing the problem as a case of restoring a previous
behavior is confusing and ahistorical. If there was a useful behavior
that was lost, then it was quite an accidental behavior all along. The
difference matters because now you have to reconcile what you're
saying with the lazy_vacuum_heap_page no-cleanup-lock behavior added
in 14.

I think that you must be arguing for making the early lifetime of a
heap page special to VACUUM, since AFAICT you want to change VACUUM's
behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not*
with pages that have one remaining LP_UNUSED item, but are otherwise
empty (which could be set all-visible/all-frozen in either the first
or second heap pass, even if we disabled the lazy_scan_new_or_empty()
behavior you're complaining about). You seem to want to distinguish
between very new pages (that also happen to be empty), and old pages
that happen to be empty. Right?

> I also do wonder whether the different behaviour of PageIsEmpty() and
> PageIsNew() pages makes sense. The justification for not marking PageIsNew()
> pages as all-visible holds just as true for empty pages. There's just as much
> free space there.

What you say here makes a lot of sense to me. I'm just not sure what
I'd prefer to do about it.

--
Peter Geoghegan



Re: Why mark empty pages all visible?

От
Andres Freund
Дата:
Hi,

On 2023-03-28 13:05:19 -0700, Peter Geoghegan wrote:
> On Tue, Mar 28, 2023 at 12:01 PM Andres Freund <andres@anarazel.de> wrote:
> > I will probably make that argument - so far I was just trying to understand
> > the intent of the current code. There aren't really comments explaining why we
> > want to mark currently-pinned empty/new pages all-visible and enter them into
> > the FSM.
>
> I don't think that not being able to immediately get a cleanup lock on
> a page signifies much of anything. I never have.

Why is that? It's as clear a signal of concurrent activity on the buffer
you're going to get.


> > Historically we did *not* enter currently pinned empty/new pages into the FSM
> > / VM. Afaics that's new as of 44fa84881fff.
>
> Of course that's true, but I don't know why that history is
> particularly important.

It's interesting to understand *why* we are doing what we are. I think it'd
make sense to propose changing how things work around this, but I just don't
feel like I have a good enough grasp for why we do some of the things we
do. And given there's not a lot of comments around it and some of the comments
that do exist are inconsistent with themselves, looking at the history seems
like the next best thing?


> I actually think that I might agree with the substance of much of what
> you're saying, but at the same time I don't think that you're defining
> the problem in a way that's particularly helpful.

Likely because the goals of the existing code aren't clear to me. So I don't
feel like I have a firm grasp...


> I gather that you *don't* want to do anything about the
> lazy_vacuum_heap_page behavior with setting empty pages all-visible (just
> the lazy_scan_new_or_empty behavior).

Not in the short-medium term, at least. In the long term I do suspect we might
want to do something about it.  We have a *crapton* of contention in the FSM
and caused by the FSM in bulk workloads. With my relation extension patch
disabling the FSM nearly doubles concurrent load speed.

At the same time, the fact that we might loose knowledge about all the
existing free space in case of promotion or crash and never rediscover that
space (because the pages are frozen), seems decidedly not great.

I don't know what the path forward is, but it seems somewhat clear that we
ought to do something. I suspect having a not crash-safe FSM isn't really
acceptable anymore - it probably is fine to not persist a *reduction* in free
space, but we can't permanently loose track of free space, no matter how many
crashes.

I know that you strongly dislike the way the FSM works, although I forgot some
of the details.

One thing I am fairly certain about is that using the FSM to tell other
backends about newly bulk extended pages is not a good solution, even though
we're stuck with it for the moment.


> I actually agree that VACUUM is way too unconcerned about interfering
> with concurrent activity in terms of how it manages free space in the
> FSM. But this seems like just about the least important example of
> that (outside the context of your RelationGetBufferForTuple work). The
> really important case (that VACUUM gets wrong) all involve recently
> dead tuples. But I don't think that you want to talk about that right
> now. You want to talk about RelationGetBufferForTuple.

That's indeed the background. By now I'd also like to add a few comments
explaining why we do what we currently do, because I don't find all of it
obvious.


> > The reason I'm looking at this is that there's a lot of complexity at the
> > bottom of RelationGetBufferForTuple(), related to needing to release the lock
> > on the newly extended page and then needing to recheck whether there still is
> > free space on the page. And that it's not complicated enough
> > (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked).
> >
> > As far as I can tell, if we went back to not entering new/empty pages into
> > either VM or FSM, we could rely on the page not getting filled while just
> > pinning, not locking it.
>
> What you're essentially arguing for is inventing a new rule that makes
> the early lifetime of a page (what we currently call a PageIsEmpty()
> page, and new pages) special, to avoid interference from VACUUM. I
> have made similar arguments myself on quite a few occasions, so I'm
> actually sympathetic. I just think that you should own it. And no, I'm
> not just reflexively defending my work in 44fa84881fff; I actually
> think that framing the problem as a case of restoring a previous
> behavior is confusing and ahistorical. If there was a useful behavior
> that was lost, then it was quite an accidental behavior all along. The
> difference matters because now you have to reconcile what you're
> saying with the lazy_vacuum_heap_page no-cleanup-lock behavior added
> in 14.

I really don't have a position to own yet, not on firm enough ground.


> I think that you must be arguing for making the early lifetime of a
> heap page special to VACUUM, since AFAICT you want to change VACUUM's
> behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not*
> with pages that have one remaining LP_UNUSED item, but are otherwise
> empty (which could be set all-visible/all-frozen in either the first
> or second heap pass, even if we disabled the lazy_scan_new_or_empty()
> behavior you're complaining about). You seem to want to distinguish
> between very new pages (that also happen to be empty), and old pages
> that happen to be empty. Right?

I think that might be worthwhile, yes. The retry code in
RelationGetBufferForTuple() is quite hairy and almost impossible to test. If
we can avoid the complexity, at a fairly bound cost (vacuum needing to
re-visit new/empty pages if they're currently pinned), it'd imo be more that
worth the price.


But perhaps the better path forward is to just bite the bullet and introduce a
shared memory table of open files, that contains "content size" and "physical
size" for each relation.  We've had a lot of things over the years that'd have
benefitted from that.

To address the RelationGetBufferForTuple() case, vacuum would simply not scan
beyond the "content size", so it'd never encounter the page that
RelationGetBufferForTuple() is currently dealing with.  And we'd not need to
add bulk extended pages into the FSM.

This would also, I think, be the basis for teaching vacuum to truncate
relations without acquiring an AEL - which IME causes a lot of operational
issues.

It'd not do anything about loosing track of free space though :/.


> > I also do wonder whether the different behaviour of PageIsEmpty() and
> > PageIsNew() pages makes sense. The justification for not marking PageIsNew()
> > pages as all-visible holds just as true for empty pages. There's just as much
> > free space there.
>
> What you say here makes a lot of sense to me. I'm just not sure what
> I'd prefer to do about it.

You and me both...


I wonder what it'd take to make the FSM "more crashsafe". Leaving aside the
cases of "just extended" new/empty pages: We already call
XLogRecordPageWithFreeSpace() for HEAP2_VACUUM, HEAP2_VISIBLE,
XLOG_HEAP2_PRUNE as well as insert/multi_insert/update. However, we don't set
the LSN of FSM pages. Which means we'll potentially flush dirty FSM buffers to
disk, before the corresponding WAL records make it to disk.

ISTM that even if we just used the LSN of the last WAL record for
RecordPageWithFreeSpace()/LogRecordPageWithFreeSpace(), and perhaps also
called LogRecordPageWithFreeSpace() during XLOG_HEAP2_FREEZE replay, we'd
*drastically* shrink the chance of loosing track of free space. Obviously not
free, but ISTM that it can't add a lot of overhead.

I think we can loose the contents of individual leaf FSM pages, e.g. due to
checksum failures - but perhaps we could address that on access, e.g. by
removing the frozen bit for the corresponding heap pages, which'd lead us to
eventually rediscover the free space?

That'd still leve us with upper level corruption, but I guess we could just
recompute those in some circumstances?



Hm - it'd sure be nice if pg_buffercache would show the LSN of valid pages...

Greetings,

Andres Freund



Re: Why mark empty pages all visible?

От
Peter Geoghegan
Дата:
On Tue, Mar 28, 2023 at 3:29 PM Andres Freund <andres@anarazel.de> wrote:
> Why is that? It's as clear a signal of concurrent activity on the buffer
> you're going to get.

Not immediately getting a cleanup lock in VACUUM is informative to the
extent that you only care about what is happening that very
nanosecond. If you look at which pages it happens to in detail, what
you seem to end up with is a whole bunch of noise, which (on its own)
tells you exactly nothing about what VACUUM really ought to be doing
with those pages. In almost all cases we could get a cleanup lock by
waiting for one millisecond and retrying.

I suspect that the cleanup lock thing might be a noisy, unreliable
proxy for the condition that you actually care about, in the context
of your work on relation extension. I bet there is a better signal to
go on, if you look for one.

> It's interesting to understand *why* we are doing what we are. I think it'd
> make sense to propose changing how things work around this, but I just don't
> feel like I have a good enough grasp for why we do some of the things we
> do. And given there's not a lot of comments around it and some of the comments
> that do exist are inconsistent with themselves, looking at the history seems
> like the next best thing?

I think that I know why Heikki had no comments about PageIsEmpty()
pages when the VM first went in, back in 2009: because it was just so
obvious that you'd treat them the same as any other initialized page,
it didn't seem to warrant a comment at all. The difference between a
page with 0 tuples and 1 tuple is the same difference between a page
with 1 tuple and a page with 2 tuples. A tiny difference (one extra
tuple), of no particular consequence.

I think that you don't see it that way now because you're focussed on
the hio.c view of things. That code looked very different back in
2009, and in any case is very far removed from vacuumlazy.c.

I can tell you what I was thinking of with lazy_scan_new_or_empty: I
hate special cases. I will admit to being a zealot about it.

> > I gather that you *don't* want to do anything about the
> > lazy_vacuum_heap_page behavior with setting empty pages all-visible (just
> > the lazy_scan_new_or_empty behavior).
>
> Not in the short-medium term, at least. In the long term I do suspect we might
> want to do something about it.  We have a *crapton* of contention in the FSM
> and caused by the FSM in bulk workloads. With my relation extension patch
> disabling the FSM nearly doubles concurrent load speed.

I've seen the same effect myself. There is no question that that's a
big problem.

I think that the problem is that the system doesn't have any firm
understanding of pages as things that are owned by particular
transactions and/or backends, at least to a limited, scoped extent.
It's all really low level, when it actually needs to be high level and
take lots of context that comes from the application into account.

> At the same time, the fact that we might loose knowledge about all the
> existing free space in case of promotion or crash and never rediscover that
> space (because the pages are frozen), seems decidedly not great.

Unquestionably.

> I don't know what the path forward is, but it seems somewhat clear that we
> ought to do something. I suspect having a not crash-safe FSM isn't really
> acceptable anymore - it probably is fine to not persist a *reduction* in free
> space, but we can't permanently loose track of free space, no matter how many
> crashes.

Strongly agreed. It's a terrible false economy. If we bit the bullet
and made relation extension and the FSM crash safe, we'd gain so much
more than we'd lose.

> One thing I am fairly certain about is that using the FSM to tell other
> backends about newly bulk extended pages is not a good solution, even though
> we're stuck with it for the moment.

Strongly agreed.

> > I think that you must be arguing for making the early lifetime of a
> > heap page special to VACUUM, since AFAICT you want to change VACUUM's
> > behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not*
> > with pages that have one remaining LP_UNUSED item, but are otherwise
> > empty (which could be set all-visible/all-frozen in either the first
> > or second heap pass, even if we disabled the lazy_scan_new_or_empty()
> > behavior you're complaining about). You seem to want to distinguish
> > between very new pages (that also happen to be empty), and old pages
> > that happen to be empty. Right?
>
> I think that might be worthwhile, yes. The retry code in
> RelationGetBufferForTuple() is quite hairy and almost impossible to test. If
> we can avoid the complexity, at a fairly bound cost (vacuum needing to
> re-visit new/empty pages if they're currently pinned), it'd imo be more that
> worth the price.

Short term, you could explicitly say that PageIsEmpty() means that the
page is qualitatively different to other empty pages that were left
behind by VACUUM's second phase.

> But perhaps the better path forward is to just bite the bullet and introduce a
> shared memory table of open files, that contains "content size" and "physical
> size" for each relation.  We've had a lot of things over the years that'd have
> benefitted from that.

Strongly agreed on this.

> To address the RelationGetBufferForTuple() case, vacuum would simply not scan
> beyond the "content size", so it'd never encounter the page that
> RelationGetBufferForTuple() is currently dealing with.  And we'd not need to
> add bulk extended pages into the FSM.
>
> This would also, I think, be the basis for teaching vacuum to truncate
> relations without acquiring an AEL - which IME causes a lot of operational
> issues.

I have said the same exact thing myself at least once. Again, it's a
question of marrying high level and low level information. That is key
here.

> I wonder what it'd take to make the FSM "more crashsafe". Leaving aside the
> cases of "just extended" new/empty pages: We already call
> XLogRecordPageWithFreeSpace() for HEAP2_VACUUM, HEAP2_VISIBLE,
> XLOG_HEAP2_PRUNE as well as insert/multi_insert/update. However, we don't set
> the LSN of FSM pages. Which means we'll potentially flush dirty FSM buffers to
> disk, before the corresponding WAL records make it to disk.

Part of the problem is that we remember the amount of free space in
each heap page with way too much granularity. That adds to the
contention problem, because backends fight it out in a mad dash to
locate miniscule amounts of free space. Moreover, If there were (say)
only 5 or 7 distinct increments of free space that the FSM could
represent for each heap page, then true crash safety becomes a lot
cheaper.

I'll say it again: high level and low level information need to be combined.

> That'd still leve us with upper level corruption, but I guess we could just
> recompute those in some circumstances?

I think that we should just bite the bullet and come up with a way to
make it fully crash safe. No its, no buts.

--
Peter Geoghegan