Обсуждение: Why mark empty pages all visible?
Hi, visibilitymap.c currently marks empty pages as all visible, including WAL logging them: if (PageIsEmpty(page)) ... /* * Empty pages are always all-visible and all-frozen (note that * the same is currently not true for new pages, see above). */ if (!PageIsAllVisible(page)) ... visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr, vmbuffer, InvalidTransactionId, VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN); It seems odd that we enter the page into the VM at this point. That means that use of that page will now require a bit more work (including RelationGetBufferForTuple() pinning it). Note that we do *not* do so for new pages: if (PageIsNew(page)) { /* * All-zeroes pages can be left over if either a backend extends the * relation by a single page, but crashes before the newly initialized * page has been written out, or when bulk-extending the relation * (which creates a number of empty pages at the tail end of the * relation), and then enters them into the FSM. * * Note we do not enter the page into the visibilitymap. That has the * downside that we repeatedly visit this page in subsequent vacuums, * but otherwise we'll never discover the space on a promoted standby. * The harm of repeated checking ought to normally not be too bad. The * space usually should be used at some point, otherwise there * wouldn't be any regular vacuums. ... return true; } The standby.c reasoning seems to hold just as well for empty pages? In fact, there might very well be many more empty pages than new pages. Which of course also is also the only argument for putting empty pages into the VM - there could be many of them, so we might not want to rescan them on a regular basis. But there's actually also no real bound on the number of new pages, so I'm not sure that argument goes all that far? The standby argument also doesn't just seem to apply to the standby, but also to crashes on the primary, as the FSM is not crashsafe... I traced that through the versions - that behaviour originates in the original commit adding the visibilitymap (608195a3a365). There's no comments back then explaining why this behaviour was chosen. This got a bit stranger with 44fa84881fff, because now we add the page into the VM even if it currently is pinned: if (!ConditionalLockBufferForCleanup(buf)) ... /* Check for new or empty pages before lazy_scan_noprune call */ if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true, vmbuffer)) ... It seems quite odd to set a page to all visible that we could not currently get a cleanup lock on - obviously evidence of another backend trying to to do something with the page. The main way to encounter this situation, afaict, is when RelationGetTargetBufferForTuple() briefly releases the lock on a newly extended page, to acquire the lock on the source page. The buffer is pinned, but not locked in that situation. I started to look into this in the context of https://postgr.es/m/20230325025740.wzvchp2kromw4zqz%40awork3.anarazel.de and https://postgr.es/m/20221029025420.eplyow6k7tgu6he3%40awork3.anarazel.de which both would ever so slightly extend the window in which we don't hold a lock on the page (to do a visibilitymap_pin() and RecordPageWithFreeSpace() respectively). It seems pretty clear that we shouldn't enter a currently-in-use page into the VM or freespacemap. All that's going to do is to "disturb" the backend trying to use that page (by directing other backends to it) and to make its job more expensive. It's less clear, but imo worth discussing, whether we should continue to set empty pages to all-visible. Greetings, Andres Freund
On Mon, Mar 27, 2023 at 6:48 PM Andres Freund <andres@anarazel.de> wrote: > It seems odd that we enter the page into the VM at this point. That means that > use of that page will now require a bit more work (including > RelationGetBufferForTuple() pinning it). I think that it's fairly obvious that it's *not* odd at all. If it didn't do this then the pages would have to be scanned by VACUUM. You haven't said anything about the leading cause of marking empty pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop marking empty pages all-frozen? Actually it isn't technically an empty page according to PageIsEmpty(), since I wrote PageTruncateLinePointerArray() in a way that made it leave a heap page with at least a single LP_UNUSED item. But it'll essentially leave behind an empty page in many cases. The regression tests mark pages all-frozen in this path quite a bit more often than any other path according to gcov. > This got a bit stranger with 44fa84881fff, because now we add the page into > the VM even if it currently is pinned: > It seems quite odd to set a page to all visible that we could not currently > get a cleanup lock on - obviously evidence of another backend trying to to do > something with the page. You can say the same thing about lazy_vacuum_heap_page(), too. Including the part about cleanup locking. > It seems pretty clear that we shouldn't enter a currently-in-use page into the > VM or freespacemap. All that's going to do is to "disturb" the backend trying > to use that page (by directing other backends to it) and to make its job more > expensive. I don't think that it's clear. What about the case where there is only one tuple, on a page that we cannot cleanup lock? Where do you draw the line? -- Peter Geoghegan
Hi, On 2023-03-27 20:12:11 -0700, Peter Geoghegan wrote: > On Mon, Mar 27, 2023 at 6:48 PM Andres Freund <andres@anarazel.de> wrote: > > It seems odd that we enter the page into the VM at this point. That means that > > use of that page will now require a bit more work (including > > RelationGetBufferForTuple() pinning it). > > I think that it's fairly obvious that it's *not* odd at all. If it > didn't do this then the pages would have to be scanned by VACUUM. Yes - just like in the case of new pages. > You haven't said anything about the leading cause of marking empty > pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop > marking empty pages all-frozen? It's not obvious that it should - but it's not as clear a case as the ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() one. In the latter, we know a) that we don't have to do any work to be able to advance the horizon b) we know that somebody else has the page pinned What's the point in marking it all-visible at that point? In quite likely be from RelationGetBufferForTuple() having extended the relation and then briefly needed to release the lock (to acquire the lock on otherBuffer or in GetVisibilityMapPins()). > > This got a bit stranger with 44fa84881fff, because now we add the page into > > the VM even if it currently is pinned: > > > It seems quite odd to set a page to all visible that we could not currently > > get a cleanup lock on - obviously evidence of another backend trying to to do > > something with the page. > > You can say the same thing about lazy_vacuum_heap_page(), too. > Including the part about cleanup locking. I don't follow. In the ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() case we are dealing with an new or empty page. Whereas lazy_vacuum_heap_page() deals with a page that definitely has dead tuples on it. How are those two cases comparable? > > It seems pretty clear that we shouldn't enter a currently-in-use page into the > > VM or freespacemap. All that's going to do is to "disturb" the backend trying > > to use that page (by directing other backends to it) and to make its job more > > expensive. > > I don't think that it's clear. What about the case where there is only > one tuple, on a page that we cannot cleanup lock? Where do you draw > the line? I don't see how that's comparable? For one, we might need to clean up that tuple for vacuum to be able to advance the horizon. And as far as the non-cleanup lock path goes, it actually can perform work there. And it doesn't even need to acquire an exclusive lock. Greetings, Andres Freund
On Mon, Mar 27, 2023 at 9:32 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-03-27 20:12:11 -0700, Peter Geoghegan wrote: > > On Mon, Mar 27, 2023 at 6:48 PM Andres Freund <andres@anarazel.de> wrote: > > > It seems odd that we enter the page into the VM at this point. That means that > > > use of that page will now require a bit more work (including > > > RelationGetBufferForTuple() pinning it). > > > > I think that it's fairly obvious that it's *not* odd at all. If it > > didn't do this then the pages would have to be scanned by VACUUM. > > Yes - just like in the case of new pages. I'm not saying that the status quo is free of contradictions. Only that there seem to be contradictions in what you're saying now. > > You haven't said anything about the leading cause of marking empty > > pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop > > marking empty pages all-frozen? > > It's not obvious that it should - but it's not as clear a case as the > ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() one. In the > latter, we know > a) that we don't have to do any work to be able to advance the horizon > b) we know that somebody else has the page pinned > > What's the point in marking it all-visible at that point? In quite likely be > from RelationGetBufferForTuple() having extended the relation and then briefly > needed to release the lock (to acquire the lock on otherBuffer or in > GetVisibilityMapPins()). I think that there is significant value in avoiding special cases, on general principle. If we stopped doing this in lazy_scan_new_or_empty() we'd be inventing a new thing that cleanup locks are supposed to protect against. Maybe something like that would make sense, but if so then make that argument, and make it explicitly represented in the code. > I don't follow. In the ConditionalLockBufferForCleanup() -> > lazy_scan_new_or_empty() case we are dealing with an new or empty > page. Whereas lazy_vacuum_heap_page() deals with a page that definitely has > dead tuples on it. How are those two cases comparable? It doesn't have dead tuples anymore, though. ISTM that there is an issue here with the definition of an empty page. You're concerned about PageIsEmpty() pages. Which actually aren't quite the same thing as an empty page left behind by lazy_vacuum_heap_page(). It's just that this distinction isn't quite acknowledged anywhere, and probably didn't exist at all at some point. Maybe that should be addressed. > > > It seems pretty clear that we shouldn't enter a currently-in-use page into the > > > VM or freespacemap. All that's going to do is to "disturb" the backend trying > > > to use that page (by directing other backends to it) and to make its job more > > > expensive. > > > > I don't think that it's clear. What about the case where there is only > > one tuple, on a page that we cannot cleanup lock? Where do you draw > > the line? > > I don't see how that's comparable? For one, we might need to clean up that > tuple for vacuum to be able to advance the horizon. And as far as the > non-cleanup lock path goes, it actually can perform work there. And it doesn't > even need to acquire an exclusive lock. So we should put space in the FSM if it has one tuple, but not if it has zero tuples? Though not if it has zero tuples following processing by lazy_vacuum_heap_page()? -- Peter Geoghegan
Hi, On 2023-03-27 21:51:09 -0700, Peter Geoghegan wrote: > On Mon, Mar 27, 2023 at 9:32 PM Andres Freund <andres@anarazel.de> wrote: > > > You haven't said anything about the leading cause of marking empty > > > pages all-frozen, by far: lazy_vacuum_heap_page(). Should it also stop > > > marking empty pages all-frozen? > > > > It's not obvious that it should - but it's not as clear a case as the > > ConditionalLockBufferForCleanup() -> lazy_scan_new_or_empty() one. In the > > latter, we know > > a) that we don't have to do any work to be able to advance the horizon > > b) we know that somebody else has the page pinned > > > > What's the point in marking it all-visible at that point? In quite likely be > > from RelationGetBufferForTuple() having extended the relation and then briefly > > needed to release the lock (to acquire the lock on otherBuffer or in > > GetVisibilityMapPins()). > > I think that there is significant value in avoiding special cases, on > general principle. If we stopped doing this in > lazy_scan_new_or_empty() we'd be inventing a new thing that cleanup > locks are supposed to protect against. Maybe something like that would > make sense, but if so then make that argument, and make it explicitly > represented in the code. I will probably make that argument - so far I was just trying to understand the intent of the current code. There aren't really comments explaining why we want to mark currently-pinned empty/new pages all-visible and enter them into the FSM. Historically we did *not* enter currently pinned empty/new pages into the FSM / VM. Afaics that's new as of 44fa84881fff. The reason I'm looking at this is that there's a lot of complexity at the bottom of RelationGetBufferForTuple(), related to needing to release the lock on the newly extended page and then needing to recheck whether there still is free space on the page. And that it's not complicated enough (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked). As far as I can tell, if we went back to not entering new/empty pages into either VM or FSM, we could rely on the page not getting filled while just pinning, not locking it. > > I don't follow. In the ConditionalLockBufferForCleanup() -> > > lazy_scan_new_or_empty() case we are dealing with an new or empty > > page. Whereas lazy_vacuum_heap_page() deals with a page that definitely > > has dead tuples on it. How are those two cases comparable? > > It doesn't have dead tuples anymore, though. > > ISTM that there is an issue here with the definition of an empty page. > You're concerned about PageIsEmpty() pages. And not just any PageIsEmpty() page, ones that are currently pinned. I also do wonder whether the different behaviour of PageIsEmpty() and PageIsNew() pages makes sense. The justification for not marking PageIsNew() pages as all-visible holds just as true for empty pages. There's just as much free space there. Greetings, Andres Freund
On Tue, Mar 28, 2023 at 12:01 PM Andres Freund <andres@anarazel.de> wrote: > I will probably make that argument - so far I was just trying to understand > the intent of the current code. There aren't really comments explaining why we > want to mark currently-pinned empty/new pages all-visible and enter them into > the FSM. I don't think that not being able to immediately get a cleanup lock on a page signifies much of anything. I never have. > Historically we did *not* enter currently pinned empty/new pages into the FSM > / VM. Afaics that's new as of 44fa84881fff. Of course that's true, but I don't know why that history is particularly important. Either way, holding a pin was never supposed to work as an interlock against a page being concurrently set all-visible, or having its space recorded in the FSM. It's true that my work on VACUUM has shaken out a couple of bugs where we accidentally relied on that being true. But those were all due to the change in lazy_vacuum_heap_page() made in Postgres 14 -- not the addition of lazy_scan_new_or_empty() in Postgres 15. I actually think that I might agree with the substance of much of what you're saying, but at the same time I don't think that you're defining the problem in a way that's particularly helpful. I gather that you *don't* want to do anything about the lazy_vacuum_heap_page behavior with setting empty pages all-visible (just the lazy_scan_new_or_empty behavior). So clearly this isn't really about marking empty pages all-visible, with or without a cleanup lock. It's actually about something rather more specific: the interactions with RelationGetBufferForTuple. I actually agree that VACUUM is way too unconcerned about interfering with concurrent activity in terms of how it manages free space in the FSM. But this seems like just about the least important example of that (outside the context of your RelationGetBufferForTuple work). The really important case (that VACUUM gets wrong) all involve recently dead tuples. But I don't think that you want to talk about that right now. You want to talk about RelationGetBufferForTuple. > The reason I'm looking at this is that there's a lot of complexity at the > bottom of RelationGetBufferForTuple(), related to needing to release the lock > on the newly extended page and then needing to recheck whether there still is > free space on the page. And that it's not complicated enough > (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked). > > As far as I can tell, if we went back to not entering new/empty pages into > either VM or FSM, we could rely on the page not getting filled while just > pinning, not locking it. What you're essentially arguing for is inventing a new rule that makes the early lifetime of a page (what we currently call a PageIsEmpty() page, and new pages) special, to avoid interference from VACUUM. I have made similar arguments myself on quite a few occasions, so I'm actually sympathetic. I just think that you should own it. And no, I'm not just reflexively defending my work in 44fa84881fff; I actually think that framing the problem as a case of restoring a previous behavior is confusing and ahistorical. If there was a useful behavior that was lost, then it was quite an accidental behavior all along. The difference matters because now you have to reconcile what you're saying with the lazy_vacuum_heap_page no-cleanup-lock behavior added in 14. I think that you must be arguing for making the early lifetime of a heap page special to VACUUM, since AFAICT you want to change VACUUM's behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not* with pages that have one remaining LP_UNUSED item, but are otherwise empty (which could be set all-visible/all-frozen in either the first or second heap pass, even if we disabled the lazy_scan_new_or_empty() behavior you're complaining about). You seem to want to distinguish between very new pages (that also happen to be empty), and old pages that happen to be empty. Right? > I also do wonder whether the different behaviour of PageIsEmpty() and > PageIsNew() pages makes sense. The justification for not marking PageIsNew() > pages as all-visible holds just as true for empty pages. There's just as much > free space there. What you say here makes a lot of sense to me. I'm just not sure what I'd prefer to do about it. -- Peter Geoghegan
Hi, On 2023-03-28 13:05:19 -0700, Peter Geoghegan wrote: > On Tue, Mar 28, 2023 at 12:01 PM Andres Freund <andres@anarazel.de> wrote: > > I will probably make that argument - so far I was just trying to understand > > the intent of the current code. There aren't really comments explaining why we > > want to mark currently-pinned empty/new pages all-visible and enter them into > > the FSM. > > I don't think that not being able to immediately get a cleanup lock on > a page signifies much of anything. I never have. Why is that? It's as clear a signal of concurrent activity on the buffer you're going to get. > > Historically we did *not* enter currently pinned empty/new pages into the FSM > > / VM. Afaics that's new as of 44fa84881fff. > > Of course that's true, but I don't know why that history is > particularly important. It's interesting to understand *why* we are doing what we are. I think it'd make sense to propose changing how things work around this, but I just don't feel like I have a good enough grasp for why we do some of the things we do. And given there's not a lot of comments around it and some of the comments that do exist are inconsistent with themselves, looking at the history seems like the next best thing? > I actually think that I might agree with the substance of much of what > you're saying, but at the same time I don't think that you're defining > the problem in a way that's particularly helpful. Likely because the goals of the existing code aren't clear to me. So I don't feel like I have a firm grasp... > I gather that you *don't* want to do anything about the > lazy_vacuum_heap_page behavior with setting empty pages all-visible (just > the lazy_scan_new_or_empty behavior). Not in the short-medium term, at least. In the long term I do suspect we might want to do something about it. We have a *crapton* of contention in the FSM and caused by the FSM in bulk workloads. With my relation extension patch disabling the FSM nearly doubles concurrent load speed. At the same time, the fact that we might loose knowledge about all the existing free space in case of promotion or crash and never rediscover that space (because the pages are frozen), seems decidedly not great. I don't know what the path forward is, but it seems somewhat clear that we ought to do something. I suspect having a not crash-safe FSM isn't really acceptable anymore - it probably is fine to not persist a *reduction* in free space, but we can't permanently loose track of free space, no matter how many crashes. I know that you strongly dislike the way the FSM works, although I forgot some of the details. One thing I am fairly certain about is that using the FSM to tell other backends about newly bulk extended pages is not a good solution, even though we're stuck with it for the moment. > I actually agree that VACUUM is way too unconcerned about interfering > with concurrent activity in terms of how it manages free space in the > FSM. But this seems like just about the least important example of > that (outside the context of your RelationGetBufferForTuple work). The > really important case (that VACUUM gets wrong) all involve recently > dead tuples. But I don't think that you want to talk about that right > now. You want to talk about RelationGetBufferForTuple. That's indeed the background. By now I'd also like to add a few comments explaining why we do what we currently do, because I don't find all of it obvious. > > The reason I'm looking at this is that there's a lot of complexity at the > > bottom of RelationGetBufferForTuple(), related to needing to release the lock > > on the newly extended page and then needing to recheck whether there still is > > free space on the page. And that it's not complicated enough > > (c.f. INSERT_FROZEN doing visibilitymap_pin() with the page locked). > > > > As far as I can tell, if we went back to not entering new/empty pages into > > either VM or FSM, we could rely on the page not getting filled while just > > pinning, not locking it. > > What you're essentially arguing for is inventing a new rule that makes > the early lifetime of a page (what we currently call a PageIsEmpty() > page, and new pages) special, to avoid interference from VACUUM. I > have made similar arguments myself on quite a few occasions, so I'm > actually sympathetic. I just think that you should own it. And no, I'm > not just reflexively defending my work in 44fa84881fff; I actually > think that framing the problem as a case of restoring a previous > behavior is confusing and ahistorical. If there was a useful behavior > that was lost, then it was quite an accidental behavior all along. The > difference matters because now you have to reconcile what you're > saying with the lazy_vacuum_heap_page no-cleanup-lock behavior added > in 14. I really don't have a position to own yet, not on firm enough ground. > I think that you must be arguing for making the early lifetime of a > heap page special to VACUUM, since AFAICT you want to change VACUUM's > behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not* > with pages that have one remaining LP_UNUSED item, but are otherwise > empty (which could be set all-visible/all-frozen in either the first > or second heap pass, even if we disabled the lazy_scan_new_or_empty() > behavior you're complaining about). You seem to want to distinguish > between very new pages (that also happen to be empty), and old pages > that happen to be empty. Right? I think that might be worthwhile, yes. The retry code in RelationGetBufferForTuple() is quite hairy and almost impossible to test. If we can avoid the complexity, at a fairly bound cost (vacuum needing to re-visit new/empty pages if they're currently pinned), it'd imo be more that worth the price. But perhaps the better path forward is to just bite the bullet and introduce a shared memory table of open files, that contains "content size" and "physical size" for each relation. We've had a lot of things over the years that'd have benefitted from that. To address the RelationGetBufferForTuple() case, vacuum would simply not scan beyond the "content size", so it'd never encounter the page that RelationGetBufferForTuple() is currently dealing with. And we'd not need to add bulk extended pages into the FSM. This would also, I think, be the basis for teaching vacuum to truncate relations without acquiring an AEL - which IME causes a lot of operational issues. It'd not do anything about loosing track of free space though :/. > > I also do wonder whether the different behaviour of PageIsEmpty() and > > PageIsNew() pages makes sense. The justification for not marking PageIsNew() > > pages as all-visible holds just as true for empty pages. There's just as much > > free space there. > > What you say here makes a lot of sense to me. I'm just not sure what > I'd prefer to do about it. You and me both... I wonder what it'd take to make the FSM "more crashsafe". Leaving aside the cases of "just extended" new/empty pages: We already call XLogRecordPageWithFreeSpace() for HEAP2_VACUUM, HEAP2_VISIBLE, XLOG_HEAP2_PRUNE as well as insert/multi_insert/update. However, we don't set the LSN of FSM pages. Which means we'll potentially flush dirty FSM buffers to disk, before the corresponding WAL records make it to disk. ISTM that even if we just used the LSN of the last WAL record for RecordPageWithFreeSpace()/LogRecordPageWithFreeSpace(), and perhaps also called LogRecordPageWithFreeSpace() during XLOG_HEAP2_FREEZE replay, we'd *drastically* shrink the chance of loosing track of free space. Obviously not free, but ISTM that it can't add a lot of overhead. I think we can loose the contents of individual leaf FSM pages, e.g. due to checksum failures - but perhaps we could address that on access, e.g. by removing the frozen bit for the corresponding heap pages, which'd lead us to eventually rediscover the free space? That'd still leve us with upper level corruption, but I guess we could just recompute those in some circumstances? Hm - it'd sure be nice if pg_buffercache would show the LSN of valid pages... Greetings, Andres Freund
On Tue, Mar 28, 2023 at 3:29 PM Andres Freund <andres@anarazel.de> wrote: > Why is that? It's as clear a signal of concurrent activity on the buffer > you're going to get. Not immediately getting a cleanup lock in VACUUM is informative to the extent that you only care about what is happening that very nanosecond. If you look at which pages it happens to in detail, what you seem to end up with is a whole bunch of noise, which (on its own) tells you exactly nothing about what VACUUM really ought to be doing with those pages. In almost all cases we could get a cleanup lock by waiting for one millisecond and retrying. I suspect that the cleanup lock thing might be a noisy, unreliable proxy for the condition that you actually care about, in the context of your work on relation extension. I bet there is a better signal to go on, if you look for one. > It's interesting to understand *why* we are doing what we are. I think it'd > make sense to propose changing how things work around this, but I just don't > feel like I have a good enough grasp for why we do some of the things we > do. And given there's not a lot of comments around it and some of the comments > that do exist are inconsistent with themselves, looking at the history seems > like the next best thing? I think that I know why Heikki had no comments about PageIsEmpty() pages when the VM first went in, back in 2009: because it was just so obvious that you'd treat them the same as any other initialized page, it didn't seem to warrant a comment at all. The difference between a page with 0 tuples and 1 tuple is the same difference between a page with 1 tuple and a page with 2 tuples. A tiny difference (one extra tuple), of no particular consequence. I think that you don't see it that way now because you're focussed on the hio.c view of things. That code looked very different back in 2009, and in any case is very far removed from vacuumlazy.c. I can tell you what I was thinking of with lazy_scan_new_or_empty: I hate special cases. I will admit to being a zealot about it. > > I gather that you *don't* want to do anything about the > > lazy_vacuum_heap_page behavior with setting empty pages all-visible (just > > the lazy_scan_new_or_empty behavior). > > Not in the short-medium term, at least. In the long term I do suspect we might > want to do something about it. We have a *crapton* of contention in the FSM > and caused by the FSM in bulk workloads. With my relation extension patch > disabling the FSM nearly doubles concurrent load speed. I've seen the same effect myself. There is no question that that's a big problem. I think that the problem is that the system doesn't have any firm understanding of pages as things that are owned by particular transactions and/or backends, at least to a limited, scoped extent. It's all really low level, when it actually needs to be high level and take lots of context that comes from the application into account. > At the same time, the fact that we might loose knowledge about all the > existing free space in case of promotion or crash and never rediscover that > space (because the pages are frozen), seems decidedly not great. Unquestionably. > I don't know what the path forward is, but it seems somewhat clear that we > ought to do something. I suspect having a not crash-safe FSM isn't really > acceptable anymore - it probably is fine to not persist a *reduction* in free > space, but we can't permanently loose track of free space, no matter how many > crashes. Strongly agreed. It's a terrible false economy. If we bit the bullet and made relation extension and the FSM crash safe, we'd gain so much more than we'd lose. > One thing I am fairly certain about is that using the FSM to tell other > backends about newly bulk extended pages is not a good solution, even though > we're stuck with it for the moment. Strongly agreed. > > I think that you must be arguing for making the early lifetime of a > > heap page special to VACUUM, since AFAICT you want to change VACUUM's > > behavior with strictly PageIsEmpty/PageIsNew pages only -- and *not* > > with pages that have one remaining LP_UNUSED item, but are otherwise > > empty (which could be set all-visible/all-frozen in either the first > > or second heap pass, even if we disabled the lazy_scan_new_or_empty() > > behavior you're complaining about). You seem to want to distinguish > > between very new pages (that also happen to be empty), and old pages > > that happen to be empty. Right? > > I think that might be worthwhile, yes. The retry code in > RelationGetBufferForTuple() is quite hairy and almost impossible to test. If > we can avoid the complexity, at a fairly bound cost (vacuum needing to > re-visit new/empty pages if they're currently pinned), it'd imo be more that > worth the price. Short term, you could explicitly say that PageIsEmpty() means that the page is qualitatively different to other empty pages that were left behind by VACUUM's second phase. > But perhaps the better path forward is to just bite the bullet and introduce a > shared memory table of open files, that contains "content size" and "physical > size" for each relation. We've had a lot of things over the years that'd have > benefitted from that. Strongly agreed on this. > To address the RelationGetBufferForTuple() case, vacuum would simply not scan > beyond the "content size", so it'd never encounter the page that > RelationGetBufferForTuple() is currently dealing with. And we'd not need to > add bulk extended pages into the FSM. > > This would also, I think, be the basis for teaching vacuum to truncate > relations without acquiring an AEL - which IME causes a lot of operational > issues. I have said the same exact thing myself at least once. Again, it's a question of marrying high level and low level information. That is key here. > I wonder what it'd take to make the FSM "more crashsafe". Leaving aside the > cases of "just extended" new/empty pages: We already call > XLogRecordPageWithFreeSpace() for HEAP2_VACUUM, HEAP2_VISIBLE, > XLOG_HEAP2_PRUNE as well as insert/multi_insert/update. However, we don't set > the LSN of FSM pages. Which means we'll potentially flush dirty FSM buffers to > disk, before the corresponding WAL records make it to disk. Part of the problem is that we remember the amount of free space in each heap page with way too much granularity. That adds to the contention problem, because backends fight it out in a mad dash to locate miniscule amounts of free space. Moreover, If there were (say) only 5 or 7 distinct increments of free space that the FSM could represent for each heap page, then true crash safety becomes a lot cheaper. I'll say it again: high level and low level information need to be combined. > That'd still leve us with upper level corruption, but I guess we could just > recompute those in some circumstances? I think that we should just bite the bullet and come up with a way to make it fully crash safe. No its, no buts. -- Peter Geoghegan