Обсуждение: Re: Lowering the ever-growing heap->pd_lower
On Wed, Mar 10, 2021 at 6:01 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > The case I was concerned about back when is that there are various bits of > > code that may visit a page with a predetermined TID in mind to look at. > > An index lookup is an obvious example, and another one is chasing an > > update chain's t_ctid link. You might argue that if the tuple was dead > > enough to be removed, there should be no such in-flight references to > > worry about, but I think such an assumption is unsafe. There is not, for > > example, any interlock that ensures that a process that has obtained a TID > > from an index will have completed its heap visit before a VACUUM that > > subsequently removed that index entry also removes the heap tuple. > > I am aware of this problem. I will admit that I did not detected all > potentially problematic accesses, so I'll show you my work. Attached is a trivial rebase of your v3, which I've called v4. I am interested in getting this patch into Postgres 14. > In my search for problematic accesses, I make the following assumptions: > * PageRepairFragmentation as found in bufpage is only applicable to > heap pages; this function is not applied to other pages in core > postgres. So, any problems that occur are with due to access with an > OffsetNumber > PageGetMaxOffsetNumber. > * Items [on heap pages] are only extracted after using PageGetItemId > for that item on the page, whilst holding a lock. > The 3 problem cases were classified based on the origin of the > potentially invalid pointer. > > Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup I think that it boils down to this: 100% of the cases where this could be a problem all either involve old TIDs, or old line pointer -- in principle these could be invalidated in some way, like your backwards scan example. But that's it. Bear in mind that we always call PageRepairFragmentation() with a super-exclusive lock. > This is in the replay of transaction logs, in heap_xlog_freeze_page. > As I am unaware whether or not pages to which these transaction logs > are applied can contain changes from the xlog-generating instance, I > flagged this as a potential problem. The application of the xlogs is > probably safe (it assumes the existence of a HeapTupleHeader for that > ItemId), but my limited knowledge put this on the 'potential problem' > list. > > Please advise on this; I cannot find the right documentation Are you aware of wal_consistency_checking? I think that this should be fine. There are differences that are possible between a replica and primary, but they're very minor and don't seem relevant. > Problem case 3: invalid redirect pointers; pruneheap.c line 949, in > heap_get_root_tuples > > The heap pruning mechanism currently assumes that all redirects are > valid. Currently, if a HOT root points to !ItemIdIsNormal, it breaks > out of the loop, but doesn't actually fail. This code is already > potentially problematic because it has no bounds or sanity checking > for the nextoffnum, but with shrinking pd_linp it would also add the > failure mode of HOT tuples pointing into what is now arbitrary data. heap_prune_chain() is less trusting than heap_get_root_tuples(), though -- it doesn't trust redirects (because there is a generic offnum sanity check at the start of its loop). I think that the inconsistency between these two functions probably isn't hugely significant. Ideally it would be 100% clear which of the defenses in code like this is merely extra hardening. The assumptions should be formalized. There is nothing wrong with hardening, but we should know it when we see it. > This failure mode is now accompanied by an Assert, which fails when > the redirect is to an invalid OffsetNumber. This is not enough to not > exhibit arbitrary behaviour when accessing corrupted data with > non-assert builds, but I think that that is fine; we already do not > have a guaranteed behaviour for this failure mode. What about my "set would-be LP_UNUSED space to all-0x7F bytes in CLOBBER_FREED_MEMORY builds" idea? Did you think about that? -- Peter Geoghegan
Вложения
On Wed, 31 Mar 2021 at 05:35, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Mar 10, 2021 at 6:01 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > > The case I was concerned about back when is that there are various bits of > > > code that may visit a page with a predetermined TID in mind to look at. > > > An index lookup is an obvious example, and another one is chasing an > > > update chain's t_ctid link. You might argue that if the tuple was dead > > > enough to be removed, there should be no such in-flight references to > > > worry about, but I think such an assumption is unsafe. There is not, for > > > example, any interlock that ensures that a process that has obtained a TID > > > from an index will have completed its heap visit before a VACUUM that > > > subsequently removed that index entry also removes the heap tuple. > > > > I am aware of this problem. I will admit that I did not detected all > > potentially problematic accesses, so I'll show you my work. > > Attached is a trivial rebase of your v3, which I've called v4. I am > interested in getting this patch into Postgres 14. Thanks, and that'd be great! PFA v5, which fixes 1 issue later mentioned, and adds some comments on existing checks that are now in a critical path. > > In my search for problematic accesses, I make the following assumptions: > > * PageRepairFragmentation as found in bufpage is only applicable to > > heap pages; this function is not applied to other pages in core > > postgres. So, any problems that occur are with due to access with an > > OffsetNumber > PageGetMaxOffsetNumber. > > * Items [on heap pages] are only extracted after using PageGetItemId > > for that item on the page, whilst holding a lock. > > > The 3 problem cases were classified based on the origin of the > > potentially invalid pointer. > > > > Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup > > I think that it boils down to this: 100% of the cases where this could > be a problem all either involve old TIDs, or old line pointer -- in > principle these could be invalidated in some way, like your backwards > scan example. But that's it. Bear in mind that we always call > PageRepairFragmentation() with a super-exclusive lock. Yeah, that's the gist of what I found out. All accesses using old line pointers need revalidation, and there were some cases in which this was not yet done correctly. > > This is in the replay of transaction logs, in heap_xlog_freeze_page. > > As I am unaware whether or not pages to which these transaction logs > > are applied can contain changes from the xlog-generating instance, I > > flagged this as a potential problem. The application of the xlogs is > > probably safe (it assumes the existence of a HeapTupleHeader for that > > ItemId), but my limited knowledge put this on the 'potential problem' > > list. > > > > Please advise on this; I cannot find the right documentation > > Are you aware of wal_consistency_checking? I was vaguely aware that an option with that name exists, but that was about the extent. Thanks for pointing me in that direction. > I think that this should be fine. There are differences that are > possible between a replica and primary, but they're very minor and > don't seem relevant. OK, then I'll assume that WAL replay shouldn't cause problems here. > > Problem case 3: invalid redirect pointers; pruneheap.c line 949, in > > heap_get_root_tuples > > > > The heap pruning mechanism currently assumes that all redirects are > > valid. Currently, if a HOT root points to !ItemIdIsNormal, it breaks > > out of the loop, but doesn't actually fail. This code is already > > potentially problematic because it has no bounds or sanity checking > > for the nextoffnum, but with shrinking pd_linp it would also add the > > failure mode of HOT tuples pointing into what is now arbitrary data. > > heap_prune_chain() is less trusting than heap_get_root_tuples(), > though -- it doesn't trust redirects (because there is a generic > offnum sanity check at the start of its loop). I think that the > inconsistency between these two functions probably isn't hugely > significant. > > Ideally it would be 100% clear which of the defenses in code like this > is merely extra hardening. The assumptions should be formalized. There > is nothing wrong with hardening, but we should know it when we see it. I realized one of my Assert()s was incorrectly asserting an actually valid page state, so I've updated and documented that case. > > This failure mode is now accompanied by an Assert, which fails when > > the redirect is to an invalid OffsetNumber. This is not enough to not > > exhibit arbitrary behaviour when accessing corrupted data with > > non-assert builds, but I think that that is fine; we already do not > > have a guaranteed behaviour for this failure mode. > > What about my "set would-be LP_UNUSED space to all-0x7F bytes in > CLOBBER_FREED_MEMORY builds" idea? Did you think about that? I had implemented it locally, but was waiting for some more feedback before posting that and got busy with other stuff since, it's now attached. I've also played around with marking the free space on the page as undefined for valgrind, but later realized that that would make the test for defined memory in PageAddItemExtended fail. This is documented in the commit message of the attached patch 0002. With regards, Matthias van de Meent
Вложения
On Wed, Mar 31, 2021 at 2:49 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I had implemented it locally, but was waiting for some more feedback > before posting that and got busy with other stuff since, it's now > attached. > > I've also played around with marking the free space on the page as > undefined for valgrind, but later realized that that would make the > test for defined memory in PageAddItemExtended fail. This is > documented in the commit message of the attached patch 0002. I would like to deal with this work within the scope of the project we're discussing over on the "New IndexAM API controlling index vacuum strategies" thread. The latest revision of that patch series includes a modified version of your patch: https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com Please take discussion around this project over to that other thread. There are a variety of issues that can only really be discussed in that context. Note that I've revised the patch so that it runs during VACUUM's second heap pass only -- not during pruning/defragmentation. This means that the line pointer array truncation mechanism will only ever kick-in during a VACUUM operation. -- Peter Geoghegan
On Sat, Apr 3, 2021 at 10:07 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I would like to deal with this work within the scope of the project
> we're discussing over on the "New IndexAM API controlling index vacuum
> strategies" thread. The latest revision of that patch series includes
> a modified version of your patch:
>
> https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com
>
> Please take discussion around this project over to that other thread.
> There are a variety of issues that can only really be discussed in
> that context.
Since that work has been committed as of 3c3b8a4b2689, I've marked this CF entry as committed.
--
John Naylor
EDB: http://www.enterprisedb.com
> I would like to deal with this work within the scope of the project
> we're discussing over on the "New IndexAM API controlling index vacuum
> strategies" thread. The latest revision of that patch series includes
> a modified version of your patch:
>
> https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com
>
> Please take discussion around this project over to that other thread.
> There are a variety of issues that can only really be discussed in
> that context.
Since that work has been committed as of 3c3b8a4b2689, I've marked this CF entry as committed.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, 3 May 2021 at 16:26, John Naylor <john.naylor@enterprisedb.com> wrote: > > On Sat, Apr 3, 2021 at 10:07 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I would like to deal with this work within the scope of the project > > we're discussing over on the "New IndexAM API controlling index vacuum > > strategies" thread. The latest revision of that patch series includes > > a modified version of your patch: > > > > https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com > > > > Please take discussion around this project over to that other thread. > > There are a variety of issues that can only really be discussed in > > that context. > > Since that work has been committed as of 3c3b8a4b2689, I've marked this CF entry as committed. I disagree that this work has been fully committed. A derivative was committed that would solve part of the problem, but it doesn't cover all problem cases. I believe that I voiced such concern in the other thread as well. As such, I am planning on fixing this patch sometime before the next commit fest so that we can truncate the LP array during hot pruning as well, instead of only doing so in the 2nd VACUUM pass. This is especially relevant on pages where hot is highly effective, but vacuum can't keep up and many unused (former HOT) line pointers now exist on the page. With regards, Matthias van de Meent
On Mon, 3 May 2021 at 16:39, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I am planning on fixing this patch sometime > before the next commit fest so that we can truncate the LP array > during hot pruning as well, instead of only doing so in the 2nd VACUUM > pass. PFA the updated version of this patch. Apart from adding line pointer truncation in PageRepairFragmentation (as in the earlier patches), I also altered PageTruncateLinePointerArray to clean up all trailing line pointers, even if it was the last item on the page. This means that for 32-bit systems, pages that have once had tuples (but have been cleared since) can now be used again for MaxHeapTupleSize insertions. Without this patch, an emptied page would always have at least one line pointer left, which equates to MaxHeapTupleSize actual free space, but PageGetFreeSpace always subtracts sizeof(ItemIdData), leaving the perceived free space as reported to the FSM less than MaxHeapTupleSize if the page has any line pointers. For 64-bit systems, this is not as much of a problem, because MaxHeapTupleSize is 4 bytes smaller on those systems, which leaves us with 1 line pointer as margin for the FSM to recognise the page as free enough for one MaxHeapTupleSize item. With regards, Matthias van de Meent
Вложения
On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > PFA the updated version of this patch. Apart from adding line pointer > truncation in PageRepairFragmentation (as in the earlier patches), I > also altered PageTruncateLinePointerArray to clean up all trailing > line pointers, even if it was the last item on the page. Can you show a practical benefit to this patch, such as an improvement in throughout or in efficiency for a given workload? It was easy to see that having something was better than having nothing at all. But things are of course different now that we have PageTruncateLinePointerArray(). -- Peter Geoghegan
On Tue, 18 May 2021 at 20:33, Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > PFA the updated version of this patch. Apart from adding line pointer > > truncation in PageRepairFragmentation (as in the earlier patches), I > > also altered PageTruncateLinePointerArray to clean up all trailing > > line pointers, even if it was the last item on the page. > > Can you show a practical benefit to this patch, such as an improvement > in throughout or in efficiency for a given workload? > > It was easy to see that having something was better than having > nothing at all. But things are of course different now that we have > PageTruncateLinePointerArray(). There does seem to be utility in Matthias' patch, which currently does two things: 1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup That is going to have a clear benefit for HOT workloads, which by their nature will use a lot of line pointers. Many applications are updated much more frequently than they are vacuumed. Peter - what is your concern about doing this more frequently? Why would we *not* do this? 2. Reduce number of line pointers to 0 in some cases. Matthias - I don't think you've made a full case for doing this, nor looked at the implications. The comment clearly says "it seems like a good idea to avoid leaving a PageIsEmpty()" page behind. So I would be inclined to remove that from the patch and consider that as a separate issue, or close this. -- Simon Riggs http://www.EnterpriseDB.com/
On Tue, 3 Aug 2021 at 08:57, Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > On Tue, 18 May 2021 at 20:33, Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent > > <boekewurm+postgres@gmail.com> wrote: > > > PFA the updated version of this patch. Apart from adding line pointer > > > truncation in PageRepairFragmentation (as in the earlier patches), I > > > also altered PageTruncateLinePointerArray to clean up all trailing > > > line pointers, even if it was the last item on the page. > > > > Can you show a practical benefit to this patch, such as an improvement > > in throughout or in efficiency for a given workload? > > > > It was easy to see that having something was better than having > > nothing at all. But things are of course different now that we have > > PageTruncateLinePointerArray(). > > There does seem to be utility in Matthias' patch, which currently does > two things: > 1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup > That is going to have a clear benefit for HOT workloads, which by > their nature will use a lot of line pointers. > Many applications are updated much more frequently than they are vacuumed. > Peter - what is your concern about doing this more frequently? Why > would we *not* do this? One clear reason as to why we _do_ want this, is that the current shrinking only happens in the second phase of vacuum. Shrinking the LP-array in heap_page_prune decreases the chance that tuples that could fit on the page due to removed HOT chain items don't currently fit on the page due to lack of vacuum, whilst adding only little overhead. Additionally, heap_page_prune is also executed if more empty space on the page is required for a new tuple that currently doesn't fit, and in such cases I think clearing as much space as possible is useful. > 2. Reduce number of line pointers to 0 in some cases. > Matthias - I don't think you've made a full case for doing this, nor > looked at the implications. I have looked at the implications (see upthread), and I haven't found any implications other than those mentioned below. > The comment clearly says "it seems like a good idea to avoid leaving a > PageIsEmpty()" page behind. Do note that that comment is based on (to the best of my knowledge) unmeasured, but somewhat informed, guesswork ('it seems like a good idea'), which I also commented on in the thread discussing the patch that resulted in that commit [0]. If I recall correctly, the decision to keep at least 1 line pointer on the page was because this feature was to be committed late in the development cycle of pg14, and as such there would be little time to check the impact of fully clearing pages. To go forward with the feature in pg14 at that point, it was safer to not completely empty pages, so that we'd not be changing the paths we were hitting during e.g. vacuum too significantly, reducing the chances on significant bugs that would require the patch to be reverted [1]. I agreed at that point that that was a safer bet, but right now it's early in the pg15 development cycle, and I've had the time to get more experience around the vacuum and line pointer machinery. That being the case, I consider this a re-visit of the topic 'is it OK to truncate the LP-array to 0', where previously the answer was 'we don't know, and it's late in the release cycle', and after looking through the code base now I argue that the answer is Yes. One more point for going to 0 is that for 32-bit systems, a single line pointer is enough to block a page from being 'empty' enough to fit a MaxHeapTupleSize-sized tuple (when requesting pages through the FSM). Additionally, there are some other optimizations we can only apply to empty pages: - vacuum (with disable_page_skipping = on) will process these empty pages faster, as it won't need to do any pruning on that page. With page skipping enabled this won't matter because empty pages are all_visible and therefore vacuum won't access that page. - the pgstattuple contrib extension processes emtpy pages (slightly) faster in pgstattuple_approx - various loops won't need to check the remaining item that it is unused, saving some cycles in those loops when the page is accessed. and further future optimizations might include - Full-page WAL logging of empty pages produced in the checkpointer could potentially be optimized to only log 'it's an empty page' instead of writing out the full 8kb page, which would help in reducing WAL volume. Previously this optimization would never be hit on heapam-pages because pages could not become empty again, but right now this has real potential for applying an optimization. Kind regards, Matthias van de Meent [0] https://www.postgresql.org/message-id/CAEze2Wh-nXjkp0bLN_vQwgHttC8CRH%3D1ewcrWk%2B7RX5B93YQPQ%40mail.gmail.com [1] https://www.postgresql.org/message-id/CAH2-WznCxtWL4B995y2KJWj-%2BjrjahH4n6gD2R74SyQJo6Y63w%40mail.gmail.com
On Tue, 3 Aug 2021 at 17:15, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > and further future optimizations might include > > - Full-page WAL logging of empty pages produced in the checkpointer > could potentially be optimized to only log 'it's an empty page' > instead of writing out the full 8kb page, which would help in reducing > WAL volume. Previously this optimization would never be hit on > heapam-pages because pages could not become empty again, but right now > this has real potential for applying an optimization. So what you are saying is your small change will cause multiple additional FPIs in WAL. I don't call that a future optimization, I call that essential additional work. +1 on committing the first part of the patch, -1 on the second. I suggest you split the patch and investigate the second part further. -- Simon Riggs http://www.EnterpriseDB.com/
On Tue, 3 Aug 2021 at 20:37, Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > On Tue, 3 Aug 2021 at 17:15, Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > > and further future optimizations might include > > > > - Full-page WAL logging of empty pages produced in the checkpointer > > could potentially be optimized to only log 'it's an empty page' > > instead of writing out the full 8kb page, which would help in reducing > > WAL volume. Previously this optimization would never be hit on > > heapam-pages because pages could not become empty again, but right now > > this has real potential for applying an optimization. > > So what you are saying is your small change will cause multiple > additional FPIs in WAL. I don't call that a future optimization, I > call that essential additional work. I think you misunderstood my statement. The patch does not change more pages than before. The patch only ensures that empty heapam pages are truly empty according to the relevant PageIsEmpty()-macro; which hypothethically allows for optimizations in the checkpointer process that currently (as in, since its inception) writes all changed pages as full page writes (unless turned off). This change makes it easier and more worthwile to implement a further optimization for the checkpointer and/or buffer manager to determine that 1.) this page is now empty, and that 2.) we can therefore write a specialized WAL record specifically tuned for empty pages instead of FPI records. No additional pages are changed, because each time the line pointer array is shrunk, we've already either marked dead tuples as unused (2nd phase vacuum) or removed HOT line pointers / truncated dead tuples to lp_dead line pointers (heap_page_prune). If, instead, you are suggesting that this checkpointer FPI optimization should be part of the patch just because the potential is there, then I disagree. Please pardon me if this was not your intended message, but "you suggested this might be possible after your patch, thus you must implement this" seems like a great way to burn contributor goodwill. The scope of the checkpointer is effectively PostgreSQL's WAL, plus the page formats of all access methods that use the Page-based storage manager (not just table access methods, but also those of indexes). I'm not yet comfortable with hacking in those (sub)systems, nor do I expect to have the time/effort capacity soon to go through all the logic of these access methods to validate the hypothesis that such optimization can be both correctly implemented and worthwile. Patch 2, as I see it, just clears up some leftover stuff from the end of the pg14 release cycle with new insights and research I didn't have at that point in time. As it is a behaviour change for wal-logged actions, it cannot realistically be backported; therefore it is included as an improvement for pg15. Kind regards, Matthias van de Meent
On Mon, Aug 2, 2021 at 11:57 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > 1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup > That is going to have a clear benefit for HOT workloads, which by > their nature will use a lot of line pointers. Why do you say that? > Many applications are updated much more frequently than they are vacuumed. > Peter - what is your concern about doing this more frequently? Why > would we *not* do this? What I meant before was that this argument worked back when we limited the technique to VACUUM's second heap pass. Doing line pointer array truncation at that point alone meant that it only ever happened outside of VACUUM proper. Prior to that point we literally did nothing about LP_UNUSED items at the end of each line pointer array, so we were going from doing nothing to doing something. This time it's quite different: we're truncating the line pointer array during pruning. Pruning often occurs opportunistically, during regular query processing. In fact I'd say that it's far more common than pruning by VACUUM. So the chances of line pointer array truncation hurting rather than helping seems higher. Plus now we might break things like DDL, that would naturally not have been affected before because we were only doing line pointer truncation during VACUUM proper. Of course it might well be true that there is a significant benefit to this patch. I don't think that we should assume that that's the case, though. We have yet to see a test case showing any benefit. Maybe that's an easy thing to produce, and maybe Matthias has assumed that I must already know what to look at. But I don't. It's not obvious to me how to assess this patch now. I don't claim to have any insight about what we should or should not do. At least not right now. When I committed the original (commit 3c3b8a4b), I did so because I thought that it was very likely to improve certain cases and very unlikely to hurt any other cases. Nothing more. > 2. Reduce number of line pointers to 0 in some cases. > Matthias - I don't think you've made a full case for doing this, nor > looked at the implications. > The comment clearly says "it seems like a good idea to avoid leaving a > PageIsEmpty()" page behind. > So I would be inclined to remove that from the patch and consider that > as a separate issue, or close this. This was part of that earlier commit because of sheer paranoia; nothing more. I actually think that it's highly unlikely to protect us from bugs in practice. Though I am, in a certain sense, likely to be wrong about "PageIsEmpty() defensiveness", it does not bother me in the slightest. It seems like the right approach in the absence of new information about a significant downside. If my paranoia really did turn out to be justified, then I would expect that there'd be a subtle, nasty bug. That possibility is what I was really thinking of. And so it almost doesn't matter to me how unlikely we might think such a bug is now, unless and until somebody can demonstrate a real practical downside to my defensive approach. -- Peter Geoghegan
On Tue, Aug 3, 2021 at 12:27 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > This change makes it easier and more worthwile to implement a further > optimization for the checkpointer and/or buffer manager to determine > that 1.) this page is now empty, and that 2.) we can therefore write a > specialized WAL record specifically tuned for empty pages instead of > FPI records. No additional pages are changed, because each time the > line pointer array is shrunk, we've already either marked dead tuples > as unused (2nd phase vacuum) or removed HOT line pointers / truncated > dead tuples to lp_dead line pointers (heap_page_prune). We generate an FPI the first time a page is modified after a checkpoint. The FPI consists of the page *after* it has been modified. Presumably this optimization would need the heap page to be 100% empty, so we're left with what seems to me to be a very narrow target for optimization; something that is naturally rare. A fully-empty page seems quite unlikely in the case of xl_heap_vacuum records, and impossible in the case of xl_heap_prune records. Even with all the patches, working together. Have I missed something? -- Peter Geoghegan
On Wed, 4 Aug 2021 at 03:51, Peter Geoghegan <pg@bowt.ie> wrote: > > We generate an FPI the first time a page is modified after a > checkpoint. The FPI consists of the page *after* it has been modified. In that case, I misremembered when FPIs were written with relation to checkpoints. Thanks for reminding me. The point of non-FPIs as a replacement could still be valid, except for the point below making this yet more unlikely. > Presumably this optimization would need the heap page to be 100% > empty, so we're left with what seems to me to be a very narrow target > for optimization; something that is naturally rare. Yes. > A fully-empty page seems quite unlikely in the case of xl_heap_vacuum > records, and impossible in the case of xl_heap_prune records. Even > with all the patches, working together. Have I missed something? No, you're correct. xl_heap_prune cannot ever empty pages, as it leaves at least 1 dead, or 1 redirect + 1 normal, line pointer on the page. Furthermore, it is indeed quite unlikely that the 2nd pass of vacuum will be the first page modification after a checkpoint; it is quite a bit more likely that the page was first modified by the 1st vacuum pass. Although, this chance on the first modification since checkpoint being made by the second pass is increased by indexless tables (however unlikely, they exist in practice) and the autovacuum index cleanup delay mechanisms allowing pages with only dead item pointers to remain dead for more than just this one vacuum run, but the chances on fully clearing the page are indeed very, very slim. Kind regards, Matthias van de Meent
On Wed, 4 Aug 2021 at 02:43, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Aug 2, 2021 at 11:57 PM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > 2. Reduce number of line pointers to 0 in some cases. > > Matthias - I don't think you've made a full case for doing this, nor > > looked at the implications. > > The comment clearly says "it seems like a good idea to avoid leaving a > > PageIsEmpty()" page behind. > > So I would be inclined to remove that from the patch and consider that > > as a separate issue, or close this. > > This was part of that earlier commit because of sheer paranoia; > nothing more. I actually think that it's highly unlikely to protect us > from bugs in practice. Though I am, in a certain sense, likely to be > wrong about "PageIsEmpty() defensiveness", it does not bother me in > the slightest. It seems like the right approach in the absence of new > information about a significant downside. If my paranoia really did > turn out to be justified, then I would expect that there'd be a > subtle, nasty bug. That possibility is what I was really thinking of. > And so it almost doesn't matter to me how unlikely we might think such > a bug is now, unless and until somebody can demonstrate a real > practical downside to my defensive approach. As I believe I have mentioned before, there is one significant downside: 32-bit systems cannot reuse pages that contain only a singular unused line pointer for max-sized FSM-requests. A fresh page has 8168 bytes free (8kB - 24B page header), which then becomes 8164 when returned from PageGetFreeSpace (it acocunts for space used by the line pointer when inserting items onto the page). On 64-bit systems, MaxHeapTupleSize is 8160, and for for 32-bit systems the MaxHeapTupleSize is 8164. When we leave one more unused line pointer on the page, this means the page will have a PageGetFreeSpace of 8160, 4 bytes less than the MaxHeapTupleSize on 32-bit systems. As such, there will never be FSM entries of the largest category for pages that have had data on those systems, and as such, those systems will need to add pages for each request of the largest category, meaning that all tuples larger than 8128 bytes (largest request that would request the 254-category) will always be put on a new page, regardless of the actual availability of free space in the table. You might argue that this is a problem in the FSM subsystem, but in this case it actively hinders us from reusing pages in the largest category of FSM-requests. If you would argue 'PageGetHeapFreeSpace should keep free line pointers in mind when calculating free space', then I would argue 'yes, but isn't it better then to also actually fully mark that space as unused'. All in all, I'd just rather remove the distinction between once-used pages and fresh pages completely by truncating the LP-array to 0 than to leave this bloating behaviour in the system. Kind regards, Matthias van de Meent.
On Tue, Aug 3, 2021 at 8:44 PM Peter Geoghegan <pg@bowt.ie> wrote: > This time it's quite different: we're truncating the line pointer > array during pruning. Pruning often occurs opportunistically, during > regular query processing. In fact I'd say that it's far more common > than pruning by VACUUM. So the chances of line pointer array > truncation hurting rather than helping seems higher. How would it hurt? It's easy to see the harm caused by not shortening the line pointer array when it is possible to do so: we're using up space in the page that could have been made free. It's not so obvious to me what the downside of shortening it might be. I suppose there is a risk that we shorten it and get no benefit, or even shorten it and have to lengthen it again almost immediately. But neither of those things really matters unless shortening is expensive. If we can piggy-back it on an existing WAL record, I don't really see what the problem is. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 4 Aug 2021 at 01:43, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Aug 2, 2021 at 11:57 PM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > 1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup > > That is going to have a clear benefit for HOT workloads, which by > > their nature will use a lot of line pointers. > > Why do you say that? Truncating line pointers can make extra space on the page, so it could be the difference between a HOT and a non-HOT update. My understanding is that these just-in-time actions have a beneficial effect in other circumstances, so we can do that here also. If we truncate line pointers more frequently then the root pointers will tend to be lower in the array, which will make truncation even more effective. > > Many applications are updated much more frequently than they are vacuumed. > > Peter - what is your concern about doing this more frequently? Why > > would we *not* do this? > > What I meant before was that this argument worked back when we limited > the technique to VACUUM's second heap pass. Doing line pointer array > truncation at that point alone meant that it only ever happened > outside of VACUUM proper. Prior to that point we literally did nothing > about LP_UNUSED items at the end of each line pointer array, so we > were going from doing nothing to doing something. > > This time it's quite different: we're truncating the line pointer > array during pruning. Pruning often occurs opportunistically, during > regular query processing. In fact I'd say that it's far more common > than pruning by VACUUM. So the chances of line pointer array > truncation hurting rather than helping seems higher. If the only thing we do to a page is truncate line pointers then it may not be worth it. But dirtying a page to reclaim space is also the precise time when reclaiming line pointers makes sense also. So if the page is dirtied by cleaning, then that is the time to reclaim line pointers also. Again, why would we reclaim space to avoid bloat but then ignore any line pointer bloat? It's not clear why truncating unused line pointers would break DDL. -- Simon Riggs http://www.EnterpriseDB.com/
On Wed, Aug 4, 2021 at 12:09 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > Truncating line pointers can make extra space on the page, so it could > be the difference between a HOT and a non-HOT update. My understanding > is that these just-in-time actions have a beneficial effect in other > circumstances, so we can do that here also. I would prefer if the arguments in favor were a little more concrete. Maybe in general they don't have to be. But that would be my preference, and what I would look if I was to commit such a patch. > If the only thing we do to a page is truncate line pointers then it > may not be worth it. But dirtying a page to reclaim space is also the > precise time when reclaiming line pointers makes sense also. So if the > page is dirtied by cleaning, then that is the time to reclaim line > pointers also. > > Again, why would we reclaim space to avoid bloat but then ignore any > line pointer bloat? I don't think that my mental model is significantly different to yours here. Like everybody else, I can easily imagine how this *might* have value. All I'm really saying is that a burden of proof exists for this patch (or any performance orientated patch). In my opinion that burden has yet to be met by this patch. My guess is that Matthias can show a clear, measurable benefit with a little more work. If that happens then all of my marginal concerns about the risk of bugs become relatively unimportant. It might even be okay to commit the patch on the basis of "what could the harm be?" if there was some effort to demonstrate empirically that the performance downside really was zero. > It's not clear why truncating unused line pointers would break DDL. I'm just saying that it's obviously not possible now, with the VACUUM-only PageTruncateLinePointerArray()/lazy_vacuum_heap_page() code path added to Postgres 14 -- because VACUUM's relation-level lock makes sure of that. That property has some value. Certainly not enough value to block progress on a feature that is clearly useful, but enough to give me pause. -- Peter Geoghegan
On Wed, Aug 4, 2021 at 7:39 AM Robert Haas <robertmhaas@gmail.com> wrote: > How would it hurt? > > It's easy to see the harm caused by not shortening the line pointer > array when it is possible to do so: we're using up space in the page > that could have been made free. It's not so obvious to me what the > downside of shortening it might be. I think that that's probably true. That in itself doesn't seem like a good enough reason to commit the patch. Maybe this really won't be difficult for Matthias. I just want to see some concrete testing, maybe with pgbench, or with an artificial test case. Maybe something synthetic that shows a benefit measurable in on-disk table size. Or at least the absence of any regressions. Something. -- Peter Geoghegan
On Wed, 4 Aug 2021 at 15:39, Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Aug 3, 2021 at 8:44 PM Peter Geoghegan <pg@bowt.ie> wrote: > > This time it's quite different: we're truncating the line pointer > > array during pruning. Pruning often occurs opportunistically, during > > regular query processing. In fact I'd say that it's far more common > > than pruning by VACUUM. So the chances of line pointer array > > truncation hurting rather than helping seems higher. > > How would it hurt? > > It's easy to see the harm caused by not shortening the line pointer > array when it is possible to do so: we're using up space in the page > that could have been made free. It's not so obvious to me what the > downside of shortening it might be. I suppose there is a risk that we > shorten it and get no benefit, or even shorten it and have to lengthen > it again almost immediately. But neither of those things really > matters unless shortening is expensive. If we can piggy-back it on an > existing WAL record, I don't really see what the problem is. Hmm, there is no information in WAL to describe the line pointers being truncated by PageTruncateLinePointerArray(). We just truncate every time we see a XLOG_HEAP2_VACUUM record and presume it does the same thing as the original change. If that is safe, then we don't need to put the truncation on a WAL record at all, we just truncate after every XLOG_HEAP2_PRUNE record. If that is not safe... then we have a PG14 bug. If we do want to see this in WAL, both xl_heap_vacuum and xl_heap_prune lend themselves to just adding one more OffsetNumber onto the existing array, to represent the highest offset after truncation. So we don't need to change the WAL format. -- Simon Riggs http://www.EnterpriseDB.com/
On Thu, Aug 5, 2021 at 6:28 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > Hmm, there is no information in WAL to describe the line pointers > being truncated by PageTruncateLinePointerArray(). We just truncate > every time we see a XLOG_HEAP2_VACUUM record and presume it does the > same thing as the original change. > > If that is safe, then we don't need to put the truncation on a WAL > record at all, we just truncate after every XLOG_HEAP2_PRUNE record. I agree that that's how we'd do it. This approach is no different to assuming that PageRepairFragmentation() reliably produces a final defragmented page image deterministically when called after we prune. These days we automatically verify assumptions like this via wal_consistency_checking. It would absolutely be able to catch any bugs in PageTruncateLinePointerArray(), since the truncate code path has plenty of coverage within the regression tests. -- Peter Geoghegan
This thread has stalled, and the request for benchmark/test has gone unanswered so I'm marking this patch Returned with Feedback. Please feel free to resubmit this patch if it is picked up again. -- Daniel Gustafsson https://vmware.com/
On Thu, 2 Dec 2021 at 11:17, Daniel Gustafsson <daniel@yesql.se> wrote: > > This thread has stalled, and the request for benchmark/test has gone unanswered > so I'm marking this patch Returned with Feedback. Please feel free to resubmit > this patch if it is picked up again. Well then, here we go. It took me some time to find the time and examples, but here we are. Attached is v7 of the patchset, which is a rebase on top of the latest release, with some updated comments. Peter Geoghegan asked for good arguments for the two changes implemented. Below are my arguments detailed, with adversarial loads that show the problematic behaviour of the line pointer array that is fixed with the patch. Kind regards, Matthias van de Meent Truncating lp_array to 0 line pointers =========================== On 32-bit pre-patch systems the heap grows without limit; post-patch the relation doesn't grow beyond 16kiB (assuming no other sessions in the database): > -- setup > CREATE TABLE tst (filler text); > ALTER TABLE tst SET (autovacuum_enabled = off, fillfactor = 10); -- disable autovacuum, and trigger pruning more often > ALTER TABLE tst ALTER COLUMN filler SET STORAGE PLAIN; > INSERT INTO tst VALUES (''); > -- processing load > VACUUM tst; UPDATE tst SET filler = repeat('1', 8134); -- ~ max size heaptuple in MAXIMUM_ALIGNOF = 4 systems On 64-bit systems this load is not an issue due to slightly larger tolerances in the FSM. # Truncating lp_array during pruning =========================== The following adversarial load grows the heap relation, but with the patch the relation keeps its size. The point being that HOT updates can temporarily inflate the LP array significantly, and this patch can actively combat that issue while we're waiting for the 2nd pass of vacuum to arrive. > -- Initialize experiment > TRUNCATE tst; > INSERT INTO tst SELECT null; > UPDATE tst SET filler = null; > VACUUM tst; > > -- start workload by filling all line pointer slots with minsized heap tuples > do language plpgsql $$ > begin > for i in 1..289 loop > update tst set filler = null; > end loop; > end; > $$; > -- all line pointers on page 0 are now filled with hot updates of 1st line pointer > > -- hot-update hits the page; pruning is first applied > -- All but first and last LP are now empty. new tuple is inserted at > -- offset=2 > UPDATE tst SET filler = null; > > -- Insert large tuple, filling most of the now free space between the end of > -- the max size line pointer array > UPDATE tst SET filler = repeat('1', 7918); > > -- insert slightly smaller tuple, that is ~ the size of the unused space in > -- the LP array > UPDATE tst SET filler = repeat('1', 1144); > > -- reset experiment to initialized state > UPDATE tst SET filler = null;
Вложения
On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Peter Geoghegan asked for good arguments for the two changes > implemented. Below are my arguments detailed, with adversarial loads > that show the problematic behaviour of the line pointer array that is > fixed with the patch. Why is it okay that lazy_scan_prune() still calls PageGetMaxOffsetNumber() once for the page, before it ever calls heap_page_prune()? Won't lazy_scan_prune() need to reestablish maxoff now, if only so that its scan-page-items loop doesn't get confused when it goes on to read "former line pointers"? This is certainly possible with the CLOBBER_FREED_MEMORY stuff in place (which will memset the truncated line pointer space with a 0x7F7F7F7F pattern). -- Peter Geoghegan
On Wed, 16 Feb 2022 at 20:54, Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > Peter Geoghegan asked for good arguments for the two changes > > implemented. Below are my arguments detailed, with adversarial loads > > that show the problematic behaviour of the line pointer array that is > > fixed with the patch. > > Why is it okay that lazy_scan_prune() still calls > PageGetMaxOffsetNumber() once for the page, before it ever calls > heap_page_prune()? Won't lazy_scan_prune() need to reestablish maxoff > now, if only so that its scan-page-items loop doesn't get confused > when it goes on to read "former line pointers"? This is certainly > possible with the CLOBBER_FREED_MEMORY stuff in place (which will > memset the truncated line pointer space with a 0x7F7F7F7F pattern). Good catch, it is not. Attached a version that re-establishes maxoff after each prune operation. -Matthias
Вложения
On Wed, 16 Feb 2022 at 21:14, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Wed, 16 Feb 2022 at 20:54, Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent > > <boekewurm+postgres@gmail.com> wrote: > > > Peter Geoghegan asked for good arguments for the two changes > > > implemented. Below are my arguments detailed, with adversarial loads > > > that show the problematic behaviour of the line pointer array that is > > > fixed with the patch. > > > > Why is it okay that lazy_scan_prune() still calls > > PageGetMaxOffsetNumber() once for the page, before it ever calls > > heap_page_prune()? Won't lazy_scan_prune() need to reestablish maxoff > > now, if only so that its scan-page-items loop doesn't get confused > > when it goes on to read "former line pointers"? This is certainly > > possible with the CLOBBER_FREED_MEMORY stuff in place (which will > > memset the truncated line pointer space with a 0x7F7F7F7F pattern). > > Good catch, it is not. Attached a version that re-establishes maxoff > after each prune operation. I double-checked the changes, and to me it seems like that was the only place in the code where PageGetMaxOffsetNumber was not handled correctly. This was fixed in the latest patch (v8). Peter, would you have time to further review this patch and/or commit it? - Matthias
On Thu, Mar 10, 2022 at 5:49 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > I double-checked the changes, and to me it seems like that was the > only place in the code where PageGetMaxOffsetNumber was not handled > correctly. This was fixed in the latest patch (v8). > > Peter, would you have time to further review this patch and/or commit it? I'll definitely review it some more before too long. -- Peter Geoghegan
On Tue, Feb 15, 2022 at 10:48 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > # Truncating lp_array during pruning > =========================== > > The following adversarial load grows the heap relation, but with the > patch the relation keeps its size. The point being that HOT updates > can temporarily inflate the LP array significantly, and this patch can > actively combat that issue while we're waiting for the 2nd pass of > vacuum to arrive. I am sympathetic to the idea that giving the system a more accurate picture of how much free space is available on each heap page is an intrinsic good. This might help us in a few different areas. For example, the FSM cares about relatively small differences in available free space *among* heap pages that are "in competition" in RelationGetBufferForTuple(). Plus we have a heuristic based on PageGetHeapFreeSpace() in heap_page_prune_opt() to consider. We should definitely increase MaxHeapTuplesPerPage before too long, for a variety of reasons that I have talked about in the past. Its current value is 291 on all mainstream platforms, a value that's derived from accidental historic details -- which predate HOT. Obviously an increase in MaxHeapTuplesPerPage is likely to make the problem that the patch proposes to solve worse. I lean towards committing the patch now as work in that direction, in fact. It helps that this patch now seems relatively low risk. -- Peter Geoghegan
On Mon, Apr 4, 2022 at 7:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > I am sympathetic to the idea that giving the system a more accurate > picture of how much free space is available on each heap page is an > intrinsic good. This might help us in a few different areas. For > example, the FSM cares about relatively small differences in available > free space *among* heap pages that are "in competition" in > RelationGetBufferForTuple(). Plus we have a heuristic based on > PageGetHeapFreeSpace() in heap_page_prune_opt() to consider. Pushed a slightly revised version of this just now. Differences: * Rewrote the comments, and adjusted related comments in vacuumlazy.c. Mostly just made them shorter. * I eventually decided that it was fine to just accept the issue with maxoff in lazy_scan_prune (the pruning path used by VACUUM). There seemed to be no need to reestablish a maxoff for the page here following further reflection. I changed my mind. Setting reclaimed line pointer array space to a pattern of 0x7F bytes wasn't adding much here. Pruning either runs in VACUUM, or opportunistically. When it runs in VACUUM things are highly constrained already. Opportunistic pruning for heap_page_prune_opt() callers doesn't even require that the caller start out with a buffer lock. Pruning only goes ahead when we successfully acquire a cleanup lock -- so callers can't be relying on maxoff not changing. * Didn't keep the changes to PageTruncateLinePointerArray(). There is at least no reason to tie this question about VACUUM to how pruning behaves. I still see some small value in avoiding creating a new path that allows PageIsEmpty() pages to come into existence in a new way, which is no less true with the patch I pushed. Thanks -- Peter Geoghegan
Hi, On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote: > We should definitely increase MaxHeapTuplesPerPage before too long, > for a variety of reasons that I have talked about in the past. Its > current value is 291 on all mainstream platforms, a value that's > derived from accidental historic details -- which predate HOT. I'm on-board with that - but I think we should rewrite a bunch of places that use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using several KB of stack at the current the current value already (*), but if it grows further... Greetings, Andres Freund * lazy_scan_prune() has OffsetNumber deadoffsets[MaxHeapTuplesPerPage]; xl_heap_freeze_tuple frozen[MaxHeapTuplesPerPage]; which already works out to 4074 bytes.
On Thu, Apr 7, 2022 at 4:01 PM Andres Freund <andres@anarazel.de> wrote: > I'm on-board with that - but I think we should rewrite a bunch of places that > use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using > several KB of stack at the current the current value already (*), but if it grows > further... No arguments here. There are probably quite a few places that won't need to be fixed, because it just doesn't matter, but lazy_scan_prune() will. -- Peter Geoghegan
On Fri, 8 Apr 2022 at 01:01, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote: > > We should definitely increase MaxHeapTuplesPerPage before too long, > > for a variety of reasons that I have talked about in the past. Its > > current value is 291 on all mainstream platforms, a value that's > > derived from accidental historic details -- which predate HOT. > > I'm on-board with that - but I think we should rewrite a bunch of places that > use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using > several KB of stack at the current the current value already (*), but if it grows > further... Yeah, I think we should definately support more line pointers on a heap page, but abusing MaxHeapTuplesPerPage for that is misleading: the current value is the physical limit for heap tuples, as we have at most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage won't change. A macro MaxHeapLinePointersPerPage would probably be more useful, which could be as follows (assuming we don't want to allow filling a page with effectively only dead line pointers): #define MaxHeapLinePointersPerPage \ ((int) (((BLCKSZ - SizeOfPageHeaderData) / \ (MAXALIGN(SizeofHeapTupleHeader) + 2 * sizeof(ItemIdData))) * 2)) This accounts for the worst case of one redirect + one min-sized live heap tuple, and fills the page with it. Although impossible to put a page in such a state, that would be the worst case of live line pointers on a page. For the default BLCKSZ of 8kB, that results in 510 line pointers used-but-not-dead, an increase of ~ 70% over what's currently available. -Matthias
On Thu, Apr 7, 2022 at 7:01 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote: > > We should definitely increase MaxHeapTuplesPerPage before too long, > > for a variety of reasons that I have talked about in the past. Its > > current value is 291 on all mainstream platforms, a value that's > > derived from accidental historic details -- which predate HOT. > > I'm on-board with that - but I think we should rewrite a bunch of places that > use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using > several KB of stack at the current the current value already (*), but if it grows > further... I agree that the value of 291 is pretty much accidental, but it also seems fairly generous to me. The bigger you make it, the more space you can waste. I must have missed (or failed to understand) previous discussions about why raising it would be a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 8, 2022 at 4:38 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Yeah, I think we should definately support more line pointers on a > heap page, but abusing MaxHeapTuplesPerPage for that is misleading: > the current value is the physical limit for heap tuples, as we have at > most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage > won't change. A macro MaxHeapLinePointersPerPage would probably be > more useful, which could be as follows (assuming we don't want to > allow filling a page with effectively only dead line pointers): That's a good point. Sounds like it might be the right approach. I suppose that it will depend on how much use of MaxHeapTuplesPerPage remains once it is split in two like this. -- Peter Geoghegan
On Fri, Apr 8, 2022 at 6:17 AM Robert Haas <robertmhaas@gmail.com> wrote: > I agree that the value of 291 is pretty much accidental, but it also > seems fairly generous to me. The bigger you make it, the more space > you can waste. I must have missed (or failed to understand) previous > discussions about why raising it would be a good idea. What do you mean about wasting space? Wasting space on the stack? I can't imagine you meant wasting space on the page, since being able to accomodate more items on each heap page seems like it would be strictly better, barring any unintended weird FSM issues. As far as I know the only real downside to increasing it is the impact on tidbitmap.c. Increasing the number of valid distinct TID values might have a negative impact on performance during bitmap scans, which will need to be managed. However, I don't think that increased stack space usage will be a problem, with a little work. It either won't matter at all (e.g. an array of offset numbers on the stack still won't be very big), or it can be fixed locally where it turns out to matter (like in lazy_scan_prune). We used to routinely use MaxOffsetNumber for arrays of item offset numbers. I cut down on that in the B-Tree code, reducing it to MaxIndexTuplesPerPage (which is typically 407) in a few places. So anything close to our current MaxIndexTuplesPerPage ought to be fine for most individual arrays stored on the stack. -- Peter Geoghegan
On Fri, Apr 8, 2022 at 9:44 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Apr 8, 2022 at 4:38 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > Yeah, I think we should definately support more line pointers on a > > heap page, but abusing MaxHeapTuplesPerPage for that is misleading: > > the current value is the physical limit for heap tuples, as we have at > > most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage > > won't change. A macro MaxHeapLinePointersPerPage would probably be > > more useful, which could be as follows (assuming we don't want to > > allow filling a page with effectively only dead line pointers): > > That's a good point. Sounds like it might be the right approach. > > I suppose that it will depend on how much use of MaxHeapTuplesPerPage > remains once it is split in two like this. Thinking about this some more, I wonder if it would make sense to split MaxHeapTuplesPerPage into two new constants (a true MaxHeapTuplesPerPage, plus MaxHeapLinePointersPerPage), for the reasons discussed, but also as a way of getting a *smaller* effective MaxHeapTuplesPerPage than 291 in some contexts only. There are some ways in which the current MaxHeapTuplesPerPage isn't enough, but also some ways in which it is excessive. It might be useful if PageGetHeapFreeSpace() usually considered a heap page to have no free space if the number of tuples with storage (or some cheap proxy thereof) was about 227, which is the largest number of distinct heap tuples that can *plausibly* ever be stored on an 8KiB page (it ignores zero column tables). Most current PageGetHeapFreeSpace() callers (including VACUUM) would continue to call that function in the same way as today, and get this lower limit. A few of the existing PageGetHeapFreeSpace() callers could store more line pointers than that (MaxHeapLinePointersPerPage, which might be 510 in practice) -- but only those involved in updates. The overall idea is to recognize that free space is not interchangeable -- updates should have some kind of advantage over plain inserts when it comes to the space on the page of the tuple that they're updating. We might even want to make our newly defined, lower MaxHeapTuplesPerPage into a tunable storage param. -- Peter Geoghegan
On Fri, Apr 8, 2022 at 12:57 PM Peter Geoghegan <pg@bowt.ie> wrote: > What do you mean about wasting space? Wasting space on the stack? I > can't imagine you meant wasting space on the page, since being able to > accomodate more items on each heap page seems like it would be > strictly better, barring any unintended weird FSM issues. I meant wasting space in the page. I think that's a real concern. Imagine you allowed 1000 line pointers per page. Each one consumes 2 bytes. So now you could have ~25% of each page in the table storing dead line pointers. That sounds awful, and just running VACUUM won't fix it once it's happened, because the still-live line pointers are likely to be at the end of the line pointer array and thus truncating it won't necessarily be possible. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Apr 8, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > I meant wasting space in the page. I think that's a real concern. > Imagine you allowed 1000 line pointers per page. Each one consumes 2 > bytes. So now you could have ~25% of each page in the table storing > dead line pointers. That sounds awful, and just running VACUUM won't > fix it once it's happened, because the still-live line pointers are > likely to be at the end of the line pointer array and thus truncating > it won't necessarily be possible. I see. That's a legitimate concern, though one that I believe can be addressed. I have learned to dread any kind of bloat that's irreversible, no matter how minor it might seem when seen as an isolated event, so I'm certainly sympathetic to these concerns. You can make a similar argument in favor of a higher MaxHeapLinePointersPerPage limit, though -- and that's why I believe an increase of some kind makes sense. The argument goes like this: What if we miss the opportunity to systematically keep successor versions of a given logical row on the same heap page over time, due only to the current low MaxHeapLinePointersPerPage limit of 291? If we had only been able to "absorb" just a few extra versions in the short term, we would have had stability (in the sense of being able to preserve locality among related logical rows) in the long term. We could have kept everything together, if only we didn't overreact to what were actually short term, rare perturbations. -- Peter Geoghegan
On Fri, Apr 8, 2022 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote: > What if we miss the opportunity to systematically keep successor > versions of a given logical row on the same heap page over time, due > only to the current low MaxHeapLinePointersPerPage limit of 291? If we > had only been able to "absorb" just a few extra versions in the short > term, we would have had stability (in the sense of being able to > preserve locality among related logical rows) in the long term. We > could have kept everything together, if only we didn't overreact to > what were actually short term, rare perturbations. Hmm. I wonder if we could teach the system to figure out which of those things is happening. In the case that I'm worried about, when we're considering growing the line pointer array, either the line pointers will be dead or the line pointers will be used but the tuples to which they point will be dead. In the case you describe here, there should be very few dead tuples or line pointers in the page. Maybe when the number of line pointers starts to get big, we refuse to add more without checking the number of dead tuples and dead line pointers and verifying that those numbers are still small. Or, uh, something. One fly in the ointment is that if we refuse to expand the line pointer array, we might extend the relation instead, which is another kind of bloat and thus not great. But if the line pointer array is somehow filling up with tons of dead tuples, we're going to have to extend the relation anyway. I suspect that in some circumstances it's better to just accept that outcome and hope that it leads to some pages becoming empty, thus allowing their line pointer arrays to be reset. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-04-08 09:17:40 -0400, Robert Haas wrote: > I agree that the value of 291 is pretty much accidental, but it also > seems fairly generous to me. The bigger you make it, the more space > you can waste. I must have missed (or failed to understand) previous > discussions about why raising it would be a good idea. It's not hard to hit scenarios where pages are effectively unusable, because they have close to 291 dead items, without autovacuum triggering (or autovacuum just taking a while). You basically just need updates / deletes to concentrate in a certain range of the table and have indexing that prevents HOT updates. Because the overall percentage of dead tuples is low, no autovacuum is triggered, yet a range of the table contains little but dead items. At which point you basically waste 7k bytes (1164 bytes for dead items IIRC) until a vacuum finally kicks in - way more than what what you'd waste if the number of line items were limited at e.g. 2 x MaxHeapTuplesPerPage This has become a bit more pronounced with vacuum skipping index cleanup when there's "just a few" dead items - if all your updates concentrate in a small region, 2% of the whole relation size isn't actually that small. I wonder if we could reduce the real-world space wastage of the line pointer array, if we changed the the logic about which OffsetNumbers to use during inserts / updates and and made a few tweaks to to pruning. 1) It's kind of OK for heap-only tuples to get a high OffsetNumber - we can reclaim them during pruning once they're dead. They don't leave behind a dead item that's unreclaimable until the next vacuum with an index cleanup pass. 2) Arguably the OffsetNumber of a redirect target can be changed. It might break careless uses of WHERE ctid = ... though (which likely are already broken, just harder to hit). These leads me to a few potential improvements: a) heap_page_prune_prune() should take the number of used items into account when deciding whether to prune. Right now we trigger hot pruning based on the number of items only if PageGetMaxOffsetNumber(page) >= MaxHeapTuplesPerPage. But because it requires a vacuum to reclaim an ItemId used for a root tuple, we should trigger HOT pruning when it might lower which OffsetNumber get used. b) heap_page_prune_prune() should be triggered in more paths. E.g. when inserting / updating, we should prune if it allows us to avoid using a high OffsetNumber. c) What if we left some percentage of ItemIds unused, when looking for the OffsetNumber of a new HOT row version? That'd make it more likely for non-HOT updates and inserts to fit onto the page, without permanently increasing the size of the line pointer array. d) If we think 2) is acceptable, we could move the targets of redirects to make space for new root tuples, without increasing the permanent size of the line pointer array. Crazy? Greetings, Andres Freund
Hi, On 2022-04-08 15:04:37 -0400, Robert Haas wrote: > I meant wasting space in the page. I think that's a real concern. > Imagine you allowed 1000 line pointers per page. Each one consumes 2 > bytes. It's 4 bytes per line pointer, right? struct ItemIdData { unsigned int lp_off:15; /* 0: 0 4 */ unsigned int lp_flags:2; /* 0:15 4 */ unsigned int lp_len:15; /* 0:17 4 */ /* size: 4, cachelines: 1, members: 3 */ /* last cacheline: 4 bytes */ }; Or am I confusing myself somehow? I do wish the length of the tuple weren't in ItemIdData, but part of the tuple, so we'd not waste this space for dead items (I think it'd also simplify more code than it'd complicate). But ... - Andres
On Fri, Apr 8, 2022 at 2:18 PM Andres Freund <andres@anarazel.de> wrote: > It's 4 bytes per line pointer, right? Yeah, it's 4 bytes in Postgres. Most other DB systems only need 2 bytes, which is implemented in exactly the way that you're imagining. -- Peter Geoghegan
On Fri, Apr 8, 2022 at 2:06 PM Andres Freund <andres@anarazel.de> wrote: > It's not hard to hit scenarios where pages are effectively unusable, because > they have close to 291 dead items, without autovacuum triggering (or > autovacuum just taking a while). I think that this is mostly a problem with HOT-updates, and regular updates to a lesser degree. Deletes seem less troublesome. I find that it's useful to think in terms of the high watermark number of versions required for a given logical row over time. It's probably quite rare for most individual logical rows to truly require more than 2 or 3 versions per row at the same time, to serve queries. Even in update-heavy tables. And without doing anything fancy with the definition of HeapTupleSatisfiesVacuum(). There are important exceptions, certainly, but overall I think that we're still not doing good enough with these easier cases. The high watermark number of versions is probably going to be significantly greater than the typical number of versions for the same row. So maybe we give up on keeping a row on its original heap block today, all because of a once-off (or very rare) event where we needed slightly more extra space for only a fraction of a second. The tell-tale sign of these kinds of problems can sometimes be seen with synthetic, rate-limited benchmarks. If it takes a very long time for the problem to grow, but nothing about the workload really ever changes, then that suggests problems that have this quality. The probability of any given logical row being moved to another heap block is very low. And yet it is inevitable that many (even all) will, given enough time, given enough opportunities to get unlucky. > This has become a bit more pronounced with vacuum skipping index cleanup when > there's "just a few" dead items - if all your updates concentrate in a small > region, 2% of the whole relation size isn't actually that small. The 2% threshold was chosen based on the observation that it was below the effective threshold where autovacuum just won't ever launch anything on a moderate sized table (unless you set autovacuum_vacuum_scale_factor to something absurdly low). The real problem is that IMV. That's why I think that we need to drive it based primarily on page-level characteristics. While effectively ignoring pages that are all-visible when deciding if enough bloat is present to necessitate vacuuming. > 1) It's kind of OK for heap-only tuples to get a high OffsetNumber - we can > reclaim them during pruning once they're dead. They don't leave behind a > dead item that's unreclaimable until the next vacuum with an index cleanup > pass. I like the general direction here, but this particular idea doesn't seem like a winner. > 2) Arguably the OffsetNumber of a redirect target can be changed. It might > break careless uses of WHERE ctid = ... though (which likely are already > broken, just harder to hit). That makes perfect sense to me, though. > a) heap_page_prune_prune() should take the number of used items into account > when deciding whether to prune. Right now we trigger hot pruning based on > the number of items only if PageGetMaxOffsetNumber(page) >= > MaxHeapTuplesPerPage. But because it requires a vacuum to reclaim an ItemId > used for a root tuple, we should trigger HOT pruning when it might lower > which OffsetNumber get used. Unsure about this. > b) heap_page_prune_prune() should be triggered in more paths. E.g. when > inserting / updating, we should prune if it allows us to avoid using a high > OffsetNumber. Unsure about this too. I prototyped a design that gives individual backends soft ownership of heap blocks that were recently allocated, and later prunes the heap page when it fills [1]. Useful for aborted transactions, where it preserves locality -- leaving aborted tuples behind makes their space ultimately reused for unrelated inserts, which is bad. But eager pruning allows the inserter to leave behind more or less pristine heap pages, which don't need to be pruned later on. > c) What if we left some percentage of ItemIds unused, when looking for the > OffsetNumber of a new HOT row version? That'd make it more likely for > non-HOT updates and inserts to fit onto the page, without permanently > increasing the size of the line pointer array. That sounds promising. [1] https://postgr.es/m/CAH2-Wzm-VhVeQYTH8hLyYho2wdG8Ecrm0uPQJWjap6BOVfe9Og@mail.gmail.com -- Peter Geoghegan
On Fri, Apr 8, 2022 at 1:29 PM Robert Haas <robertmhaas@gmail.com> wrote: > Hmm. I wonder if we could teach the system to figure out which of > those things is happening. In the case that I'm worried about, when > we're considering growing the line pointer array, either the line > pointers will be dead or the line pointers will be used but the tuples > to which they point will be dead. In the case you describe here, there > should be very few dead tuples or line pointers in the page. Maybe > when the number of line pointers starts to get big, we refuse to add > more without checking the number of dead tuples and dead line pointers > and verifying that those numbers are still small. Or, uh, something. It seems like the central idea is that we think in terms of "versions per logical row", even in low level code that traditionally hasn't made those kinds of distinctions. Ideally we could structure pruning a little bit more like a B-Tree page split, where there is an explicit "incoming tuple" that won't fit (without splitting the page, or maybe doing some kind of deletion). If the so-called incoming tuple that we'd rather like to fit on the page is an insert of an unrelated row, don't allow it (don't prune, give up). But if it's an update (especially a hot update), be much more permissive about allowing it, and/or going ahead with pruning in order to make sure it happens. I like Andres' idea of altering LP_REDIRECTs just to be able to use up lower line pointers first. Or preserving a few extra LP_UNUSED items on insert. Those seem promising to me. > One fly in the ointment is that if we refuse to expand the line > pointer array, we might extend the relation instead, which is another > kind of bloat and thus not great. But if the line pointer array is > somehow filling up with tons of dead tuples, we're going to have to > extend the relation anyway. I suspect that in some circumstances it's > better to just accept that outcome and hope that it leads to some > pages becoming empty, thus allowing their line pointer arrays to be > reset. I agree. Sometimes the problem is that we don't cut our losses when we should -- sometimes just accepting a limited downside is the right thing to do. Like with the FSM; we diligently use every last scrap of free space, without concern for the bigger picture. It's penny-wise, pound-foolish. -- Peter Geoghegan