Re: [PoC] Improve dead tuple storage for lazy vacuum
От | Masahiko Sawada |
---|---|
Тема | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Дата | |
Msg-id | CAD21AoAYA0fnPC7ww2v1USvn3VQ52Zn5XK6gyhWbaXuXn16Unw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [PoC] Improve dead tuple storage for lazy vacuum (Masahiko Sawada <sawada.mshk@gmail.com>) |
Список | pgsql-hackers |
On Mon, Dec 19, 2022 at 4:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Dec 12, 2022 at 7:14 PM John Naylor > > <john.naylor@enterprisedb.com> wrote: > > > > > > > > > On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > > > > > > > > > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in moredetail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just dosome useful work and not fail). > > > > > > > > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big > > > > the radix tree is using dsa_get_total_size(). If the size returned by > > > > dsa_get_total_size() (+ some memory used by TIDStore meta information) > > > > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum > > > > and heap vacuum. However, when allocating DSA memory for > > > > radix_tree_control at creation, we allocate 1MB > > > > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for > > > > radix_tree_control from it. das_get_total_size() returns 1MB even if > > > > there is no TID collected. > > > > > > 2MB makes sense. > > > > > > If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be drivenby blocks allocated. I have an idea on that below. > > > > > > > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insertinto the store, and check the store's memory usage before continuing. > > > > > > > > Right, I think it's no problem in slab cases. In DSA cases, the new > > > > segment size follows a geometric series that approximately doubles the > > > > total storage each time we create a new segment. This behavior comes > > > > from the fact that the underlying DSM system isn't designed for large > > > > numbers of segments. > > > > > > And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocatedis greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the areawill contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4full, like this: > > > > > > maintenance work mem = 256MB, so stop if we go over 128MB: > > > > > > 2*(1+2+4+8+16+32) = 126MB -> keep going > > > 126MB + 64 = 190MB -> stop > > > > > > That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segmentwould be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even withPeter G.'s VM snapshot informing the allocation size, I imagine). > > > > Right. In this case, even if we allocate 64MB, we will use only 2088 > > bytes at maximum. So I think the memory space used for vacuum is > > practically limited to half. > > > > > > > > And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technicallygoing over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would nevergo over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and smalllocal arrays of TIDs, I think with this scheme we'll be well under the limit. > > > > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it > > seems that they look at only memory that are actually dsa_allocate'd. > > To be exact, we estimate the number of hash buckets based on work_mem > > (and hash_mem_multiplier) and use it as the upper limit. So I've > > confirmed that the result of dsa_get_total_size() could exceed the > > limit. I'm not sure it's a known and legitimate usage. If we can > > follow such usage, we can probably track how much dsa_allocate'd > > memory is used in the radix tree. > > I've experimented with this idea. The newly added 0008 patch changes > the radix tree so that it counts the memory usage for both local and > shared cases. I've attached updated version patches to make cfbot happy. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
- v15-0008-PoC-calculate-memory-usage-in-radix-tree.patch
- v15-0005-tool-for-measuring-radix-tree-performance.patch
- v15-0007-PoC-DSA-support-for-radix-tree.patch
- v15-0009-PoC-lazy-vacuum-integration.patch
- v15-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patch
- v15-0004-Use-bitmapword-for-node-125.patch
- v15-0001-introduce-vector8_min-and-vector8_highbit_mask.patch
- v15-0003-Add-radix-implementation.patch
- v15-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patch
В списке pgsql-hackers по дате отправления: