Обсуждение: Making the initial and maximum DSA segment sizes configurable

Поиск
Список
Период
Сортировка

Making the initial and maximum DSA segment sizes configurable

От
Masahiko Sawada
Дата:
Hi all,

I started this new thread from another thread[1] where we're
discussing a new storage for TIDs, TidStore, since we found a
difficulty about the memory usage limit for TidStores on DSA.

TidStore is a new data structure to efficiently store TIDs, backed by
a radix tree. In the patch series proposed on that thread, in addition
to radix tree and TidStore, there is another patch for lazy (parallel)
vacuum to replace the array of dead tuple TIDs with a TidStore. To
support parallel vacuum, radix tree (and TidStore) can be created on a
local memory as well as on DSA. Also, it has memory usage limit
functionality; we can specify the memory limit (e.g.,
maintenance_work_mem) to TidStoreCreate() function. Once the total DSA
segment size (area->control->total_segment_size) exceeds the limit,
TidStoreIsFull() returns true. The lazy vacuum can continue scanning
heap blocks to collect dead tuple TIDs until TidStoreIsFull() returns
true. Currently lazy vacuum is the sole user of TidStore but maybe it
can be used by other codes such as tidbitmap.c where will be limited
by work_mem.

During the development, we found out that DSA memory growth is
unpredictable, leading to inefficient memory limitation.

DSA is built on top of DSM segments and it manages a set of DSM
segments, adding new segments as required and detaching them when they
are no longer needed. The DSA segment starts with 1MB in size and a
new segment size is at least big enough to follow a geometric series
that approximately doubles the total storage each time we create a new
segment. Because of this fact, it's not efficient to simply compare
the memory limit to the total segment size. For example, if
maintenance_work_mem is 512MB, the total segment size will be like:

2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) = 510MB -> less than the
limit, continue heap scan.

2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + 256 = 766MB -> stop (exceed  254MB).

One might think we can use dsa_set_size_limit() but it cannot; lazy
vacuum ends up with an error. If we set DSA_ALLOC_NO_OOM, we might end
up stopping the insertion halfway.

Besides excessively allocating memory, since the initial DSM segment
size is fixed 1MB, memory usage of a shared TidStore will start from
1MB+. This is higher than the minimum values of both work_mem and
maintenance_work_mem, 64kB and 1MB respectively. Increasing the
minimum m_w_m to 2MB might be acceptable but not for work_mem.

Researching possible solutions, we found that aset.c also has a
similar characteristic; allocates an 8K block (by default) upon the
first allocation in a context, and doubles that size for each
successive block request. But we can specify the initial block size
and max blocksize. This made me think of an idea to specify both to
DSA and both values are calculated based on m_w_m. I've attached the
patch for this idea. The changes to dsa.c are straightforward since
dsa.c already uses macros DSA_INITIAL_SEGMENT_SIZE and
DSA_MAX_SEGMENT_SIZE. I just made these values configurable.

FYI with this patch, we can create a DSA in parallel_vacuum_init()
with initial and maximum block sizes as follows:

initial block size = min(m_w_m / 4, 1MB)
max block size = max(m_w_m / 8, 8MB)

In most cases, we can start with a 1MB initial segment, the same as
before. For larger memory, the heap scan stops after DSA allocates
1.25 times more memory than m_w_m. For example, if m_w_m = 512MB, the
both initial and maximum segment sizes are 1MB and 64MB respectively,
and then DSA allocates the segments as follows until heap scanning
stops:

2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 4) = 510MB -> less than the
limit, continue heap scan.

2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 5) = 574MB -> stop
(allocated additional 62MB).

It also works with smaller memory; If the limit is 1MB, we start with
a 256KB initial segment and heap scanning stops after DSA allocated
1.5MB (= 256kB + 256kB + 512kB + 512kB).

There is room for considering better formulas for initial and maximum
block sizes but making both values configurable is a promising idea.
And the analogous behavior to aset could be a good thing for
readability and maintainability. There is another test result where I
used this idea on top of a radix tree[2].

We need to consider the total number of allocated DSA segments as the
total number of DSM segments available on the system is fixed[3]. But
it seems not problematic even with this patch since we allocate only a
few additional segments (in above examples 17 segs vs. 19 segs). There
was no big difference also in performance[2].

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoDBmD5q%3DeO%2BK%3DgyuVt53XvwpJ2dgxPwrtZ-eVOjVmtJjg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAD21AoDKr%3D4YHphy6cRojE5eyT6E2ao8xb44E309eTrUEOC6xw%40mail.gmail.com
[3] from dsm.c, the total number of DSM segments available on the
system is calculated by:
#define PG_DYNSHMEM_FIXED_SLOTS         64
#define PG_DYNSHMEM_SLOTS_PER_BACKEND   5
maxitems = PG_DYNSHMEM_FIXED_SLOTS
    + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

Re: Making the initial and maximum DSA segment sizes configurable

От
Masahiko Sawada
Дата:
Hi,

On Wed, Mar 22, 2023 at 12:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Hi all,
>
> I started this new thread from another thread[1] where we're
> discussing a new storage for TIDs, TidStore, since we found a
> difficulty about the memory usage limit for TidStores on DSA.
>
> TidStore is a new data structure to efficiently store TIDs, backed by
> a radix tree. In the patch series proposed on that thread, in addition
> to radix tree and TidStore, there is another patch for lazy (parallel)
> vacuum to replace the array of dead tuple TIDs with a TidStore. To
> support parallel vacuum, radix tree (and TidStore) can be created on a
> local memory as well as on DSA. Also, it has memory usage limit
> functionality; we can specify the memory limit (e.g.,
> maintenance_work_mem) to TidStoreCreate() function. Once the total DSA
> segment size (area->control->total_segment_size) exceeds the limit,
> TidStoreIsFull() returns true. The lazy vacuum can continue scanning
> heap blocks to collect dead tuple TIDs until TidStoreIsFull() returns
> true. Currently lazy vacuum is the sole user of TidStore but maybe it
> can be used by other codes such as tidbitmap.c where will be limited
> by work_mem.
>
> During the development, we found out that DSA memory growth is
> unpredictable, leading to inefficient memory limitation.
>
> DSA is built on top of DSM segments and it manages a set of DSM
> segments, adding new segments as required and detaching them when they
> are no longer needed. The DSA segment starts with 1MB in size and a
> new segment size is at least big enough to follow a geometric series
> that approximately doubles the total storage each time we create a new
> segment. Because of this fact, it's not efficient to simply compare
> the memory limit to the total segment size. For example, if
> maintenance_work_mem is 512MB, the total segment size will be like:
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) = 510MB -> less than the
> limit, continue heap scan.
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + 256 = 766MB -> stop (exceed  254MB).
>
> One might think we can use dsa_set_size_limit() but it cannot; lazy
> vacuum ends up with an error. If we set DSA_ALLOC_NO_OOM, we might end
> up stopping the insertion halfway.
>
> Besides excessively allocating memory, since the initial DSM segment
> size is fixed 1MB, memory usage of a shared TidStore will start from
> 1MB+. This is higher than the minimum values of both work_mem and
> maintenance_work_mem, 64kB and 1MB respectively. Increasing the
> minimum m_w_m to 2MB might be acceptable but not for work_mem.
>
> Researching possible solutions, we found that aset.c also has a
> similar characteristic; allocates an 8K block (by default) upon the
> first allocation in a context, and doubles that size for each
> successive block request. But we can specify the initial block size
> and max blocksize. This made me think of an idea to specify both to
> DSA and both values are calculated based on m_w_m. I've attached the
> patch for this idea. The changes to dsa.c are straightforward since
> dsa.c already uses macros DSA_INITIAL_SEGMENT_SIZE and
> DSA_MAX_SEGMENT_SIZE. I just made these values configurable.
>
> FYI with this patch, we can create a DSA in parallel_vacuum_init()
> with initial and maximum block sizes as follows:
>
> initial block size = min(m_w_m / 4, 1MB)
> max block size = max(m_w_m / 8, 8MB)
>
> In most cases, we can start with a 1MB initial segment, the same as
> before. For larger memory, the heap scan stops after DSA allocates
> 1.25 times more memory than m_w_m. For example, if m_w_m = 512MB, the
> both initial and maximum segment sizes are 1MB and 64MB respectively,
> and then DSA allocates the segments as follows until heap scanning
> stops:
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 4) = 510MB -> less than the
> limit, continue heap scan.
>
> 2 * (1 + 2 + 4 + 8 + 16 + 32 + 64) + (64 * 5) = 574MB -> stop
> (allocated additional 62MB).
>
> It also works with smaller memory; If the limit is 1MB, we start with
> a 256KB initial segment and heap scanning stops after DSA allocated
> 1.5MB (= 256kB + 256kB + 512kB + 512kB).
>
> There is room for considering better formulas for initial and maximum
> block sizes but making both values configurable is a promising idea.
> And the analogous behavior to aset could be a good thing for
> readability and maintainability. There is another test result where I
> used this idea on top of a radix tree[2].
>
> We need to consider the total number of allocated DSA segments as the
> total number of DSM segments available on the system is fixed[3]. But
> it seems not problematic even with this patch since we allocate only a
> few additional segments (in above examples 17 segs vs. 19 segs). There
> was no big difference also in performance[2].
>

The last time I posted this email seemed not good timing since it was
close to the feature freeze, and the email was very long. The tidstore
and radix tree developments are still in-progress[1] and this change
is still necessary. I'd like to summarize the problem and proposal:

* Both the initial DSA segment size and the maximum DSA segment size
are fixed values: 1MB and 1TB respectively.
* The total allocated DSA segments follows a geometric series.
* The patch makes both the initial and maximum DSA segment sizes configurable.
* Which helps:
    * minimize wasting memory when the total DSA segment size reaches
the limit set by caller.
    * create a data structure with a small memory, for example 64kB,
the minimum value of work_mem.

According to the recent discussion, it might be sufficient to make
only the maximum DSA segment size configurable.

I'll register this item for the next commit fest.

Regards,

[1] https://www.postgresql.org/message-id/CAD21AoDjTbp2SHn4hRzAxWNeYArn4Yd4UdH9XRoNzdrYWNgExw%40mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



Re: Making the initial and maximum DSA segment sizes configurable

От
Tomas Vondra
Дата:
Hello Masahiko-san,

I'm not super-familiar with the DSA/DSM stuff, but I think your proposal
makes sense.

I agree with your observation that DSA is a bit like AllocSet, so if
that allows specifying min/max block size, maybe DSA should allow the
same thing for segments ...

However, does it actually address the problem you've described? If I
understand it correctly, the problem is that with the doubling logic,
we're likely to overshoot the limit. For example with m_w_m=512MB we
first undershoot it a bit (510MB), and then overshoot it a lot (766MB).
Which is not great, I agree (especially the overshooting).

But is the modification you propose much better? I mean, we still
overshoot the limit, right? By a smaller amount (just 62MB instead of
254MB), but it's still more than the limit ...

Moreover, this really depend on the caller using lower init/max segment
size, right? I'd bet most places would just hard-code something, which
means it won't respond to changes in the m_w_m value.


Could instead allow specifying the expected size / memory limit,
calculate the maximum segment size in DSA code, and also modify how the
segment size evolves over time to decrease as we get closer to the
expected size / limit?

For example, let's say we'd create DSA with 512MB limit. Then we could
do this:

1MB, 2MB, 4MB, ..., 128MB, 256MB, 1MB, 1MB, ...

because after 256MB we have 511MB of segments (in total), and we have to
go all the way back to the smallest segment to not exceed the limit (or
to minimize how much we exceed it). If the limit was set to 600MB, we'd
go back to 64MB, then 16MB, etc.

Or maybe we could be smarter and calculate an "optimal point" at which
point to start decreasing the segment size, roughly half-way through. So
we'd end up with something like

1MB, 2MB, 4MB, ..., 128MB, 128MB, 64MB, 32MB, 16MB, ..., 1MB

But maybe that's unnecessarily complicated ... or maybe I'm missing some
details that make this impossible for the DSA/DSM code.

FWIW the aset.c code has the same problem - it's not aware of limits
like work_mem / maintenance_work_mem, and with hard-coded limits we may
easily hit exceed those (if we get to sufficiently large blocks,
although in most cases the max block is 8MB, which limits how much we
overshoot the limit). Not sure if that's an issue in practice, maybe the
virtual memory thing deals with this for us.


If you choose to go with passing the min/max segment size to DSA, maybe
this should do a similar thing to aset.c and define a couple "good"
values (like ALLOCSET_DEFAULT_SIZES, ALLOCSET_SMALL_SIZES, ...) and/or a
macro to calculate good segment sizes for a given limit.

Also, there's a comment:

 * See dsa_create() for a note about the tranche arguments.

which should probably reference dsa_create_extended() instead.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Making the initial and maximum DSA segment sizes configurable

От
Masahiko Sawada
Дата:
Hi,

Thank you for the comments!

On Mon, Feb 26, 2024 at 6:58 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> Hello Masahiko-san,
>
> I'm not super-familiar with the DSA/DSM stuff, but I think your proposal
> makes sense.
>
> I agree with your observation that DSA is a bit like AllocSet, so if
> that allows specifying min/max block size, maybe DSA should allow the
> same thing for segments ...
>
> However, does it actually address the problem you've described? If I
> understand it correctly, the problem is that with the doubling logic,
> we're likely to overshoot the limit. For example with m_w_m=512MB we
> first undershoot it a bit (510MB), and then overshoot it a lot (766MB).
> Which is not great, I agree (especially the overshooting).
>
> But is the modification you propose much better? I mean, we still
> overshoot the limit, right? By a smaller amount (just 62MB instead of
> 254MB), but it's still more than the limit ...
>
> Moreover, this really depend on the caller using lower init/max segment
> size, right? I'd bet most places would just hard-code something, which
> means it won't respond to changes in the m_w_m value.
>
>
> Could instead allow specifying the expected size / memory limit,
> calculate the maximum segment size in DSA code, and also modify how the
> segment size evolves over time to decrease as we get closer to the
> expected size / limit?
>
> For example, let's say we'd create DSA with 512MB limit. Then we could
> do this:
>
> 1MB, 2MB, 4MB, ..., 128MB, 256MB, 1MB, 1MB, ...
>
> because after 256MB we have 511MB of segments (in total), and we have to
> go all the way back to the smallest segment to not exceed the limit (or
> to minimize how much we exceed it). If the limit was set to 600MB, we'd
> go back to 64MB, then 16MB, etc.

Interesting idea. In fact, since we use each segment size two times,
we would do like:

1MB, 1MB, 2MB, 2MB, ... 128MB, 128MB = 510MB (continue)

then, back to the smallest segment:

2MB, 1MB = 513MB (stop)

With 600MB limit, we would do like:

1MB, 1MB, 2MB, 2MB, ... 128MB, 128MB = 510MB (continue)
64MB + 16MB + 8MB + 2MB + 1MB = 601MB (stop)

>
> Or maybe we could be smarter and calculate an "optimal point" at which
> point to start decreasing the segment size, roughly half-way through. So
> we'd end up with something like
>
> 1MB, 2MB, 4MB, ..., 128MB, 128MB, 64MB, 32MB, 16MB, ..., 1MB

I remember John proposed a similar idea[1]. Quoting from the email:

m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to
the DSA area.

2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so
"stairstep down" the new segment sizes:

766 + 2*(128) + 64 = 1086MB -> stop


Both are interesting ideas. The reason why I proposed the idea is the
simplicity; it is simple and a similar usage as aset.c.

I guess the latter idea (a soft limit idea) might also be implemented
in a simple way. It's worth trying it.

>
> But maybe that's unnecessarily complicated ... or maybe I'm missing some
> details that make this impossible for the DSA/DSM code.
>
> FWIW the aset.c code has the same problem - it's not aware of limits
> like work_mem / maintenance_work_mem, and with hard-coded limits we may
> easily hit exceed those (if we get to sufficiently large blocks,
> although in most cases the max block is 8MB, which limits how much we
> overshoot the limit). Not sure if that's an issue in practice, maybe the
> virtual memory thing deals with this for us.

Right, we use 8MB max block size in most cases and it works in aset.c.
On the other hand, since dsm.c has the limits on total number of
segments, we cannot use unnecessarily small segments.

>
> If you choose to go with passing the min/max segment size to DSA, maybe
> this should do a similar thing to aset.c and define a couple "good"
> values (like ALLOCSET_DEFAULT_SIZES, ALLOCSET_SMALL_SIZES, ...) and/or a
> macro to calculate good segment sizes for a given limit.

Agreed.

>
> Also, there's a comment:
>
>  * See dsa_create() for a note about the tranche arguments.
>
> which should probably reference dsa_create_extended() instead.

Thanks, will fix it.

Regards,

[1] https://www.postgresql.org/message-id/CAFBsxsGiiyY%2BwykVLBbN9hFUMiNHqEr_Kqg9Mpc%3Duv4sg8eagQ%40mail.gmail.com


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com