Обсуждение: Proposal: Limitations of palloc inside checkpointer

Поиск

Список

Период

Сортировка

Proposal: Limitations of palloc inside checkpointer

От

Ekaterina Sokolova

Дата:

25 февраля 2025 г., 22:44:33

Hi, hackers!

Historically, the checkpointer process use palloc() into 
AbsorbSyncRequests() function. Therefore, the checkpointer does not 
expect to receive a request larger than 1 GB.

We encountered a case where the database went into recovery state, after 
applying all wal, the checkpointer process generated an "invalid memory 
alloc request size" error and entered a loop. But it is quite acceptable 
for the recovery state to receive such a large allocation request.

A simple solution to this problem is to use palloc_extended() instead of 
palloc(). But is it safe to allow the checkpointer to allocate so much 
memory at once?

I have proposal to update this memory allocation but I need your ideas 
and advices on how to do it in appropriate way. As an idea, we can 
replace the array with a list of arrays to allocate memory in chunks. As 
a bad idea, we can process a temporary array without locking.

I would be glad to hear your ideas and suggestions about this topic. 
Have a nice day!

-- 
Ekaterina Sokolova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Proposal: Limitations of palloc inside checkpointer

От

Maxim Orlov

Дата:

26 февраля 2025 г., 11:46:45

On Tue, 25 Feb 2025 at 22:44, Ekaterina Sokolova <e.sokolova@postgrespro.ru> wrote:

Hi, hackers!

Historically, the checkpointer process use palloc() into
AbsorbSyncRequests() function. Therefore, the checkpointer does not
expect to receive a request larger than 1 GB.

Yeah. And the most unpleasant thing is it won't simply fail with an error or helpful message suggesting a workaround (reduce the amount of shared memory). Checkpointer will just "stuck".

AFAICS, we have a few options:

1. Leave it as it is, but fatal on allocation of the chunk more than 1G.

2. Use palloc_extended with MCXT_ALLOC_HUGE flag.

3. Do not use any allocation and use CheckpointerShmem->requests directly in case of > 1G size of the required allocation.

Case (3) is not an option, in my opinion. So, we following (1) or (2). Personally, I'm for (2), PFA v0 patch.

Best regards,

Maxim Orlov.

Вложения

v0-0001-Expect-huge-number-of-requests-in-checkpointer.patch

Re: Proposal: Limitations of palloc inside checkpointer

От

Andres Freund

Дата:

26 февраля 2025 г., 11:54:17

Hi,

On 2025-02-26 11:46:45 +0300, Maxim Orlov wrote:
> On Tue, 25 Feb 2025 at 22:44, Ekaterina Sokolova <e.sokolova@postgrespro.ru>
> wrote:
> 
> > Hi, hackers!
> >
> > Historically, the checkpointer process use palloc() into
> > AbsorbSyncRequests() function. Therefore, the checkpointer does not
> > expect to receive a request larger than 1 GB.
> 
> 
> Yeah. And the most unpleasant thing is it won't simply fail with an error
> or helpful message suggesting a workaround (reduce the amount of shared
> memory). Checkpointer will just "stuck".
> 
> AFAICS, we have a few options:
> 1. Leave it as it is, but fatal on allocation of the chunk more than 1G.
> 2. Use palloc_extended with MCXT_ALLOC_HUGE flag.
> 3. Do not use any allocation and use CheckpointerShmem->requests directly
> in case of > 1G size of the required allocation.

4) Do compaction incrementally, instead of doing it for all requests at once.

That'd probably be better, because

a) it'll take some time to to compact 10s to 100s of million requests, which
   makes it much more likely that backends will have to perform syncs
   themselves and the lock will be held for an extended period of time

b) allocating gigabytes of memory obviously makes it more likely that you'll
   fail with out-of-memory at runtime or evne get OOM killed.

Greetings,

Andres Freund

Re: Proposal: Limitations of palloc inside checkpointer

От

Maxim Orlov

Дата:

26 февраля 2025 г., 14:10:33

On Wed, 26 Feb 2025 at 11:54, Andres Freund <andres@anarazel.de> wrote:

4) Do compaction incrementally, instead of doing it for all requests at once.

Yeah, good idea! I completely forgot about that. Thanks!

Best regards,

Maxim Orlov.

Re: Proposal: Limitations of palloc inside checkpointer

От

Maxim Orlov

Дата:

27 февраля 2025 г., 17:16:28

I tried to implement the idea (4). This is the patch.

But, there is a problem. See, when we release lock and call RememberSyncRequest() and acquire it again, CompactCheckpointerRequestQueue() may be called and the state of the request array may be changed. Of course, we can recheck num_requests after retaking the lock and restart the algo all over again. But this is not a great idea, since we can stuck in this loop if someone is pushing requests in the queue.

As for case (3). In fact, the described problem happens only with high enough values of NBuffers. Thus, user already expected postgres to use huge amount of RAM. Is this really a problem if he will demand some more to process sync request?

Best regards,

Maxim Orlov.

Вложения

v1-0001-AbsorbSyncRequests-incrementally-instead-of-doing.patch

Re: Proposal: Limitations of palloc inside checkpointer

От

Maxim Orlov

Дата:

28 февраля 2025 г., 11:13:29

After done some testing, I found a bug in the patch. If more requests were pushed while we release the lock, num_requests could not be set to zero.

Here is a fixed version.

Best regards,

Maxim Orlov.

Вложения

v2-0001-AbsorbSyncRequests-incrementally-instead-of-doing.patch

Re: Proposal: Limitations of palloc inside checkpointer

От

Maxim Orlov

Дата:

28 февраля 2025 г., 11:55:45

Here is an alternative solution. We can limit the number of processed requests to fit in a 1Gb memory chunk for each pass. Obviously, we left some requests in the queue to be processed in the next call. We can overcome this by using multi-step processing: estimating the number of steps in the beginning and processing requests again.

I'd like to hear your opinion on the subject.

Best regards,

Maxim Orlov.

Вложения

v3-0001-Limit-AbsorbSyncRequests-to-1Gb-at-once.patch

Re: Proposal: Limitations of palloc inside checkpointer

От

Maxim Orlov

Дата:

28 февраля 2025 г., 19:54:12

I think I figured it out. Here is v4.

If the number of requests is less than 1 GB, the algorithm stays the same as before. If we need to process more, we will do it incrementally with slices of 1 GB.

Best regards,

Maxim Orlov.

Вложения

v4-0001-Process-sync-requests-incrementally-in-AbsorbSync.patch

Re: Proposal: Limitations of palloc inside checkpointer

От

Xuneng Zhou

Дата:

12 марта 2025 г., 10:27:31

Hi,
The patch itself looks ok to me. I'm curious about the trade-offs between this incremental approach and the alternative of using palloc_extended() with the MCXT_ALLOC_HUGE flag. The approach of splitting the requests into fixed-size slices avoids OOM failures or process termination by the OOM killer, which is good. However, it does add some overhead with additional lock acquisition/release cycles and memory movement operations via memmove(). The natural question is whether the security justify the cost. Regarding the slice size of 1 GB, is this derived from MaxAllocSize limit, or was it chosen for other performance reasons? whether a different size might offer better performance under typical workloads?

It would be helpful to know the reasoning behind these design decisions.

Maxim Orlov <orlovmg@gmail.com> 于2025年3月1日周六 00:54写道：

I think I figured it out. Here is v4.

If the number of requests is less than 1 GB, the algorithm stays the same as before. If we need to process more, we will do it incrementally with slices of 1 GB.

Best regards,
Maxim Orlov.

Re: Proposal: Limitations of palloc inside checkpointer

От

Maxim Orlov

Дата:

12 марта 2025 г., 14:00:53

On Wed, 12 Mar 2025 at 10:27, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi,
The patch itself looks ok to me. I'm curious about the trade-offs between this incremental approach and the alternative of using palloc_extended() with the MCXT_ALLOC_HUGE flag. The approach of splitting the requests into fixed-size slices avoids OOM failures or process termination by the OOM killer, which is good. However, it does add some overhead with additional lock acquisition/release cycles and memory movement operations via memmove(). The natural question is whether the security justify the cost. Regarding the slice size of 1 GB, is this derived from MaxAllocSize limit, or was it chosen for other performance reasons? whether a different size might offer better performance under typical workloads?

I think 1 GB is derived purely from MaxAllocSize. This "palloc" is a relatively old one, and no one expected the number of requests to exceed 1 GB. Now we have the ability to set the shared_buffers to a huge number (without discussing now whether this makes any real sense), thus this limit for palloc becomes a problem.

Best regards,

Maxim Orlov.

Re: Proposal: Limitations of palloc inside checkpointer

От

Heikki Linnakangas

Дата:

05 апреля 2025 г., 00:22:52

On 12/03/2025 13:00, Maxim Orlov wrote:
> On Wed, 12 Mar 2025 at 10:27, Xuneng Zhou <xunengzhou@gmail.com 
> <mailto:xunengzhou@gmail.com>> wrote:
> 
>> The patch itself looks ok to me. I'm curious about the trade-offs
>> between this incremental approach and the alternative of
>> using palloc_extended() with the MCXT_ALLOC_HUGE flag. The approach
>> of splitting the requests into fixed-size slices  avoids OOM
>> failures or process termination by the OOM killer, which is good.

+1

>> However, it does add some overhead with additional lock acquisition/
>> release cycles and memory movement operations via memmove(). The
>> natural question is whether the security justify the cost.

Hmm, if you turned the array into a ring buffer, you could absorb part 
of the queue without the memmove().

With that, I'd actually suggest using a much smaller allocation, maybe 
10000 entries or even less. That should be enough to keep the lock 
acquire/release overhead acceptable.

>> Regarding the slice size of 1 GB,  is this derived from
>> MaxAllocSize limit, or was it chosen for other performance
>> reasons? whether a different size might offer better performance
>> under typical workloads?
> 
> I think 1 GB is derived purely from MaxAllocSize. This "palloc" is a 
> relatively old one, and no one expected the number of requests to exceed 
> 1 GB. Now we have the ability to set the shared_buffers to a huge number 
> (without discussing now whether this makes any real sense), thus this 
> limit for palloc becomes a problem.

Yeah. Frankly even 1 GB seems excessive for this. We set max_requests = 
NBuffers, which makes some sense so that you can fit the worst case 
scenario of quickly evicting all pages from the buffer cache. But even 
then I'd expect the checkpointer to be able to absorb the changes before 
the queue fills up. And we have the compaction logic now too.

So I wonder if we should cap max_requests at, say, 10 million requests now?

CompactCheckpointerRequestQueue() makes a pretty large allocation too, 
BTW, although much smaller than the one in AbsorbSyncRequests(). The 
hash table it builds could grow quite large though.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Proposal: Limitations of palloc inside checkpointer

От

Xuneng Zhou

Дата:

10 апреля 2025 г., 18:43:20

Hi,

I’ve updated and rebased Maxim's patch version 4 with optimizations like ring buffer and capped max_requests proposed by Heikki. These are the summary of Changes in v5:

• Replaced the linear array model in AbsorbSyncRequests() with a bounded ring buffer to avoid large memmove() operations when processing sync requests.

• Introduced a tunable batch size (CKPT_REQ_BATCH_SIZE, default: 10,000) to incrementally absorb requests while respecting MaxAllocSize.

• Capped max_requests with a hard upper bound (MAX_CHECKPOINT_REQUESTS = 10,000,000) to avoid pathological memory growth when shared_buffers is very large.

• Updated CompactCheckpointerRequestQueue() to support the ring buffer layout.

• Retained the existing compacting logic but modified it to operate in-place over the ring.

>> The patch itself looks ok to me. I'm curious about the trade-offs
>> between this incremental approach and the alternative of
>> using palloc_extended() with the MCXT_ALLOC_HUGE flag. The approach
>> of splitting the requests into fixed-size slices avoids OOM
>> failures or process termination by the OOM killer, which is good.

+1

>> However, it does add some overhead with additional lock acquisition/
>> release cycles and memory movement operations via memmove(). The
>> natural question is whether the security justify the cost.

Hmm, if you turned the array into a ring buffer, you could absorb part
of the queue without the memmove().

With that, I'd actually suggest using a much smaller allocation, maybe
10000 entries or even less. That should be enough to keep the lock
acquire/release overhead acceptable

>> Regarding the slice size of 1 GB, is this derived from
>> MaxAllocSize limit, or was it chosen for other performance
>> reasons? whether a different size might offer better performance
>> under typical workloads?
>
> I think 1 GB is derived purely from MaxAllocSize. This "palloc" is a
> relatively old one, and no one expected the number of requests to exceed
> 1 GB. Now we have the ability to set the shared_buffers to a huge number
> (without discussing now whether this makes any real sense), thus this
> limit for palloc becomes a problem.

Yeah. Frankly even 1 GB seems excessive for this. We set max_requests =
NBuffers, which makes some sense so that you can fit the worst case
scenario of quickly evicting all pages from the buffer cache. But even
then I'd expect the checkpointer to be able to absorb the changes before
the queue fills up. And we have the compaction logic now too.

So I wonder if we should cap max_requests at, say, 10 million requests now?

CompactCheckpointerRequestQueue() makes a pretty large allocation too,
BTW, although much smaller than the one in AbsorbSyncRequests(). The
hash table it builds could grow quite large though.

I’m not certain what the optimal cap for max_requests should be—whether it’s 10 million(setting it to 10 million would result in a peak temporary allocation of roughly 700 MB within CompactCheckpointerRequestQueue), 1 million, or some other value—but introducing an upper bound seems important. Without a cap, max_requests could theoretically scale with NBuffers, potentially resulting in excessive temporary memory allocations.

As this is my first patch submission, it might be somewhat naive and experimental— I appreciate your patience and feedback.

Вложения

v5-0001-Process-sync-requests-incrementally-in-AbsorbSyncReq.patch

Re: Proposal: Limitations of palloc inside checkpointer

От

Xuneng Zhou

Дата:

15 апреля 2025 г., 07:02:25

Hi,

I've moved this entry to the July CommitFest. The CI reported an unused variable warning in patch v5, so I've addressed it by removing the unused one. Sorry for not catching the issue locally.

Xuneng Zhou <xunengzhou@gmail.com> 于2025年4月10日周四 23:43写道：

Hi,
I’ve updated and rebased Maxim's patch version 4 with optimizations like ring buffer and capped max_requests proposed by Heikki. These are the summary of Changes in v5:
• Replaced the linear array model in AbsorbSyncRequests() with a bounded ring buffer to avoid large memmove() operations when processing sync requests.
• Introduced a tunable batch size (CKPT_REQ_BATCH_SIZE, default: 10,000) to incrementally absorb requests while respecting MaxAllocSize.
• Capped max_requests with a hard upper bound (MAX_CHECKPOINT_REQUESTS = 10,000,000) to avoid pathological memory growth when shared_buffers is very large.
• Updated CompactCheckpointerRequestQueue() to support the ring buffer layout.
• Retained the existing compacting logic but modified it to operate in-place over the ring.
>> The patch itself looks ok to me. I'm curious about the trade-offs
>> between this incremental approach and the alternative of
>> using palloc_extended() with the MCXT_ALLOC_HUGE flag. The approach
>> of splitting the requests into fixed-size slices avoids OOM
>> failures or process termination by the OOM killer, which is good.

+1

>> However, it does add some overhead with additional lock acquisition/
>> release cycles and memory movement operations via memmove(). The
>> natural question is whether the security justify the cost.

Hmm, if you turned the array into a ring buffer, you could absorb part
of the queue without the memmove().

With that, I'd actually suggest using a much smaller allocation, maybe
10000 entries or even less. That should be enough to keep the lock
acquire/release overhead acceptable

>> Regarding the slice size of 1 GB, is this derived from
>> MaxAllocSize limit, or was it chosen for other performance
>> reasons? whether a different size might offer better performance
>> under typical workloads?
>
> I think 1 GB is derived purely from MaxAllocSize. This "palloc" is a
> relatively old one, and no one expected the number of requests to exceed
> 1 GB. Now we have the ability to set the shared_buffers to a huge number
> (without discussing now whether this makes any real sense), thus this
> limit for palloc becomes a problem.

Yeah. Frankly even 1 GB seems excessive for this. We set max_requests =
NBuffers, which makes some sense so that you can fit the worst case
scenario of quickly evicting all pages from the buffer cache. But even
then I'd expect the checkpointer to be able to absorb the changes before
the queue fills up. And we have the compaction logic now too.

So I wonder if we should cap max_requests at, say, 10 million requests now?

CompactCheckpointerRequestQueue() makes a pretty large allocation too,
BTW, although much smaller than the one in AbsorbSyncRequests(). The
hash table it builds could grow quite large though.

I’m not certain what the optimal cap for max_requests should be—whether it’s 10 million(setting it to 10 million would result in a peak temporary allocation of roughly 700 MB within CompactCheckpointerRequestQueue), 1 million, or some other value—but introducing an upper bound seems important. Without a cap, max_requests could theoretically scale with NBuffers, potentially resulting in excessive temporary memory allocations.

As this is my first patch submission, it might be somewhat naive and experimental— I appreciate your patience and feedback.

Вложения

v6-0001-Process-sync-requests-incrementally-in-AbsorbSyncReq.patch

Re: Proposal: Limitations of palloc inside checkpointer

От

Alexander Korotkov

Дата:

02 июня 2025 г., 22:03:14

Hi, Xuneng Zhou!

On Tue, Apr 15, 2025 at 7:02 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> I've moved this entry to the July CommitFest. The CI reported an unused variable warning in patch v5, so I've
addressedit by removing the unused one. Sorry for not catching the issue locally. 

Thank you for your work on this subject!

I have few notes about that:
1) Should we make CompactCheckpointerRequestQueue() process the queue
of checkpoint requests in smaller parts for the same reason we do this
in AbsorbSyncRequests()? That would require significant redesign of
the algorithm, but still.
2) That's pretty independent to the changes by the patch, but should
CompactCheckpointerRequestQueue() fill the gaps with entries from the
tail instead of rewriting the whole queue?  That might be a bit
faster.
3) For sure, we wouldn't backpatch this.  Can we prepare some simple
solution for back branches?  Perhaps, just introduction of
MAX_CHECKPOINT_REQUESTS is enough to save us from allocations larger
than 1GB.

------
Regards,
Alexander Korotkov
Supabase

Re: Proposal: Limitations of palloc inside checkpointer

От

Xuneng Zhou

Дата:

03 июня 2025 г., 13:16:12

Hi Alexander,

Thanks a lot for reviewing!

> I have few notes about that:
> 1) Should we make CompactCheckpointerRequestQueue() process the queue
> of checkpoint requests in smaller parts for the same reason we do this
> in AbsorbSyncRequests()? That would require significant redesign of
> the algorithm, but still.

In AbsorbSyncRequests, we process requests incrementally in batches to
avoid allocating more than 1 GB of memory, which would lead to
repeated failure. I think this is less concerning in
CompactCheckpointerRequestQueue, because if we caps num_requests at 10
million, the hash table peaks at ~500 MB and skip_slot[] at ~10
MB—both under 1 GB.


> 2) That's pretty independent to the changes by the patch, but should
> CompactCheckpointerRequestQueue() fill the gaps with entries from the
> tail instead of rewriting the whole queue?  That might be a bit
> faster.
This optimization would be quite helpful for compacting large queues.
For small ones, it may also add extra costs. Can we use a hybrid
approach? If it's independent, should we create a standalone patch for
it?

> 3) For sure, we wouldn't backpatch this.  Can we prepare some simple
> solution for back branches?  Perhaps, just introduction of
> MAX_CHECKPOINT_REQUESTS is enough to save us from allocations larger
> than 1GB.
>
I think this would work well for back branches.

Re: Proposal: Limitations of palloc inside checkpointer

От

Alexander Korotkov

Дата:

03 июня 2025 г., 14:04:24

On Tue, Jun 3, 2025 at 1:16 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> Thanks a lot for reviewing!
>
> > I have few notes about that:
> > 1) Should we make CompactCheckpointerRequestQueue() process the queue
> > of checkpoint requests in smaller parts for the same reason we do this
> > in AbsorbSyncRequests()? That would require significant redesign of
> > the algorithm, but still.
>
> In AbsorbSyncRequests, we process requests incrementally in batches to
> avoid allocating more than 1 GB of memory, which would lead to
> repeated failure. I think this is less concerning in
> CompactCheckpointerRequestQueue, because if we caps num_requests at 10
> million, the hash table peaks at ~500 MB and skip_slot[] at ~10
> MB—both under 1 GB.

Right, but another point is to avoid lengthy holding of
CheckpointerCommLock.  What do you think about that?

> > 2) That's pretty independent to the changes by the patch, but should
> > CompactCheckpointerRequestQueue() fill the gaps with entries from the
> > tail instead of rewriting the whole queue?  That might be a bit
> > faster.
> This optimization would be quite helpful for compacting large queues.
> For small ones, it may also add extra costs. Can we use a hybrid
> approach? If it's independent, should we create a standalone patch for
> it?

Why do you think this would add extra cost of class queues?  Given
we're basically rewriting the algorithm of
CompactCheckpointerRequestQueue(), I think it's OK to integrate this
into once patch.

> > 3) For sure, we wouldn't backpatch this.  Can we prepare some simple
> > solution for back branches?  Perhaps, just introduction of
> > MAX_CHECKPOINT_REQUESTS is enough to save us from allocations larger
> > than 1GB.
> >
> I think this would work well for back branches.

OK.

------
Regards,
Alexander Korotkov
Supabase

Re: Proposal: Limitations of palloc inside checkpointer

От

Xuneng Zhou

Дата:

04 июня 2025 г., 06:11:17

Hi all,

Sorry—I forgot to Cc on my previous message. Resending here so they’re
on the thread:

On Wed, Jun 4, 2025 at 11:07 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi Alexander,
>
> Thanks again for the feedback!
>
> 1) Batch-processing CompactCheckpointerRequestQueue() and AbsorbSyncRequests()?
>
> After some thoughts, I realized my previous take was incomplete—sorry
> for the confusion. Heikki suggested capping num_requests at 10 million
> [1]. With that limit, the largest hash table is ~500 MB and the
> skip_slot[] array is ~10 MB in CompactCheckpointerRequestQueue and the
> max size of request array in AbsorbSyncRequests is well under 400 MB,
> so we never exceed 1 GB. Even without batching, compaction stays under
> the cap. Batching in AbsorbSyncRequests may still help by amortizing
> memory allocation, but it adds extra lock/unlock overhead. Not sure if
> that overhead is worth it under the cap.
>
> Of course, all of this depends on having a cap in place. Picking the
> right cap size can be tricky (see point 2). If we decide not to
> enforce a cap now or in future versions, then batching both
> CompactCheckpointerRequestQueue(maybe?) and AbsorbSyncRequests become
> essential. We also need to consider the batch size—Heikki suggested 10
> k for AbsorbSyncRequests—but I’m not sure whether that suits typical
> or extreme workloads.
>
> > Right, but another point is to avoid lengthy holding of
> > CheckpointerCommLock.  What do you think about that?
>
> I am not clear on this. Could you elaborate on it?
>
> [1] https://www.postgresql.org/message-id/c1993b75-a5bc-42fd-bbf1-6f06a1b37107%40iki.fi
>
>
> 2) Back-branch fixes with MAX_CHECKPOINT_REQUESTS?
>
> This is simple and effective, but can be hard to get the value right.
> I think we should think more of it. For very large-scale use cases,
> like hundreds of GB shared_buffers, 10 million seems small if the
> checkpointer is not able to absorb the changes before the queue fills
> up. In this case, making compaction more efficient like 3) would be
> helpful. However, if we do this for back-branch as well, the solution
> is not that simple any more.
>
>
> 3) Fill gaps by pulling from the tail instead of rewriting the whole queue?
>
> I misunderstood at first—this is a generally helpful optimization.
> I'll integrate it into the current patch.

Re: Proposal: Limitations of palloc inside checkpointer

От

Xuneng Zhou

Дата:

04 июня 2025 г., 15:33:30

Hi,

Thanks for the feedback!

> I think it would be good start point to use the same batch size of
> CompactCheckpointerRequestQueue() and AbsorbSyncRequests()
>

So we’ll keep both batch processing and the request cap in place for now.

> > > Right, but another point is to avoid lengthy holding of
> > > CheckpointerCommLock.  What do you think about that?
> >
> > I am not clear on this. Could you elaborate on it?
>
> See [1] for more detailed description of this.

> Links.
> 1. https://www.postgresql.org/message-id/flat/db4534f83a22a29ab5ee2566ad86ca92%40postgrespro.ru

I read the thread but didn’t find a specific explanation of avoiding
long lock holds. My understanding is: when compaction processes a very
large queue in one go, it holds CheckpointerCommLock the entire time,
blocking all ForwardSyncRequest callers. Batch processing would
release the lock after each chunk, allowing other backends to make
progress. Is that correct?

Re: Proposal: Limitations of palloc inside checkpointer

От

Xuneng Zhou

Дата:

11 июня 2025 г., 10:58:26

Hi,

> 3) Fill gaps by pulling from the tail instead of rewriting the whole queue?
>
> I misunderstood at first—this is a generally helpful optimization.
> I'll integrate it into the current patch.

Great, thank you.

I dug deeper into the “fill gaps from the tail” optimization and implemented a version of it. The tricky part is not the copy itself but guaranteeing that the queue ends up hole-free and that tail really points at the slot after the last live request. With a twin-cursor gap-fill we refuse to move SYNC_FORGET_REQUEST / SYNC_FILTER_REQUEST (they’re order-sensitive fences).

If the final survivor is one of those barriers, the cursors meet while a hole still exists immediately before the barrier:

head → A [hole] FILTER(X) …unused…

If we then compute tail = (head + remaining_requests) % max_requests, the value lands inside the live region (on the barrier itself). The invariant (head + num_requests) % max_requests == tail is broken, so the next enqueue overwrites live data or the checkpointer under-scans the queue.

Alternatively, we may allow relocating SYNC_FORGET_REQUEST and SYNC_FILTER_REQUEST entries, but ensuring their ordering semantics remain correct would be quite challenging. That concern is why the implementation uses a forward-scan compaction. As the source comment noted:

* The basic idea here is that a request can be skipped if it's followed

* by a later, identical request. It might seem more sensible to work

* backwards from the end of the queue and check whether a request is

* preceded by an earlier, identical request, in the hopes of doing less

* copying. But that might change the semantics, if there's an

* intervening SYNC_FORGET_REQUEST or SYNC_FILTER_REQUEST, so we do it

* this way.

Best,

Xuneng

Re: Proposal: Limitations of palloc inside checkpointer

От

Alexander Korotkov

Дата:

23 июля 2025 г., 01:37:23

Hi, Xuneng!

On Thu, Jun 26, 2025 at 4:31 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> Patch v7 modifies CompactCheckpointerRequestQueue() to process requests incrementally in batches of
CKPT_REQ_BATCH_SIZE(10,000), similar to the approach used in AbsorbSyncRequests(). This limits memory usage from
O(num_requests)to O(batch_size) for both hash tables and skip arrays. 
>
>
> - Hash table memory bounded by batch size regardless of total queue size
>
> - Skip array allocation limited to batch size instead of max_requests
>
> - Prevents potential OOM conditions with very large request queues
>
>
> Trade-offs
>
> Cross-batch duplicate detection: The incremental approach won't detect duplicates spanning batch boundaries. This
limitationseems acceptable since: 
>
>    - The main issue need to be addressed is preventing memory allocation failures
>
>    - Remaining duplicates are still handled by the RememberSyncRequest() function in the sync subsystem
>
>    - The purpose of this function is to make some rooms for new requests not remove all duplicates.
>
>
> Lock holding Duration
>
> Andres pointed out[1] that compacting a very large queue takes considerable time, and holding the exclusive lock for
anextended period makes it much more likely that backends will have to perform syncs themselves - which is exactly what
CompactCheckpointerRequestQueue()is trying to avoid in the first place. However, releasing the lock between batches
wouldintroduce race conditions that would make the design much more complex. Given that the primary goal of this patch
isto avoid large memory allocations, I keep the lock held for the whole function for simplicity now. 
>
> [1] https://postgrespro.com/list/id/bjno37ickfafixkqmd2lcyopsajnuig5mm4rg6tn2ackpqyiba@w3sjfo3usuos

I've reviewed the v7 of patch.  I can note following.
1) The patch makes CompactCheckpointerRequestQueue() process queue in
batches, but doesn't release the lock between batches.  The algorithm
becomes more complex, and it holds the lock for the same (or probably
longer) time.  We trade possibility to miss some duplicates to less
memory consumption.   However, with MAX_CHECKPOINT_REQUESTS limit,
maximal memory consumption shouldn't be too high anyway.
2) Even if we manage to release the lock between the batches then we
still need some additional coordination to prevent two processed doing
CompactCheckpointerRequestQueue() simultaneously.
3) Another problem of releasing the lock is that in spite of
AbsorbSyncRequests() there is no work to do while not holding the
lock.  Releasing the lock and immediately taking it back have a little
use: other processes have almost no chance to grab the lock.

In the token of all of above, I think it's not reasonable to do
CompactCheckpointerRequestQueue() in batches.  Sorry for the
confusion.

Regarding the gap filling, I've got from [1] that the requests order
matters.  Then gap filling strategy is impossible or too complex.  The
point is withdrawn.  I think v6 is basically good enough.

The v8 is attached.  It's basically the same as v6, but has revised
commit message and comments.

The patch for stable branches is also attached: it just limits the
maximum size of the checkpointer requests queue.

Links.
1. https://www.postgresql.org/message-id/CABPTF7XSSecQ-k7k9cQJsA3ACHmCVwdoRfv4DxOMom4cNQL%3D5Q%40mail.gmail.com

------
Regards,
Alexander Korotkov
Supabase

Вложения

Re: Proposal: Limitations of palloc inside checkpointer

От

Xuneng Zhou

Дата:

25 июля 2025 г., 15:23:21

Hi, Alexander

Thanks for reviewing and feedback!

> The v8 is attached.  It's basically the same as v6, but has revised
> commit message and comments.
>
> The patch for stable branches is also attached: it just limits the
> maximum size of the checkpointer requests queue.

LGTM.

Best,
Xuneng

Re: Proposal: Limitations of palloc inside checkpointer

От

Alexander Korotkov

Дата:

25 июля 2025 г., 18:37:35

On Fri, Jul 25, 2025 at 3:23 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Thanks for reviewing and feedback!
>
> > The v8 is attached.  It's basically the same as v6, but has revised
> > commit message and comments.
> >
> > The patch for stable branches is also attached: it just limits the
> > maximum size of the checkpointer requests queue.
>
> LGTM.

Good, thank you.
I'm going to push this if no objections.

------
Regards,
Alexander Korotkov
Supabase

Re: Proposal: Limitations of palloc inside checkpointer

От

Tom Lane

Дата:

06 августа 2025 г., 21:58:34

Alexander Korotkov <aekorotkov@gmail.com> writes:
> I'm going to push this if no objections.

I looked at these patches while preparing release notes, and
found an oversight.  CheckpointerShmemInit does

        CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);

but CheckpointerShmemSize still does

    size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));

So if NBuffers is more than MAX_CHECKPOINT_REQUESTS, we will allocate
extra CheckpointerRequest array entries that we will never use,
wasting shared memory.  Admittedly the amount is small relative to the
shared buffers themselves, but at the very least this is confusing.

The comment in CheckpointerShmemSize needs adjustment, too.

Also, the reason I was studying it so carefully is that I could not
figure out who to credit for the back-patched fix.  It looks like
the original suggestion for the minimal fix was Alexander's, but
other people seem to have had their fingers in the actual patch
writing --- or am I misreading the thread?

            regards, tom lane

Re: Proposal: Limitations of palloc inside checkpointer

От

Xuneng Zhou

Дата:

07 августа 2025 г., 10:16:58

Hi, Tom!

Thanks for catching this.

On Thu, Aug 7, 2025 at 2:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > I'm going to push this if no objections.
>
> I looked at these patches while preparing release notes, and
> found an oversight.  CheckpointerShmemInit does
>
>         CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);
>
> but CheckpointerShmemSize still does
>
>     size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
>
> So if NBuffers is more than MAX_CHECKPOINT_REQUESTS, we will allocate
> extra CheckpointerRequest array entries that we will never use,
> wasting shared memory.  Admittedly the amount is small relative to the
> shared buffers themselves, but at the very least this is confusing.
>
> The comment in CheckpointerShmemSize needs adjustment, too.

I attached a patch to fix it.

Best,
Xuneng

Вложения

v1-0001-Fix-checkpointer-shared-memory-allocation.patch

Re: Proposal: Limitations of palloc inside checkpointer

От

Xuneng Zhou

Дата:

07 августа 2025 г., 16:01:13

Hi,

On Thu, Aug 7, 2025 at 3:16 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi, Tom!
>
> Thanks for catching this.
>
> On Thu, Aug 7, 2025 at 2:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Alexander Korotkov <aekorotkov@gmail.com> writes:
> > > I'm going to push this if no objections.
> >
> > I looked at these patches while preparing release notes, and
> > found an oversight.  CheckpointerShmemInit does
> >
> >         CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);
> >
> > but CheckpointerShmemSize still does
> >
> >     size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
> >
> > So if NBuffers is more than MAX_CHECKPOINT_REQUESTS, we will allocate
> > extra CheckpointerRequest array entries that we will never use,
> > wasting shared memory.  Admittedly the amount is small relative to the
> > shared buffers themselves, but at the very least this is confusing.
> >
> > The comment in CheckpointerShmemSize needs adjustment, too.
>
> I attached a patch to fix it.

Sorry for the error in this patch. I forgot to take the min of
NBuffers and MAX_CHECKPOINT_REQUESTS for mem size calculation.
Many thanks to Alexander for correcting and pushing it.

Re: Proposal: Limitations of palloc inside checkpointer

От

Alexander Korotkov

Дата:

07 августа 2025 г., 16:32:41

On Thu, Aug 7, 2025 at 4:01 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> On Thu, Aug 7, 2025 at 3:16 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Hi, Tom!
> >
> > Thanks for catching this.
> >
> > On Thu, Aug 7, 2025 at 2:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> > > Alexander Korotkov <aekorotkov@gmail.com> writes:
> > > > I'm going to push this if no objections.
> > >
> > > I looked at these patches while preparing release notes, and
> > > found an oversight.  CheckpointerShmemInit does
> > >
> > >         CheckpointerShmem->max_requests = Min(NBuffers, MAX_CHECKPOINT_REQUESTS);
> > >
> > > but CheckpointerShmemSize still does
> > >
> > >     size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
> > >
> > > So if NBuffers is more than MAX_CHECKPOINT_REQUESTS, we will allocate
> > > extra CheckpointerRequest array entries that we will never use,
> > > wasting shared memory.  Admittedly the amount is small relative to the
> > > shared buffers themselves, but at the very least this is confusing.
> > >
> > > The comment in CheckpointerShmemSize needs adjustment, too.
> >
> > I attached a patch to fix it.
>
> Sorry for the error in this patch. I forgot to take the min of
> NBuffers and MAX_CHECKPOINT_REQUESTS for mem size calculation.
> Many thanks to Alexander for correcting and pushing it.

No problem.  Thank you for the patch.  And thanks to Tom for catching this.

------
Regards,
Alexander Korotkov
Supabase

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Proposal: Limitations of palloc inside checkpointer

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения