Обсуждение: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

Поиск
Список
Период
Сортировка

Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Priya V
Дата:
Hello Postgres community,

We operate a large PostgreSQL fleet (~15,000 databases) on dedicated Linux hosts.
Each host runs multiple PostgreSQL instances (multi-instance setup, not just multiple DBs inside one instance).

Environment:

  • PostgreSQL Versions: Mix of 13.13 and 15.12 (upgrades in progress to be at 15.12 currently both are actively in use)

  • OS / Kernel: RHEL 7 & RHEL 8 variants, kernels in the 4.14–4.18 range

  • RAM: 256 GiB (varies slightly)

  • Swap: Currently none

  • Workload: Highly mixed — OLTP-style internal apps with unpredictable query patterns and connection counts

  • Goal: Uniform, safe memory settings across the fleet to avoid kernel or database instability

We’re reviewing vm.overcommit_* settings because we’ve seen conflicting guidance:

  • vm.overcommit_memory = 2 gives predictability but can reject allocations early

  • vm.overcommit_memory = 1 is more flexible but risks OOM kills if many backends hit peak memory usage at once

We’re considering:

  • vm.overcommit_memory = 2 for strict accounting

  • Increasing vm.overcommit_ratio from 50 → 80 or 90 to better reflect actual PostgreSQL usage (e.g., work_mem reservations that aren’t fully used)

Our questions for those running large PostgreSQL fleets:

  1. What overcommit_ratio do you find safe for PostgreSQL without causing kernel memory crunches?

  2. Do you prefer overcommit_memory = 1 or = 2 for production stability?

  3. How much swap (if any) do you keep in large-memory servers where PostgreSQL is the primary workload? Is having swap configured a good idea or not ? 

  4. Any real-world cases where kernel accounting was too strict or too loose for PostgreSQL?

  5. What settings to go with if we are not planning on using swap ?

We’d like to avoid both extremes:

  • Too low a ratio → PostgreSQL backends failing allocations even with free RAM

  • Too high a ratio → OOM killer terminating PostgreSQL under load spikes

Any operational experiences, tuning recommendations, or kernel/PG interaction pitfalls would be very helpful.

TIA



Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Joe Conway
Дата:
On 8/5/25 13:01, Priya V wrote:
> *Environment:*
>     *PostgreSQL Versions:* Mix of 13.13 and 15.12 (upgrades in progress
>     to be at 15.12 currently both are actively in use)

PostgreSQL 13 end of life after November 13, 2025

>     *OS / Kernel:* RHEL 7 & RHEL 8 variants, kernels in the 4.14–4.18 range

RHEL 7 has been EOL for quite a while now. Note that you have to watch 
out for collation issues/corrupted indexes after OS upgrades due to 
collations changing with newer glibc versions.

>     *Swap:* Currently none

bad idea

>     *Workload:* Highly mixed — OLTP-style internal apps with
>     unpredictable query patterns and connection counts
> 
>     *Goal:* Uniform, safe memory settings across the fleet to avoid
>     kernel or database instability

> We’re considering:
>     *|vm.overcommit_memory = 2|* for strict accounting

yes

>     Increasing |vm.overcommit_ratio| from 50 → 80 or 90 to better
>     reflect actual PostgreSQL usage (e.g., |work_mem| reservations that
>     aren’t fully used)

work_mem does not reserve memory -- it is a maximum that might be used 
in memory for a particular operation

> *Our questions for those running large PostgreSQL fleets:*
>  1.
>     What |overcommit_ratio| do you find safe for PostgreSQL without
>     causing kernel memory crunches?

Read this:
https://www.cybertec-postgresql.com/en/what-you-should-know-about-linux-memory-overcommit-in-postgresql/

>  2.
>     Do you prefer |overcommit_memory = 1| or |= 2| for production stability?

Use overcommit_memory = 2 for production stability

>  3.
>     How much swap (if any) do you keep in large-memory servers where
>     PostgreSQL is the primary workload? Is having swap configured a good
>     idea or not ?

You don't necessary need a large amount of swap, but you definitely 
should not disable it.

Some background on that:
https://chrisdown.name/2018/01/02/in-defence-of-swap.html

>  4.
>     Any real-world cases where kernel accounting was too strict or too
>     loose for PostgreSQL?

In my experience the biggest issues are when postgres is running in a 
memory constrained cgroup. If you want to constrain memory with cgroups, 
use cgroup v2 (not 1) and use memory.high to constrain it, not memory.max.

>  5. What settings to go with if we are not planning on using swap ?

IMHO do not disable swap on Linux, at least not on production, ever.

> We’d like to avoid both extremes:
>     Too low a ratio → PostgreSQL backends failing allocations even with
>     free RAM

Have you actually seen this or are you theorizing?

>     Too high a ratio → OOM killer terminating PostgreSQL under load spikes

If overcommit_memory = 2, overcommit_ratio is reasonable (less than 100, 
maybe 80 or so as you suggested), and swap is not disabled, and you are 
not running in a memory constrained cgroup, I would be very surprised if 
you will ever get hit by the OOM killer. And if you do, things are so 
bad the database was probably dying anyway.

HTH,

-- 
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com



Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Frits Hoogland
Дата:
Joe, 

Can you name any technical reason why not having swap for a database is an actual bad idea?

Memory always is limited. Swap was invented to overcome a situation where the (incidental) memory usage of paged in memory was could (regularly) get higher than physical memory would allow, and thus have the (clear) workaround of having swap to 'cushion' the memory shortage issue by allowing a "second level" memory storage on disk.
Still, this does not making memory unlimited. Swap extends the physical memory available with the amount of swap. There still is a situation where you can run out of memory when swap is added, simply by paging in more memory than physical memory and swap.

Today, most systems are not memory constrained anymore, or: it is possible to get a server with enough physical memory to hold your common needed total memory need. 
And given the latency sensitive nature of databases in general, which includes postgres, for any serious deployment you should get a server with enough memory to host your workload, and configure postgres not to overload the memory.

If you do oversubscribe on (physical) memory, you will get pain somewhere, there is no way around that.
The article in defense of swap in essence is saying that if you happen to oversubscribe on memory, sharing the pain between anonymous and file is better.
I would say you are already in a bad place if that happens, which is especially bad for databases, and databases should allow you to make memory usage predictable.

However, what I found is that with 4+ kernels (4.18 to be precise; rhel 8), the kernel can try to favour file pages in certain situations making anonymous memory getting paged out even if swappiness is set to 1 or 0, and if there is a wealth of inactive file memory. It seems to have to do with workingset protection(?) mechanisms, but given the lack of clear statistics I can't be sure about that. What it does lead to in my situations is a constant rate of swapping in and out in certain situations, whilst there is no technical reason for linux to do so because there is enough available memory.

My point of view has been that vm.overcommit_memory set to 2 was the way to go, because that allows linux to limit based on a set limit on allocation time, which guarantees way to make the database never run out of memory.
it does guarantees linux to never run out of memory, absolutely.
However, this limit is hard, and is applied for the process at both usermode and system mode (kernel level), and thus can enforce not providing memory at times where it's not safe to do so, and thus corrupt execution. I have to be honest, I have not seen this myself, but trustworthy sources have reported this repeatedly, which I am inclined to believe. This means postgres execution can corrupt/terminate in unlucky situations, which is impacts availability.

 
Frits Hoogland




On 5 Aug 2025, at 20:52, Joe Conway <mail@joeconway.com> wrote:

On 8/5/25 13:01, Priya V wrote:
*Environment:*
   *PostgreSQL Versions:* Mix of 13.13 and 15.12 (upgrades in progress
   to be at 15.12 currently both are actively in use)

PostgreSQL 13 end of life after November 13, 2025

   *OS / Kernel:* RHEL 7 & RHEL 8 variants, kernels in the 4.14–4.18 range

RHEL 7 has been EOL for quite a while now. Note that you have to watch out for collation issues/corrupted indexes after OS upgrades due to collations changing with newer glibc versions.

   *Swap:* Currently none

bad idea

   *Workload:* Highly mixed — OLTP-style internal apps with
   unpredictable query patterns and connection counts
   *Goal:* Uniform, safe memory settings across the fleet to avoid
   kernel or database instability

We’re considering:
   *|vm.overcommit_memory = 2|* for strict accounting

yes

   Increasing |vm.overcommit_ratio| from 50 → 80 or 90 to better
   reflect actual PostgreSQL usage (e.g., |work_mem| reservations that
   aren’t fully used)

work_mem does not reserve memory -- it is a maximum that might be used in memory for a particular operation

*Our questions for those running large PostgreSQL fleets:*
1.
   What |overcommit_ratio| do you find safe for PostgreSQL without
   causing kernel memory crunches?

Read this:
https://www.cybertec-postgresql.com/en/what-you-should-know-about-linux-memory-overcommit-in-postgresql/

2.
   Do you prefer |overcommit_memory = 1| or |= 2| for production stability?

Use overcommit_memory = 2 for production stability

3.
   How much swap (if any) do you keep in large-memory servers where
   PostgreSQL is the primary workload? Is having swap configured a good
   idea or not ?

You don't necessary need a large amount of swap, but you definitely should not disable it.

Some background on that:
https://chrisdown.name/2018/01/02/in-defence-of-swap.html

4.
   Any real-world cases where kernel accounting was too strict or too
   loose for PostgreSQL?

In my experience the biggest issues are when postgres is running in a memory constrained cgroup. If you want to constrain memory with cgroups, use cgroup v2 (not 1) and use memory.high to constrain it, not memory.max.

5. What settings to go with if we are not planning on using swap ?

IMHO do not disable swap on Linux, at least not on production, ever.

We’d like to avoid both extremes:
   Too low a ratio → PostgreSQL backends failing allocations even with
   free RAM

Have you actually seen this or are you theorizing?

   Too high a ratio → OOM killer terminating PostgreSQL under load spikes

If overcommit_memory = 2, overcommit_ratio is reasonable (less than 100, maybe 80 or so as you suggested), and swap is not disabled, and you are not running in a memory constrained cgroup, I would be very surprised if you will ever get hit by the OOM killer. And if you do, things are so bad the database was probably dying anyway.

HTH,

--
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com



Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Priya V
Дата:
Hi Frits, Joe, 

Thank you both for you insights 

Current situation:

cat /proc/sys/vm/overcommit_memory
0

cat /proc/sys/vm/overcommit_ratio
50

$ cat /proc/sys/vm/swappiness
60

Workload: Multi-tenant PostgreSQL

uname -r
4.18.0-477.83.1.el8_8.x86_64

free -h
total used free shared buff/cache available
Mem: 249Gi 4.3Gi 1.7Gi 22Gi 243Gi 221Gi
Swap: 0B 0B 0B

if we set overcommit_memory = 2, what should we set the overcommit_ration value to ? Can you pls suggest ? 
Is there a rule of thumb to go with ?

Our goal is to not run into OOM issues, no memory wastage and also not starve kernel ? 

Thanks! 





On Wed, Aug 6, 2025 at 3:47 AM Frits Hoogland <frits.hoogland@gmail.com> wrote:
Joe, 

Can you name any technical reason why not having swap for a database is an actual bad idea?

Memory always is limited. Swap was invented to overcome a situation where the (incidental) memory usage of paged in memory was could (regularly) get higher than physical memory would allow, and thus have the (clear) workaround of having swap to 'cushion' the memory shortage issue by allowing a "second level" memory storage on disk.
Still, this does not making memory unlimited. Swap extends the physical memory available with the amount of swap. There still is a situation where you can run out of memory when swap is added, simply by paging in more memory than physical memory and swap.

Today, most systems are not memory constrained anymore, or: it is possible to get a server with enough physical memory to hold your common needed total memory need. 
And given the latency sensitive nature of databases in general, which includes postgres, for any serious deployment you should get a server with enough memory to host your workload, and configure postgres not to overload the memory.

If you do oversubscribe on (physical) memory, you will get pain somewhere, there is no way around that.
The article in defense of swap in essence is saying that if you happen to oversubscribe on memory, sharing the pain between anonymous and file is better.
I would say you are already in a bad place if that happens, which is especially bad for databases, and databases should allow you to make memory usage predictable.

However, what I found is that with 4+ kernels (4.18 to be precise; rhel 8), the kernel can try to favour file pages in certain situations making anonymous memory getting paged out even if swappiness is set to 1 or 0, and if there is a wealth of inactive file memory. It seems to have to do with workingset protection(?) mechanisms, but given the lack of clear statistics I can't be sure about that. What it does lead to in my situations is a constant rate of swapping in and out in certain situations, whilst there is no technical reason for linux to do so because there is enough available memory.

My point of view has been that vm.overcommit_memory set to 2 was the way to go, because that allows linux to limit based on a set limit on allocation time, which guarantees way to make the database never run out of memory.
it does guarantees linux to never run out of memory, absolutely.
However, this limit is hard, and is applied for the process at both usermode and system mode (kernel level), and thus can enforce not providing memory at times where it's not safe to do so, and thus corrupt execution. I have to be honest, I have not seen this myself, but trustworthy sources have reported this repeatedly, which I am inclined to believe. This means postgres execution can corrupt/terminate in unlucky situations, which is impacts availability.

 
Frits Hoogland




On 5 Aug 2025, at 20:52, Joe Conway <mail@joeconway.com> wrote:

On 8/5/25 13:01, Priya V wrote:
*Environment:*
   *PostgreSQL Versions:* Mix of 13.13 and 15.12 (upgrades in progress
   to be at 15.12 currently both are actively in use)

PostgreSQL 13 end of life after November 13, 2025

   *OS / Kernel:* RHEL 7 & RHEL 8 variants, kernels in the 4.14–4.18 range

RHEL 7 has been EOL for quite a while now. Note that you have to watch out for collation issues/corrupted indexes after OS upgrades due to collations changing with newer glibc versions.

   *Swap:* Currently none

bad idea

   *Workload:* Highly mixed — OLTP-style internal apps with
   unpredictable query patterns and connection counts
   *Goal:* Uniform, safe memory settings across the fleet to avoid
   kernel or database instability

We’re considering:
   *|vm.overcommit_memory = 2|* for strict accounting

yes

   Increasing |vm.overcommit_ratio| from 50 → 80 or 90 to better
   reflect actual PostgreSQL usage (e.g., |work_mem| reservations that
   aren’t fully used)

work_mem does not reserve memory -- it is a maximum that might be used in memory for a particular operation

*Our questions for those running large PostgreSQL fleets:*
1.
   What |overcommit_ratio| do you find safe for PostgreSQL without
   causing kernel memory crunches?

Read this:
https://www.cybertec-postgresql.com/en/what-you-should-know-about-linux-memory-overcommit-in-postgresql/

2.
   Do you prefer |overcommit_memory = 1| or |= 2| for production stability?

Use overcommit_memory = 2 for production stability

3.
   How much swap (if any) do you keep in large-memory servers where
   PostgreSQL is the primary workload? Is having swap configured a good
   idea or not ?

You don't necessary need a large amount of swap, but you definitely should not disable it.

Some background on that:
https://chrisdown.name/2018/01/02/in-defence-of-swap.html

4.
   Any real-world cases where kernel accounting was too strict or too
   loose for PostgreSQL?

In my experience the biggest issues are when postgres is running in a memory constrained cgroup. If you want to constrain memory with cgroups, use cgroup v2 (not 1) and use memory.high to constrain it, not memory.max.

5. What settings to go with if we are not planning on using swap ?

IMHO do not disable swap on Linux, at least not on production, ever.

We’d like to avoid both extremes:
   Too low a ratio → PostgreSQL backends failing allocations even with
   free RAM

Have you actually seen this or are you theorizing?

   Too high a ratio → OOM killer terminating PostgreSQL under load spikes

If overcommit_memory = 2, overcommit_ratio is reasonable (less than 100, maybe 80 or so as you suggested), and swap is not disabled, and you are not running in a memory constrained cgroup, I would be very surprised if you will ever get hit by the OOM killer. And if you do, things are so bad the database was probably dying anyway.

HTH,

--
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com



Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Joe Conway
Дата:
(Both: please trim and reply inline on these lists as I have done;
  Frits, please reply all not just to the list -- I never received your
  reply to me)

On 8/6/25 11:51, Priya V wrote:
> *cat /proc/sys/vm/overcommit_ratio*
> 50
> $ *cat /proc/sys/vm/swappiness*
> 60
> 
> *Workload*: Multi-tenant PostgreSQL
> 
> *uname -r*
> 4.18.0-477.83.1.el8_8.x86_64

IMHO you should strongly consider getting on a more recent distro with a 
newer kernel.

> *free -h*
> total used free shared buff/cache available
> Mem: 249Gi 4.3Gi 1.7Gi 22Gi 243Gi 221Gi
> Swap: 0B 0B 0B

As I said, do not disable swap. You don't need a huge amount, but maybe 
16 GB or so would do it.

> if we set overcommit_memory = 2, what should we set the 
> overcommit_ration value to ? Can you pls suggest ?
> Is there a rule of thumb to go with ?

There is no rule of thumb that I am aware of. Every workload is 
different. Start with something like 80 and do your own testing to 
refine that number.

> *Our goal is to not run into OOM issues, no memory wastage and also not 
> starve kernel ? *

With overcommit_memory = 2, swap on (and reasonably sized), and 
overcommit_ratio to something reasonable (certainly below 100), I think 
you will have a difficult time getting an OOM kill even if you try 
during testing. But you have to do your own testing for your workloads 
-- there is no magic button here.

That is, unless you run postgres in a cgroup with memory.limit (cgroup 
v1) or memory.max (cgroup v2) set. Note, running in containers with 
memory limits set e.g. via Kubernetes will do that under the covers. 
That is a completely different story.

> On Wed, Aug 6, 2025 at 3:47 AM Frits Hoogland <frits.hoogland@gmail.com 
> <mailto:frits.hoogland@gmail.com>> wrote:
>     Can you name any technical reason why not having swap for a database
>     is an actual bad idea?

Did you read the blog I linked? Do your own experiments.

* Swap is what is used when anonymous memory must be reclaimed to allow 
for an allocation of anonymous memory.

* The Linux kernel will aggressively use all available memory for file 
buffers, pushing usage against the limits.

* Especially in the older 4 series kernels, file buffers often cannot be 
reclaimed fast enough

* With no swap and a large-ish anonymous memory request, it is easy to 
push over the limit to cause the OOM killer to strike.

* On the other hand, with swap enabled anon memory can be reclaimed 
giving the kernel more time to deal with file buffer reclamation.

At least that is what I have observed.

HTH,

-- 
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com



Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Frits Hoogland
Дата:
> As I said, do not disable swap. You don't need a huge amount, but maybe 16 GB or so would do it.

Joe, please, can you state a technical reason for saying this?
All you are saying is ‘don’t do this’.

I’ve stated my reasons for why this doesn’t make sense, and you don’t give any reason.

The article you cite does seem to point to general usage, not database usage.

Frits

> Op 6 aug 2025 om 18:33 heeft Joe Conway <mail@joeconway.com> het volgende geschreven:
>
> (Both: please trim and reply inline on these lists as I have done;
> Frits, please reply all not just to the list -- I never received your
> reply to me)
>
>> On 8/6/25 11:51, Priya V wrote:
>> *cat /proc/sys/vm/overcommit_ratio*
>> 50
>> $ *cat /proc/sys/vm/swappiness*
>> 60
>> *Workload*: Multi-tenant PostgreSQL
>> *uname -r*
>> 4.18.0-477.83.1.el8_8.x86_64
>
> IMHO you should strongly consider getting on a more recent distro with a newer kernel.
>
>> *free -h*
>> total used free shared buff/cache available
>> Mem: 249Gi 4.3Gi 1.7Gi 22Gi 243Gi 221Gi
>> Swap: 0B 0B 0B
>
> As I said, do not disable swap. You don't need a huge amount, but maybe 16 GB or so would do it.
>
>> if we set overcommit_memory = 2, what should we set the overcommit_ration value to ? Can you pls suggest ?
>> Is there a rule of thumb to go with ?
>
> There is no rule of thumb that I am aware of. Every workload is different. Start with something like 80 and do your
owntesting to refine that number. 
>
>> *Our goal is to not run into OOM issues, no memory wastage and also not starve kernel ? *
>
> With overcommit_memory = 2, swap on (and reasonably sized), and overcommit_ratio to something reasonable (certainly
below100), I think you will have a difficult time getting an OOM kill even if you try during testing. But you have to
doyour own testing for your workloads -- there is no magic button here. 
>
> That is, unless you run postgres in a cgroup with memory.limit (cgroup v1) or memory.max (cgroup v2) set. Note,
runningin containers with memory limits set e.g. via Kubernetes will do that under the covers. That is a completely
differentstory. 
>
>> On Wed, Aug 6, 2025 at 3:47 AM Frits Hoogland <frits.hoogland@gmail.com <mailto:frits.hoogland@gmail.com>> wrote:
>>    Can you name any technical reason why not having swap for a database
>>    is an actual bad idea?
>
> Did you read the blog I linked? Do your own experiments.
>
> * Swap is what is used when anonymous memory must be reclaimed to allow for an allocation of anonymous memory.
>
> * The Linux kernel will aggressively use all available memory for file buffers, pushing usage against the limits.
>
> * Especially in the older 4 series kernels, file buffers often cannot be reclaimed fast enough
>
> * With no swap and a large-ish anonymous memory request, it is easy to push over the limit to cause the OOM killer to
strike.
>
> * On the other hand, with swap enabled anon memory can be reclaimed giving the kernel more time to deal with file
bufferreclamation. 
>
> At least that is what I have observed.
>
> HTH,
>
> --
> Joe Conway
> PostgreSQL Contributors Team
> Amazon Web Services: https://aws.amazon.com



Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Joe Conway
Дата:
On 8/6/25 17:14, Frits Hoogland wrote:
>> As I said, do not disable swap. You don't need a huge amount, but
>> maybe 16 GB or so would do it.

> Joe, please, can you state a technical reason for saying this?
> All you are saying is ‘don’t do this’.
> 
> I’ve stated my reasons for why this doesn’t make sense, and you don’t give any reason.

What do you call the below?

>> Op 6 aug 2025 om 18:33 heeft Joe Conway <mail@joeconway.com> het volgende geschreven:

>> * Swap is what is used when anonymous memory must be reclaimed to
>> allow for an allocation of anonymous memory.
>> 
>> * The Linux kernel will aggressively use all available memory for
>> file buffers, pushing usage against the limits.
>> 
>> * Especially in the older 4 series kernels, file buffers often
>> cannot be reclaimed fast enough
>> 
>> * With no swap and a large-ish anonymous memory request, it is
>> easy to push over the limit to cause the OOM killer to strike.
>> 
>> * On the other hand, with swap enabled anon memory can be
>> reclaimed giving the kernel more time to deal with file buffer
>> reclamation.
>> 
>> At least that is what I have observed.

If you don't think that is adequate technical reason, feel free to 
ignore my advice.

-- 
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com



Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Frits Hoogland
Дата:
Joe, I am trying to help, and make people think about things correctly.

The linux kernel is actually constantly changing, sometimes subtle and sometimes less subtle, and there is a general lack of very clear statistics indicating the more nuanced memory operations, and the documentation about it.
And: there are a lot of myths about memory management, which either are myths because it's a situation that was once true but given the changes of the kernel code is not true anymore, but also sometimes just a myth.

The best technical description of recent memory management that I could find is: https://lpc.events/event/11/contributions/896/attachments/793/1493/slides-r2.pdf

Op 6 aug 2025 om 18:33 heeft Joe Conway <mail@joeconway.com> het volgende geschreven:

* Swap is what is used when anonymous memory must be reclaimed to
allow for an allocation of anonymous memory.

Correct. Swapped out pages are anonymous memory pages exclusively.
It's the result of memory reclaim for anonymous pages, which cannot be discarded like (non dirty and non-pinned) file pages, which don't need saving the page content.

* The Linux kernel will aggressively use all available memory for
file buffers, pushing usage against the limits.

It's an explicit design of the linux kernel to not reclaim file pages when they are unpinned/not used anymore, leaving them as a cached page.
(anonymous pages are freed explciitly when released by the ower and put on the free list)
There is no aggresive push, file pages are left after use, so there is no pushing usage against the limits.
It's the swapper ('page daemon') that eventually (based on a zone limit called 'memory low', which is vm.min_free_kbytes *2), based on LRU, frees file pages, and when free memory gets to vm.min_free_kbytes*1 (called 'pages min') forces tasks to free memory theirselves (called 'direct reclaim').

* Especially in the older 4 series kernels, file buffers often
cannot be reclaimed fast enough

I am not sure what is described here, and whether this is about the swapper or direct reclaim.
There is no need to do this 'fast enough', see the above slide deck.
This probably is aimed at the swapper not reclaiming 'fast enough', however, that is not how this works: if memory requests makes free memory go to 'pages min', a task will perform 'direct reclaim'.

* With no swap and a large-ish anonymous memory request, it is
easy to push over the limit to cause the OOM killer to strike.

I am afraid that this is not a correct representation of the actual mechanism, again: look at the slide deck and explanations above.
The swapper frees memory, which is used by a task requesting pages at page fault, for which it doesn't matter if that is anonymous memory or file memory.. 
If memory gets down to pages min, the swapper did not reclaim memory fast enough, and a task will perform direct reclaim.

The decision on what memory type to reclaim in case of direct reclaim is file memory or anonymous memory.
If there is no swap, the option to use anonymous memory is not available, because anonymous pages cannot be discarded like non-dirty, unpinned file pages can, they have to be preserved.
If swappiness is set to 0, but swap is available, some documentation suggests it will never use anonymous memory, however I found this not to be true, linux might still choose anonymous memory to reclaim. Obviously, the lower swappiness, the lesser reclaim will choose anonymous memory pages.

What you seem to suggest, is that with no swap, and thus the option to use anonymous pages for reclaim the reclaim mechanism is dependent on the speed of (file) reclaim, possibly from the swapper. I hope it's clear this is not true.

Obviously, when there is swap, the total amount of pages that become potentially available for reclaim becomes higher, because the size of swap anonymous pages can be reclaimed.
But then if that amount is set to a low amount (as suggested: You don't need a huge amount'), the actual increase in pages availability for reclaim is negligible, and thus the benefit that it provides for not running out of memory.

* On the other hand, with swap enabled anon memory can be
reclaimed giving the kernel more time to deal with file buffer
reclamation.

See the explanation with the previous comments. Time is not a component in reclaim for failure to find pages for a task that page faults for memory addition, because a task will do direct reclaim if it exhausts free memory provided by the swapper.

At least that is what I have observed.

The kernel code for direct reclaim shows that when direct reclaim has finished scanning memory pages (either only file pages with no swap, of in the case of having swap, the file and anonymous pages), and wasn't able to satisfy the request for the pages it needs, it will trigger the kernel Out of memory thread, because it has run out of available pages it needs.

Again, like I mentioned in the beginning, there are lots and lots of nuances and mechanisms in play, this is a reasonable basic explanation of the mechanism based on the above slide deck and reading the kernel code.
One thing that can very easily be misleading is that memory is not a general, system wide, pool, but instead separated by zones. This might lead to situation where there still is memory available for reclaim system wide, but not in the zone the process is scanning, and thus might seem to run out of memory triggering the OOM killer when there still is memory, which can be very confusing if you're not aware of these details.

I do have read, experimented, searched, tested and diagnosed a lot of issues. And this is what have come up with, which does fit the kernel code, and documentation that I trust.

Based on these mechanisms, and especially for database systems, removing swap is a way to take away a mechanism that has no benefit for database systems on modern, high memory, systems.
That does not mean it's not beneficial in other cases. If memory usage is very dynamic, memory is more constrained, and the operation is less latency sensitive, it might be a good idea to have an overflow, with all the downsides that it brings.


Frits Hoogland

On 7 Aug 2025, at 03:12, Joe Conway <mail@joeconway.com> wrote:

On 8/6/25 17:14, Frits Hoogland wrote:
As I said, do not disable swap. You don't need a huge amount, but
maybe 16 GB or so would do it.

Joe, please, can you state a technical reason for saying this?
All you are saying is ‘don’t do this’.
I’ve stated my reasons for why this doesn’t make sense, and you don’t give any reason.

What do you call the below?

Op 6 aug 2025 om 18:33 heeft Joe Conway <mail@joeconway.com> het volgende geschreven:

* Swap is what is used when anonymous memory must be reclaimed to
allow for an allocation of anonymous memory.
* The Linux kernel will aggressively use all available memory for
file buffers, pushing usage against the limits.
* Especially in the older 4 series kernels, file buffers often
cannot be reclaimed fast enough
* With no swap and a large-ish anonymous memory request, it is
easy to push over the limit to cause the OOM killer to strike.
* On the other hand, with swap enabled anon memory can be
reclaimed giving the kernel more time to deal with file buffer
reclamation.
At least that is what I have observed.

If you don't think that is adequate technical reason, feel free to ignore my advice.

--
Joe Conway
PostgreSQL Contributors Team
Amazon Web Services: https://aws.amazon.com

Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Bruce Momjian
Дата:
On Wed, Aug  6, 2025 at 11:14:34PM +0200, Frits Hoogland wrote:
> > As I said, do not disable swap. You don't need a huge amount, but maybe 16 GB or so would do it.
> 
> Joe, please, can you state a technical reason for saying this?
> All you are saying is ‘don’t do this’. 
> 
> I’ve stated my reasons for why this doesn’t make sense, and you don’t give any reason.
> 
> The article you cite does seem to point to general usage, not database usage.

Here is a blog entry about it from 2012:

    https://momjian.us/main/blogs/pgblog/2012.html#July_25_2012

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.



Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Jorge Rodriguez
Дата:
available 


From: Bruce Momjian <bruce@momjian.us>
Sent: Tuesday, August 12, 2025 1:49:26 PM
To: Frits Hoogland <frits.hoogland@gmail.com>
Cc: Joe Conway <mail@joeconway.com>; Priya V <mailme0216@gmail.com>; pgsql-performance@lists.postgresql.org <pgsql-performance@lists.postgresql.org>
Subject: Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet
 
Note: This message originated from outside the FIU Faculty/Staff email system.


On Wed, Aug  6, 2025 at 11:14:34PM +0200, Frits Hoogland wrote:
> > As I said, do not disable swap. You don't need a huge amount, but maybe 16 GB or so would do it.
>
> Joe, please, can you state a technical reason for saying this?
> All you are saying is ‘don’t do this’.
>
> I’ve stated my reasons for why this doesn’t make sense, and you don’t give any reason.
>
> The article you cite does seem to point to general usage, not database usage.

Here is a blog entry about it from 2012:

        https://urldefense.com/v3/__https://momjian.us/main/blogs/pgblog/2012.html*July_25_2012__;Iw!!FjuHKAHQs5udqho!KYC_U3_OpOjplPgrQZatREvJ8Ft3U7O30Lh6SH0uGMsTcYagWyupQLofwkNLDhhxgmzWzl6blB1HpFk0oj4$

--
  Bruce Momjian  <bruce@momjian.us>        https://urldefense.com/v3/__https://momjian.us__;!!FjuHKAHQs5udqho!KYC_U3_OpOjplPgrQZatREvJ8Ft3U7O30Lh6SH0uGMsTcYagWyupQLofwkNLDhhxgmzWzl6blB1H1-eyJ3I$
  EDB                                      https://urldefense.com/v3/__https://enterprisedb.com__;!!FjuHKAHQs5udqho!KYC_U3_OpOjplPgrQZatREvJ8Ft3U7O30Lh6SH0uGMsTcYagWyupQLofwkNLDhhxgmzWzl6blB1HJNHkhfc$

  Do not let urgent matters crowd out time for investment in the future.


Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Frédéric Yhuel
Дата:

On 8/8/25 10:21, Frits Hoogland wrote:
> If swappiness is set to 0, but swap is available, some documentation 
> suggests it will never use anonymous memory, however I found this not to 
> be true, linux might still choose anonymous memory to reclaim.


A bug in RHEL8 meant that swappiness was not taken into account unless 
cgroupv2 was configured or vm.force_cgroup_v2_swappiness was set to 1. 
See references [1] and [2]. Could this be the cause of your observation?

[1] https://access.redhat.com/solutions/6785021
[2] https://github.com/systemd/systemd/issues/9276



Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Frits Hoogland
Дата:
Thank you for your message Frederic,

I am very much aware of that issue. It’s actually incorrect to say that is a bug: that is how cgroupsv1, which is
bundledwith rhel8, works. However, it is very counter intuitive. For that reason redhat created the
force_cgroup_v2_swappinessparameter uniquely theirselves, it’s not a common Linux parameter. 

The specific issue I see in certain cases leading to unreasonable swap usage is Linux workingset detection kicking in,
whichcan choose anonymous memory despite having lots of file memory available, leading to swapping, which sometimes
leadsto a thrashing situation. 

It is funny to see how emotional people react to removing swap, and how people go through great efforts to carefully
tryingwrap that in a technical reason or point to people having said something that agrees with their emotion. I should
saythat I understand the reluctance, it’s not weird to feel anxious. 

The kernel has no inherent swap requirement. Of course, removing swap cannot be blindly applied, you have to carefully
makeit suit your environment, usage and intention. And there ARE cases where swap makes sense (if you have memory usage
thatexceeds physical memory, and you add enough swap to sustain that). But a database in general typically responds bad
toswapping (or anything that fluctuates latency), and when swap removal is sensibly done, it prevents anonymous
(includinglesser frequently used ;-)) memory from getting swapped.  

I will not convince everybody, but I hope I can make some people that understand the technology thinking about it and
considerthe arguments. 

Friendly regards,

Frits



>
> Op 18 aug 2025 om 18:17 heeft Frédéric Yhuel <frederic.yhuel@dalibo.com> het volgende geschreven:
>
> 
>
>> On 8/8/25 10:21, Frits Hoogland wrote:
>> If swappiness is set to 0, but swap is available, some documentation suggests it will never use anonymous memory,
howeverI found this not to be true, linux might still choose anonymous memory to reclaim. 
>
>
> A bug in RHEL8 meant that swappiness was not taken into account unless cgroupv2 was configured or
vm.force_cgroup_v2_swappinesswas set to 1. See references [1] and [2]. Could this be the cause of your observation? 
>
> [1] https://access.redhat.com/solutions/6785021
> [2] https://github.com/systemd/systemd/issues/9276



Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

От
Frédéric Yhuel
Дата:

On 8/19/25 17:37, Frits Hoogland wrote:
> The specific issue I see in certain cases leading to unreasonable swap usage is Linux workingset detection kicking
in

Do you have a way to highlight that precisely? I mean, can you prove 
that it is Linux workingset detection that is causing swapping? I've 
also encountered surprising cases where the swap fills up despite there 
being plenty of available memory (lots of page cache). However these 
cases were not associated with slowdowns or other problems. I only 
became aware of them because a client was anxious about her swap usage.