Обсуждение: /proc/self/oom_adj is deprecated in newer Linux kernels
While testing 9.1 RPMs on Fedora 15 (2.6.40 kernel), I notice messages like these in the kernel log: Sep 11 13:38:56 rhl kernel: [ 415.308092] postgres (18040): /proc/18040/oom_adj is deprecated, please use /proc/18040/oom_score_adjinstead. These don't show up on every single PG process launch, but that probably just indicates there's a rate-limiter in the kernel reporting mechanism. So it looks like it behooves us to cater for oom_score_adj in the future. The simplest, least risky change that I can think of is to copy-and-paste the relevant #ifdef code block in fork_process.c. If we do that, then it would be up to the packager whether to #define LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior he wants. That would be good enough for my own purposes in building Fedora/RHEL packages, since I can predict with confidence which kernel versions a given build is likely to be used with. I think probably the same would be true for most other distro-specific builds. Does anyone want to argue for doing something more complicated, and if so what exactly? regards, tom lane
On Fri, Sep 16, 2011 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Does anyone want > to argue for doing something more complicated, and if so what exactly? > Well there's no harm trying to write to oom_score_adj and if that fails with EEXISTS trying to write to oom_adj. -- greg
Greg Stark <stark@mit.edu> writes: > On Fri, Sep 16, 2011 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Does anyone want >> to argue for doing something more complicated, and if so what exactly? > Well there's no harm trying to write to oom_score_adj and if that > fails with EEXISTS trying to write to oom_adj. Well, (1) what value are you going to write (they need to be different for the two files), and (2) the main point of the exercise, at present, is to avoid kernel log messages. I'm not sure that trying to create random files under /proc isn't going to draw bleats in the kernel log. regards, tom lane
Excerpts from Tom Lane's message of vie sep 16 13:37:46 -0300 2011: > Greg Stark <stark@mit.edu> writes: > > On Fri, Sep 16, 2011 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Does anyone want > >> to argue for doing something more complicated, and if so what exactly? > > > Well there's no harm trying to write to oom_score_adj and if that > > fails with EEXISTS trying to write to oom_adj. > > Well, (1) what value are you going to write (they need to be different > for the two files), and (2) the main point of the exercise, at present, > is to avoid kernel log messages. I'm not sure that trying to create > random files under /proc isn't going to draw bleats in the kernel log. I guess the question is what semantics the new code has. In the old badness() world, child processes inherited the oom_adj value of its parent. What the code in fork_process was used for was resetting the value back to 0 (meaning "kernel is allowed to kill this process on OOM"), so that you could set the oom_adj in the start script for postmaster (to a value meaning "never kill this one"), and the backends would see their values reset to zero. The new oom_score_adj has a scale of -1000 to +1000, with zero being neutral and -1000 being "never kill". So what we want to do here in most cases, it seems, is set the value to zero whether it's oom_adj or oom_score_adj -- assuming the new code is still causing children processes to inherit the "adj" value from the parent. Now the problem is that we have defined the LINUX_OOM_ADJ symbol as meaning the value we're going to write. Maybe this wasn't the best choice. I mean, it's very flexible, but it doesn't seem to offer any benefit over a plain boolean choice. Is your proposal to create a new LINUX_OOM_SCORE_ADJ cpp symbol with the same semantics? The most thorough documentation on this option seems to be this commit: https://github.com/mirrors/linux-2.6/commit/a63d83f427fbce97a6cea0db2e64b0eb8435cd10#include/linux/oom.h -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Now the problem is that we have defined the LINUX_OOM_ADJ symbol as > meaning the value we're going to write. Maybe this wasn't the best > choice. I mean, it's very flexible, but it doesn't seem to offer any > benefit over a plain boolean choice. > Is your proposal to create a new LINUX_OOM_SCORE_ADJ cpp symbol with the > same semantics? Yes, that's what I was thinking. We could avoid that if we were going to hard-wire a decision that zero is the thing to write, but I see no reason to place such a restriction on users. Who's to say they might not want backends to adopt some other value? regards, tom lane
On Fri, Sep 16, 2011 at 10:37, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> On Fri, Sep 16, 2011 at 3:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Does anyone want >>> to argue for doing something more complicated, and if so what exactly? > >> Well there's no harm trying to write to oom_score_adj and if that >> fails with EEXISTS trying to write to oom_adj. Yeah, I don't really like the idea of a compile time option that is kernel version dependent... But i don't feel too strongly about it either (all my kernels are new enough that they support oom_score_adj). I'll also note that on my system we are in the good company of ssd and chromium: sshd (978): /proc/978/oom_adj is deprecated, please use /proc/978/oom_score_adj instead. chromium-sandbo (1377): /proc/1375/oom_adj is deprecated, please use /proc/1375/oom_score_adj instead. [ It quite annoying that soon after we decided to stick -DLINUX_OOM_ADJ in they changed it. Whatever happened to a stable userspace API :-( ]
On fre, 2011-09-16 at 10:57 -0400, Tom Lane wrote: > So it looks like it behooves us to cater for oom_score_adj in the > future. The simplest, least risky change that I can think of is to > copy-and-paste the relevant #ifdef code block in fork_process.c. > If we do that, then it would be up to the packager whether to #define > LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior > he wants. There are lots of reasons why that won't work: backports, forward ports, derivatives, custom kernels, distribution upgrades, virtual hosting. ISTM that we could at least query the currently used kernel version.
Peter Eisentraut <peter_e@gmx.net> writes: > On fre, 2011-09-16 at 10:57 -0400, Tom Lane wrote: >> So it looks like it behooves us to cater for oom_score_adj in the >> future. The simplest, least risky change that I can think of is to >> copy-and-paste the relevant #ifdef code block in fork_process.c. >> If we do that, then it would be up to the packager whether to #define >> LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior >> he wants. > There are lots of reasons why that won't work: backports, forward ports, > derivatives, custom kernels, distribution upgrades, virtual hosting. [ shrug... ] These are all hypothetical reasons. A packager who foresaw needing that could just turn on both write attempts, or for that matter patch the code to do whatever else he saw fit. In practice, we've not had requests for anything significantly smarter than what is there. But having said that, it wouldn't be very hard to arrange things so that if you did have both symbols defined, the code would only attempt to write oom_adj if it had failed to write oom_score_adj; which is about as close as you're likely to get to a kernel version test for this. regards, tom lane
On sön, 2011-09-18 at 12:21 -0400, Tom Lane wrote: > Peter Eisentraut <peter_e@gmx.net> writes: > > On fre, 2011-09-16 at 10:57 -0400, Tom Lane wrote: > >> So it looks like it behooves us to cater for oom_score_adj in the > >> future. The simplest, least risky change that I can think of is to > >> copy-and-paste the relevant #ifdef code block in fork_process.c. > >> If we do that, then it would be up to the packager whether to #define > >> LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior > >> he wants. > > > There are lots of reasons why that won't work: backports, forward ports, > > derivatives, custom kernels, distribution upgrades, virtual hosting. > > [ shrug... ] These are all hypothetical reasons. A packager who > foresaw needing that could just turn on both write attempts, or for that > matter patch the code to do whatever else he saw fit. In practice, > we've not had requests for anything significantly smarter than what is > there. > > But having said that, it wouldn't be very hard to arrange things so that > if you did have both symbols defined, the code would only attempt to > write oom_adj if it had failed to write oom_score_adj; which is about as > close as you're likely to get to a kernel version test for this. Why is this feature not a run-time configuration variable or at least a configure option? It's awfully well hidden now. I doubt a lot of people are using this even though they might wish to.
Peter Eisentraut <peter_e@gmx.net> writes: > On sön, 2011-09-18 at 12:21 -0400, Tom Lane wrote: >> But having said that, it wouldn't be very hard to arrange things so that >> if you did have both symbols defined, the code would only attempt to >> write oom_adj if it had failed to write oom_score_adj; which is about as >> close as you're likely to get to a kernel version test for this. > Why is this feature not a run-time configuration variable or at least a > configure option? It's awfully well hidden now. I doubt a lot of > people are using this even though they might wish to. See the thread in which the feature was designed originally: http://archives.postgresql.org/pgsql-hackers/2010-01/msg00170.php The key point is that to get useful behavior, you need cooperation between both a root-privileged startup script and the PG executable. That tends to throw the problem into the domain of packagers, more than end users, and definitely puts a big crimp in the idea that run-time configuration of just half of the behavior could be helpful. So far, no Linux packagers have complained that the design is inadequate (a position that I also hold when wearing my red fedora) so I do not feel a need to complicate it further. regards, tom lane
This is one way to prevent the kernel warning message without having to introduce a new constant. Scale the old oom_adj-style value the same way the kernel internally does it and use that instead if oom_score_adj is available for writing. Signed-off-by: Dan McGee <dan@archlinux.org> --- This addresses some of the concerns raised on the ML and will at least keep those of us running modern kernels happy. Alternatively one could switch the symbol used to be the new style and have the old one computed; however this is a pain for legacy vs. current versions. src/backend/postmaster/fork_process.c | 22 +++++++++++++++++++++-1 files changed, 21 insertions(+), 1 deletions(-) diff --git a/src/backend/postmaster/fork_process.c b/src/backend/postmaster/fork_process.c index b2fe9a1..3cded54 100644 --- a/src/backend/postmaster/fork_process.c +++ b/src/backend/postmaster/fork_process.c @@ -81,16 +81,36 @@ fork_process(void) * Use open() not stdio, to ensure we control the open flags. Some * Linux security environments reject anything but O_WRONLY. */ - int fd = open("/proc/self/oom_adj", O_WRONLY, 0); + int fd = open("/proc/self/oom_score_adj", O_WRONLY, 0); /* We ignore all errors */ if (fd >= 0) { char buf[16]; + int oom_score_adj; + /* + * The compile-time value is the old style oom_adj; + * scale it the same way the kernel does to + * convert to the new style oom_score_adj. This + * should become a constant at compile time. + * Valid values range from -17 (never kill) to + * 15; no attempt of validation is done. + */ + oom_score_adj = LINUX_OOM_ADJ * 1000 / 17; snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ); (void) write(fd, buf, strlen(buf)); close(fd); + } else if (errno == EEXIST) { + int fd = open("/proc/self/oom_adj", O_WRONLY, 0); + if (fd >= 0) + { + char buf[16]; + + snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ); + (void) write(fd, buf, strlen(buf)); + close(fd); + } } }#endif /* LINUX_OOM_ADJ */ -- 1.7.6.1
On Mon, Sep 19, 2011 at 3:11 PM, Dan McGee <dan@archlinux.org> wrote: > This is one way to prevent the kernel warning message without having to > introduce a new constant. Scale the old oom_adj-style value the same way > the kernel internally does it and use that instead if oom_score_adj is > available for writing. > > Signed-off-by: Dan McGee <dan@archlinux.org> > --- > > This addresses some of the concerns raised on the ML and will at least keep > those of us running modern kernels happy. > > Alternatively one could switch the symbol used to be the new style and have the > old one computed; however this is a pain for legacy vs. current versions. > > src/backend/postmaster/fork_process.c | 22 +++++++++++++++++++++- > 1 files changed, 21 insertions(+), 1 deletions(-) > > diff --git a/src/backend/postmaster/fork_process.c b/src/backend/postmaster/fork_process.c > index b2fe9a1..3cded54 100644 > --- a/src/backend/postmaster/fork_process.c > +++ b/src/backend/postmaster/fork_process.c > @@ -81,16 +81,36 @@ fork_process(void) > * Use open() not stdio, to ensure we control the open flags. Some > * Linux security environments reject anything but O_WRONLY. > */ > - int fd = open("/proc/self/oom_adj", O_WRONLY, 0); > + int fd = open("/proc/self/oom_score_adj", O_WRONLY, 0); > > /* We ignore all errors */ > if (fd >= 0) > { > char buf[16]; > + int oom_score_adj; > > + /* > + * The compile-time value is the old style oom_adj; > + * scale it the same way the kernel does to > + * convert to the new style oom_score_adj. This > + * should become a constant at compile time. > + * Valid values range from -17 (never kill) to > + * 15; no attempt of validation is done. > + */ > + oom_score_adj = LINUX_OOM_ADJ * 1000 / 17; > snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ); Of course it would help to actually use the computed value here; this should be: snprintf(buf,sizeof(buf), "%d\n", oom_score_adj); > (void) write(fd, buf, strlen(buf)); > close(fd); > + } else if (errno == EEXIST) { > + int fd = open("/proc/self/oom_adj", O_WRONLY, 0); > + if (fd >= 0) > + { > + char buf[16]; > + > + snprintf(buf, sizeof(buf), "%d\n", LINUX_OOM_ADJ); > + (void) write(fd, buf, strlen(buf)); > + close(fd); > + } > } > } > #endif /* LINUX_OOM_ADJ */ > -- > 1.7.6.1 > >
On Mon, Sep 19, 2011 at 4:36 PM, Dan McGee <dan@archlinux.org> wrote: > [ patch ] I suppose it's Tom who really needs to comment on this, but I'm not too enthusiastic about this approach. Duplicating the Linux kernel calculation into our code means that we could drift if the formula changes again. I like Tom's previous suggestion (I think) of allowing both constants to be defined - if they are, then we try oom_score_adj first and fall back to oom_adj if that fails. For bonus points, we could have postmaster stat() its own oom_score_adj file before forking and set a global variable to indicate the results. That way we'd only ever need to test once per postmaster startup (at least until someone figures out a way to swap out the running kernel without stopping the server...!). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 23, 2011 at 2:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Sep 19, 2011 at 4:36 PM, Dan McGee <dan@archlinux.org> wrote: >> [ patch ] > > I suppose it's Tom who really needs to comment on this, but I'm not > too enthusiastic about this approach. Duplicating the Linux kernel > calculation into our code means that we could drift if the formula > changes again. Why would the formula ever change? This seems like a different excuse way of really saying you don't appreciate the hacky approach, which I can understand completely. However, it simply doesn't make sense for them to change this formula, as it scales the -17 to 16 old range to the new -1000 to 1000 range. Those endpoints won't be changing unless a third method is introduced, in which case this whole thing is mute and we'd need to fix it yet again. > I like Tom's previous suggestion (I think) of allowing both constants > to be defined - if they are, then we try oom_score_adj first and fall > back to oom_adj if that fails. For bonus points, we could have > postmaster stat() its own oom_score_adj file before forking and set a > global variable to indicate the results. That way we'd only ever need > to test once per postmaster startup (at least until someone figures > out a way to swap out the running kernel without stopping the > server...!). This would be fine, it just seems unreasonably complicated, not to mention unnecessary. I might as well point this out [1]. I don't think a soul out there has built without defining this to 0 (if they define it at all), and not even that many people are using it. Is it all that bad of an idea to just force it to 0 for both knobs and be done with it? -Dan McGee [1] http://www.google.com/codesearch#search/&q=LINUX_OOM_ADJ=&type=cs - Slackware and Fedora are the only hits not in the PG codebase, and both define it to 0.
On Fri, Sep 23, 2011 at 4:05 PM, Dan McGee <dan@archlinux.org> wrote: > On Fri, Sep 23, 2011 at 2:49 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Sep 19, 2011 at 4:36 PM, Dan McGee <dan@archlinux.org> wrote: >>> [ patch ] >> >> I suppose it's Tom who really needs to comment on this, but I'm not >> too enthusiastic about this approach. Duplicating the Linux kernel >> calculation into our code means that we could drift if the formula >> changes again. > Why would the formula ever change? This seems like a different excuse > way of really saying you don't appreciate the hacky approach, which I > can understand completely. However, it simply doesn't make sense for > them to change this formula, as it scales the -17 to 16 old range to > the new -1000 to 1000 range. Those endpoints won't be changing unless > a third method is introduced, in which case this whole thing is mute > and we'd need to fix it yet again. > >> I like Tom's previous suggestion (I think) of allowing both constants >> to be defined - if they are, then we try oom_score_adj first and fall >> back to oom_adj if that fails. For bonus points, we could have >> postmaster stat() its own oom_score_adj file before forking and set a >> global variable to indicate the results. That way we'd only ever need >> to test once per postmaster startup (at least until someone figures >> out a way to swap out the running kernel without stopping the >> server...!). > This would be fine, it just seems unreasonably complicated, not to > mention unnecessary. I might as well point this out [1]. I don't think > a soul out there has built without defining this to 0 (if they define > it at all), and not even that many people are using it. Is it all that > bad of an idea to just force it to 0 for both knobs and be done with > it? Did we do anything about this? Anyone else have an opinion on what ought to be done? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: >> [ oom_score_adj business ] > Did we do anything about this? Anyone else have an opinion on what > ought to be done? I held off doing anything because it didn't seem like we had consensus. OTOH, it may well be that it's not important enough to demand real consensus, and he who does the work gets to make the choices. regards, tom lane
On Mon, Oct 24, 2011 at 1:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >>> [ oom_score_adj business ] > >> Did we do anything about this? Anyone else have an opinion on what >> ought to be done? > > I held off doing anything because it didn't seem like we had consensus. > OTOH, it may well be that it's not important enough to demand real > consensus, and he who does the work gets to make the choices. Half a loaf is better than none. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I was reminded today that we still haven't done anything about this: Tom Lane <tgl@sss.pgh.pa.us> writes: > While testing 9.1 RPMs on Fedora 15 (2.6.40 kernel), I notice > messages like these in the kernel log: > Sep 11 13:38:56 rhl kernel: [ 415.308092] postgres (18040): /proc/18040/oom_adj is deprecated, please use /proc/18040/oom_score_adjinstead. At this point there are no shipping Fedora versions that don't emit this gripe, and F15 is even about to go EOL. The previous discussion thread at http://archives.postgresql.org/pgsql-hackers/2011-09/msg00794.php went off into the weeds of what was in my opinion over-design. I still think it's sufficient to do what I suggested initially: > ... The simplest, least risky change that I can think of is to > copy-and-paste the relevant #ifdef code block in fork_process.c. > If we do that, then it would be up to the packager whether to #define > LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior > he wants. and would like to squeeze that into 9.2 so that we're only a year late and not two years late in responding to this issue :-(. Objections? regards, tom lane
On Tue, Jun 12, 2012 at 12:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I was reminded today that we still haven't done anything about this: > > Tom Lane <tgl@sss.pgh.pa.us> writes: >> While testing 9.1 RPMs on Fedora 15 (2.6.40 kernel), I notice >> messages like these in the kernel log: >> Sep 11 13:38:56 rhl kernel: [ 415.308092] postgres (18040): /proc/18040/oom_adj is deprecated, please use /proc/18040/oom_score_adjinstead. > > At this point there are no shipping Fedora versions that don't emit this > gripe, and F15 is even about to go EOL. > > The previous discussion thread at > http://archives.postgresql.org/pgsql-hackers/2011-09/msg00794.php > went off into the weeds of what was in my opinion over-design. > I still think it's sufficient to do what I suggested initially: > >> ... The simplest, least risky change that I can think of is to >> copy-and-paste the relevant #ifdef code block in fork_process.c. >> If we do that, then it would be up to the packager whether to #define >> LINUX_OOM_ADJ or LINUX_OOM_SCORE_ADJ or both depending on the behavior >> he wants. > > and would like to squeeze that into 9.2 so that we're only a year late > and not two years late in responding to this issue :-(. > > Objections? I think my position, and the position of the people who responded to the original thread, was that it seems like you're architecting a kludge when it wouldn't be that hard to architect a nicer solution. That having been said, I don't think it's such a large kludge that we should all have massive dry heaves at the thought of it living in our repository. And then too, this isn't the time to be architecting new 9.2 feature anyway. So I say go for it. If it makes your life easier, back-patch it. Code that requires a #define to enable it won't break anything for anyone else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 12, 2012 at 12:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I still think it's sufficient to do what I suggested initially: >>> ... The simplest, least risky change that I can think of is to >>> copy-and-paste the relevant #ifdef code block in fork_process.c. > I think my position, and the position of the people who responded to > the original thread, was that it seems like you're architecting a > kludge when it wouldn't be that hard to architect a nicer solution. I'd be the first to agree it's a kluge. But the end result of the previous discussion was that it wasn't so obvious how to architect a nicer solution. Nor is there all that much room for people to use a nicer solution, given that we need help from not just fork_process.c but the root-privileged startup script (or lately in Fedora it's a systemd unit script doing the privilege-increasing end of this). In the short run, if we don't have consensus on this, I'll probably just carry a Fedora patch like so: - int fd = open("/proc/self/oom_adj", O_WRONLY, 0); + int fd = open("/proc/self/oom_score_adj", O_WRONLY, 0); But it seems a tad silly to be carrying such a patch for a block of code that is only of interest to Linux packagers anyway, and nearly all such packagers are facing this same issue either now or in the very near future. regards, tom lane
On Tue, Jun 12, 2012 at 1:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jun 12, 2012 at 12:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I still think it's sufficient to do what I suggested initially: >>>> ... The simplest, least risky change that I can think of is to >>>> copy-and-paste the relevant #ifdef code block in fork_process.c. > >> I think my position, and the position of the people who responded to >> the original thread, was that it seems like you're architecting a >> kludge when it wouldn't be that hard to architect a nicer solution. > > I'd be the first to agree it's a kluge. But the end result of the > previous discussion was that it wasn't so obvious how to architect > a nicer solution. Nor is there all that much room for people to use a > nicer solution, given that we need help from not just fork_process.c > but the root-privileged startup script (or lately in Fedora it's a > systemd unit script doing the privilege-increasing end of this). Well, I don't think a GUC or two would be such a bad way of doing it, but... > In the short run, if we don't have consensus on this, I'll probably > just carry a Fedora patch like so: > > - int fd = open("/proc/self/oom_adj", O_WRONLY, 0); > + int fd = open("/proc/self/oom_score_adj", O_WRONLY, 0); > > But it seems a tad silly to be carrying such a patch for a block of > code that is only of interest to Linux packagers anyway, and nearly > all such packagers are facing this same issue either now or in the > very near future. ...I also agree with this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Sep 19, 2011 at 10:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Eisentraut <peter_e@gmx.net> writes: >> On sön, 2011-09-18 at 12:21 -0400, Tom Lane wrote: >>> But having said that, it wouldn't be very hard to arrange things so that >>> if you did have both symbols defined, the code would only attempt to >>> write oom_adj if it had failed to write oom_score_adj; which is about as >>> close as you're likely to get to a kernel version test for this. > >> Why is this feature not a run-time configuration variable or at least a >> configure option? It's awfully well hidden now. I doubt a lot of >> people are using this even though they might wish to. > > See the thread in which the feature was designed originally: > http://archives.postgresql.org/pgsql-hackers/2010-01/msg00170.php > > The key point is that to get useful behavior, you need cooperation > between both a root-privileged startup script and the PG executable. > That tends to throw the problem into the domain of packagers, more > than end users, and definitely puts a big crimp in the idea that > run-time configuration of just half of the behavior could be helpful. > So far, no Linux packagers have complained that the design is inadequate > (a position that I also hold when wearing my red fedora) so I do not > feel a need to complicate it further. Startup scripts are not solely in the domain of packagers. End users can also be expected to develop/edit their own startup scripts. Providing it as GUC would have given end users both the peices, but with a compile-time option they have only one half of the solution; except if they go compile their own binaries, which forces them into being packagers. I am not alone in feeling that if Postgres wishes to provide a control over child backend's oom_score_adj, it should be a GUC parameter rather than a compile-time option. Yesterday a customer wanted to leverage this and couldn't because they refuse to maintain their own fork of Postgres code. Please find attached a patch to turn it into a GUC, #ifdef'd by __linux__ macro. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
Вложения
Gurjeet Singh <gurjeet@singh.im> writes: > Startup scripts are not solely in the domain of packagers. End users > can also be expected to develop/edit their own startup scripts. > Providing it as GUC would have given end users both the peices, but > with a compile-time option they have only one half of the solution; > except if they go compile their own binaries, which forces them into > being packagers. I don't find that this argument holds any water at all. Anyone who's developing their own start script can be expected to manage recompiling Postgres. Extra GUCs do not have zero cost, especially not ones that are as complicated-to-explain as this would be. I would also argue that there's a security issue with making it a GUC. A non-root DBA should not have the authority to decide whether or not postmaster child processes run with nonzero OOM adjustment; that decision properly belongs to whoever has authorized the root-owned startup script to change the adjustment in the first place. So seeing this as two independent pieces is not only wrong but dangerous. regards, tom lane
On Tue, Jun 10, 2014 at 10:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Gurjeet Singh <gurjeet@singh.im> writes: >> Startup scripts are not solely in the domain of packagers. End users >> can also be expected to develop/edit their own startup scripts. > >> Providing it as GUC would have given end users both the peices, but >> with a compile-time option they have only one half of the solution; >> except if they go compile their own binaries, which forces them into >> being packagers. > > I don't find that this argument holds any water at all. Anyone who's > developing their own start script can be expected to manage recompiling > Postgres. Huh? Lots of people install PostgreSQL via, say, RPMs, but may still want to change their startup script locally. I don't think those users should have to give up the benefits of RPM packaging because they want a different oom_adj value. Then they have to be responsible for updating the packages every time there's a new minor release, instead of just typing 'yum update'. That's a LOT of extra work. > Extra GUCs do not have zero cost, especially not ones that are as > complicated-to-explain as this would be. NOT having them isn't free either. > I would also argue that there's a security issue with making it a GUC. > A non-root DBA should not have the authority to decide whether or not > postmaster child processes run with nonzero OOM adjustment; that decision > properly belongs to whoever has authorized the root-owned startup script > to change the adjustment in the first place. So seeing this as two > independent pieces is not only wrong but dangerous. I think the only possible issue is if the DBA doesn't even have shell access. If he doesn't have root but does have shell access, he could have recompiled anyway - it's just more work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 10, 2014 at 10:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Gurjeet Singh <gurjeet@singh.im> writes: >> Startup scripts are not solely in the domain of packagers. End users >> can also be expected to develop/edit their own startup scripts. > >> Providing it as GUC would have given end users both the peices, but >> with a compile-time option they have only one half of the solution; >> except if they go compile their own binaries, which forces them into >> being packagers. > > I don't find that this argument holds any water at all. Anyone who's > developing their own start script can be expected to manage recompiling > Postgres. I respectfully disagree. Writing and managing init/start scripts requires much different set of expertise than compiling and managing builds of a critical software like a database product. I would consider adding `echo -1000 > /proc/self/oom_score_adj` to a start script as a deployment specific tweak, and not 'developing own start script'. > Extra GUCs do not have zero cost, especially not ones that are as > complicated-to-explain as this would be. > > I would also argue that there's a security issue with making it a GUC. > A non-root DBA should not have the authority to decide whether or not > postmaster child processes run with nonzero OOM adjustment; that decision > properly belongs to whoever has authorized the root-owned startup script > to change the adjustment in the first place. So seeing this as two > independent pieces is not only wrong but dangerous. From experiments last night, I see that child process' oom_score_adj is capped by the parent process' setting that is forking. So I don't think it's a security risk from that perspective. I'll retest and provide the exact findings. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
On 2014-06-10 07:56:01 -0400, Gurjeet Singh wrote: > Providing it as GUC would have given end users both the peices, but > with a compile-time option they have only one half of the solution; > except if they go compile their own binaries, which forces them into > being packagers. > > I am not alone in feeling that if Postgres wishes to provide a control > over child backend's oom_score_adj, it should be a GUC parameter > rather than a compile-time option. Yesterday a customer wanted to > leverage this and couldn't because they refuse to maintain their own > fork of Postgres code. Independent of the rest of the discussion, I think there's one more point: Trying to keep your system stable by *increasing* the priority of normal backends is a bad idea. If you system gets into OOM land you need to fix that, not whack who gets killed around. The reason it makes sense to increase the priority of the postmaster is that that *does* increase the stability by cleaning up resources and restarting everything. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 10, 2014 at 10:32 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-06-10 07:56:01 -0400, Gurjeet Singh wrote: >> Providing it as GUC would have given end users both the peices, but >> with a compile-time option they have only one half of the solution; >> except if they go compile their own binaries, which forces them into >> being packagers. >> >> I am not alone in feeling that if Postgres wishes to provide a control >> over child backend's oom_score_adj, it should be a GUC parameter >> rather than a compile-time option. Yesterday a customer wanted to >> leverage this and couldn't because they refuse to maintain their own >> fork of Postgres code. > > Independent of the rest of the discussion, I think there's one more > point: Trying to keep your system stable by *increasing* the priority of > normal backends is a bad idea. If you system gets into OOM land you need > to fix that, not whack who gets killed around. > The reason it makes sense to increase the priority of the postmaster is > that that *does* increase the stability by cleaning up resources and > restarting everything. But the priority is inherited by child processes, so to decrease the priority of JUST the postmaster you need to decrease its priority and then set the priority of each child back to the original value. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 10, 2014 at 10:32 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-06-10 07:56:01 -0400, Gurjeet Singh wrote: >> Providing it as GUC would have given end users both the peices, but >> with a compile-time option they have only one half of the solution; >> except if they go compile their own binaries, which forces them into >> being packagers. >> >> I am not alone in feeling that if Postgres wishes to provide a control >> over child backend's oom_score_adj, it should be a GUC parameter >> rather than a compile-time option. Yesterday a customer wanted to >> leverage this and couldn't because they refuse to maintain their own >> fork of Postgres code. > > Independent of the rest of the discussion, I think there's one more > point: Trying to keep your system stable by *increasing* the priority of > normal backends is a bad idea. If you system gets into OOM land you need > to fix that, not whack who gets killed around. > The reason it makes sense to increase the priority of the postmaster is > that that *does* increase the stability by cleaning up resources and > restarting everything. As it stands right now, a user can decrease the likelyhood of Postmaster being killed by adjusting the start script, but that decreases the likelyhood of al the child processes, too, making the Postmaster just as likely as a kill-candidate, maybe even higher because it's the parent, as any backend. This patch gives the user a control to let the backend's likelyhood of being killed be different/higher than that of the postmaster. The users were already capable of doing that, but were required to custom-compile Postgres to get the benefits. This patch just removes that requirement. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
On 2014-06-10 10:42:41 -0400, Robert Haas wrote: > On Tue, Jun 10, 2014 at 10:32 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-06-10 07:56:01 -0400, Gurjeet Singh wrote: > >> Providing it as GUC would have given end users both the peices, but > >> with a compile-time option they have only one half of the solution; > >> except if they go compile their own binaries, which forces them into > >> being packagers. > >> > >> I am not alone in feeling that if Postgres wishes to provide a control > >> over child backend's oom_score_adj, it should be a GUC parameter > >> rather than a compile-time option. Yesterday a customer wanted to > >> leverage this and couldn't because they refuse to maintain their own > >> fork of Postgres code. > > > > Independent of the rest of the discussion, I think there's one more > > point: Trying to keep your system stable by *increasing* the priority of > > normal backends is a bad idea. If you system gets into OOM land you need > > to fix that, not whack who gets killed around. > > The reason it makes sense to increase the priority of the postmaster is > > that that *does* increase the stability by cleaning up resources and > > restarting everything. > > But the priority is inherited by child processes, so to decrease the > priority of JUST the postmaster you need to decrease its priority and > then set the priority of each child back to the original value. Gurjeet argues for making the absolute value of child processes priority configurable though? My point is that attacking the problem from that angle is wrong. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 10, 2014 at 10:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I don't find that this argument holds any water at all. Anyone who's >> developing their own start script can be expected to manage recompiling >> Postgres. > Huh? Lots of people install PostgreSQL via, say, RPMs, but may still > want to change their startup script locally. So? The RPM packager could probably be expected to have compiled with the oom-adjust-reset option enabled. If not, why aren't these people lobbying the packager to meet their expectation? I remain of the opinion that allowing nonprivileged people to decide whether that code is active or not is unsafe from a system level. regards, tom lane
On Tue, Jun 10, 2014 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jun 10, 2014 at 10:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I don't find that this argument holds any water at all. Anyone who's >>> developing their own start script can be expected to manage recompiling >>> Postgres. > >> Huh? Lots of people install PostgreSQL via, say, RPMs, but may still >> want to change their startup script locally. > > So? The RPM packager could probably be expected to have compiled with the > oom-adjust-reset option enabled. If not, why aren't these people lobbying > the packager to meet their expectation? Because that might take years to happen, or the packager might never do it at all on the theory that what is good for customers in general is different than what's good for one particular customer, or on the theory that it's just not a high enough priority for that packager. Are you seriously saying that you've never needed to customize a startup script on a RHEL box somewhere? > I remain of the opinion that allowing nonprivileged people to decide > whether that code is active or not is unsafe from a system level. On what factual basis? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund <andres@2ndquadrant.com> writes: > Independent of the rest of the discussion, I think there's one more > point: Trying to keep your system stable by *increasing* the priority of > normal backends is a bad idea. If you system gets into OOM land you need > to fix that, not whack who gets killed around. > The reason it makes sense to increase the priority of the postmaster is > that that *does* increase the stability by cleaning up resources and > restarting everything. That's half of the reason. The other half is that, at least back when we added this code, the Linux kernel's victim-selection code disproportionately chose to kill the postmaster rather than any child backend, an outcome definitely not to be preferred. IIRC this was because it blamed the postmaster for the sum of childrens' memory sizes *including shared memory*, counting the shared memory over again for each child. I don't know whether our switch away from SysV shmem has helped that, or if recent kernel versions have less brain-dead OOM victim selection. I'm not terribly hopeful though. But anyway, yeah, the point of this feature is that the OOM priority of the postmaster, and *only* the postmaster, should be raised. Allowing unprivileged people to break that is not attractive on any level. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 10, 2014 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So? The RPM packager could probably be expected to have compiled with the >> oom-adjust-reset option enabled. If not, why aren't these people lobbying >> the packager to meet their expectation? > Because that might take years to happen, ... unlike adding a GUC? > or the packager might never > do it at all on the theory that what is good for customers in general > is different than what's good for one particular customer, or on the > theory that it's just not a high enough priority for that packager. > Are you seriously saying that you've never needed to customize a > startup script on a RHEL box somewhere? Sure, but what's that have to do with this? Any Red Hat or PGDG RPM will come with this code already enabled in the build, so there is no need for anyone to have a GUC to play around with the behavior. >> I remain of the opinion that allowing nonprivileged people to decide >> whether that code is active or not is unsafe from a system level. > On what factual basis? Because it would convert the intended behavior (postmaster and only postmaster is exempt from OOM kill) into a situation where possibly all of the database processes are exempt from OOM kill, at the whim of somebody who should not have the privilege to decide that. In my view, the root-owned startup script grants OOM exemption to the postmaster because it *knows* that the postmaster's children will drop the exemption. If that trust can be violated because some clueless DBA decided to frob a GUC setting, I'd be a lot more hesitant about giving the postmaster the exemption in the first place. regards, tom lane
> Andres Freund <andres@2ndquadrant.com> writes: > > Independent of the rest of the discussion, I think there's one more > > point: Trying to keep your system stable by *increasing* the priority of > > normal backends is a bad idea. If you system gets into OOM land you need > > to fix that, not whack who gets killed around. > > The reason it makes sense to increase the priority of the postmaster is > > that that *does* increase the stability by cleaning up resources and > > restarting everything. > > That's half of the reason. The other half is that, at least back when > we added this code, the Linux kernel's victim-selection code > disproportionately chose to kill the postmaster rather than any child > backend, an outcome definitely not to be preferred. IIRC this was because > it blamed the postmaster for the sum of childrens' memory sizes *including > shared memory*, counting the shared memory over again for each child. That's essentially the same thing. We want the postmaster survive, not the children... > I don't know whether our switch away from SysV shmem has helped that, or > if recent kernel versions have less brain-dead OOM victim selection. > I'm not terribly hopeful though. It has gotten much better in the latest few versions. But it'll be a fair bit till those will be wildly used. And even then, from the kernel's view, there's just no reason strongly prefer not to kill the postmaster over other processes without the user providing the information. On 2014-06-10 11:04:52 -0400, Tom Lane wrote: > But anyway, yeah, the point of this feature is that the OOM priority > of the postmaster, and *only* the postmaster, should be raised. Allowing > unprivileged people to break that is not attractive on any level. A superuser could already write a function that echoes into /proc? I fail to see how a GUC changes the landscape significantly here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-06-10 11:04:52 -0400, Tom Lane wrote: >> But anyway, yeah, the point of this feature is that the OOM priority >> of the postmaster, and *only* the postmaster, should be raised. Allowing >> unprivileged people to break that is not attractive on any level. > A superuser could already write a function that echoes into /proc? Maybe I'm mistaken, but I thought once the fork_process code has reset our process's setting to zero it's not possible to lower it again (without privileges we'd not have). regards, tom lane
On 2014-06-10 11:14:43 -0400, Tom Lane wrote: > Sure, but what's that have to do with this? Any Red Hat or PGDG RPM will > come with this code already enabled in the build, so there is no need for > anyone to have a GUC to play around with the behavior. That's imo a fair point. Unless I misunderstand things Gurjeet picked the topic up again because he wants to increase the priority of the children. Is that correct Gurjeet? I do think a GUC might be a nicer interface for this than a #define. Possibly even with a absolute reset value for the children. But only because there's no way, afaics, to explicitly reset to the default value. > >> I remain of the opinion that allowing nonprivileged people to decide > >> whether that code is active or not is unsafe from a system level. > > > On what factual basis? > > Because it would convert the intended behavior (postmaster and only > postmaster is exempt from OOM kill) into a situation where possibly > all of the database processes are exempt from OOM kill, at the whim > of somebody who should not have the privilege to decide that. Meh^3. By that argument we need to forbid superusers to create any form of untrusted functions. Forbid anything that does malloc(), system(), fork(), whatever from a user's influence. Superusers can execute code in the postmaster using shared_preload_libraries. Feigning that that's not possible isn't buying us anything. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Gurjeet Singh <gurjeet@singh.im> writes: > As it stands right now, a user can decrease the likelyhood of > Postmaster being killed by adjusting the start script, but that > decreases the likelyhood of al the child processes, too, making the > Postmaster just as likely as a kill-candidate, maybe even higher > because it's the parent, as any backend. Exactly. > This patch gives the user a control to let the backend's likelyhood of > being killed be different/higher than that of the postmaster. If you think your users might want to give the postmaster OOM-exemption, why don't you just activate the existing code when you build? Resetting the OOM setting to zero is safe whether or not the startup script did anything to the postmaster's setting. In short, I don't see what a GUC adds here, except uncertainty, which is not a good thing. regards, tom lane
Re: /proc/self/oom_adj is deprecated in newer Linux kernels
От
andres@anarazel.de (Andres Freund)
Дата:
On 2014-06-10 11:20:28 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-06-10 11:04:52 -0400, Tom Lane wrote: > >> But anyway, yeah, the point of this feature is that the OOM priority > >> of the postmaster, and *only* the postmaster, should be raised. Allowing > >> unprivileged people to break that is not attractive on any level. > > > A superuser could already write a function that echoes into /proc? > > Maybe I'm mistaken, but I thought once the fork_process code has reset our > process's setting to zero it's not possible to lower it again (without > privileges we'd not have). No, doesn't look that way. It's possible to reset it to the value set at process start. So unless we introduce double forks for every backend start it can be reset by ordinary processes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 10, 2014 at 11:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jun 10, 2014 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> So? The RPM packager could probably be expected to have compiled with the >>> oom-adjust-reset option enabled. If not, why aren't these people lobbying >>> the packager to meet their expectation? > >> Because that might take years to happen, > > ... unlike adding a GUC? /me shrugs. Sure, it'll take a release cycle, but once we've made this configurable, it'll always be configurable. New packagers who don't have this set up properly can show up at any time. >> or the packager might never >> do it at all on the theory that what is good for customers in general >> is different than what's good for one particular customer, or on the >> theory that it's just not a high enough priority for that packager. > >> Are you seriously saying that you've never needed to customize a >> startup script on a RHEL box somewhere? > > Sure, but what's that have to do with this? Any Red Hat or PGDG RPM will > come with this code already enabled in the build, so there is no need for > anyone to have a GUC to play around with the behavior. Well, clearly, somebody hasn't got it right, or there wouldn't be this complaint. I'll grant you that "somebody" may be EnterpriseDB's own packaging in this instance, but I wouldn't like to bet that no one else has ever got this wrong nor ever will. Peter asked upthread why this wasn't a GUC with the comment that "Why is this feature not a run-time configuration variable or at least a configure option? It's awfully well hidden now. I doubt a lot of people are using this even though they might wish to." I think that's quite right, and note that Peter is in no way affiliated with EnterpriseDB and made that comment (rather presciently) long before Gurjeet's recent report. >>> I remain of the opinion that allowing nonprivileged people to decide >>> whether that code is active or not is unsafe from a system level. > >> On what factual basis? > > Because it would convert the intended behavior (postmaster and only > postmaster is exempt from OOM kill) into a situation where possibly > all of the database processes are exempt from OOM kill, at the whim > of somebody who should not have the privilege to decide that. Gurjeet already refused that argument. Apparently, the child processes can only increase their chance of being OOM-killed, not decrease it, so you have this exactly backwards: right now, an individual system administrator can exempt all of the database processes from OOM kill, but cannot exempt the postmaster alone. > In my view, the root-owned startup script grants OOM exemption to > the postmaster because it *knows* that the postmaster's children > will drop the exemption. If that trust can be violated because some > clueless DBA decided to frob a GUC setting, I'd be a lot more hesitant > about giving the postmaster the exemption in the first place. How about using an environment variable? It seems to me that would address the concern about DBAs without shell access. They might be able to frob GUCs, but presumably not the postmaster's starting environment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 10, 2014 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This patch gives the user a control to let the backend's likelyhood of >> being killed be different/higher than that of the postmaster. > > If you think your users might want to give the postmaster OOM-exemption, > why don't you just activate the existing code when you build? Resetting > the OOM setting to zero is safe whether or not the startup script did > anything to the postmaster's setting. The whole scenario here is that the user *doesn't want to recompile*. You seem to be trying to relitigate an argument that Gurjeet already discussed in his original post and I already refuted once after that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-06-10 11:14:43 -0400, Tom Lane wrote: >> Because it would convert the intended behavior (postmaster and only >> postmaster is exempt from OOM kill) into a situation where possibly >> all of the database processes are exempt from OOM kill, at the whim >> of somebody who should not have the privilege to decide that. > Meh^3. By that argument we need to forbid superusers to create any form > of untrusted functions. Forbid anything that does malloc(), system(), > fork(), whatever from a user's influence. That's utter and complete nonsense. We're discussing an operation that is root-privileged (ie, lowering a process's OOM score), not random stuff that unprivileged processes can do. regards, tom lane
On 2014-06-10 11:35:23 -0400, Robert Haas wrote: > > Because it would convert the intended behavior (postmaster and only > > postmaster is exempt from OOM kill) into a situation where possibly > > all of the database processes are exempt from OOM kill, at the whim > > of somebody who should not have the privilege to decide that. > > Gurjeet already refused that argument. Only that he was wrong: ~# echo -500 > /proc/$BASHPID/oom_score_adj ~# su andres ~$ echo -1000 > /proc/$BASHPID/oom_score_adj bash: echo: write error: Permission denied # so I can't decrease the likelihood ~$ echo -400 > /proc/$BASHPID/oom_score_adj # but I can increase it ~$ echo -500 > /proc/$BASHPID/oom_score_adj # but also *reset* it Since we have to do the adjustment after the fork - otherwise the postmaster would be vulnerable for a time - normal backends can reset their score. > Apparently, the child > processes can only increase their chance of being OOM-killed, not > decrease it, so you have this exactly backwards: right now, an > individual system administrator can exempt all of the database > processes from OOM kill, but cannot exempt the postmaster alone. Well, if you compile postgres with the #define it'll reset the likelihood after the fork? That's the reset? Or do you mean without compiling postgres with the option? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 10, 2014 at 11:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In my view, the root-owned startup script grants OOM exemption to > the postmaster because it *knows* that the postmaster's children > will drop the exemption. If that trust can be violated because some > clueless DBA decided to frob a GUC setting, I'd be a lot more hesitant > about giving the postmaster the exemption in the first place. Even if the clueless DBA tries to set the oom_score_adj below that of Postmaster, Linux kernel prevents that from being done. I demonstrate that in the below screen share. I used GUC as well as plain command line to try and set a child's badness (oom_score_adj) to be lower than that of Postmaster's, and Linux disallows doing that, unless I use root's powers. So the argument that this GUC is a security concern, can be ignored. Root user (one with control of start script) still controls the lowest badness setting of all Postgres processes. If done at fork_process time, the child process simply inherits parent's badness setting. Best regards, # Set postmaster badness: -1000 $ sudo echo -1000 > /proc/$$/oom_score_adj ; cat /proc/self/oom_score_adj -1000 # Set backend badness the same as postmaster: -1000 $ perl -pi -e 's/linux_oom_score_adj = .*/linux_oom_score_adj = -1000/' $PGDATA/postgresql.conf ; grep oom $PGDATA/postgresql.conf linux_oom_score_adj = -1000 $ pgstop; pgstart pg_ctl: server is running (PID: 65355) /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres "-D" "/home/gurjeet/dev/pgdbuilds/oom_guc/db/data" waiting for server to shut down.... done server stopped pg_ctl: no server running waiting for server to start.... done server started Database cluster state: in production # Backends and the postmaster have the same badness; lower than default 0 $ for p in $(echo $(pgserverPIDList)| cut -d , --output-delimiter=' ' -f 1-); do echo $p $(cat /proc/$p/oom_score_adj) $(cat /proc/$p/cmdline); done 537 -1000 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 539 -1000 postgres: checkpointer process 540 -1000 postgres: writer process 541 -1000 postgres: wal writer process 542 -1000 postgres: autovacuum launcher process 543 -1000 postgres: stats collector process # Set postmaster badness: -100 $ sudo echo -100 > /proc/$$/oom_score_adj ; cat /proc/self/oom_score_adj -100 # Set backend badness the lower than postmaster: -500 vs. -100 $ perl -pi -e 's/linux_oom_score_adj = .*/linux_oom_score_adj = -500... linux_oom_score_adj = -500 $ pgstop; pgstart ... # Backends are capped to parent's badness. $ for p in $(echo $(pgserverPIDList)| cut -d , ... 716 -100 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 718 -100 postgres: checkpointer process 719 -100 postgres: writer process 720 -100 postgres: wal writer process 721 -100 postgres: autovacuum launcher process 722 -100 postgres: stats collector process # Set postmaster badness: -100 $ sudo echo -100 > /proc/$$/oom_score_adj ; cat /proc/self/oom_score_adj -100 # Set backend badness the higher than postmaster: +500 vs. -100 $ perl -pi -e 's/linux_oom_score_adj = .*/linux_oom_score_adj = 500/' $PGDATA/postgresql.conf ; grep oom $PGDATA/postgresql.conf linux_oom_score_adj = 500 $ pgstop; pgstart ... # Backends have higher badness, hence higher likelyhood to be killed than the Postmaster. $ for p in $(echo $(pgserverPIDList)| cut -d , ... 1602 -100 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 1604 500 postgres: checkpointer process 1605 500 postgres: writer process 1606 500 postgres: wal writer process 1607 500 postgres: autovacuum launcher process 1608 500 postgres: stats collector process # Set postmaster badness: -100 $ sudo echo -100 > /proc/$$/oom_score_adj ; cat /proc/self/oom_score_adj -100 # Reset backend badness to default: 0 $ perl -pi -e 's/linux_oom_score_adj = .*/linux_oom_score_adj = 0/' $PGDATA/postgresql.conf ; grep oom $PGDATA/postgresql.conf linux_oom_score_adj = 0 $ pgstop; pgstart ... # Backends have higher badness, hence higher likelyhood to be killed than the Postmaster. $ for p in $(echo $(pgserverPIDList)| cut -d , ... 1835 -100 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 1837 0 postgres: checkpointer process 1838 0 postgres: writer process 1839 0 postgres: wal writer process 1840 0 postgres: autovacuum launcher process 1841 0 postgres: stats collector process # Lower checkpointer's badness, but keep it higher than postmaster's $ echo -1 > /proc/1837/oom_score_adj $ for p in $(echo $(pgserverPIDList)| cut -d , ... 1835 -100 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 1837 -1 postgres: checkpointer process ... # Lower checkpointer's badness, but keep it same as postmaster's $ echo -100 > /proc/1837/oom_score_adj $ for p in $(echo $(pgserverPIDList)| cut -d , ... 1835 -100 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 1837 -100 postgres: checkpointer process ... # Lower checkpointer's badness, and try to lower it below postmaster's $ echo -101 > /proc/1837/oom_score_adj bash: echo: write error: Permission denied $ for p in $(echo $(pgserverPIDList)| cut -d , ... 1835 -100 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 1837 -100 postgres: checkpointer process ... # As root, force child process' badness to be lower than postnaster's $ sudo echo -101 > /proc/1837/oom_score_adj [sudo] password for gurjeet: $ for p in $(echo $(pgserverPIDList)| cut -d , ... 1835 -100 /home/gurjeet/dev/pgdbuilds/oom_guc/db/bin/postgres-D/home/gurjeet/dev/pgdbuilds/oom_guc/db/data 1837 -101 postgres: checkpointer process ... -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
Robert Haas <robertmhaas@gmail.com> writes: > Well, clearly, somebody hasn't got it right, or there wouldn't be this > complaint. I'll grant you that "somebody" may be EnterpriseDB's own > packaging in this instance, but I wouldn't like to bet that no one > else has ever got this wrong nor ever will. Peter asked upthread why > this wasn't a GUC with the comment that "Why is this feature not a > run-time configuration variable or at least a configure option? It's > awfully well hidden now. I doubt a lot of people are using this even > though they might wish to." I think that's quite right, and note that > Peter is in no way affiliated with EnterpriseDB and made that comment > (rather presciently) long before Gurjeet's recent report. I'd be okay with a configure option, if you think that would make this issue more visible to packagers. It's delegating the responsibility to the DBA level that I'm unhappy about. >> Because it would convert the intended behavior (postmaster and only >> postmaster is exempt from OOM kill) into a situation where possibly >> all of the database processes are exempt from OOM kill, at the whim >> of somebody who should not have the privilege to decide that. > Gurjeet already refused that argument. He can refuse it all he likes, but that doesn't make his opinion correct. > How about using an environment variable? It seems to me that would > address the concern about DBAs without shell access. They might be > able to frob GUCs, but presumably not the postmaster's starting > environment. Hmmm ... yeah, that might work. It would solve the problem I'm worried about, which is making sure that the startup script has control of what happens. regards, tom lane
On 2014-06-10 11:40:25 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-06-10 11:14:43 -0400, Tom Lane wrote: > >> Because it would convert the intended behavior (postmaster and only > >> postmaster is exempt from OOM kill) into a situation where possibly > >> all of the database processes are exempt from OOM kill, at the whim > >> of somebody who should not have the privilege to decide that. > > > Meh^3. By that argument we need to forbid superusers to create any form > > of untrusted functions. Forbid anything that does malloc(), system(), > > fork(), whatever from a user's influence. > > That's utter and complete nonsense. We're discussing an operation that is > root-privileged (ie, lowering a process's OOM score), not random stuff > that unprivileged processes can do. Oh, comeon. Tom. You a) conveniently left of the part where I said that the user can execute code from the postmaster. b) fork() can be used to escape the oom killer. c) Lots of much worse things can be done to the system with arbitrary system calls than adjusting oom_score_adj. The postmaster can currently change oom_score_adj. Users can run code as a postmaster. Simple as that. Besides, as demonstrated in http://www.postgresql.org/message-id/20140610154536.GN8406@alap3.anarazel.de postmaster children can already reset their score. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 10, 2014 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'd be okay with a configure option, if you think that would make this > issue more visible to packagers. It's delegating the responsibility to > the DBA level that I'm unhappy about. > [...] > >> How about using an environment variable? It seems to me that would >> address the concern about DBAs without shell access. They might be >> able to frob GUCs, but presumably not the postmaster's starting >> environment. > > Hmmm ... yeah, that might work. It would solve the problem I'm worried > about, which is making sure that the startup script has control of what > happens. A configure option wouldn't address the issue from my perspective; you still have to recompile if you don't like what your packager did, and your packager might still be stupid. However, an environment variable seems like it would be just fine. It might even be more convenient for packagers that having to compile the desired value into the binary. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
andres@anarazel.de (Andres Freund) writes: > On 2014-06-10 11:20:28 -0400, Tom Lane wrote: >> Maybe I'm mistaken, but I thought once the fork_process code has reset our >> process's setting to zero it's not possible to lower it again (without >> privileges we'd not have). > No, doesn't look that way. It's possible to reset it to the value set at > process start. So unless we introduce double forks for every backend > start it can be reset by ordinary processes. That's kind of annoying --- I wonder why they went to the trouble of doing that? But anyway, it's probably not worth the cost of double-forking to prevent it. I still say though that this is not a reason to make it as easy as change-a-GUC to break the intended behavior. Robert's idea of having the start script set an environment variable to control the OOM adjustment reset seems like it would satisfy my concern. regards, tom lane
On Tue, Jun 10, 2014 at 11:24 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-06-10 11:14:43 -0400, Tom Lane wrote: >> Sure, but what's that have to do with this? Any Red Hat or PGDG RPM will >> come with this code already enabled in the build, so there is no need for >> anyone to have a GUC to play around with the behavior. > > That's imo a fair point. Unless I misunderstand things Gurjeet picked > the topic up again because he wants to increase the priority of the > children. Is that correct Gurjeet? Yes. A DBA would like to prevent the postmaster from being killed by OOM killer, but let the child processes be still subject to OOM killer's whim. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 10, 2014 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If you think your users might want to give the postmaster OOM-exemption, >> why don't you just activate the existing code when you build? Resetting >> the OOM setting to zero is safe whether or not the startup script did >> anything to the postmaster's setting. > The whole scenario here is that the user *doesn't want to recompile*. Yeah, I understood that. The question is why EDB isn't satisfied to just add "-DLINUX_OOM_ADJ=0" to their build options, but instead would like to dump a bunch of uncertainty on other packagers who might not like the implications of a GUC. regards, tom lane
On 2014-06-10 11:52:17 -0400, Tom Lane wrote: > andres@anarazel.de (Andres Freund) writes: > > On 2014-06-10 11:20:28 -0400, Tom Lane wrote: > >> Maybe I'm mistaken, but I thought once the fork_process code has reset our > >> process's setting to zero it's not possible to lower it again (without > >> privileges we'd not have). > > > No, doesn't look that way. It's possible to reset it to the value set at > > process start. So unless we introduce double forks for every backend > > start it can be reset by ordinary processes. > > That's kind of annoying --- I wonder why they went to the trouble of doing > that? My guess is that they want to allow a process to only temporarily reduce the likelihood of getting killed while it's doing something important. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jun 10, 2014 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > andres@anarazel.de (Andres Freund) writes: >> On 2014-06-10 11:20:28 -0400, Tom Lane wrote: >>> Maybe I'm mistaken, but I thought once the fork_process code has reset our >>> process's setting to zero it's not possible to lower it again (without >>> privileges we'd not have). > >> No, doesn't look that way. It's possible to reset it to the value set at >> process start. So unless we introduce double forks for every backend >> start it can be reset by ordinary processes. > > That's kind of annoying --- I wonder why they went to the trouble of doing > that? But anyway, it's probably not worth the cost of double-forking to > prevent it. I still say though that this is not a reason to make it as > easy as change-a-GUC to break the intended behavior. > > Robert's idea of having the start script set an environment variable to > control the OOM adjustment reset seems like it would satisfy my concern. I'm fine with this solution. Should this be a constant 0, or be configurable based on env. variable's value? -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
Gurjeet Singh <gurjeet@singh.im> writes: > On Tue, Jun 10, 2014 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert's idea of having the start script set an environment variable to >> control the OOM adjustment reset seems like it would satisfy my concern. > I'm fine with this solution. Should this be a constant 0, or be > configurable based on env. variable's value? If we're going to do this, I would say that we should also take the opportunity to get out from under the question of which kernel API we're dealing with. So perhaps a design like this: 1. If the environment variable PG_OOM_ADJUST_FILE is defined, it's the name of a file to write something into after forking. The typical value would be "/proc/self/oom_score_adj" or "/proc/self/oom_adj". If not set, nothing happens. 2. If the environment variable PG_OOM_ADJUST_VALUE is defined, that's the string we write, otherwise we write "0". regards, tom lane
On Tue, Jun 10, 2014 at 11:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jun 10, 2014 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> If you think your users might want to give the postmaster OOM-exemption, >>> why don't you just activate the existing code when you build? Resetting >>> the OOM setting to zero is safe whether or not the startup script did >>> anything to the postmaster's setting. > >> The whole scenario here is that the user *doesn't want to recompile*. > > Yeah, I understood that. The question is why EDB isn't satisfied to just > add "-DLINUX_OOM_ADJ=0" to their build options, but instead would like to > dump a bunch of uncertainty on other packagers who might not like the > implications of a GUC. Well, I think we should do that, too, and I'm going to propose it to Dave & team. If we're not already doing the right thing here (and I don't know off-hand whether we are), then that is definitely our issue to fix and there's no reason to change PostgreSQL on that account. But I think the point is that this is a pretty subtle and well-concealed thing which a PostgreSQL packager might fail to do correctly in every instance - or where an individual user might want to install a different policy than whatever the packager's default is. Making that easy to do without recompiling seems to me to be a general good. Magnus once commented to me that every GUC which is PGC_POSTMASTER is a bug - "perhaps not an easy bug to fix, but a bug all the same" (his words), because requiring people to restart the postmaster to change settings is often problematic. I think this is even more true of recompiling. IMO, it should be an explicit goal of the project to make reconfiguration of an already-installed system - perhaps even already in production - as easy as possible. We will doubtless fail to achieve perfection but that doesn't mean we shouldn't try. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
In short: I can accept the idea that picking reasonable specific values is impossible so let's just ensure that children are always killed before the parent (basically the default behavior). If you then say that any system that is capable of implementing that rule should have it set then leaving it unexposed makes sense. That line of thought does not require the abstract cost of a GUC to factor into the equation. However, in considering system configuration and concurrently running processes.... Tom Lane-2 wrote > Extra GUCs do not have zero cost, especially not ones that are as > complicated-to-explain as this would be. The explain factor does little for me since if given a reasonable default the GUC can be ignored until a problem manifests. At that point not having a GUC makes resolving the problem require someone to stop using packaging and instead compile their own source. This results in a poor user experience. So, if someone where to provide a complete patch that introduced a GUC to enable/disable as well as set priorities - to include documentation and reasonable postgresql.conf defaults - what specific ongoing GUC costs would prevent applying said patch? You mention a security hazard but that is not a cost of GUCs generally but this one specifically; and one that seems to have been deemed no riskier than other DBA capabilities. Help me understand the cost side of the equation since the benefit side seems clear-cut enough to me. The OOM problem is real and making PostgreSQL fit within the overall memory management architecture of a given server is something that is the responsibility of the system admin in conjunction with the DBA - not us or the packager. I'll buy that because this is a linux issue that the implementation could be packager selectable but given the dynamics of Linux centralizing a reasonable default implementation into core makes sense. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/proc-self-oom-adj-is-deprecated-in-newer-Linux-kernels-tp4810971p5806697.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On 06/10/2014 07:02 AM, Tom Lane wrote: > Gurjeet Singh <gurjeet@singh.im> writes: >> Startup scripts are not solely in the domain of packagers. End users >> can also be expected to develop/edit their own startup scripts. > >> Providing it as GUC would have given end users both the peices, but >> with a compile-time option they have only one half of the solution; >> except if they go compile their own binaries, which forces them into >> being packagers. > > I don't find that this argument holds any water at all. Anyone who's > developing their own start script can be expected to manage recompiling > Postgres. This is certainly not true in any fashion. Production systems require a degree of flexibility and configuration that does not require the maintaining or compiling of source code. Sincerely, Joshua D. Drake -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, @cmdpromptinc "If we send our children to Caesar for their education, we should not be surprised when they come back as Romans."
Gurjeet Singh-4 wrote > Even if the clueless DBA tries to set the oom_score_adj below that of > Postmaster, Linux kernel prevents that from being done. I demonstrate > that in the below screen share. I used GUC as well as plain command > line to try and set a child's badness (oom_score_adj) to be lower than > that of Postmaster's, and Linux disallows doing that, unless I use > root's powers. > > So the argument that this GUC is a security concern, can be ignored. > Root user (one with control of start script) still controls the lowest > badness setting of all Postgres processes. If done at fork_process > time, the child process simply inherits parent's badness setting. The counter point here is that the postmaster can be set to "no kill" and the >= condition allows for children to achieve the same while it is our explicit intent that the children be strictly > parent. To that end, should the adjustment value be provided as an offset to the postmasters instead of an absolute value - and disallow <= zero offset values in the process? I get and generally agree with the environment variable proposal and it's stated goal to restrict whom can makes changes. But how much less cost does an environment variable have than a GUC if one GUC argument is still its maintenance overhead? David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/proc-self-oom-adj-is-deprecated-in-newer-Linux-kernels-tp4810971p5806700.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
<p dir="ltr"><br /> On Jun 10, 2014 7:05 PM, "Joshua D. Drake" <<a href="mailto:jd@commandprompt.com">jd@commandprompt.com</a>>wrote:<br /> ><br /> ><br /> > On 06/10/2014 07:02AM, Tom Lane wrote:<br /> >><br /> >> Gurjeet Singh <<a href="mailto:gurjeet@singh.im">gurjeet@singh.im</a>>writes:<br /> >>><br /> >>> Startup scripts arenot solely in the domain of packagers. End users<br /> >>> can also be expected to develop/edit their own startupscripts.<br /> >><br /> >><br /> >>> Providing it as GUC would have given end users both thepeices, but<br /> >>> with a compile-time option they have only one half of the solution;<br /> >>>except if they go compile their own binaries, which forces them into<br /> >>> being packagers.<br />>><br /> >><br /> >> I don't find that this argument holds any water at all. Anyone who's<br /> >>developing their own start script can be expected to manage recompiling<br /> >> Postgres.<br /> ><br />><br /> > This is certainly not true in any fashion. Production systems require a degree of flexibility and configurationthat does not require the maintaining or compiling of source code.<br /> ><p dir="ltr">As JD surely understands,it pains me a lot but I have to agree with him here.<p dir="ltr">There are also a non-trivial number of organizationsthat just don't allow compilers on production boxes, adding another hurdle for those. We may think that's ridiculous,and it may be, but it's reality.<p dir="ltr">That said, I agree that a guc is probably a bad idea. I like theidea of an environment variable if we need it to be changeable. Or perhaps what we need is just better documentation ofwhat's there already. Perhaps the documentation should have a section specifically aimed at packagers? <p dir="ltr">/Magnus
On Tue, Jun 10, 2014 at 1:13 PM, David G Johnston <david.g.johnston@gmail.com> wrote: > Gurjeet Singh-4 wrote >> So the argument that this GUC is a security concern, can be ignored. >> Root user (one with control of start script) still controls the lowest >> badness setting of all Postgres processes. If done at fork_process >> time, the child process simply inherits parent's badness setting. > > The counter point here is that the postmaster can be set to "no kill" and Only the root user can do that, since he/she has control over the start script. All that the DBA could do with a GUC is to set backends' badness worse than postmaster's, but bot better. > the >= condition allows for children to achieve the same while it is our > explicit intent that the children be strictly > parent. I don't think anyone argued for that behaviour. > To that end, should the adjustment value be provided as an offset to the > postmasters instead of an absolute value - and disallow <= zero offset > values in the process? Seems unnecessary, given current knowledge. > I get and generally agree with the environment variable proposal and it's > stated goal to restrict whom can makes changes. But how much less cost does > an environment variable have than a GUC if one GUC argument is still its > maintenance overhead? Having it as a GUC would have meant that two entities are required to get the configuration right: one who controls start scripts, and the other who controls GUC settings. With the environment variable approach, a root user alone can control the behaviour like so in start script: echo -200 > /proc/self/oom_score_adj export PG_OOM_ADJUST_FILE=oom_score_adj export PG_OOM_ADJUST_VALUE=-100 Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
On Tue, Jun 10, 2014 at 12:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > If we're going to do this, I would say that we should also take the > opportunity to get out from under the question of which kernel API > we're dealing with. So perhaps a design like this: > > 1. If the environment variable PG_OOM_ADJUST_FILE is defined, it's > the name of a file to write something into after forking. The typical > value would be "/proc/self/oom_score_adj" or "/proc/self/oom_adj". > If not set, nothing happens. > > 2. If the environment variable PG_OOM_ADJUST_VALUE is defined, that's > the string we write, otherwise we write "0". Please find attached the patch. It includes the doc changes as well. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
Вложения
Gurjeet Singh <gurjeet@singh.im> writes: > Please find attached the patch. It includes the doc changes as well. What exactly is the point of the static state you added here? There is no situation where that could possibly be useful, because this code is executed at most once per process, and not at all in the postmaster. AFAICS, that's just complication (and permanent waste of a kilobyte of static data space) for no return. Also, this seems to try to write the file whether or not the environment variable was set, which wasn't the plan. I also don't really see the point of parsing the value variable into an int. Why not just spit it out as the original string? It's not ours to legislate whether the kernel will take a value that's not an integer. regards, tom lane
Gurjeet Singh <gurjeet@singh.im> writes: > On Tue, Jun 10, 2014 at 12:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If we're going to do this, I would say that we should also take the >> opportunity to get out from under the question of which kernel API >> we're dealing with. So perhaps a design like this: >> >> 1. If the environment variable PG_OOM_ADJUST_FILE is defined, it's >> the name of a file to write something into after forking. The typical >> value would be "/proc/self/oom_score_adj" or "/proc/self/oom_adj". >> If not set, nothing happens. >> >> 2. If the environment variable PG_OOM_ADJUST_VALUE is defined, that's >> the string we write, otherwise we write "0". > Please find attached the patch. It includes the doc changes as well. Applied with some editorialization. regards, tom lane
On Wed, Jun 18, 2014 at 8:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Gurjeet Singh <gurjeet@singh.im> writes: > >> Please find attached the patch. It includes the doc changes as well. > > Applied with some editorialization. Thanks! would it be possible to include this in 9.4 as well? Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
Gurjeet Singh <gurjeet@singh.im> writes: > would it be possible to include this in 9.4 as well? While this is clearly an improvement over what we had before, it's impossible to argue that it's a bug fix, and we are way past the 9.4 feature freeze deadline. In particular, packagers who've already done their 9.4 development work might be blindsided by us slipping this into 9.4 release. So while I wouldn't have a problem with putting this change into 9.4 from a technical standpoint, it's hard to argue that it'd meet project norms from a development-process standpoint. regards, tom lane
On Mon, Jun 23, 2014 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Gurjeet Singh <gurjeet@singh.im> writes: >> would it be possible to include this in 9.4 as well? > > While this is clearly an improvement over what we had before, it's > impossible to argue that it's a bug fix, and we are way past the 9.4 > feature freeze deadline. In particular, packagers who've already done > their 9.4 development work might be blindsided by us slipping this into > 9.4 release. So while I wouldn't have a problem with putting this change > into 9.4 from a technical standpoint, it's hard to argue that it'd meet > project norms from a development-process standpoint. While I'd love to reduce the number of future installations without this fix in place, I respect the decision to honor project policy. At the same time, this change does not break anything. It introduces new environment variables which change the behaviour, but behaves the old way in the absence of those variables. So I guess it's not changing the default behaviour in incompatible way. BTW, does the project publish the feature-freeze deadlines and other dates somewhere (apart from on this list). I was looking the other day and didn't find any reference. Either the commitfest app or the 'Developer' section of the wiki [1] seem to be good candidates for this kind of information. [1]: https://wiki.postgresql.org/wiki/Development_information Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
Gurjeet Singh <gurjeet@singh.im> writes: > On Mon, Jun 23, 2014 at 11:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> While this is clearly an improvement over what we had before, it's >> impossible to argue that it's a bug fix, and we are way past the 9.4 >> feature freeze deadline. In particular, packagers who've already done >> their 9.4 development work might be blindsided by us slipping this into >> 9.4 release. So while I wouldn't have a problem with putting this change >> into 9.4 from a technical standpoint, it's hard to argue that it'd meet >> project norms from a development-process standpoint. > While I'd love to reduce the number of future installations without > this fix in place, I respect the decision to honor project policy. At > the same time, this change does not break anything. It introduces new > environment variables which change the behaviour, but behaves the old > way in the absence of those variables. Uh, no, it doesn't. We removed the dependence on -DLINUX_OOM_SCORE_ADJ. If a packager is expecting that to still work in 9.4, he's going to be unpleasantly surprised, because the system will silently fail to do what he's expecting: it will run all the backend processes at no-OOM-kill priority, which is likely to be bad. It is possible to make packages that will work either way, along the lines of the advice I sent to the Red Hat guys: https://bugzilla.redhat.com/show_bug.cgi?id=1110969 but I think it's a bit late in the cycle to be telling packagers to do that for 9.4. > BTW, does the project publish the feature-freeze deadlines and other > dates somewhere (apart from on this list). It's usually in the dev meeting minutes https://wiki.postgresql.org/wiki/PgCon_2014_Developer_Meeting#9.5_Schedule regards, tom lane
On Mon, Jun 23, 2014 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Gurjeet Singh <gurjeet@singh.im> writes: >> While I'd love to reduce the number of future installations without >> this fix in place, I respect the decision to honor project policy. At >> the same time, this change does not break anything. It introduces new >> environment variables which change the behaviour, but behaves the old >> way in the absence of those variables. > > Uh, no, it doesn't. We removed the dependence on -DLINUX_OOM_SCORE_ADJ. > If a packager is expecting that to still work in 9.4, he's going to be > unpleasantly surprised, because the system will silently fail to do what > he's expecting: it will run all the backend processes at no-OOM-kill > priority, which is likely to be bad. True. I didn't think from a packager's perspective. > It is possible to make packages that will work either way, along the lines > of the advice I sent to the Red Hat guys: > https://bugzilla.redhat.com/show_bug.cgi?id=1110969 > > but I think it's a bit late in the cycle to be telling packagers to do > that for 9.4. Understandable. >> BTW, does the project publish the feature-freeze deadlines and other >> dates somewhere (apart from on this list). > > It's usually in the dev meeting minutes > https://wiki.postgresql.org/wiki/PgCon_2014_Developer_Meeting#9.5_Schedule Thanks. But it doesn't spell out the specific dates, or even the month of feature-freeze. Best regards, -- Gurjeet Singh http://gurjeet.singh.im/ EDB www.EnterpriseDB.com
Re: Tom Lane 2014-06-23 <17054.1403542164@sss.pgh.pa.us> > > While I'd love to reduce the number of future installations without > > this fix in place, I respect the decision to honor project policy. At > > the same time, this change does not break anything. It introduces new > > environment variables which change the behaviour, but behaves the old > > way in the absence of those variables. > > Uh, no, it doesn't. We removed the dependence on -DLINUX_OOM_SCORE_ADJ. > If a packager is expecting that to still work in 9.4, he's going to be > unpleasantly surprised, because the system will silently fail to do what > he's expecting: it will run all the backend processes at no-OOM-kill > priority, which is likely to be bad. Ok I'm late to the party, but the reason I'm still joining is we have proper unit tests which just told me the 9.5 packages have changed OOM handling. So it wouldn't just slip through if you changed that in 9.4 as well, but would get fixed. I have two comments on the patch: The choice to make the behavior depend first on PG_OOM_ADJUST_FILE and only secondly on PG_OOM_ADJUST_VALUE seems the wrong way round to me. On every modestly new kernel oom_score_adj is present, so PG_OOM_ADJUST_FILE should default to using it. On the other hand, what people really want to achieve (or tune) with this feature is to set the OOM adjustment back to 0 (or some other value), so to me, it would be much more natural to set PG_OOM_ADJUST_VALUE=0 to activate the feature, and only have to mess with PG_OOM_ADJUST_FILE if my kernel is older. (Which it isn't on any kernel supported by Debian and Ubuntu LTS.) The other bit is the non-deprecation of OOM_ADJ in contrib/start-scripts/linux. It took me a while to understand the logic of the variables used there - the interface would be much clearer if it just was like this: ... set default PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj ... and then this in the configurable part of the script: PG_MASTER_OOM_SCORE_ADJ=-1000 PG_OOM_SCORE_ADJ=0 # Older Linux kernels may not have /proc/self/oom_score_adj, but instead # /proc/self/oom_adj, which works similarly except the disable value is -17. # For such a system, uncomment these three lines instead. #PG_OOM_ADJUST_FILE=/proc/self/oom_adj #PG_MASTER_OOM_SCORE_ADJ=-17 #PG_OOM_SCORE_ADJ=0 ... and then use PG_OOM_ADJUST_FILE below instead of manually figuring out which proc file to write to by looking at OOM.*_ADJ. Christoph -- cb@df7cb.de | http://www.df7cb.de/
On Tue, Jul 1, 2014 at 8:22 AM, Christoph Berg <cb@df7cb.de> wrote: > Re: Tom Lane 2014-06-23 <17054.1403542164@sss.pgh.pa.us> >> > While I'd love to reduce the number of future installations without >> > this fix in place, I respect the decision to honor project policy. At >> > the same time, this change does not break anything. It introduces new >> > environment variables which change the behaviour, but behaves the old >> > way in the absence of those variables. >> >> Uh, no, it doesn't. We removed the dependence on -DLINUX_OOM_SCORE_ADJ. >> If a packager is expecting that to still work in 9.4, he's going to be >> unpleasantly surprised, because the system will silently fail to do what >> he's expecting: it will run all the backend processes at no-OOM-kill >> priority, which is likely to be bad. > > Ok I'm late to the party, but the reason I'm still joining is we have > proper unit tests which just told me the 9.5 packages have changed OOM > handling. So it wouldn't just slip through if you changed that in 9.4 > as well, but would get fixed. > > I have two comments on the patch: > > The choice to make the behavior depend first on PG_OOM_ADJUST_FILE and > only secondly on PG_OOM_ADJUST_VALUE seems the wrong way round to me. > On every modestly new kernel oom_score_adj is present, so > PG_OOM_ADJUST_FILE should default to using it. On the other hand, what > people really want to achieve (or tune) with this feature is to set > the OOM adjustment back to 0 (or some other value), so to me, it would > be much more natural to set PG_OOM_ADJUST_VALUE=0 to activate the > feature, and only have to mess with PG_OOM_ADJUST_FILE if my kernel is > older. (Which it isn't on any kernel supported by Debian and Ubuntu > LTS.) Of course, we have no guarantee that the Linux kernel guys won't change this again. Apparently "we don't break userspace" is a somewhat selectively-enforced principle. > The other bit is the non-deprecation of OOM_ADJ in > contrib/start-scripts/linux. It took me a while to understand the > logic of the variables used there - the interface would be much > clearer if it just was like this: > > ... set default PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj > > ... and then this in the configurable part of the script: > > PG_MASTER_OOM_SCORE_ADJ=-1000 > PG_OOM_SCORE_ADJ=0 > # Older Linux kernels may not have /proc/self/oom_score_adj, but instead > # /proc/self/oom_adj, which works similarly except the disable value is -17. > # For such a system, uncomment these three lines instead. > #PG_OOM_ADJUST_FILE=/proc/self/oom_adj > #PG_MASTER_OOM_SCORE_ADJ=-17 > #PG_OOM_SCORE_ADJ=0 > > ... and then use PG_OOM_ADJUST_FILE below instead of manually figuring > out which proc file to write to by looking at OOM.*_ADJ. I can't help but agree with this, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jul 1, 2014 at 8:22 AM, Christoph Berg <cb@df7cb.de> wrote: >> I have two comments on the patch: >> >> The choice to make the behavior depend first on PG_OOM_ADJUST_FILE and >> only secondly on PG_OOM_ADJUST_VALUE seems the wrong way round to me. > Of course, we have no guarantee that the Linux kernel guys won't > change this again. Apparently "we don't break userspace" is a > somewhat selectively-enforced principle. Yeah, I'm unexcited about this proposal. In any case, given the two existing APIs we have to deal with, allowing PG_OOM_ADJUST_VALUE to default to "0" is sane in both APIs but a default for the file name can work for only one. >> The other bit is the non-deprecation of OOM_ADJ in >> contrib/start-scripts/linux. It took me a while to understand the >> logic of the variables used there - the interface would be much >> clearer if it just was like this: >> ... and then use PG_OOM_ADJUST_FILE below instead of manually figuring >> out which proc file to write to by looking at OOM.*_ADJ. > I can't help but agree with this, though. Fair enough. I went for a minimum-change approach when hacking that script, but we could change it some more in the name of readability. Will do something about that. regards, tom lane
Re: Tom Lane 2014-07-01 <20654.1404247905@sss.pgh.pa.us> > Yeah, I'm unexcited about this proposal. In any case, given the two > existing APIs we have to deal with, allowing PG_OOM_ADJUST_VALUE to > default to "0" is sane in both APIs but a default for the file name > can work for only one. Nod. > Fair enough. I went for a minimum-change approach when hacking that > script, but we could change it some more in the name of readability. > Will do something about that. Thanks, it's much nicer now. There's one uglyness left though: The name PG_CHILD_OOM_SCORE_ADJ should match what actually gets passed to the backends, DAEMON_ENV="PG_OOM_ADJUST_FILE=$PG_OOM_ADJUST_FILE PG_OOM_ADJUST_VALUE=$PG_CHILD_OOM_SCORE_ADJ" would better be PG_OOM_ADJUST_VALUE=$PG_OOM_ADJUST_VALUE. (Possibly the smart way to fix this would be to change src/backend/postmaster/fork_process.c to use PG_CHILD_OOM_SCORE_ADJ instead.) Christoph -- cb@df7cb.de | http://www.df7cb.de/
On Tue, 1 Jul 2014 15:57:25 -0400 Robert Haas <robertmhaas@gmail.com> wrote: > Of course, we have no guarantee that the Linux kernel guys won't > change this again. Apparently "we don't break userspace" is a > somewhat selectively-enforced principle. It's selectively enforced in that kernel developers don't (and can't) scan the world for software that might break. OTOH, if somebody were to respond to a proposed change by saying "this breaks PostgreSQL," I would be amazed if the change were to be merged in that form. In other words, it's important to scream. There were concerns during the last round of messing with the OOM logic, but nobody pointed to something that actually broke. I predict that somebody will certainly try to rewrite the OOM code again in the future. The nature of that code is such that it is never going to work as well as people would like. But if application developers make it clear that they are dependent on the current interface working, kernel developers will be constrained to keep it working. jon
Christoph Berg <cb@df7cb.de> writes: > Re: Tom Lane 2014-07-01 <20654.1404247905@sss.pgh.pa.us> >> Fair enough. I went for a minimum-change approach when hacking that >> script, but we could change it some more in the name of readability. >> Will do something about that. > Thanks, it's much nicer now. There's one uglyness left though: The > name PG_CHILD_OOM_SCORE_ADJ should match what actually gets passed to > the backends, Meh ... I don't find that to be important. Making a parallel with PG_MASTER_OOM_SCORE_ADJ seemed more helpful here. regards, tom lane
On Wed, Jul 2, 2014 at 9:08 AM, Jonathan Corbet <corbet@lwn.net> wrote: > On Tue, 1 Jul 2014 15:57:25 -0400 > Robert Haas <robertmhaas@gmail.com> wrote: >> Of course, we have no guarantee that the Linux kernel guys won't >> change this again. Apparently "we don't break userspace" is a >> somewhat selectively-enforced principle. > > It's selectively enforced in that kernel developers don't (and can't) > scan the world for software that might break. OTOH, if somebody were > to respond to a proposed change by saying "this breaks PostgreSQL," I > would be amazed if the change were to be merged in that form. > > In other words, it's important to scream. There were concerns during > the last round of messing with the OOM logic, but nobody pointed to > something that actually broke. > > I predict that somebody will certainly try to rewrite the OOM code > again in the future. The nature of that code is such that it is never > going to work as well as people would like. But if application > developers make it clear that they are dependent on the current > interface working, kernel developers will be constrained to keep it > working. Well, of course the reality is that if you don't break some things, you can never change anything. And clearly we want Linux to be able to change things, to make them better. On the flip side, interface changes do cause some pain, and it's not realistic to expect every software project to monitor LKML. This one is not really a big deal, though; I think we just need to keep in mind, as you say, that it might change again. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company