Обсуждение: Re: DSA_ALLOC_NO_OOM doesn't work

Поиск
Список
Период
Сортировка

Re: DSA_ALLOC_NO_OOM doesn't work

От
Heikki Linnakangas
Дата:
(moving to pgsql-hackers)

On 29/01/2024 14:06, Heikki Linnakangas wrote:
> If you call dsa_allocate_extended(DSA_ALLOC_NO_OOM), it will still
> ereport an error if you run out of space (originally reported at [0]).
> 
> Attached patch adds code to test_dsa.c to demonstrate that:
> 
> postgres=# select test_dsa_basic();
> ERROR:  could not resize shared memory segment "/PostgreSQL.1312700148"
> to 1075843072 bytes: No space left on device
> 
> [0] https://github.com/pgvector/pgvector/issues/434#issuecomment-1912744489

I wrote the attached patch to address this, in a fairly straightforward 
or naive way. The problem was that even though dsa_allocate() had code 
to check the return value of dsm_create(), and return NULL instead of 
ereport(ERROR) if the DSA_ALLOC_NO_OOM was set, dsm_create() does not in 
fact return NULL on OOM. To fix, I added a DSM_CREATE_NO_OOM option to 
dsm_create(), and I also had to punch that through to dsm_impl_op().

This is a little unpolished, but if we want to backpatch something 
narrow now, this would probably be the right approach.

However, I must say that the dsm_impl_op() interface is absolutely 
insane. It's like someone looked at ioctl() and thought, "hey that's a 
great idea!". It mixes all different operations like creating or 
destroying a DSM segment together into one function call, and the return 
value is just a boolean, even though the function could fail for many 
different reasons, and the callers do in fact care about the reason. In 
a more natural interface, the different operations would have very 
different function signatures.

I think we must refactor that. It might be best to leave this 
DSA_ALLOC_NO_OOM bug unfixed in backpatches, and fix it on top of the 
refactorings on master only. Later, we can backpatch the refactorings 
too if we're comfortable with it; extensions shouldn't be using the 
dsm_impl_op() interface directly.

(I skimmed through the thread where the DSM code was added, but didn't 
see any mention of why dsm_impl_op() signature is like that: 

https://www.postgresql.org/message-id/CA%2BTgmoaDqDUgt%3D4Zs_QPOnBt%3DEstEaVNP%2B5t%2Bm%3DFPNWshiPR3A%40mail.gmail.com)

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Вложения

Re: DSA_ALLOC_NO_OOM doesn't work

От
Thomas Munro
Дата:
On Wed, Feb 14, 2024 at 3:23 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 29/01/2024 14:06, Heikki Linnakangas wrote:
> > If you call dsa_allocate_extended(DSA_ALLOC_NO_OOM), it will still
> > ereport an error if you run out of space (originally reported at [0]).
> >
> > Attached patch adds code to test_dsa.c to demonstrate that:
> >
> > postgres=# select test_dsa_basic();
> > ERROR:  could not resize shared memory segment "/PostgreSQL.1312700148"
> > to 1075843072 bytes: No space left on device
> >
> > [0] https://github.com/pgvector/pgvector/issues/434#issuecomment-1912744489

Right, DSA_ALLOC_NO_OOM handles the case where there aren't any more
DSM slots (which used to be more common before we ramped up some
constants) and the case where max_total_segment_size (as self-imposed
limit) would be exceeded, but there is nothing to deal with failure to
allocate at the DSM level, and yeah that just isn't a thing it can do.
Not surprisingly that this observation comes from a Docker user: its
64MB default size limit on the /dev/shm mountpoint breaks parallel
query as discussed on list a few times (see also
https://github.com/docker-library/postgres/issues/416).

This is my mistake, introduced in commit 16be2fd10019 where I failed
to pass that down into DSM code.  The only user of DSA_ALLOC_NO_OOM in
core code so far is in dshash.c, where we just re-throw after some
cleanup, commit 4569715b, so you could leak that control object due to
this phenomenon.

> I wrote the attached patch to address this, in a fairly straightforward
> or naive way. The problem was that even though dsa_allocate() had code
> to check the return value of dsm_create(), and return NULL instead of
> ereport(ERROR) if the DSA_ALLOC_NO_OOM was set, dsm_create() does not in
> fact return NULL on OOM. To fix, I added a DSM_CREATE_NO_OOM option to
> dsm_create(), and I also had to punch that through to dsm_impl_op().

Yeah, makes total sense.

> This is a little unpolished, but if we want to backpatch something
> narrow now, this would probably be the right approach.
>
> However, I must say that the dsm_impl_op() interface is absolutely
> insane. It's like someone looked at ioctl() and thought, "hey that's a
> great idea!". It mixes all different operations like creating or
> destroying a DSM segment together into one function call, and the return
> value is just a boolean, even though the function could fail for many
> different reasons, and the callers do in fact care about the reason. In
> a more natural interface, the different operations would have very
> different function signatures.

Yeah.  It also manages to channel some of shmat() et al's negative beauty.

Anyway, that leads to this treatment of errnos in your patch:

+            errno = 0;
             if (dsm_impl_op(DSM_OP_CREATE, seg->handle, size,
&seg->impl_private,
-                            &seg->mapped_address, &seg->mapped_size, ERROR))
+                            &seg->mapped_address, &seg->mapped_size,
ERROR, impl_flags))
                 break;
+            if ((flags & DSM_CREATE_NO_OOM) != 0 && (errno == ENOMEM
|| errno == ENOSPC))

... which seems reasonable given the constraints.  Another idea might
be to write something different in one of those output parameters to
distinguish out-of-memory, but nothing really seems to fit...

> I think we must refactor that. It might be best to leave this
> DSA_ALLOC_NO_OOM bug unfixed in backpatches, and fix it on top of the
> refactorings on master only. Later, we can backpatch the refactorings
> too if we're comfortable with it; extensions shouldn't be using the
> dsm_impl_op() interface directly.

Yeah, that sounds true.  We don't do a good job of nailing down the
public API of PostgreSQL but dsm_impl_op() is a rare case that is very
obviously not intended to be called by anyone else.  On the other
hand, if we really want to avoid changing the function prototype on
principle, perhaps we could make a new operation DSM_OP_CREATE_NO_OOM
instead?



Re: DSA_ALLOC_NO_OOM doesn't work

От
Robert Haas
Дата:
On Tue, Feb 13, 2024 at 7:53 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> However, I must say that the dsm_impl_op() interface is absolutely
> insane. It's like someone looked at ioctl() and thought, "hey that's a
> great idea!".

As the person who wrote that code, this made me laugh.

I agree it's not the prettiest interface, but I thought that was OK
considering that it should only ever have a very limited number of
callers. I believe I did it this way in the interest of code
compactness. Since there are four DSM implementations, I wanted the
implementation-specific code to be short and all in one place, and
jamming it all into one function served that purpose. Also, there's a
bunch of logic that is shared by multiple operations - detach and
destroy tend to be similar, and so do create and attach, and there are
even things that are shared across all operations, like the snprintf
at the top of dsm_impl_posix() or the slightly larger amount of
boilerplate at the top of dsm_impl_sysv().

I'm not particularly opposed to refactoring this to make it nicer, but
my judgement was that splitting it up into one function per operation
per implementation, say, would have involved a lot of duplication of
small bits of code that might then get out of sync with each other
over time. By doing it this way, the logic is a bit tangled -- or
maybe more than a bit -- but there's very little duplication because
each implementation gets jammed into the smallest possible box. I'm OK
with somebody deciding that I got the trade-offs wrong there, but I
will be interested to see the number of lines added vs. removed in any
future refactoring patch.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: DSA_ALLOC_NO_OOM doesn't work

От
Heikki Linnakangas
Дата:
On 14/02/2024 09:23, Robert Haas wrote:
> On Tue, Feb 13, 2024 at 7:53 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> However, I must say that the dsm_impl_op() interface is absolutely
>> insane. It's like someone looked at ioctl() and thought, "hey that's a
>> great idea!".
> 
> As the person who wrote that code, this made me laugh.
> 
> I agree it's not the prettiest interface, but I thought that was OK
> considering that it should only ever have a very limited number of
> callers. I believe I did it this way in the interest of code
> compactness. Since there are four DSM implementations, I wanted the
> implementation-specific code to be short and all in one place, and
> jamming it all into one function served that purpose. Also, there's a
> bunch of logic that is shared by multiple operations - detach and
> destroy tend to be similar, and so do create and attach, and there are
> even things that are shared across all operations, like the snprintf
> at the top of dsm_impl_posix() or the slightly larger amount of
> boilerplate at the top of dsm_impl_sysv().
> 
> I'm not particularly opposed to refactoring this to make it nicer, but
> my judgement was that splitting it up into one function per operation
> per implementation, say, would have involved a lot of duplication of
> small bits of code that might then get out of sync with each other
> over time. By doing it this way, the logic is a bit tangled -- or
> maybe more than a bit -- but there's very little duplication because
> each implementation gets jammed into the smallest possible box. I'm OK
> with somebody deciding that I got the trade-offs wrong there, but I
> will be interested to see the number of lines added vs. removed in any
> future refactoring patch.

That's fair, I can see those reasons. Nevertheless, I do think it was a 
bad tradeoff. A little bit of repetition would be better here, or we can 
extract the common parts to smaller functions.

I came up with the attached:

   25 files changed, 1710 insertions(+), 1113 deletions(-)

So yeah, it's more code, and there's some repetition, but I think this 
is more readable. Some of that is extra boilerplate because I split the 
implementations to separate files, and I also added tests.

I'm not 100% wedded on all of the decisions here, but I think this is 
the right direction overall. For example, I decided to separate 
dsm_handle used by the high-level interface in dsm.c and the 
dsm_impl_handle used by the low-level interace in dsm_impl_*.c (more on 
that below). That feels slightly better to me, but could be left out if 
we don't want that.

Notable changes:

- Split the single multiplexed dsm_impl_op() function into multiple 
functions for different operations. This allows more natural function 
signatures for the different operations.

- The create() function is now responsible for generating the handle, 
instead of having the caller generate it. Those implementations that 
need to generate a random handle and retry if it's already in use, now 
do that retry within the implementation.

- The destroy() function no longer detaches the segment; you must call 
detach() first if the segment is still attached. This avoids having to 
pass "junk" values when destroying a segment that's not attached, and in 
case of error, makes it more clear what failed.

- Separate dsm_handle, used by backend code to interact with the high 
level interface in dsm.c, from dsm_impl_handle, which is used to 
interact with the low-level functions in dsm_impl.c. This gets rid of 
the convention in dsm.c of reserving odd numbers for DSM segments stored 
in the main shmem area. There is now an explicit flag for that the 
control slot. For generating dsm_handles, we now use the same scheme we 
used to use for main-area shm segments for all DSM segments, which 
includes the slot number in the dsm_handle. The implementations use 
their own mechanisms for generating the low-level dsm_impl_handles (all 
but the SysV implementation generate a random handle and retry on 
collision).

- Use IPC_PRIVATE in the SysV implementation to have the OS create a 
unique identifier for us. Use the shmid directly as the (low-level) 
handle, so that we don't need to use shmget() to convert a key to shmid, 
and don't need the "cache" for that.

- create() no longer returns the mapped_size. The old Windows 
implementation had some code to read the actual mapped size after 
creating the mapping, and returned that in *mapped_size. Others just 
returned the requested size. In principle reading the actual size might 
be useful; the caller might be able to make use of the whole mapped size 
when it's larger than requested. In practice, the callers didn't do 
that. Also, POSIX shmem on FreeBSD has similar round-up-to-page-size 
behavior but the implementation did not query the actual mapped size 
after creating the segment, so you could not rely on it.

- Added a test that exercises basic create, detach, attach functionality 
using all the different implementations supported on the current platform.

- Change datatype of the opaque types in dsm_impl.c from "void *" to 
typedefs over uintptr_t. It's easy to make mistakes with "void *", as 
you can pass any pointer without getting warnings from the compiler. 
Dedicated typedefs give a bit more type checking. (This is in the first 
commit, all the other changes are bundled together in the second commit.)

Overall, I don't think this is backpatchable. The handle changes and use 
of IPC_PRIVATE in particular: they could lead to failure to clean up old 
segments if you upgraded the binary without a clean shutdown. A slightly 
different version of this possibly would be, but I'd like to focus on 
what's best for master for now.

-- 
Heikki Linnakangas
Neon (https://neon.tech)
Вложения

Re: DSA_ALLOC_NO_OOM doesn't work

От
Thomas Munro
Дата:
On Thu, Feb 22, 2024 at 8:19 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> - Separate dsm_handle, used by backend code to interact with the high
> level interface in dsm.c, from dsm_impl_handle, which is used to
> interact with the low-level functions in dsm_impl.c. This gets rid of
> the convention in dsm.c of reserving odd numbers for DSM segments stored
> in the main shmem area. There is now an explicit flag for that the
> control slot. For generating dsm_handles, we now use the same scheme we
> used to use for main-area shm segments for all DSM segments, which
> includes the slot number in the dsm_handle. The implementations use
> their own mechanisms for generating the low-level dsm_impl_handles (all
> but the SysV implementation generate a random handle and retry on
> collision).

Could we use slot number and generation number, instead of slot number
and random number?  I have never liked the random number thing, which
IIUC was needed because of SysV key space management problems 'leaking'
up to the handle level (yuck).  With generations, you could keep
collisions arbitrarily far apart (just decide how many bits to use).
Collisions aren't exactly likely, but if there is no need for that
approach, I'm not sure why we'd keep it.  (I remember dealing with
actual collisions in the wild due to lack of PRNG seeding in workers,
which admittedly should be vanishingly rare now).

If the slot number is encoded into the handle, why do we still need a
linear search for the slot?

> - create() no longer returns the mapped_size. The old Windows
> implementation had some code to read the actual mapped size after
> creating the mapping, and returned that in *mapped_size. Others just
> returned the requested size. In principle reading the actual size might
> be useful; the caller might be able to make use of the whole mapped size
> when it's larger than requested. In practice, the callers didn't do
> that. Also, POSIX shmem on FreeBSD has similar round-up-to-page-size
> behavior but the implementation did not query the actual mapped size
> after creating the segment, so you could not rely on it.

I think that is an interesting issue with the main shmem area.  There,
we can set huge_page_size to fantastically large sizes up to 16GB on
some architectures, but we have nothing to make sure we don't waste
some or most of the final page.  But I agree that there's not much
point in worrying about this for DSM.

> - Added a test that exercises basic create, detach, attach functionality
> using all the different implementations supported on the current platform.

I wonder how we could test the cleanup-after-crash behaviour.



Re: DSA_ALLOC_NO_OOM doesn't work

От
Thomas Munro
Дата:
On Thu, Feb 22, 2024 at 10:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> collisions arbitrarily far apart (just decide how many bits to use).

. o O ( Perhaps if you also allocated slots using a FIFO freelist,
instead of the current linear search for the first free slot, you
could maximise the time before a slot is reused, improving the
collision-avoiding power of a generation scheme? )



Re: DSA_ALLOC_NO_OOM doesn't work

От
Heikki Linnakangas
Дата:
On 22/02/2024 01:03, Thomas Munro wrote:
> On Thu, Feb 22, 2024 at 10:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> collisions arbitrarily far apart (just decide how many bits to use).
> 
> . o O ( Perhaps if you also allocated slots using a FIFO freelist,
> instead of the current linear search for the first free slot, you
> could maximise the time before a slot is reused, improving the
> collision-avoiding power of a generation scheme? )

We could also enlarge dsm_handle from 32-bits to 64-bits, if we're 
worried about collisions.

I actually experimented with something like that too: I encoded the "is 
this in main region" in one of the high bits and let the implementation 
use the low bits. One small issue with that is that we have a few places 
that pass a DSM handle as the 'bgw_main' argument when launching a 
worker process, and on 32-bit platforms that would not be wide enough. 
Those could be changed to use the wider 'bgw_extra' field instead, though.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: DSA_ALLOC_NO_OOM doesn't work

От
Robert Haas
Дата:
On Thu, Feb 22, 2024 at 12:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> That's fair, I can see those reasons. Nevertheless, I do think it was a
> bad tradeoff. A little bit of repetition would be better here, or we can
> extract the common parts to smaller functions.
>
> I came up with the attached:
>
>    25 files changed, 1710 insertions(+), 1113 deletions(-)
>
> So yeah, it's more code, and there's some repetition, but I think this
> is more readable. Some of that is extra boilerplate because I split the
> implementations to separate files, and I also added tests.

Adding tests is great. I'm unenthusiastic about the rest, but I don't
really care enough to argue.

What's the goal here, anyway? I mean, why bother?

--
Robert Haas
EDB: http://www.enterprisedb.com