Re: Vectored I/O in bulk_write.c

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Vectored I/O in bulk_write.c
Дата
Msg-id 71aae2e2-d99b-4f98-b4f9-186a0ed3309a@iki.fi
обсуждение исходный текст
Ответ на Re: Vectored I/O in bulk_write.c  (Thomas Munro <thomas.munro@gmail.com>)
Ответы Re: Vectored I/O in bulk_write.c  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
(Replying to all your messages in this thread together)

> This made me wonder why smgrwritev() and smgrextendv() shouldn't be
> backed by the same implementation, since they are essentially the same
> operation.

+1 to the general idea of merging the write and extend functions.

> The differences are some assertions which might as well be
> moved up to the smgr.c level as they must surely apply to any future
> smgr implementation too, right?, and the segment file creation policy
> which can be controlled with a new argument.  I tried that here.

Currently, smgrwrite/smgrextend just calls through to mdwrite/mdextend. 
I'd like to keep it that way. Otherwise we need to guess what a 
hypothetical smgr implementation might look like.

For example this assertion:

>     /* This assert is too expensive to have on normally ... */
> #ifdef CHECK_WRITE_VS_EXTEND
>     Assert(blocknum >= mdnblocks(reln, forknum));
> #endif

I think that should continue to be md.c's internal affair. For example, 
imagine that you had a syscall like write() but which always writes to 
the end of the file and also returns the offset that the data was 
written to. You would want to assert the returned offset instead of the 
above.

So -1 on moving up the assertions to smgr.c.

Let's bite the bullet and merge the smgrwrite and smgrextend functions 
at the smgr level too. I propose the following signature:

#define SWF_SKIP_FSYNC        0x01
#define SWF_EXTEND        0x02
#define SWF_ZERO        0x04

void smgrwritev(SMgrRelation reln, ForkNumber forknum,
        BlockNumber blocknum,
        const void **buffer, BlockNumber nblocks,
        int flags);

This would replace smgwrite, smgrextend, and smgrzeroextend. The 
mdwritev() function would have the same signature. A single 'flags' arg 
looks better in the callers than booleans, because you don't need to 
remember what 'true' or 'false' means. The callers would look like this:

/* like old smgrwrite(reln, MAIN_FORKNUM, 123, buf, false) */
smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1, 0);

/* like old smgrwrite(reln, MAIN_FORKNUM, 123, buf, true) */
smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1, SWF_SKIP_FSYNC);

/* like old smgrextend(reln, MAIN_FORKNUM, 123, buf, true) */
smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1,
            SWF_EXTEND | SWF_SKIP_FSYNC);

/* like old smgrzeroextend(reln, MAIN_FORKNUM, 123, 10, true) */
smgrwritev(reln, MAIN_FORKNUM, 123, NULL, 10,
            SWF_EXTEND | SWF_ZERO | SWF_SKIP_FSYNC);

> While thinking about that I realised that an existing write-or-extend
> assertion in master is wrong because it doesn't add nblocks.
> 
> Hmm, it's a bit weird that we have nblocks as int or BlockNumber in
> various places, which I think should probably be fixed.

+1

> Here also is a first attempt at improving the memory allocation and
> memory layout.
> ...
> +typedef union BufferSlot
> +{
> +    PGIOAlignedBlock buffer;
> +    dlist_node    freelist_node;
> +}            BufferSlot;
> +

If you allocated the buffers in one large contiguous chunk, you could 
often do one large write() instead of a gathered writev() of multiple 
blocks. That should be even better, although I don't know much of a 
difference it makes. The above layout wastes a fair amount memory too, 
because 'buffer' is I/O aligned.

> I wonder if bulk logging should trigger larger WAL writes in the "Have
> to write it ourselves" case in AdvanceXLInsertBuffer(), since writing
> 8kB of WAL at a time seems like an unnecessarily degraded level of
> performance, especially with wal_sync_method=open_datasync.  Of course
> the real answer is "make sure wal_buffers is high enough for your
> workload" (usually indirectly by automatically scaling from
> shared_buffers), but this problem jumps out when tracing bulk_writes.c
> with default settings.  We write out the index 128kB at a time, but
> the WAL 8kB at a time.

Makes sense.

On 12/03/2024 23:38, Thomas Munro wrote:
> One more observation while I'm thinking about bulk_write.c... hmm, it
> writes the data out and asks the checkpointer to fsync it, but doesn't
> call smgrwriteback().  I assume that means that on Linux the physical
> writeback sometimes won't happen until the checkpointer eventually
> calls fsync() sequentially, one segment file at a time.  I see that
> it's difficult to decide how to do that though; unlike checkpoints,
> which have rate control/spreading, bulk writes could more easily flood
> the I/O subsystem in a burst.  I expect most non-Linux systems'
> write-behind heuristics to fire up for bulk sequential writes, but on
> Linux where most users live, there is no write-behind heuristic AFAIK
> (on the most common file systems anyway), so you have to crank that
> handle if you want it to wake up and smell the coffee before it hits
> internal limits, but then you have to decide how fast to crank it.
> 
> This problem will come into closer focus when we start talking about
> streaming writes.  For the current non-streaming bulk_write.c coding,
> I don't have any particular idea of what to do about that, so I'm just
> noting the observation here.

It would be straightforward to call smgrwriteback() from sgmr_bulk_flush 
every 1 GB of written data for example. It would be nice to do it in the 
background though, and not stall the bulk writing for it. With the AIO 
patches, I presume we could easily start an asynchronous writeback and 
not wait for it to finish.

> Sorry for the sudden wall of text/monologue; this is all a sort of
> reaction to bulk_write.c that I should perhaps have sent to the
> bulk_write.c thread, triggered by a couple of debugging sessions.

Thanks for looking! This all makes sense. The idea of introducing the 
bulk write interface was that now we have a natural place to put all 
these heuristics and optimizations in. That seems to be a success, 
AFAICS none of the things discussed here will change the bulk_write API, 
just the implementation.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: session username in default psql prompt?
Следующее
От: Dean Rasheed
Дата:
Сообщение: Re: MERGE ... RETURNING