Here also is a first attempt at improving the memory allocation and
memory layout.
I wonder if bulk logging should trigger larger WAL writes in the "Have
to write it ourselves" case in AdvanceXLInsertBuffer(), since writing
8kB of WAL at a time seems like an unnecessarily degraded level of
performance, especially with wal_sync_method=open_datasync. Of course
the real answer is "make sure wal_buffers is high enough for your
workload" (usually indirectly by automatically scaling from
shared_buffers), but this problem jumps out when tracing bulk_writes.c
with default settings. We write out the index 128kB at a time, but
the WAL 8kB at a time.