Обсуждение: Getting rid of some more lseek() calls

Поиск
Список
Период
Сортировка

Getting rid of some more lseek() calls

От
Thomas Munro
Дата:
Hello,

Continuing the work done in commits 0dc8ead4 and c24dcd0c, here are a
few more places where we could throw away some code by switching to
pg_pread() and pg_pwrite().

Вложения

Re: Getting rid of some more lseek() calls

От
Andres Freund
Дата:
Hi,

On 2020-02-07 12:38:27 +1300, Thomas Munro wrote:
> Continuing the work done in commits 0dc8ead4 and c24dcd0c, here are a
> few more places where we could throw away some code by switching to
> pg_pread() and pg_pwrite().

Nice.



> From 5723976510f30385385628758d7118042c4e4bf6 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Fri, 7 Feb 2020 12:04:43 +1300
> Subject: [PATCH 1/3] Use pg_pread() and pg_pwrite() in slru.c.
> 
> This avoids lseek() system calls at every SLRU I/O, as was
> done for relation files in commit c24dcd0c.
> ---
>  src/backend/access/transam/slru.c | 25 ++++---------------------
>  1 file changed, 4 insertions(+), 21 deletions(-)
> 
> diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
> index d5b7a08f73..f9efb22311 100644
> --- a/src/backend/access/transam/slru.c
> +++ b/src/backend/access/transam/slru.c
> @@ -646,7 +646,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
>      SlruShared    shared = ctl->shared;
>      int            segno = pageno / SLRU_PAGES_PER_SEGMENT;
>      int            rpageno = pageno % SLRU_PAGES_PER_SEGMENT;
> -    int            offset = rpageno * BLCKSZ;
> +    off_t        offset = rpageno * BLCKSZ;
>      char        path[MAXPGPATH];
>      int            fd;
>  
> @@ -676,17 +676,9 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
>          return true;
>      }
>  
> -    if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
> -    {
> -        slru_errcause = SLRU_SEEK_FAILED;
> -        slru_errno = errno;
> -        CloseTransientFile(fd);
> -        return false;
> -    }
> -
>      errno = 0;
>      pgstat_report_wait_start(WAIT_EVENT_SLRU_READ);
> -    if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
> +    if (pg_pread(fd, shared->page_buffer[slotno], BLCKSZ, offset) != BLCKSZ)
>      {
>          pgstat_report_wait_end();
>          slru_errcause = SLRU_READ_FAILED;
> @@ -726,7 +718,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
>      SlruShared    shared = ctl->shared;
>      int            segno = pageno / SLRU_PAGES_PER_SEGMENT;
>      int            rpageno = pageno % SLRU_PAGES_PER_SEGMENT;
> -    int            offset = rpageno * BLCKSZ;
> +    off_t        offset = rpageno * BLCKSZ;
>      char        path[MAXPGPATH];
>      int            fd = -1;
>  
> @@ -836,18 +828,9 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
>          }
>      }
>  
> -    if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
> -    {
> -        slru_errcause = SLRU_SEEK_FAILED;
> -        slru_errno = errno;
> -        if (!fdata)
> -            CloseTransientFile(fd);
> -        return false;
> -    }
> -
>      errno = 0;
>      pgstat_report_wait_start(WAIT_EVENT_SLRU_WRITE);
> -    if (write(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
> +    if (pg_pwrite(fd, shared->page_buffer[slotno], BLCKSZ, offset) != BLCKSZ)
>      {
>          pgstat_report_wait_end();
>          /* if write didn't set errno, assume problem is no disk space */
> -- 
> 2.23.0

Hm. This still leaves us with one source of SLRU_SEEK_FAILED. And that's
really just for getting the file size. Should we rename that?

Perhaps we should just replace lseek(SEEK_END) with fstat()? Or at least
one wrapper function for getting the size? It seems ugly to change fd
positions just for the purpose of getting file sizes (and also implies
more kernel level locking, I believe).


> From 95d7187172f2ac6c08dc92f1043e1662b0dab4ac Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Fri, 7 Feb 2020 12:04:57 +1300
> Subject: [PATCH 2/3] Use pg_pwrite() in rewriteheap.c.
> 
> This removes an lseek() call.
> ---
>  src/backend/access/heap/rewriteheap.c | 9 +--------
>  1 file changed, 1 insertion(+), 8 deletions(-)
> 
> diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
> index 5869922ff8..9c29bc0e0f 100644
> --- a/src/backend/access/heap/rewriteheap.c
> +++ b/src/backend/access/heap/rewriteheap.c
> @@ -1156,13 +1156,6 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
>                          path, (uint32) xlrec->offset)));
>      pgstat_report_wait_end();
>  
> -    /* now seek to the position we want to write our data to */
> -    if (lseek(fd, xlrec->offset, SEEK_SET) != xlrec->offset)
> -        ereport(ERROR,
> -                (errcode_for_file_access(),
> -                 errmsg("could not seek to end of file \"%s\": %m",
> -                        path)));
> -
>      data = XLogRecGetData(r) + sizeof(*xlrec);
>  
>      len = xlrec->num_mappings * sizeof(LogicalRewriteMappingData);
> @@ -1170,7 +1163,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
>      /* write out tail end of mapping file (again) */
>      errno = 0;
>      pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_WRITE);
> -    if (write(fd, data, len) != len)
> +    if (pg_pwrite(fd, data, len, xlrec->offset) != len)
>      {
>          /* if write didn't set errno, assume problem is no disk space */
>          if (errno == 0)

lgtm.


> From da6d712eeef2e3257d7fa672d95f2901bbe62887 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Fri, 7 Feb 2020 12:05:12 +1300
> Subject: [PATCH 3/3] Use pg_pwrite() in walreceiver.c.
> 
> This gets rid of an lseek() call.  While there was code to avoid
> it in most cases, it's better to lose the call AND the global state
> and code required to avoid it.
> ---
>  src/backend/replication/walreceiver.c | 28 +++------------------------
>  1 file changed, 3 insertions(+), 25 deletions(-)
> 
> diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
> index a5e85d32f3..2ab15c3cbb 100644
> --- a/src/backend/replication/walreceiver.c
> +++ b/src/backend/replication/walreceiver.c
> @@ -85,14 +85,13 @@ WalReceiverFunctionsType *WalReceiverFunctions = NULL;
>  #define NAPTIME_PER_CYCLE 100    /* max sleep time between cycles (100ms) */
>  
>  /*
> - * These variables are used similarly to openLogFile/SegNo/Off,
> + * These variables are used similarly to openLogFile/SegNo,
>   * but for walreceiver to write the XLOG. recvFileTLI is the TimeLineID
>   * corresponding the filename of recvFile.
>   */
>  static int    recvFile = -1;
>  static TimeLineID recvFileTLI = 0;
>  static XLogSegNo recvSegNo = 0;
> -static uint32 recvOff = 0;
>  
>  /*
>   * Flags set by interrupt handlers of walreceiver for later service in the
> @@ -945,7 +944,6 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
>              use_existent = true;
>              recvFile = XLogFileInit(recvSegNo, &use_existent, true);
>              recvFileTLI = ThisTimeLineID;
> -            recvOff = 0;
>          }
>  
>          /* Calculate the start offset of the received logs */
> @@ -956,29 +954,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
>          else
>              segbytes = nbytes;
>  
> -        /* Need to seek in the file? */
> -        if (recvOff != startoff)
> -        {
> -            if (lseek(recvFile, (off_t) startoff, SEEK_SET) < 0)
> -            {
> -                char        xlogfname[MAXFNAMELEN];
> -                int            save_errno = errno;
> -
> -                XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
> -                errno = save_errno;
> -                ereport(PANIC,
> -                        (errcode_for_file_access(),
> -                         errmsg("could not seek in log segment %s to offset %u: %m",
> -                                xlogfname, startoff)));
> -            }
> -
> -            recvOff = startoff;
> -        }
> -
>          /* OK to write the logs */
>          errno = 0;
>  
> -        byteswritten = write(recvFile, buf, segbytes);
> +        byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
>          if (byteswritten <= 0)
>          {
>              char        xlogfname[MAXFNAMELEN];
> @@ -995,13 +974,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
>                      (errcode_for_file_access(),
>                       errmsg("could not write to log segment %s "
>                              "at offset %u, length %lu: %m",
> -                            xlogfname, recvOff, (unsigned long) segbytes)));
> +                            xlogfname, startoff, (unsigned long) segbytes)));
>          }
>  
>          /* Update state for write */
>          recptr += byteswritten;
>  
> -        recvOff += byteswritten;
>          nbytes -= byteswritten;
>          buf += byteswritten;

lgtm.


There's still a few more lseek(SEEK_SET) calls in the backend after this
(walsender, miscinit, pg_stat_statements). It'd imo make sense to just
try to get rid of all of them in one series this time round?

Greetings,

Andres Freund



Re: Getting rid of some more lseek() calls

От
Thomas Munro
Дата:
On Fri, Feb 7, 2020 at 1:37 PM Andres Freund <andres@anarazel.de> wrote:
> Hm. This still leaves us with one source of SLRU_SEEK_FAILED. And that's
> really just for getting the file size. Should we rename that?
>
> Perhaps we should just replace lseek(SEEK_END) with fstat()? Or at least
> one wrapper function for getting the size? It seems ugly to change fd
> positions just for the purpose of getting file sizes (and also implies
> more kernel level locking, I believe).

lseek(SEEK_END) seems to be nearly twice as fast as fstat() if you
just call it in a big loop, on Linux and FreeBSD (though I didn't
investigate exactly why, mitigations etc, it certainly returns more
stuff so there's that).  I don't think that's a problem here (I mean,
we open and close the file every time so we can't be too concerned
about the overheads), so I'm in favour of creating a pg_fstat_size(fd)
function on aesthetic grounds.  Here's a patch like that; better names
welcome.

For the main offender, namely md.c via fd.c's FileSize(), I'd hold off
on changing that until we figure out how to cache the sizes[1].

> There's still a few more lseek(SEEK_SET) calls in the backend after this
> (walsender, miscinit, pg_stat_statements). It'd imo make sense to just
> try to get rid of all of them in one series this time round?

Ok, I pushed changes for all the cases discussed except slru.c and
walsender.c, which depend on the bikeshed colour discussion about
whether we want pg_fstat_size().  See attached.

[1]
https://www.postgresql.org/message-id/flat/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com

Вложения

Re: Getting rid of some more lseek() calls

От
Michael Paquier
Дата:
On Tue, Feb 11, 2020 at 06:04:09PM +1300, Thomas Munro wrote:
> lseek(SEEK_END) seems to be nearly twice as fast as fstat() if you
> just call it in a big loop, on Linux and FreeBSD (though I didn't
> investigate exactly why, mitigations etc, it certainly returns more
> stuff so there's that).

Interesting.  What of Windows?  We've had for some time now problem
with fetching the size of files larger than 4GB (COPY, dumps..).  I am
wondering if we could not take advantage of that for those cases:
https://www.postgresql.org/message-id/15858-9572469fd3b73263@postgresql.org
--
Michael

Вложения

Re: Getting rid of some more lseek() calls

От
Thomas Munro
Дата:
On Wed, Feb 12, 2020 at 6:42 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Feb 11, 2020 at 06:04:09PM +1300, Thomas Munro wrote:
> > lseek(SEEK_END) seems to be nearly twice as fast as fstat() if you
> > just call it in a big loop, on Linux and FreeBSD (though I didn't
> > investigate exactly why, mitigations etc, it certainly returns more
> > stuff so there's that).
>
> Interesting.  What of Windows?  We've had for some time now problem
> with fetching the size of files larger than 4GB (COPY, dumps..).  I am
> wondering if we could not take advantage of that for those cases:
> https://www.postgresql.org/message-id/15858-9572469fd3b73263@postgresql.org

Hmm.  Well, on Unix we have to choose between "tell me the size but
also change the position that I either don't care about or have to
undo", and "tell me the size but also tell me all this other stuff I
don't care about".  Since Windows apparently has GetFileSizeEx(), why
not use that when that's exactly what you want?  It apparently
understands large files.

https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfilesizeex



Re: Getting rid of some more lseek() calls

От
Alvaro Herrera
Дата:
On 2020-Feb-12, Thomas Munro wrote:

> Hmm.  Well, on Unix we have to choose between "tell me the size but
> also change the position that I either don't care about or have to
> undo", and "tell me the size but also tell me all this other stuff I
> don't care about".  Since Windows apparently has GetFileSizeEx(), why
> not use that when that's exactly what you want?  It apparently
> understands large files.

I was already thinking that it might be better to make the new function
just "tell me the file size" without leaking the details of *how* we do
it, before reading about this Windows call.  That reinforces it, IMO.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Getting rid of some more lseek() calls

От
Thomas Munro
Дата:
On Thu, Feb 13, 2020 at 5:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2020-Feb-12, Thomas Munro wrote:
> > Hmm.  Well, on Unix we have to choose between "tell me the size but
> > also change the position that I either don't care about or have to
> > undo", and "tell me the size but also tell me all this other stuff I
> > don't care about".  Since Windows apparently has GetFileSizeEx(), why
> > not use that when that's exactly what you want?  It apparently
> > understands large files.
>
> I was already thinking that it might be better to make the new function
> just "tell me the file size" without leaking the details of *how* we do
> it, before reading about this Windows call.  That reinforces it, IMO.

Ok, how about this?

Вложения

Re: Getting rid of some more lseek() calls

От
Michael Paquier
Дата:
On Thu, Feb 13, 2020 at 02:51:44PM +1300, Thomas Munro wrote:
> Ok, how about this?

Alvaro's point sounds sensible to me.  I like the approach you are
taking in 0001.  At least it avoids more issues with WIN32 and stat()
(I hope to work on that at some point, we'll see..).

+/*
+ * pg_file_size --- return the size of a file
+ */
+int64
+pg_file_size(int fd)
+{
This routine has nothing really dependent on the backend.  Would it
make sense to put it in a different place where it can be used by the
frontend?  The function should include at least a comment about why we
have a special path for Windows, aka not falling into the trap of the
4GB limit for stat().

The commit message of 0001 mentions pg_read(), and that should be
pg_pread().

There are two combinations of lseek/read that could be replaced: one
in pg_receivewal.c:FindStreamingStart(), and one in
SimpleXLogPageRead() for parsexlog.c as of pg_rewind.

Patch 0002 looks good to me.  This actually removes a confusion when
failing to seek the end of the file as the offset referenced to would
be 0.  Patch 0003 is also a very good thing.
--
Michael

Вложения

Re: Getting rid of some more lseek() calls

От
Alvaro Herrera
Дата:
On 2020-Feb-13, Thomas Munro wrote:

> On Thu, Feb 13, 2020 at 5:30 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > On 2020-Feb-12, Thomas Munro wrote:
> > > Hmm.  Well, on Unix we have to choose between "tell me the size but
> > > also change the position that I either don't care about or have to
> > > undo", and "tell me the size but also tell me all this other stuff I
> > > don't care about".  Since Windows apparently has GetFileSizeEx(), why
> > > not use that when that's exactly what you want?  It apparently
> > > understands large files.
> >
> > I was already thinking that it might be better to make the new function
> > just "tell me the file size" without leaking the details of *how* we do
> > it, before reading about this Windows call.  That reinforces it, IMO.
> 
> Ok, how about this?

So, you said lseek() is faster than fstat, and we would only use the
latter because we want to avoid the file position jumping ahead, even
though it's slower.  But if the next read/write is not going to care
about the file position because pread/pwrite, then why not just do one
lseek() and not worry about the file position jumping ahead?  Maybe the
API could offer this as an option; caller can say "I do care about file
position" (a boolean flag) and then we use fstat; or they can say "I
don't care" and then we just do a single lseek(SEEK_END).  Of course, in
Windows we ignore the flag since we can do it in the fast way.

pg_file_size(int fd, bool careful)

Let's have the function comment indicate that -1 is returned in case of
failure.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Getting rid of some more lseek() calls

От
Thomas Munro
Дата:
On Fri, Feb 14, 2020 at 4:47 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> So, you said lseek() is faster than fstat, and we would only use the
> latter because we want to avoid the file position jumping ahead, even
> though it's slower.  But if the next read/write is not going to care
> about the file position because pread/pwrite, then why not just do one
> lseek() and not worry about the file position jumping ahead?  Maybe the
> API could offer this as an option; caller can say "I do care about file
> position" (a boolean flag) and then we use fstat; or they can say "I
> don't care" and then we just do a single lseek(SEEK_END).  Of course, in
> Windows we ignore the flag since we can do it in the fast way.
>
> pg_file_size(int fd, bool careful)
>
> Let's have the function comment indicate that -1 is returned in case of
> failure.

Reviving an old thread... yeah, we should probably figure out if we
still want a central fd size function and how it should look, but in
the meantime, we might as well have slru.c using pread/pwrite as
already agreed, so I went ahead and pushed that part.