Обсуждение: win32 _dosmaperr()

Поиск
Список
Период
Сортировка

win32 _dosmaperr()

От
"Qingqing Zhou"
Дата:
There were several reports of "unable to read/write" on Pg8.0.x win32 port:

http://archives.postgresql.org/pgsql-bugs/2005-02/msg00181.php

I encounter this several times and finally I catch the GetLastError()
number. It is
   32, ERROR_SHARING_VIOLATION   The process cannot access the file because it is being used by another
process.

But PG server error message is "invalid parameter" which makes this error
difficult to understand and track. After examing win32 CRT's _dosmaperr()
implementation, I found they failed to transalte ERROR_SHARING_VIOLATION, so
the default errno is set to EINVAL. To solve it, we can do our own
_dosmaperr(GetLastError()) again if read/write failed. Unfortunately our
_dosmaperr() failed to do so either, so here is a patch of error.c. Also, I
raised the error level to NOTICE for better bug report. If this is
acceptable, I will patch FileRead()/FileWrite() etc.

However, I am very sure why this could happen. That is, who uses the data
file in a non-sharing mode? There are many possibilities, a common concensus
is [Anti-]virus software. Yes, I do have one installed. If we can confirm
this, then we could at least print a hint message.


Regards,
Qingqing


Index: error.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/port/win32/error.c,v
retrieving revision 1.4
diff -c -r1.4 error.c
*** error.c 31 Dec 2004 22:00:37 -0000 1.4
--- error.c 13 Jul 2005 09:04:57 -0000
***************
*** 72,77 ****
--- 72,80 ----   ERROR_NO_MORE_FILES, ENOENT  },  {
+   ERROR_SHARING_VIOLATION, EACCES
+  },
+  {   ERROR_LOCK_VIOLATION, EACCES  },  {
***************
*** 180,188 ****   }  }

!  ereport(DEBUG4,
!    (errmsg_internal("Unknown win32 error code: %i",
!         (int) e)));  errno = EINVAL;  return; }
--- 183,192 ----   }  }

!  ereport(NOTICE,
!     (errmsg_internal("Unknown win32 error code: %i. "
!          "Please report to <pgsql-bugs@postgresql.org>.",
!           (int) e)));  errno = EINVAL;  return; }




Re: win32 _dosmaperr()

От
"Magnus Hagander"
Дата:
> There were several reports of "unable to read/write" on
> Pg8.0.x win32 port:
>
> http://archives.postgresql.org/pgsql-bugs/2005-02/msg00181.php
>
> I encounter this several times and finally I catch the
> GetLastError() number. It is
>
>     32, ERROR_SHARING_VIOLATION
>     The process cannot access the file because it is being
> used by another process.
>
> But PG server error message is "invalid parameter" which
> makes this error difficult to understand and track. After
> examing win32 CRT's _dosmaperr() implementation, I found they
> failed to transalte ERROR_SHARING_VIOLATION, so the default
> errno is set to EINVAL. To solve it, we can do our own
> _dosmaperr(GetLastError()) again if read/write failed.
> Unfortunately our
> _dosmaperr() failed to do so either, so here is a patch of
> error.c. Also, I raised the error level to NOTICE for better
> bug report. If this is acceptable, I will patch
> FileRead()/FileWrite() etc.

Seems reasonable.


> However, I am very sure why this could happen. That is, who
> uses the data file in a non-sharing mode? There are many
> possibilities, a common concensus is [Anti-]virus software.
> Yes, I do have one installed. If we can confirm this, then we
> could at least print a hint message.

I would suspect either AV software and/or backup software not excluding
the pg data files.

I suggest you try using Process Explorer from www.sysinternals.com to
figure out who has the file open. Most of the time it should be able to
tell you exactly who has locked the file - at least as long as it's done
from userspace. I'm not 100% sure on how it deals with kernel level
locks.


//Magnus


Re: win32 _dosmaperr()

От
"Qingqing Zhou"
Дата:
""Magnus Hagander"" <mha@sollentuna.net> writes
>
> I suggest you try using Process Explorer from www.sysinternals.com to
> figure out who has the file open. Most of the time it should be able to
> tell you exactly who has locked the file - at least as long as it's done
> from userspace. I'm not 100% sure on how it deals with kernel level
> locks.
>

Yes, "handle"  (also in sysinternal's site) might be an alternative. I've
add a call to "handle" to catch all the open handles of DataDir when the
error is trapped.

Regards,
Qingqing




Re: win32 _dosmaperr()

От
"Merlin Moncure"
Дата:
Qingqing wrote:
> There were several reports of "unable to read/write" on Pg8.0.x win32
> port:
>
> http://archives.postgresql.org/pgsql-bugs/2005-02/msg00181.php
>
> I encounter this several times and finally I catch the GetLastError()
> number. It is
>
>     32, ERROR_SHARING_VIOLATION
>     The process cannot access the file because it is being used by
another
> process.
>
> But PG server error message is "invalid parameter" which makes this
error
> difficult to understand and track. After examing win32 CRT's
_dosmaperr()
> implementation, I found they failed to transalte
ERROR_SHARING_VIOLATION,
> so
> the default errno is set to EINVAL. To solve it, we can do our own
> _dosmaperr(GetLastError()) again if read/write failed. Unfortunately
our
> _dosmaperr() failed to do so either, so here is a patch of error.c.
Also,
> I
> raised the error level to NOTICE for better bug report. If this is
> acceptable, I will patch FileRead()/FileWrite() etc.
>
> However, I am very sure why this could happen. That is, who uses the
data
> file in a non-sharing mode? There are many possibilities, a common
> concensus
> is [Anti-]virus software. Yes, I do have one installed. If we can
confirm
> this, then we could at least print a hint message.

I had similar problems since the early days of the win32 port, random
restarts of the stats collector and other unexplainable things.  This
only ever happened under heavy loads (1000+/sec sustained query
processing) with statement level stats on.  This played havoc with my
user diagnostic tools because it randomly restarted the stats collector
so I've had to keep row level stats off.

Merlin


Re: win32 _dosmaperr()

От
"Qingqing Zhou"
Дата:
""Magnus Hagander"" <mha@sollentuna.net> writes
>
> I suggest you try using Process Explorer from www.sysinternals.com to
> figure out who has the file open. Most of the time it should be able to
> tell you exactly who has locked the file - at least as long as it's done
> from userspace. I'm not 100% sure on how it deals with kernel level
> locks.
>

After runing PG win32 (8.0.1) sever for a while and mix some heavy
transactions like checkpoint, vacuum together, I encountered another problem
should be in the same category. PG reports:
   "could not unlink 0000xxxx, continuing to try"

at dirmod.c/pgunlink() and deadloops there. I use the PE tool you mentioned,
I found there are only 3 processes hold the handle of the problematic xlog
segment, all of them are postgres backends. Using the FileMon tool from the
same website, I found that bgwriter tried to OPEN the xlog segment with ALL
ACCESS but failed with result DELETE PEND.

That is to say, under some conditions, even if I opened file with
SHARED_DELETE flag, I may not remove the file when it is open? I did some
tests, but every time I delete/rename an opened file, I could make it.

Things could get worse because the whole database cluster may stop working
and waiting for the buffer the bgwriter is working on, but bgwriter is
waiting for (by the deadloop in pgunlink) those postgres'es to move on (so
that they could close the problematic xlog segment), which is a deadlock.

Regards,
Qingqing








Re: win32 _dosmaperr()

От
Bruce Momjian
Дата:
Interesting. Are you sure all those processes were using our standard
flags?  Seems unusual and you are right, it shouldn't be happening.

---------------------------------------------------------------------------

Qingqing Zhou wrote:
> 
> ""Magnus Hagander"" <mha@sollentuna.net> writes
> >
> > I suggest you try using Process Explorer from www.sysinternals.com to
> > figure out who has the file open. Most of the time it should be able to
> > tell you exactly who has locked the file - at least as long as it's done
> > from userspace. I'm not 100% sure on how it deals with kernel level
> > locks.
> >
> 
> After runing PG win32 (8.0.1) sever for a while and mix some heavy
> transactions like checkpoint, vacuum together, I encountered another problem
> should be in the same category. PG reports:
> 
>     "could not unlink 0000xxxx, continuing to try"
> 
> at dirmod.c/pgunlink() and deadloops there. I use the PE tool you mentioned,
> I found there are only 3 processes hold the handle of the problematic xlog
> segment, all of them are postgres backends. Using the FileMon tool from the
> same website, I found that bgwriter tried to OPEN the xlog segment with ALL
> ACCESS but failed with result DELETE PEND.
> 
> That is to say, under some conditions, even if I opened file with
> SHARED_DELETE flag, I may not remove the file when it is open? I did some
> tests, but every time I delete/rename an opened file, I could make it.
> 
> Things could get worse because the whole database cluster may stop working
> and waiting for the buffer the bgwriter is working on, but bgwriter is
> waiting for (by the deadloop in pgunlink) those postgres'es to move on (so
> that they could close the problematic xlog segment), which is a deadlock.
> 
> Regards,
> Qingqing
> 
> 
> 
> 
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: win32 _dosmaperr()

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Qingqing Zhou wrote:
>> Things could get worse because the whole database cluster may stop working
>> and waiting for the buffer the bgwriter is working on, but bgwriter is
>> waiting for (by the deadloop in pgunlink) those postgres'es to move on (so
>> that they could close the problematic xlog segment), which is a deadlock.

I think that analysis is bogus.  The bgwriter only tries to unlink xlog
segments during post-checkpoint cleanup, at which point it isn't holding
any buffer locks.  Likewise, while backends might wait trying to remove
a table file because the bgwriter has the file open, in that state they
aren't blocking the bgwriter either.

In the latter case, the backends will have to wait till the bgwriter
closes the file, which it'll do not later than the next checkpoint.
I wonder whether the complaints are coming from people who don't know
about that, and didn't wait long enough?

There could be a deadlock if a backend is holding open an old xlog
segment while it executes a CHECKPOINT command, because then it'll
wait for the bgwriter, and the bgwriter might think it could remove
the xlog file during the checkpoint.

Another form could only happen between two backends: A is trying to
unlink file F, which backend B has open, and then for some unrelated
reason B has to wait for a lock held by A.  The bgwriter doesn't take
nor wait for locks so this doesn't apply to it.

But none of this should be happening because we're supposedly always
opening all these files with the magic sharing flag.
        regards, tom lane