Обсуждение: win32 _dosmaperr()
There were several reports of "unable to read/write" on Pg8.0.x win32 port: http://archives.postgresql.org/pgsql-bugs/2005-02/msg00181.php I encounter this several times and finally I catch the GetLastError() number. It is 32, ERROR_SHARING_VIOLATION The process cannot access the file because it is being used by another process. But PG server error message is "invalid parameter" which makes this error difficult to understand and track. After examing win32 CRT's _dosmaperr() implementation, I found they failed to transalte ERROR_SHARING_VIOLATION, so the default errno is set to EINVAL. To solve it, we can do our own _dosmaperr(GetLastError()) again if read/write failed. Unfortunately our _dosmaperr() failed to do so either, so here is a patch of error.c. Also, I raised the error level to NOTICE for better bug report. If this is acceptable, I will patch FileRead()/FileWrite() etc. However, I am very sure why this could happen. That is, who uses the data file in a non-sharing mode? There are many possibilities, a common concensus is [Anti-]virus software. Yes, I do have one installed. If we can confirm this, then we could at least print a hint message. Regards, Qingqing Index: error.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/port/win32/error.c,v retrieving revision 1.4 diff -c -r1.4 error.c *** error.c 31 Dec 2004 22:00:37 -0000 1.4 --- error.c 13 Jul 2005 09:04:57 -0000 *************** *** 72,77 **** --- 72,80 ---- ERROR_NO_MORE_FILES, ENOENT }, { + ERROR_SHARING_VIOLATION, EACCES + }, + { ERROR_LOCK_VIOLATION, EACCES }, { *************** *** 180,188 **** } } ! ereport(DEBUG4, ! (errmsg_internal("Unknown win32 error code: %i", ! (int) e))); errno = EINVAL; return; } --- 183,192 ---- } } ! ereport(NOTICE, ! (errmsg_internal("Unknown win32 error code: %i. " ! "Please report to <pgsql-bugs@postgresql.org>.", ! (int) e))); errno = EINVAL; return; }
> There were several reports of "unable to read/write" on > Pg8.0.x win32 port: > > http://archives.postgresql.org/pgsql-bugs/2005-02/msg00181.php > > I encounter this several times and finally I catch the > GetLastError() number. It is > > 32, ERROR_SHARING_VIOLATION > The process cannot access the file because it is being > used by another process. > > But PG server error message is "invalid parameter" which > makes this error difficult to understand and track. After > examing win32 CRT's _dosmaperr() implementation, I found they > failed to transalte ERROR_SHARING_VIOLATION, so the default > errno is set to EINVAL. To solve it, we can do our own > _dosmaperr(GetLastError()) again if read/write failed. > Unfortunately our > _dosmaperr() failed to do so either, so here is a patch of > error.c. Also, I raised the error level to NOTICE for better > bug report. If this is acceptable, I will patch > FileRead()/FileWrite() etc. Seems reasonable. > However, I am very sure why this could happen. That is, who > uses the data file in a non-sharing mode? There are many > possibilities, a common concensus is [Anti-]virus software. > Yes, I do have one installed. If we can confirm this, then we > could at least print a hint message. I would suspect either AV software and/or backup software not excluding the pg data files. I suggest you try using Process Explorer from www.sysinternals.com to figure out who has the file open. Most of the time it should be able to tell you exactly who has locked the file - at least as long as it's done from userspace. I'm not 100% sure on how it deals with kernel level locks. //Magnus
""Magnus Hagander"" <mha@sollentuna.net> writes > > I suggest you try using Process Explorer from www.sysinternals.com to > figure out who has the file open. Most of the time it should be able to > tell you exactly who has locked the file - at least as long as it's done > from userspace. I'm not 100% sure on how it deals with kernel level > locks. > Yes, "handle" (also in sysinternal's site) might be an alternative. I've add a call to "handle" to catch all the open handles of DataDir when the error is trapped. Regards, Qingqing
Qingqing wrote: > There were several reports of "unable to read/write" on Pg8.0.x win32 > port: > > http://archives.postgresql.org/pgsql-bugs/2005-02/msg00181.php > > I encounter this several times and finally I catch the GetLastError() > number. It is > > 32, ERROR_SHARING_VIOLATION > The process cannot access the file because it is being used by another > process. > > But PG server error message is "invalid parameter" which makes this error > difficult to understand and track. After examing win32 CRT's _dosmaperr() > implementation, I found they failed to transalte ERROR_SHARING_VIOLATION, > so > the default errno is set to EINVAL. To solve it, we can do our own > _dosmaperr(GetLastError()) again if read/write failed. Unfortunately our > _dosmaperr() failed to do so either, so here is a patch of error.c. Also, > I > raised the error level to NOTICE for better bug report. If this is > acceptable, I will patch FileRead()/FileWrite() etc. > > However, I am very sure why this could happen. That is, who uses the data > file in a non-sharing mode? There are many possibilities, a common > concensus > is [Anti-]virus software. Yes, I do have one installed. If we can confirm > this, then we could at least print a hint message. I had similar problems since the early days of the win32 port, random restarts of the stats collector and other unexplainable things. This only ever happened under heavy loads (1000+/sec sustained query processing) with statement level stats on. This played havoc with my user diagnostic tools because it randomly restarted the stats collector so I've had to keep row level stats off. Merlin
""Magnus Hagander"" <mha@sollentuna.net> writes > > I suggest you try using Process Explorer from www.sysinternals.com to > figure out who has the file open. Most of the time it should be able to > tell you exactly who has locked the file - at least as long as it's done > from userspace. I'm not 100% sure on how it deals with kernel level > locks. > After runing PG win32 (8.0.1) sever for a while and mix some heavy transactions like checkpoint, vacuum together, I encountered another problem should be in the same category. PG reports: "could not unlink 0000xxxx, continuing to try" at dirmod.c/pgunlink() and deadloops there. I use the PE tool you mentioned, I found there are only 3 processes hold the handle of the problematic xlog segment, all of them are postgres backends. Using the FileMon tool from the same website, I found that bgwriter tried to OPEN the xlog segment with ALL ACCESS but failed with result DELETE PEND. That is to say, under some conditions, even if I opened file with SHARED_DELETE flag, I may not remove the file when it is open? I did some tests, but every time I delete/rename an opened file, I could make it. Things could get worse because the whole database cluster may stop working and waiting for the buffer the bgwriter is working on, but bgwriter is waiting for (by the deadloop in pgunlink) those postgres'es to move on (so that they could close the problematic xlog segment), which is a deadlock. Regards, Qingqing
Interesting. Are you sure all those processes were using our standard flags? Seems unusual and you are right, it shouldn't be happening. --------------------------------------------------------------------------- Qingqing Zhou wrote: > > ""Magnus Hagander"" <mha@sollentuna.net> writes > > > > I suggest you try using Process Explorer from www.sysinternals.com to > > figure out who has the file open. Most of the time it should be able to > > tell you exactly who has locked the file - at least as long as it's done > > from userspace. I'm not 100% sure on how it deals with kernel level > > locks. > > > > After runing PG win32 (8.0.1) sever for a while and mix some heavy > transactions like checkpoint, vacuum together, I encountered another problem > should be in the same category. PG reports: > > "could not unlink 0000xxxx, continuing to try" > > at dirmod.c/pgunlink() and deadloops there. I use the PE tool you mentioned, > I found there are only 3 processes hold the handle of the problematic xlog > segment, all of them are postgres backends. Using the FileMon tool from the > same website, I found that bgwriter tried to OPEN the xlog segment with ALL > ACCESS but failed with result DELETE PEND. > > That is to say, under some conditions, even if I opened file with > SHARED_DELETE flag, I may not remove the file when it is open? I did some > tests, but every time I delete/rename an opened file, I could make it. > > Things could get worse because the whole database cluster may stop working > and waiting for the buffer the bgwriter is working on, but bgwriter is > waiting for (by the deadloop in pgunlink) those postgres'es to move on (so > that they could close the problematic xlog segment), which is a deadlock. > > Regards, > Qingqing > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Qingqing Zhou wrote: >> Things could get worse because the whole database cluster may stop working >> and waiting for the buffer the bgwriter is working on, but bgwriter is >> waiting for (by the deadloop in pgunlink) those postgres'es to move on (so >> that they could close the problematic xlog segment), which is a deadlock. I think that analysis is bogus. The bgwriter only tries to unlink xlog segments during post-checkpoint cleanup, at which point it isn't holding any buffer locks. Likewise, while backends might wait trying to remove a table file because the bgwriter has the file open, in that state they aren't blocking the bgwriter either. In the latter case, the backends will have to wait till the bgwriter closes the file, which it'll do not later than the next checkpoint. I wonder whether the complaints are coming from people who don't know about that, and didn't wait long enough? There could be a deadlock if a backend is holding open an old xlog segment while it executes a CHECKPOINT command, because then it'll wait for the bgwriter, and the bgwriter might think it could remove the xlog file during the checkpoint. Another form could only happen between two backends: A is trying to unlink file F, which backend B has open, and then for some unrelated reason B has to wait for a lock held by A. The bgwriter doesn't take nor wait for locks so this doesn't apply to it. But none of this should be happening because we're supposedly always opening all these files with the magic sharing flag. regards, tom lane