Обсуждение: pg_listener entries deleted under heavy NOTIFY load only on Windows
Under a heavy load of NOTIFY events, entries in the pg_listener table for some events are deleted, effectively acting as though UNLISTEN were called. I have only been able to make this occur on a PostgreSQL server running on Windows. I cannot reproduce under Linux. PostgreSQL version: 8.3.4 Operating System: Windows XP I have provided some shell script that can reproduce the problem. It helps to use three psql sessions. First start a session that listens for two events ("first" and "second"), then periodically polls for notifications: check_for_events | psql -h windows_server -U postgres postgres Second, start an interactive session and execute the command "SELECT * from pg_listener". You should see the events "first" and "second" listed. Third, start a session to do repeated notification. The script produces "NOTIFY first" and "NOTIFY second" in separate transactions repeatedly. generate_events | psql -h windows_server -U postgres postgres Continue to check the pg_listener table in the interactive session. Very quickly one of the two events will disappear from the table. Strangely, when I execute the same scripts against a database server on a CentOS Linux system, this problem does not occur. #!/bin/sh poll_sec=15 imax=`expr 10000 / ${poll_sec}` if [ $# -ge 1 ]; then imax=$1 fi echo "LISTEN first;" echo "LISTEN second;" i=0 while [ $i -lt $imax ]; do echo "SELECT pg_sleep(${poll_sec});" i=`expr $i + 1` done #!/bin/sh imax=100000 autocommit=1 if [ $# -ge 1 ]; then imax=$1 fi echo "BEGIN;" i=0 while [ $i -lt $imax ]; do #echo "NOTIFY \"a_fairly_long_event_name_followed_by_colon:INSERT\";" #echo "NOTIFY \"a_fairly_long_event_name_followed_by_colon:UPDATE\";" #echo "NOTIFY \"a_fairly_long_event_name_followed_by_colon:INSERT\";" #echo "NOTIFY \"a_very_different_event_name\";" echo "NOTIFY first;" echo "NOTIFY second;" if [ "${autocommit}" ]; then echo "COMMIT;" echo "BEGIN;" fi i=`expr $i + 1` done echo "COMMIT;"
"Marshall, Steve" <smarshall@wsi.com> writes: > Under a heavy load of NOTIFY events, entries in the pg_listener table > for some events are deleted, effectively acting as though UNLISTEN were > called. > I have only been able to make this occur on a PostgreSQL server running > on Windows. AFAICS the most likely explanation for this is that Send_Notify() gets an error from kill() and concludes that the listening process has died without removing its pg_listener entry; whereupon it removes it itself. Looking at pgkill(), that theory implies that CallNamedPipe() failed when under sufficient stress. I'm not sure what the "timeout" parameter we use with CallNamedPipe actually limits, but maybe it's too small? (Microsoft's doc suggests that the timeout only matters if the pipe doesn't already exist, so I'm not sure I believe this theory; though certainly the doc is vague enough that that reading could be wrong.) Theory B is that you've got some broken antivirus code on there that is arbitrarily interfering with the pipe access. The lack of any similar previous reports suggests that there's some local issue contributing ,,, regards, tom lane
"Marshall, Steve" <smarshall@wsi.com> writes: > Under a heavy load of NOTIFY events, entries in the pg_listener table=20 > for some events are deleted, effectively acting as though UNLISTEN=20 > were called. > I have only been able to make this occur on a PostgreSQL server=20 > running on Windows. "tom lane writes": AFAICS the most likely explanation for this is that Send_Notify() gets an error from kill() and concludes that the listening process has died without removing its pg_listener entry; whereupon it removes it itself. Looking at pgkill(), that theory implies that CallNamedPipe() failed when under sufficient stress. I'm not sure what the "timeout" parameter we use with CallNamedPipe actually limits, but maybe it's too small? (Microsoft's doc suggests that the timeout only matters if the pipe doesn't already exist, so I'm not sure I believe this theory; though certainly the doc is vague enough that that reading could be wrong.) Theory B is that you've got some broken antivirus code on there that is arbitrarily interfering with the pipe access. The lack of any similar previous reports suggests that there's some local issue contributing ,,, To explore Theory B, I'll turn off all non-essential services on the Windows server and rerun the test. I'll report back with what I find. Any thoughts on how to confirm or deny Theory A?
Tome Lane wrote: Theory B is that you've got some broken antivirus code on there that is arbitrarily interfering with the pipe access. The lack of any similar previous reports suggests that there's some local issue contributing ,,, Steve Marshall wrote: To explore Theory B, I'll turned off all non-essential services on the Windows server and rerun the test. This included turning of Symantec AV and several other services we don't use anyway. The test turned out exactly the same: one of the events was dropped from pg_listener within a few seconds of starting the test.=20=20
Marshall, Steve wrote: > Any thoughts on how to confirm or deny Theory A? Try changing the 1000 to NMPWAIT_WAIT_FOREVER -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Marshall, Steve wrote: >> Any thoughts on how to confirm or deny Theory A? > Try changing the 1000 to NMPWAIT_WAIT_FOREVER As long as you're changing the source code, it'd be a good idea to verify the supposition that kill() is failing, eg in src/backend/commands/async.c if (kill(listenerPID, SIGUSR2) < 0) { + elog(LOG, "kill(%d) failed: %m", listenerPID); /* * Get rid of pg_listener entry if it refers to a PID that no * longer exists. Presumably, that backend crashed without * deleting its pg_listener entries. This code used to only If that's right, sprinkling a few debug printouts into src/port/kill.c would be the next step. regards, tom lane
Marshall, Steve wrote: Any thoughts on how to confirm or deny Theory A? Alvaro Herrera wrote: Try changing the 1000 to NMPWAIT_WAIT_FOREVER I'd tried Alvaro suggestion and it fixed the problem in my test case. With the original code, the test would consistantly drop one of the events from pg_listener within a few seconds. With the changed timeout, I've run it for about 10 minutes, or 100,000 notifications of each event, and it worked without error. Are there any side effects from effectively disabling this timeout that should be tested?
As I posted before, changing the timeout from 1000 to NMPWAIT_WAIT_FOREVER fixed the problem, or at least moved it such it does not occur easily anymore. To better understand the problem, I added debugging as Tom suggested. I restored timeout on CalledNamedPipe 1000 ms, and reran my tests. Indeed, kill is encountering an error: LOG: kill(2168) failed: No such process I instrumented pgkill to output the value of GetLastError() if CalledNamedPipe fails. It returned error code 2, which Windows identifies as ERROR_FILE_NOT_FOUND. The logic in pgkill translates this Windows error into an errno value of ESRCH. The Windows error is a bit surprising, at least to me -- I expected something indicating the pipe was full. Does anyone have a richer interpretation of this error? Thanks, Steve -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20 Alvaro Herrera <alvherre@commandprompt.com> writes: > Marshall, Steve wrote: >> Any thoughts on how to confirm or deny Theory A? > Try changing the 1000 to NMPWAIT_WAIT_FOREVER As long as you're changing the source code, it'd be a good idea to verify the supposition that kill() is failing, eg in src/backend/commands/async.c if (kill(listenerPID, SIGUSR2) < 0) { + elog(LOG, "kill(%d) failed: %m", listenerPID); /* * Get rid of pg_listener entry if it refers to a PID that no * longer exists. Presumably, that backend crashed without * deleting its pg_listener entries. This code used to only If that's right, sprinkling a few debug printouts into src/port/kill.c would be the next step. regards, tom lane
I've progressed a bit farther on the dropped notification bug, which amends my previous results.
I modified the error handling code for the kill() call in async.c to check to see if the listenerPID is associated with a running backend process. To do that, I call IsBackendPid(listenerPID). I only remove the pg_listener entry if the IsBackendPid returns false. If the proces exists, I execute the same pg_listener update logic as if kill succeeded. I found this eliminates the problem completely without, altering the timeout in CallNamedPipe.
I also refined my debug statements in pgkill, and determined the system error was different from what I had previously reported. The GetLastError() value is 31 (ERROR_GEN_FAILURE), not 2 (ERROR_FILE_NOT_FOUND). ERROR_GEN_FAILURE is translated as "A device attached to the system is not functioning", which sounds at least a bit more like the "pipe is full" or "pipe not available" error I was expecting.
Essentially, the new error handling in async.c allows postgres to fail in its efforts to signal a process about the presence of a NOTIFY event without invalidating the subscription (LISTEN) for that event. The modified logic still records the presence of an event in pg_listener, if this event has not already been recorded. I believe this means the listener will pick up this event the next time it is signalled. AFAICS, this means the event notification won't be dropped, and assuming a heavy load of notifications to the listener, the notification won't even be seriously delayed.
I think the details of the problem are reasonably well known, but we have some choices on implementation of a fix. As I see it, the fix should be localized in the error handling of the kill command in async.c. We could use the technique I described, explicitly checking for the listener process with IsBackendPid (patch do this is attached). Or we could revert to a previous technique used in this code: checking the errno from kill, and only deleting the pg_listener record under certain error conditions. In that case, we would need to ensure ERROR_GEN_FAILURE is translated to an appropriate errno value.
I think detecting whether or not the listener backend exists is really the check we want. However, there may be performance considerations to different ways of checking this.
Any thoughts on this or other possible solutions?
Steve
Marshall, Steve wrote:
I modified the error handling code for the kill() call in async.c to check to see if the listenerPID is associated with a running backend process. To do that, I call IsBackendPid(listenerPID). I only remove the pg_listener entry if the IsBackendPid returns false. If the proces exists, I execute the same pg_listener update logic as if kill succeeded. I found this eliminates the problem completely without, altering the timeout in CallNamedPipe.
I also refined my debug statements in pgkill, and determined the system error was different from what I had previously reported. The GetLastError() value is 31 (ERROR_GEN_FAILURE), not 2 (ERROR_FILE_NOT_FOUND). ERROR_GEN_FAILURE is translated as "A device attached to the system is not functioning", which sounds at least a bit more like the "pipe is full" or "pipe not available" error I was expecting.
Essentially, the new error handling in async.c allows postgres to fail in its efforts to signal a process about the presence of a NOTIFY event without invalidating the subscription (LISTEN) for that event. The modified logic still records the presence of an event in pg_listener, if this event has not already been recorded. I believe this means the listener will pick up this event the next time it is signalled. AFAICS, this means the event notification won't be dropped, and assuming a heavy load of notifications to the listener, the notification won't even be seriously delayed.
I think the details of the problem are reasonably well known, but we have some choices on implementation of a fix. As I see it, the fix should be localized in the error handling of the kill command in async.c. We could use the technique I described, explicitly checking for the listener process with IsBackendPid (patch do this is attached). Or we could revert to a previous technique used in this code: checking the errno from kill, and only deleting the pg_listener record under certain error conditions. In that case, we would need to ensure ERROR_GEN_FAILURE is translated to an appropriate errno value.
I think detecting whether or not the listener backend exists is really the check we want. However, there may be performance considerations to different ways of checking this.
Any thoughts on this or other possible solutions?
Steve
Marshall, Steve wrote:
As I posted before, changing the timeout from 1000 to NMPWAIT_WAIT_FOREVER fixed the problem, or at least moved it such it does not occur easily anymore. To better understand the problem, I added debugging as Tom suggested. I restored timeout on CalledNamedPipe 1000 ms, and reran my tests. Indeed, kill is encountering an error: LOG: kill(2168) failed: No such process I instrumented pgkill to output the value of GetLastError() if CalledNamedPipe fails. It returned error code 2, which Windows identifies as ERROR_FILE_NOT_FOUND. The logic in pgkill translates this Windows error into an errno value of ESRCH. The Windows error is a bit surprising, at least to me -- I expected something indicating the pipe was full. Does anyone have a richer interpretation of this error? Thanks, Steve -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Alvaro Herrera <alvherre@commandprompt.com> writes:Marshall, Steve wrote:Any thoughts on how to confirm or deny Theory A?Try changing the 1000 to NMPWAIT_WAIT_FOREVERAs long as you're changing the source code, it'd be a good idea to verify the supposition that kill() is failing, eg in src/backend/commands/async.c if (kill(listenerPID, SIGUSR2) < 0) { + elog(LOG, "kill(%d) failed: %m", listenerPID); /* * Get rid of pg_listener entry if it refers to a PID that no * longer exists. Presumably, that backend crashed without * deleting its pg_listener entries. This code used to only If that's right, sprinkling a few debug printouts into src/port/kill.c would be the next step. regards, tom lane
"Marshall, Steve" <smarshall@wsi.com> writes: > Essentially, the new error handling in async.c allows postgres to fail > in its efforts to signal a process about the presence of a NOTIFY event > without invalidating the subscription (LISTEN) for that event. But why would that be a good idea? It seems to me to be a bandaid that guarantees misbehavior in the infrequent-notification case. The real question here, which this doesn't advance us towards solving, is why is the CallNamedPipe call sometimes failing (or at least taking longer than it seems it ought to). regards, tom lane
The reason preserving the subscription is good is because the current behavior silently drops the subscription without informing the subscribing process in any way. This means applications waiting for changes to a database tables have the mistaken impression no changes have occurred.=20=20 The reason I think updating the pg_listener tuple even if kill fails may be based on a mistaken understanding of asynchronous notification. We could easily avoid it, if it's not the right thing to do. My thought was that a single byte was written to the named pipe to inform the listener of the presence of new events, but that the knowledge of what event(s) are pending is kept in the pg_listener table. Since this problem only occurs under heavy notification load (i.e. when the pipe already has data in it), the listener is going to get signalled about the presence of new events even if one particular kill() call failed. Let me know if I misunderstand the use of the named pipe and the pg_listener table in event notification.=20=20=20 Given my understanding, it seems like a good idea to ensure the state of pg_listener is updated with all events that were received. If we don't do that, we risk the case that the one kill that fails if for an event that was only issued that one time. In that case, we would fail to alert the application that this event occurred.=20=20 I don't think a check for process existance is a bad idea, or even a bandaid. The comment in the code block in async.c says it is removing the entry in pg_listener because the backend process does not exist. The code assumes the process is dead, but does not check it. I don't think it is unreasonable to check this assumption, and behave accordingly.=20=20 If you have suggestions on how to get more information on the cause of the error, I'd be be willing to look into them. The microsoft documentation only describes a couple of error conditions for CallNamedPipe (error 31 is not one of them), and the error code itself is not very descriptive ("A device attached to the system is not functioning"). I'm open to ideas on how to dig further. Steve=20 -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20 Sent: Wednesday, January 28, 2009 12:44 PM To: Marshall, Steve Cc: pgsql-bugs@postgresql.org Subject: Re: [BUGS] pg_listener entries deleted under heavy NOTIFY load only on Windows=20 "Marshall, Steve" <smarshall@wsi.com> writes: > Essentially, the new error handling in async.c allows postgres to fail > in its efforts to signal a process about the presence of a NOTIFY=20 > event without invalidating the subscription (LISTEN) for that event. But why would that be a good idea? It seems to me to be a bandaid that guarantees misbehavior in the infrequent-notification case. The real question here, which this doesn't advance us towards solving, is why is the CallNamedPipe call sometimes failing (or at least taking longer than it seems it ought to). regards, tom lane
"Marshall, Steve" <smarshall@wsi.com> writes: > I don't think a check for process existance is a bad idea, or even a > bandaid. The comment in the code block in async.c says it is removing > the entry in pg_listener because the backend process does not exist. In general, the way to see if a process exists is to try to kill() it; the fact that the kill failed is sufficient proof, at least in Unix-land. If it's possible for kill() to fail for transient reasons in our Windows implementation, that's a bug in the Windows emulation of kill. Another reason behind the async.c coding is that even if the process does still exist, there's no point in maintaining a pg_listener entry for it if we can't signal it. Thirdly, this is hardly the only place where we expect kill() to work reliably. You've managed to create a reproducible case illustrating that it's not being reliable, but the same bug might account for other failures much harder to reproduce and investigate. So my opinion is that the real issue here is why is the kill() implementation failing when it should not. We need to fix that, not put band-aids in async.c. As to how to fix it, I'll defer to other people more Windows-knowledgeable. Maybe taking out the timeout is really the best answer. regards, tom lane
Tom Lane wrote: >So my opinion is that the real issue here is why is the kill() >implementation failing when it should not. We need to fix that, >not put band-aids in async.c. > > I think Tom makes a valid point. To that end, I think the Window's kill implementation needs to be changed to have these properties: 1. kill should generally return quickly and should never hang 2. kill should never fail for a transient reason 3. if kill fails, it should be a good indication that the process is dead or we are permanently unable to communicate with it. The current implementation has property #1, but not #2 and #3. However, I think I've figured out a simple way to modify the Window's implementation of pgkill to achieve all three properties. I will do some long term testing of the changes over the weekend, just to check my solution works properly over longer time scales. Here is what I've found so far: * Contrary to my previous reports, the notification error is always the result of pgkill failing with error code 2 (ERROR_FILE_NOT_FOUND). I had previously thought it had issued error 31, but this was just an error in my debug message (the signal was 31, i.e. SIGUSR2). * Also contrary to what I wrote previously, long or infinite timeouts do not fix the problem. With an infinite timeout, I've avoided the problem for as long as ten minutes, but it eventually happens. In some cases, the problem even occurred quickly with an infinite timeout. The solution that seems to work is to call CallNamedPipe repeatedly in a loop if it fails. Currently, I call the function up to a maximum of 5 times, although in all test cases so far, the code has never needed more than 1 retry to succeed. Based on my testing, I may reduce the maximum number of retries. The code sleeps for 20 ms between retries. I reduced the timeout for CallNamedPipe from 1000 ms to 250 ms after the first call, to reduce the total time for the signal if we hit a case that needs to timeout. I also notice that signals that should fail, do fail. For example, signal 30 seems to be regularly sent to pid 1, and this fails in both the orignal code and my modified version. Theoretically, I'm not entirely sure why CallNamedPipe fails occasionally, but will succeed when called with the same arguments a very short while later. It's hard to know without being able to see the source code. However, from the Windows documentation, it seems like a single named pipe in Windows can have several "instances", which seem to be access interfaces to the pipe. I suspect there is some race condition where the code erroneously decides it needs to create an "instance" of the pipe, rather than waiting for an instance to become available. When the instance creation fails, it generates the FILE_NOT_FOUND error. I'll post back on Monday with more complete test results, and, if all goes well, a patch. If anyone has ideas on what else should be tested, please let me know. Steve
"Marshall, Steve" <smarshall@wsi.com> writes: > The solution that seems to work is to call CallNamedPipe repeatedly in a > loop if it fails. Seems a bit nervous-making, since there's no principled way to know how many retries to try. > I also notice that signals that should fail, do fail. For example, > signal 30 seems to be regularly sent to pid 1, and this fails in both > the orignal code and my modified version. That's probably this bug: http://archives.postgresql.org/pgsql-bugs/2008-11/msg00077.php http://archives.postgresql.org/pgsql-committers/2008-11/msg00132.php It shouldn't be happening, but it does offer a convenient test case that your patch still fails when it should fail ;-) regards, tom lane
I've completed my test for changes to the Windows implementation of pgkill. I ran two clients constantly doing a pair of notifications, with one client subscribing to those events. I ran the test for 90 hours. I found 760 cases where the CallNamedPipe function initially failed, but succeeded after recalling. In all but one of those cases, only a single retry was needed; the other case required two retries. As Tom has pointed out, there is no principled way to determine the correct number of times to retry. Therefore, I have set the maximum number of retries to 2, which is the worst case found in a fairly long stress test. This strikes me as a reasonable balance between failing quickly in truly erroneous cases, while still avoiding transient pseudo-failures. A patch with this change is attached. It should be applied to the 8.3.5 version of src/port/kill.c Steve *** kill.c 2009-02-03 14:28:21.753474644 -0500 --- kill.c.mod 2009-02-03 14:28:04.465829331 -0500 *************** *** 25,30 **** --- 25,33 ---- BYTE sigData = sig; BYTE sigRet = 0; DWORD bytes; + DWORD timeout = 1000; /* in ms; forever = NMPWAIT_WAIT_FOREVER */ + const int max_pipe_tries = 3; + int pipe_tries = 0; /* we allow signal 0 here, but it will be ignored in pg_queue_signal */ if (sig >= PG_SIGNAL_COUNT || sig < 0) *************** *** 39,45 **** return -1; } snprintf(pipename, sizeof(pipename), "\\\\.\\pipe\\pgsignal_%u", pid); ! if (!CallNamedPipe(pipename, &sigData, 1, &sigRet, 1, &bytes, 1000)) { if (GetLastError() == ERROR_FILE_NOT_FOUND) errno = ESRCH; --- 42,63 ---- return -1; } snprintf(pipename, sizeof(pipename), "\\\\.\\pipe\\pgsignal_%u", pid); ! ! /* ! * Writing data to the named pipe can fail for transient reasons. ! * Therefore, it is useful to retry if it fails. The maximum number of ! * calls to make was empirically determined from a 90-hour notification ! * stress test. ! */ ! while (pipe_tries < max_pipe_tries && ! !CallNamedPipe(pipename, &sigData, 1, &sigRet, 1, &bytes, timeout)) ! { ! pipe_tries++; ! timeout = 250; ! pg_usleep(10000); ! } ! ! if (pipe_tries >= max_pipe_tries) { if (GetLastError() == ERROR_FILE_NOT_FOUND) errno = ESRCH;
Any thoughts on whether or not this patch should be included in a future PostgreSQL release? If so, any thoughts on additional testing, etc, that needs to be done? Should the patch be added to a minor release of 8.3, or only included forward in 8.4? In my opinion, this is a pretty serious error causing some major problems and workarounds in systems I am working on, should I would vote for inclusion in minor releases. The post below seems to have become disassociated from the rest of the thread on this bug, perhaps because it was issued in February, and the rest of the thread was posted in January. Is there any way to make the mailing list archives link these messages back to the rest of the thread? -----Original Message----- I've completed my test for changes to the Windows implementation of pgkill. I ran two clients constantly doing a pair of notifications, with one client subscribing to those events. I ran the test for 90 hours. I found 760 cases where the CallNamedPipe function initially failed, but succeeded after recalling. In all but one of those cases, only a single retry was needed; the other case required two retries.=20 As Tom has pointed out, there is no principled way to determine the correct number of times to retry. Therefore, I have set the maximum number of retries to 2, which is the worst case found in a fairly long stress test. This strikes me as a reasonable balance between failing quickly in truly erroneous cases, while still avoiding transient pseudo-failures. A patch with this change is attached. It should be applied to the 8.3.5 version of src/port/kill.c Steve
Marshall, Steve wrote: > Any thoughts on whether or not this patch should be included in a future > PostgreSQL release? If so, any thoughts on additional testing, etc, > that needs to be done? Should the patch be added to a minor release of > 8.3, or only included forward in 8.4? In my opinion, this is a pretty > serious error causing some major problems and workarounds in systems I > am working on, should I would vote for inclusion in minor releases. > > The post below seems to have become disassociated from the rest of the > thread on this bug, perhaps because it was issued in February, and the > rest of the thread was posted in January. Is there any way to make the > mailing list archives link these messages back to the rest of the > thread? I am tracking this item and was waiting for someone to reply to it. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Marshall, Steve wrote: > Any thoughts on whether or not this patch should be included in a future > PostgreSQL release? If so, any thoughts on additional testing, etc, > that needs to be done? Should the patch be added to a minor release of > 8.3, or only included forward in 8.4? In my opinion, this is a pretty > serious error causing some major problems and workarounds in systems I > am working on, should I would vote for inclusion in minor releases. > > The post below seems to have become disassociated from the rest of the > thread on this bug, perhaps because it was issued in February, and the > rest of the thread was posted in January. Is there any way to make the > mailing list archives link these messages back to the rest of the > thread? Hi! Sorry, I've been unable to track this thread properly as it happened tue to travelling, but I've tried to read back through it now. What's the timing of the errors? Is there a chance that we are sending the kill signal before the signal handling thread has actually started *and created the named pipe*? We set up the signal handling stuff pretty early, but we do seem to let the postmaster continue it's work before it's up... Going back to your tests, did it ever require more than one retry? //Magnus
What's the timing of the errors? Is there a chance that we are sending the kill signal before the signal handling thread has actually started *and created the named pipe*?=20 We set up the signal handling stuff pretty early, but we do seem to let the postmaster continue it's work before it's up... Under heavy load, a signal will typically be dropped within the first few minutes. However, it can sometimes take a little while before the problem happens. Thousands of the same signal to the same process may be properly handled before one is mishandled. This is not consistant with a problem with initial creation of the pipe. Going back to your tests, did it ever require more than one retry? Yes, but rarely. In a 90 hour stress test with code that allowed up to 5 calls to CallNamedPipe, I found 760 signals that required a retry. Only one required two retries. That is why I set the number of retries to 2. The behavior might be different if the sleep interval between retries was changed. I used a 20 ms sleep interval between retries in all my tests, and in the patch I sent.
Marshall, Steve wrote: > What's the timing of the errors? Is there a chance that we are sending > the kill signal before the signal handling thread has actually started > *and created the named pipe*? > > We set up the signal handling stuff pretty early, but we do seem to let > the postmaster continue it's work before it's up... > > Under heavy load, a signal will typically be dropped within the first > few minutes. However, it can sometimes take a little while before the > problem happens. Thousands of the same signal to the same process may > be properly handled before one is mishandled. This is not consistant > with a problem with initial creation of the pipe. > > Going back to your tests, did it ever require more than one retry? > > Yes, but rarely. In a 90 hour stress test with code that allowed up to 5 > calls to CallNamedPipe, I found 760 signals that required a retry. Only > one required two retries. That is why I set the number of retries to 2. > The behavior might be different if the sleep interval between retries > was changed. I used a 20 ms sleep interval between retries in all my > tests, and in the patch I sent. I have applied a modified version (no functional changes, just stylistic) of this patch to head and backpatched to 8.2 which is as far back as we support win32. //Magnus