Re: BUG #3504: Some listening sessions never return from writing, problems ensue
От | Peter Koczan |
---|---|
Тема | Re: BUG #3504: Some listening sessions never return from writing, problems ensue |
Дата | |
Msg-id | 4544e0330708061805wcafa72br623aad2e50fe250b@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #3504: Some listening sessions never return from writing, problems ensue (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-bugs |
Not quite, here's what happens every second when idling, for both good and bad connections (with a different values to each syscall). It is trying to read data, I am a little concerned about the EAGAIN return value, though. [koczan@ator] ~ $ strace -p 6987 poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, events=POLLIN}], 3, 869) = 0 gettimeofday({1186445557, 631779}, NULL) = 0 read(5, 0x9651540, 5) = -1 EAGAIN (Resource temporarily unavailable) gettimeofday({1186445557, 632197}, NULL) = 0 ioctl(3, FIONREAD, [0]) = 0 poll(<unfinished ...> Process 6987 detached However, looking at straces when an async notify actually happens, I notice something interesting. The number of syscalls over the same interrupt on the bad connection is nearly 5 times that of the good connection. By the time the good connection is finished, the bad connection is still mucking with data. I can send you the actual straces if you think it would help. Here's my theory (and feel free to tell me that I'm full of it)...somehow, a lot of notifies happened at once, or in a very short period of time, to the point where the app was still processing notifies when the timer clicked off another second. The connection (or app, or perl module) never marked those notifies as being processed, or never updated its timestamp of when it finished, so when the next notify came around, it tried to reprocess the old data (or data since the last time it finished), and yet again couldn't finish. Lather, rinse, repeat. In sum, it might be that trying to call pg_notifies while processing notifies tickles a race condition and tricks the connection into thinking its in a bad state. The holes in my theory are that the app doesn't try to read everything every second for bad connections (so notifies are being processed somehow), and it doesn't explain why the recv-q is filling up even for good connections. Hopefully this will help somehow, though. Peter P.S. Just in case its useful, the contents of pg_listener for bad connections (I verified that notification = 0 for good connections). csdb=# select * from pg_listener where notification > 0; relname | listenerpid | notification ---------+-------------+-------------- req | 3292 | 19141 req | 25848 | 16308 req | 17004 | 23815 req | 19583 | 1703 (4 rows) On 8/6/07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Peter Koczan" <pjkoczan@gmail.com> writes: > > So, it looks like the app polls for async notifies every second. > Hopefully > > you can glean some insight from that. > > My guess is that some condition causes that to stop happening. > > One way to monitor this from outside is to use strace to watch the app. > According to the above you should see a recv() attempt on the libpq > socket every second. I suspect that you won't see any more once the app > gets into the state where the recv-q is backing up. Looking at what > happens around the time of the last recv() might give some clues. > > regards, tom lane >
В списке pgsql-bugs по дате отправления: