Обсуждение: timeouts in libpq- can libpq requests block forever/a very long time?

Поиск
Список
Период
Сортировка

timeouts in libpq- can libpq requests block forever/a very long time?

От
Dirk Niggemann
Дата:
Hi,
looking at the source for libpq- it looks like any
requests using pqWait could block for a potentially (very) 
long time (e.g 15mins default TCP/ip timeout). This includes 
some requests using the asynchronous interface, particularly 
PQFlush will still block indefinitely (whereby indefinitely refers to 
"a period of time that makes the user think their app has hung"), and
PQSendQuery will potentially block during the send.)

Would there be any problems with changing pqWait so it uses a timeout?
I would treat this timeout as equivalent to the socket having been
closed 
unexpectedly- i.e. I would close the socket myself and then do the usual 
handling for an unexpectly closed socket, with the potential to return a 
different error...
I would also like to add a timeout for the connect operation on the
socket-
on a sufficiently frazzled WAN this could block for a long time-
how many system variants don't support setting O_NDELAY/O_NONBLOCK
_before_
doing the connect (I know Linux and Solaris do, as for winsock, I
couldn't 
even guess...)
Is someone else already doing this work?

Dirk


Re: [INTERFACES] timeouts in libpq- can libpq requests block forever/a very long time?

От
Tom Lane
Дата:
Dirk Niggemann <dirk-n@dircon.co.uk> writes:
> Would there be any problems with changing pqWait so it uses a timeout?

I think this would be rather risky: if I issue a query that the backend
takes a long time to compute, I don't want libpq timing out on me,
*especially* not if its response is to close the connection rather than
some fail-soft recovery technique.  If the timeout is
application-settable and defaults to infinity, it might be OK.

In my own applications using the asynchronous query features, I have not
seen any actual hangups that could be improved by a timeout ... but I've
mostly been working across LAN rather than WAN connections...

> I would also like to add a timeout for the connect operation on the
> socket-
> on a sufficiently frazzled WAN this could block for a long time-

This might be useful, but again I wonder whether libpq ought to be
second-guessing the transport system about how long is reasonable to
wait for a response.
        regards, tom lane


Re: [INTERFACES] timeouts in libpq- can libpq requests block forever/a very long time?

От
Dirk Niggemann
Дата:
Tom Lane wrote:
> 
> Dirk Niggemann <dirk-n@dircon.co.uk> writes:
> > Would there be any problems with changing pqWait so it uses a timeout?
Actually, I've just gone away and implemented it- silly, silly me. 
The timeout can be completely disabled and is settable from PQconnect
and from some new environment variables. (PGTIMEOUT and PGCONNTIMEOUT).
It remains off by default. 
> 
> I think this would be rather risky: if I issue a query that the backend
> takes a long time to compute, I don't want libpq timing out on me,
> *especially* not if its response is to close the connection rather than
> some fail-soft recovery technique.
What kind of fail-soft could be implemented here?
I suppose I could be nice & stick a PQrequestCancel in there, but I'll 
have to make sure it doesn't frazzle any other bit of the system. Hold 
that thought: PQrequestCancel is _designed_ to be called asynchronously 
to all te other requests, so it's unlikely to cause any major probs.. 
(have I missed something?).
That doesn't however, deal with the main concern about the timeout-
there 
is no 100% reliable way to tell the difference between "this qeuery is
running 
a long time without returning results" and "the network is down" or 
"the backend is hung"
The elegant solution would be to implement some sort of
backend-ping-protocol. The 
problem with this is that the backend needs to be prepared to service
it- even if it's 
doing other things. Not the sort of ting the backend was designed to do.
BTW- as libpq doesn't set SO_KEEPALIVE, queries that run for > 15minutes
w/o any data returned will disconnect anyway (provided things are't too
confused).

>From my POV,  the purpose behind this is not to deal with
long-computation queries. 
It's to deal with unattended automatic applications that run on remote
sites over flakynetworks that splat relatively small sets of inserts (<150 rows,
consisting of 30-40 
integers max)into a table in a central DB.
Now if I really wanted to do this right, I would build some sort of
reliable queuing middleware...
Except all that costs, well, real money. and this is only an
experimental project.
The applications do these uploads at regular intervals. The applications
have to recover and perform the inserts at a later stage if the network
is not working properly. 
I expect the network not to be working properly.
The applications run 'unattended' and require no user interaction.
>  If the timeout is
> application-settable and defaults to infinity, it might be OK.
Yes, after having a closer look at PQrequestCancel I sort of see where
the problem lies.
libpq makes some effort at not closing the backend socket. 
That's why I left the original behaviour in place.
Can you describe to me what the backend does on socket close?
I should really grovel the code myself, but my main concern here is:
a) I guess the backend simply aborts the current query/transaction.
Nasty if you were running anenormous query that will take a long time to run. Otherwise not too
catastrophic.
b) I'm concerned the backend might do some blocking somewhere w/o a
timeout- that is, can closing a socket at a bad time cause the backend to get
stuck?
c) can a failed network connection cause the backend to get stuck in a
wait/blocking I/O? 
Having hung backends lying around could be a serious problem, as that
would reduce the total number of backends available (max backends does not change
dynamically).  
d) Am I misunderstanding here or is PostgresSQL 'single threaded' with
respect to requests over a single client connection? libpq certainly gives that
impression. Shoving an empty query at the backend while it is returning
rows to you would confuse it very badly?
e) just like the PQrequestCancel, could you (me?) build a sort of
PQrequestPing to confirm that nothing is reachable or functioning at the remote end
before libpq closes the socket?

> 
> In my own applications using the asynchronous query features, I have not
> seen any actual hangups that could be improved by a timeout ... but I've
> mostly been working across LAN rather than WAN connections...
What hangs did you see?

the kind of problem I'm talking about is the sort where routers go and
disappear on you...
after you've opened the connection...
or someone chops a cable on a remote site...
all fun stuff...
try this sometime:a) open connection to a remote lan over a single router (with no
backup/standby)...b) start shoving data over it.    c) shoot router by preferred method (ax, fire, power plug, admin
typo,
dripping aircon (my favourite)).d) see how long your socket connection takes to work out the network's
gone away...

> 
> > I would also like to add a timeout for the connect operation on the
> > socket-
> > on a sufficiently frazzled WAN this could block for a long time-
> 
> This might be useful, but again I wonder whether libpq ought to be
> second-guessing the transport system about how long is reasonable to
> wait for a response.
done as well-  second-guessing the transport system is a good idea sometimes- the 
default TCP/ip timeouts are all rather long. 
Adjusting them generally  means fiddling with kernel parameters (might 
break kernel. might break some other apps.)
I certainly have seen problems on Solaris connection to DEC (oracle) RDB 
over a flaky WAN. 
the application would occasionally 'stick' for at least 8-10 minutes
when we simulated the 
problem by pulling some cables. In reality it would regularly stick for
1/2 hour. 
(routers went up&down and let packets drip through on occasions. made
both ends think thenetwork was just extremly slow rather than broken.) 
Oracle's connection timeout solution is particularly insidious(I have
this by word of mouth only)-
it seems to fork children to asynchronously handle the requests that
time out. 
crashed a machine when 30 concurrent client apps all suddenly lost
connection to the same database.

apologies for seeming a bit argumentative on this point...
BTW- just realized another problem- 
we can't know for certain a if a commit happened or not if we get a
timeout...
eek...
example
normal flow:
COMMIT->        ->COMMIT        <-RESPONSE
<-RESPONSE

error flow:
COOMMIT->                ->COMMIT        <-ERROR
<-ERROR

timeout flow (ok)
COMMIT->            
timeout         
close                    ->close            ABORT
(transaction assumed aborted).

timeout flow (bad)
COMMIT->        ->COMMIT        <-RESPONSE
timeout
close
(transaction assumed aborted but has actually committed)        ->close

I cant' seem to come up with the worst-case  scenario 
(e.g. false positive on commit)
maybe it's not possible...


or do we have a three-way handshake for commits?

> 
>                         regards, tom lane

Thanks, 
Dirk