Обсуждение: Socket read fails

Поиск
Список
Период
Сортировка

Socket read fails

От
Colin Taylor
Дата:
Hi there, wondering if anyone could shed some light on selects intermittently not returning any data on our production server.

Symptoms are:   

- occurs in several different parts of our codebase but apparently its the same jdbc codepath [below]
- these codepaths execute fine the vast majority of the time
- thread blocks on read forever
- database doesnt show any query running.
- system has a couple of high load processes that run every day and we only see the errors there, though the majority of the systems' queries are done then.

Any suggestions for what I could do to fix or help diagnose the problem greatly appreciated. We saw these problems with v7, v8 and now seeming more frequently (ie every few days) with 8.1 jdbc and 8.1 server.

We run ES Linux 3.

cheers
Colin.

"Thread-571" prio=1 tid=0x085745b0 nid=0x124e runnable [759fe000..759ff87c]
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read (SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:183)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:222)
        at java.io.BufferedInputStream.read (BufferedInputStream.java:277)
        - locked <0x847ac890> (a java.io.BufferedInputStream)
        at org.postgresql.core.PGStream.Receive(PGStream.java:445)
        at org.postgresql.core.PGStream.Receive (PGStream.java:427)
        at org.postgresql.core.PGStream.ReceiveTupleV3(PGStream.java:369)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1283)
        at org.postgresql.core.v3.QueryExecutorImpl.fetch (QueryExecutorImpl.java:1447)
        - locked <0x847aa0b0> (a org.postgresql.core.v3.QueryExecutorImpl)
        at org.postgresql.jdbc2.AbstractJdbc2ResultSet.next(AbstractJdbc2ResultSet.java:1840)
        at org.ofbiz.entity.util.EntityListIterator.next(EntityListIterator.java:269)

Re: Socket read fails

От
Colin Taylor
Дата:
For the benefit of anyone else who has this problem,  it seems to be alleviated, at least in our case by rebooting the box every night. Not ideal by any means but better than having the application hang. Which kind of suggests some resource issue, investing with linux sar command though doesnt reveal any process or file limits being reached.

On 1/16/06, Colin Taylor <colin.taylor@gmail.com> wrote:
Hi there, wondering if anyone could shed some light on selects intermittently not returning any data on our production server.

Symptoms are:   

- occurs in several different parts of our codebase but apparently its the same jdbc codepath [below]
- these codepaths execute fine the vast majority of the time
- thread blocks on read forever
- database doesnt show any query running.
- system has a couple of high load processes that run every day and we only see the errors there, though the majority of the systems' queries are done then.

Any suggestions for what I could do to fix or help diagnose the problem greatly appreciated. We saw these problems with v7, v8 and now seeming more frequently (ie every few days) with 8.1 jdbc and 8.1 server.

We run ES Linux 3.

cheers
Colin.

"Thread-571" prio=1 tid=0x085745b0 nid=0x124e runnable [759fe000..759ff87c]
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read (SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:183)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:222)
        at java.io.BufferedInputStream.read (BufferedInputStream.java:277)
        - locked <0x847ac890> (a java.io.BufferedInputStream)
        at org.postgresql.core.PGStream.Receive(PGStream.java:445)
        at org.postgresql.core.PGStream.Receive (PGStream.java:427)
        at org.postgresql.core.PGStream.ReceiveTupleV3(PGStream.java:369)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1283)
        at org.postgresql.core.v3.QueryExecutorImpl.fetch (QueryExecutorImpl.java:1447)
        - locked <0x847aa0b0> (a org.postgresql.core.v3.QueryExecutorImpl)
        at org.postgresql.jdbc2.AbstractJdbc2ResultSet.next(AbstractJdbc2ResultSet.java:1840)
        at org.ofbiz.entity.util.EntityListIterator.next(EntityListIterator.java:269)

Re: Socket read fails

От
David Hustace
Дата:
On Jan 25, 2006, at 1:03 AM, Colin Taylor wrote:


> For the benefit of anyone else who has this problem,  it seems to
> be alleviated, at least in our case by rebooting the box every
> night. Not ideal by any means but better than having the
> application hang. Which kind of suggests some resource issue,
> investing with linux sar command though doesnt reveal any process
> or file limits being reached.
>

We have been fighting this very same problem with our open source
project (OpenNMS) for a several months now but only on systems with
fairly large databases (~=2.5 GB).  We recently built 8.1.2 on one
customer's beefy Sun system (Sparc and Solaris 10) hoping to resolve
the issue only to find it happening now more frequently now as well.
8.1.2's performance, while running, is much better btw.

We have been through our database code over and over and we sure that
we are closing all connections, statements, etc. though that doesn't
seem to be the issue since our hangs on socketRead0 happen on a call
to DriverManger.getConnection() (creating new connection) which
appears to be different for your hang.  Here is a recent stack trace:

"OpenNMS.Poller.DefaultPollContext" prio=10 tid=0x006df638 nid=0xa5
runnable [0x3cffe000..0x3cfffbf0]
         at java.net.SocketInputStream.socketRead0(Native Method)
         at java.net.SocketInputStream.read(SocketInputStream.java:129)
         at java.io.BufferedInputStream.fill(BufferedInputStream.java:
218)
         at java.io.BufferedInputStream.read(BufferedInputStream.java:
235)
         - locked <0x60fa45e8> (a java.io.BufferedInputStream)
         at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:254)
         at org.postgresql.core.v2.QueryExecutorImpl.processResults
(QueryExecutorImpl.java:408)
         at org.postgresql.core.v2.QueryExecutorImpl.execute
(QueryExecutorImpl.java:364)
         at org.postgresql.core.v2.QueryExecutorImpl.execute
(QueryExecutorImpl.java:258)
         - locked <0x60fa4608> (a
org.postgresql.core.v2.QueryExecutorImpl)
         at org.postgresql.core.v2.ConnectionFactoryImpl.runSetupQuery
(ConnectionFactoryImpl.java:405)
         at
org.postgresql.core.v2.ConnectionFactoryImpl.runInitialQueries
(ConnectionFactoryImpl.java:423)
         at
org.postgresql.core.v2.ConnectionFactoryImpl.openConnectionImpl
(ConnectionFactoryImpl.java:83)
         at org.postgresql.core.ConnectionFactory.openConnection
(ConnectionFactory.java:65)
         at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>
(AbstractJdbc2Connection.java:116)
         at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>
(AbstractJdbc3Connection.java:30)
         at org.postgresql.jdbc3.Jdbc3Connection.<init>
(Jdbc3Connection.java:24)
         at org.postgresql.Driver.makeConnection(Driver.java:369)
         at org.postgresql.Driver.connect(Driver.java:245)
         at java.sql.DriverManager.getConnection(DriverManager.java:525)
         - locked <0x48cd17e8> (a java.lang.Class)
         at java.sql.DriverManager.getConnection(DriverManager.java:171)
         - locked <0x48cd17e8> (a java.lang.Class)
         at
org.opennms.netmgt.config.DatabaseConnectionFactory.getConnection
(DatabaseConnectionFactory.java:1125)
         at org.opennms.netmgt.utils.JDBCTemplate.doExecute
(JDBCTemplate.java:102)
         at org.opennms.netmgt.utils.JDBCTemplate.execute
(JDBCTemplate.java:82)
         at
org.opennms.netmgt.poller.DefaultQueryManager.resolveOutage
(DefaultQueryManager.java:531)
         at org.opennms.netmgt.poller.DefaultPollContext$2.run
(DefaultPollContext.java:139)
         at
org.opennms.netmgt.poller.pollables.PendingPollEvent.processPending
(PendingPollEvent.java:94)
         at org.opennms.netmgt.poller.DefaultPollContext.onEvent
(DefaultPollContext.java:184)
         - locked <0x5a373ed0> (a java.util.LinkedList)
         at org.opennms.netmgt.eventd.EventIpcManagerDefaultImpl
$ListenerThread.run(EventIpcManagerDefaultImpl.java:169)
         at java.lang.Thread.run(Thread.java:595)



This past summer, we had someone try to debug this on the system
level and here is what he reported:

----------- begin debugging ------------------
Thread Dump showing thread who has locked DatabaseConnectionFactory:

"EventQueueProcessor" prio=5 tid=0x008262d0 nid=0x3b runnable
[239ff000..239ffc28]
         at java.net.SocketInputStream.socketRead0(Native Method)
         at java.net.SocketInputStream.read(SocketInputStream.java:129)
         at java.io.BufferedInputStream.fill(BufferedInputStream.java:
183)
         at java.io.BufferedInputStream.read(BufferedInputStream.java:
201)
         - locked <0x8183b8a8> (a java.io.BufferedInputStream)
         at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:166)
         at org.postgresql.core.QueryExecutor.executeV3
(QueryExecutor.java:127)
         - locked <0x8183b8c8> (a org.postgresql.core.PGStream)
         at org.postgresql.core.QueryExecutor.execute
(QueryExecutor.java:100)
         at org.postgresql.core.QueryExecutor.execute
(QueryExecutor.java:43)
         at org.postgresql.jdbc1.AbstractJdbc1Connection.execSQL
(AbstractJdbc1Connection.java:887)
         at
org.postgresql.jdbc1.AbstractJdbc1Connection.openConnectionV3
(AbstractJdbc1Connection.java:505)
         at
org.postgresql.jdbc1.AbstractJdbc1Connection.openConnection
(AbstractJdbc1Connection.java:214)
         at org.postgresql.Driver.connect(Driver.java:139)
         at java.sql.DriverManager.getConnection(DriverManager.java:512)
         - locked <0xf1b8f260> (a java.lang.Class)
         at java.sql.DriverManager.getConnection(DriverManager.java:171)
         - locked <0xf1b8f260> (a java.lang.Class)
         at
org.opennms.netmgt.config.DatabaseConnectionFactory.getConnection
(DatabaseConnectionFactory.java:1120)
         - locked <0x328b2758> (a java.util.LinkedList)
         at org.opennms.netmgt.xmlrpcd.XmlRpcNotifier.getNodeLabel
(XmlRpcNotifier.java:355)

pstack showing the FD of java->pgsql socket read call:

-----------------  lwp# 59 / thread# 59  --------------------
ff33d2d8 read     (36, 939a18, 2000)
feccbc30 JVM_Read (36, 939a18, 2000, feccb5c8, ff364278, ff36c950) + 68
30edac88 Java_java_net_SocketInputStream_socketRead0 (826364,
239ff100, 239ff0fc, 239ff0f8, 0, 2000) + 1e8

pfiles showing port number of the java->psql socket read call:

   54: S_IFSOCK mode:0666 dev:304,0 ino:16062 uid:0 gid:0 size:0
       O_RDWR
         SOCK_STREAM
         SO_SNDBUF(49152),SO_RCVBUF(49152)
         sockname: AF_INET 127.0.0.1  port: 50912
         peername: AF_INET 127.0.0.1  port: 5432

...after greping through the pfiles output of 17 postmaster
processes, found
the one that has socket 50912 open:

    8: S_IFSOCK mode:0666 dev:304,0 ino:13881 uid:0 gid:0 size:0
       O_RDWR
         SOCK_STREAM
         SO_REUSEADDR,SO_KEEPALIVE,SO_SNDBUF(49152),SO_RCVBUF(49152)
         sockname: AF_INET 127.0.0.1  port: 5432
         peername: AF_INET 127.0.0.1  port: 50912

pstack of this postmaster process:

3029:   /usr/local/pgsql/bin/postmaster -D /usr/local/pgsql/data
fefbca1c recv     (8, 283df0, 2000, 0)
000f7420 pq_getbyte (285c00, 328598, 1, 6, 5cbd8, 5cc30) + 38
00156a38 SocketBackend (ffbff1d8, 49, 0, 49, 0, 2) + 4
00156d28 ReadCommand (ffbff1d8, 0, ffffffff, fffffff8, 0, 2ad800) + 20
00159c94 PostgresMain (4, 2c3850, 2c3830, 6, 0, 0) + a7c
0012e414 BackendRun (2d2038, 1084, 916e0, 0, ff1d2000, 1000) + 5b8
0012dc60 BackendStartup (2d2038, 2b3b0c, 0, 0, 2b3800, 5) + 150
0012bfdc ServerLoop (2, 2b3800, 2b3800, ffbffb98, ffbffc38, 286000) +
304
0012b7c8 PostmasterMain (0, 2bffe8, 2c7aa0, 2bf000, 2be400, 23f000) +
b10
000f89cc main     (3, 2bffe8, ffbffe2c, 2be6dc, ff1d0140, ff1d0180) +
220
00039d44 _start   (0, 0, 0, 0, 0, 0) + 5c


Conclusion java is doing a read on this socket, and postmaster is
doing a recv.
----------end debugging---------------------

We are considering the use of DataSource instead of the deprecated
DriverManager, since DriverManager is synchronized.  Any thoughts on
the system level debugging above or the use of the DataSource Interface?

Note that we use very many threads, however, our connection pool
keeps the number of connection down to around 12 with peaks for 20 or
25.

-David


David Hustace
The OpenNMS Group, Inc.
Main  : +1 919 545 2553   Fax:   +1 503-961-7746
Direct: +1 919 827 1201   Skype: dhustace
Key Fingerprint: 8EC1 F319 2C1D 3197 AAB9  580F FACD 1D19 175E D903

Re: Socket read fails

От
Oliver Jowett
Дата:
David Hustace wrote:

> Conclusion java is doing a read on this socket, and postmaster is  doing
> a recv.

Well, that's pretty weird.

Can you reproduce this with a recent driver and loglevel=2 (append it as
a URL parameter to your JDBC URL) and send me the resulting debug output
offlist? It sounds like it might be hard to trigger though :(

> We are considering the use of DataSource instead of the deprecated
> DriverManager, since DriverManager is synchronized.  Any thoughts on
> the system level debugging above or the use of the DataSource Interface?

I doubt that using DataSource will make any difference. From the stack
traces you showed, this does not look like a Java-level deadlock at all.

(I noticed that your two stack traces seem to be talking to different
server versions, though.. what's up with that?)

-O

Re: Socket read fails

От
David Hustace
Дата:
On Jan 25, 2006, at 5:12 PM, Oliver Jowett wrote:

> David Hustace wrote:
>
>> Conclusion java is doing a read on this socket, and postmaster is
>> doing a recv.
>
> Well, that's pretty weird.
>
> Can you reproduce this with a recent driver and loglevel=2 (append
> it as a URL parameter to your JDBC URL) and send me the resulting
> debug output offlist? It sounds like it might be hard to trigger
> though :(

Sure.  I already have that enabled.  Finding the output is tricky.
We're currently using:

   url="jdbc:postgresql://localhost:5432/opennms?
protocolVersion=2&loglevel=2"

Version=2 is the latest attempt in debugging.

> I doubt that using DataSource will make any difference. From the
> stack traces you showed, this does not look like a Java-level
> deadlock at all.

You are right, we've been trying to eliminate every possible
solution.  Also, for future reference, can you tell me if the call to
PGSimpleDataSource.getConnection() is thread safe?

> (I noticed that your two stack traces seem to be talking to
> different server versions, though.. what's up with that?)

Not quite sure what you mean by "different server versions".  If you
are referring to PostgreSQL server versions, yes, we've been
upgrading and trying different versions to try and fix the problem.


-David


David Hustace
The OpenNMS Group, Inc.
Main  : +1 919 545 2553   Fax:   +1 503-961-7746
Key Fingerprint: 8EC1 F319 2C1D 3197 AAB9  580F FACD 1D19 175E D903



Re: Socket read fails

От
Oliver Jowett
Дата:
David Hustace wrote:

>> Can you reproduce this with a recent driver and loglevel=2 (append  it
>> as a URL parameter to your JDBC URL) and send me the resulting  debug
>> output offlist? It sounds like it might be hard to trigger  though :(
>
>
> Sure.  I already have that enabled.  Finding the output is tricky.

The debug output should be going to the DriverManager log stream, or
stderr if that's not set.

> Also, for future reference, can you tell me if the call to
> PGSimpleDataSource.getConnection() is thread safe?

It does not change the state of the datasource and just delegates to
DriverManager to do the real work, so it should be threadsafe.

-O

Re: Socket read fails

От
David Hustace
Дата:
On Jan 25, 2006, at 6:30 PM, Oliver Jowett wrote:

> It does not change the state of the datasource and just delegates
> to DriverManager to do the real work, so it should be threadsafe.

Ah.  I just saw that. in BasicDataSource.java.  Thanks for your
help.  I guess I need improve my connection caching to try and remove
stale connections or switch to another implementation.  Anyone have
production experience with SmartPool and a PostgreSQL driver?  http://
smartpool.sourceforge.net/  It looks pretty good so far.  I guess I
should search the list before asking this question. (duck)

-David


David Hustace
The OpenNMS Group, Inc.
Main  : +1 919 545 2553   Fax:   +1 503-961-7746
Direct: +1 919 827 1201   Skype: dhustace
Key Fingerprint: 8EC1 F319 2C1D 3197 AAB9  580F FACD 1D19 175E D903



Re: Socket read fails

От
Dave Cramer
Дата:
David,

A much more robust pooling mechanism is dbcp from apache

http://jakarta.apache.org/commons/dbcp/

Dave
On 25-Jan-06, at 8:18 PM, David Hustace wrote:

>
> On Jan 25, 2006, at 6:30 PM, Oliver Jowett wrote:
>
>> It does not change the state of the datasource and just delegates
>> to DriverManager to do the real work, so it should be threadsafe.
>
> Ah.  I just saw that. in BasicDataSource.java.  Thanks for your
> help.  I guess I need improve my connection caching to try and
> remove stale connections or switch to another implementation.
> Anyone have production experience with SmartPool and a PostgreSQL
> driver?  http://smartpool.sourceforge.net/  It looks pretty good so
> far.  I guess I should search the list before asking this question.
> (duck)
>
> -David
>
>
> David Hustace
> The OpenNMS Group, Inc.
> Main  : +1 919 545 2553   Fax:   +1 503-961-7746
> Direct: +1 919 827 1201   Skype: dhustace
> Key Fingerprint: 8EC1 F319 2C1D 3197 AAB9  580F FACD 1D19 175E D903
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: explain analyze is your friend
>