Обсуждение: Idle in TX / Java process hang's in the vicinity of JDBC

Поиск
Список
Период
Сортировка

Idle in TX / Java process hang's in the vicinity of JDBC

От
David Kerr
Дата:
We've got an issue where an ETL process just sort of stalled on us last night.

It left a dreaded "Idle in Transaction" in the database. Although fortunatly
it didn't appear to have any tables locked.

No errors logged in the postgresql logfile.

(we're on JDBC driver v8.3-603)

We got a stack trace of the ETL process:
Name: Thread-957
State: RUNNABLE
Total blocked: 0  Total waited: 0

Stack trace:
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:129)
org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:135)
org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:104)
org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:186)
org.postgresql.core.PGStream.Receive(PGStream.java:445)
org.postgresql.core.PGStream.ReceiveTupleV3(PGStream.java:350)
org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1306)
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:192)
   - locked org.postgresql.core.v3.QueryExecutorImpl@18493c9
org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:451)
org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:336)
org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:235)

talend_reporting_etl.load_class_roster_fact_0_1.Load_Class_Roster_Fact.tPostgresqlInput_1Process(Load_Class_Roster_Fact.java:5507)

talend_reporting_etl.load_class_roster_fact_0_1.Load_Class_Roster_Fact.Create_Audit_1_tJava_2Process(Load_Class_Roster_Fact.java:3360)

talend_reporting_etl.load_class_roster_fact_0_1.Load_Class_Roster_Fact.Create_Audit_1_tPostgresqlCommit_1Process(Load_Class_Roster_Fact.java:3274)

talend_reporting_etl.load_class_roster_fact_0_1.Load_Class_Roster_Fact.Create_Audit_1_tPostgresqlInput_1Process(Load_Class_Roster_Fact.java:2823)

talend_reporting_etl.load_class_roster_fact_0_1.Load_Class_Roster_Fact.tPostgresqlConnection_1Process(Load_Class_Roster_Fact.java:819)

talend_reporting_etl.load_class_roster_fact_0_1.Load_Class_Roster_Fact.tJava_4Process(Load_Class_Roster_Fact.java:12623)
talend_reporting_etl.load_class_roster_fact_0_1.Load_Class_Roster_Fact.runJobInTOS(Load_Class_Roster_Fact.java:14079)
talend_reporting_etl.load_class_roster_fact_0_1.Load_Class_Roster_Fact.runJob(Load_Class_Roster_Fact.java:13818)
talend_reporting_etl.load_facts_0_1.Load_Facts.tRunJob_2Process(Load_Facts.java:1722)
talend_reporting_etl.load_facts_0_1.Load_Facts.tRunJob_1Process(Load_Facts.java:1654)
talend_reporting_etl.load_facts_0_1.Load_Facts.tRunJob_6Process(Load_Facts.java:1465)
talend_reporting_etl.load_facts_0_1.Load_Facts.tRunJob_5Process(Load_Facts.java:1353)
talend_reporting_etl.load_facts_0_1.Load_Facts.tRunJob_4Process(Load_Facts.java:1164)
talend_reporting_etl.load_facts_0_1.Load_Facts.tRunJob_3Process(Load_Facts.java:1052)
talend_reporting_etl.load_facts_0_1.Load_Facts.tRunJob_8Process(Load_Facts.java:863)
talend_reporting_etl.load_facts_0_1.Load_Facts.tRunJob_7Process(Load_Facts.java:674)
talend_reporting_etl.load_facts_0_1.Load_Facts$1.run(Load_Facts.java:3722)


We were wondering what the:
org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:192)
   - locked org.postgresql.core.v3.QueryExecutorImpl@18493c9

was refering to.

Any direction for debugging would be useful. I'm not really that knowledgeable in the Java area
and my developers are a bit stumped.

Thanks

Dave

Re: Idle in TX / Java process hang's in the vicinity of JDBC

От
Kris Jurka
Дата:

On Thu, 29 Jul 2010, David Kerr wrote:

> We've got an issue where an ETL process just sort of stalled on us last
> night.
>
> It left a dreaded "Idle in Transaction" in the database. Although
> fortunatly it didn't appear to have any tables locked.
>
> Stack trace:
> java.net.SocketInputStream.socketRead0(Native Method)
> java.net.SocketInputStream.read(SocketInputStream.java:129)
> org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:135)
> org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:104)
> org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:186)
> org.postgresql.core.PGStream.Receive(PGStream.java:445)
> org.postgresql.core.PGStream.ReceiveTupleV3(PGStream.java:350)
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1306)
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:192)
>   - locked org.postgresql.core.v3.QueryExecutorImpl@18493c9
> org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:451)
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:336)
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:235)

The stacktrace indicates that Statement.executeQuery was called and the
JDBC driver started receiving the results from the server.  The driver is
in the middle of reading a particular row and is waiting for more data
from the server.  From what you're saying the server thinks it is done,
but the driver does not, indicating loss of protocol synchronization.
That's tough to debug after the fact, without a reproducible case or
network traffic dump.  If you still have this connection stuck in this
state you could try poking around with a debugger to say what values the
ReceiveTupleV3 routine has for l_size and already stored in answer.

> We were wondering what the:
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:192)
>   - locked org.postgresql.core.v3.QueryExecutorImpl@18493c9
>
> was refering to.
>

This is just indicating that this routine is synchronized.  This is done
to prevent multiple threads from trying to communicate with the server
simultaneously and does not indicate a problem.

Kris Jurka

Re: Idle in TX / Java process hang's in the vicinity of JDBC

От
David Kerr
Дата:
- >org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery(AbstractJdbc2Statement.java:235)
-
- The stacktrace indicates that Statement.executeQuery was called and the
- JDBC driver started receiving the results from the server.  The driver is
- in the middle of reading a particular row and is waiting for more data
- from the server.  From what you're saying the server thinks it is done,
- but the driver does not, indicating loss of protocol synchronization.
- That's tough to debug after the fact, without a reproducible case or
- network traffic dump.  If you still have this connection stuck in this
- state you could try poking around with a debugger to say what values the
- ReceiveTupleV3 routine has for l_size and already stored in answer.

Thanks, that's helpful! we don't have a debugger configured for that JVM
so we might have to try to catch it next time around.

Getting protocols out of sync sort of sounds to me like a network glitch
of some sort to me. does that sound possible?

- >We were wondering what the:
- >org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:192)
- >  - locked org.postgresql.core.v3.QueryExecutorImpl@18493c9
- >
- >was refering to.
- >
-
- This is just indicating that this routine is synchronized.  This is done
- to prevent multiple threads from trying to communicate with the server
- simultaneously and does not indicate a problem.

Awesome, thanks.


Dave

Re: Idle in TX / Java process hang's in the vicinity of JDBC

От
Matthew Wakeling
Дата:
On Thu, 29 Jul 2010, David Kerr wrote:
> Getting protocols out of sync sort of sounds to me like a network glitch
> of some sort to me. does that sound possible?

No. TCP has pretty robust mechanisms for preventing that sort of error.

Matthew

--
 As you approach the airport, you see a sign saying "Beware - low
 flying airplanes". There's not a lot you can do about that. Take
 your hat off?                                  -- Michael Flanders

Re: Idle in TX / Java process hang's in the vicinity of JDBC

От
David Kerr
Дата:
On 7/30/2010 1:44 AM, Matthew Wakeling wrote:
> On Thu, 29 Jul 2010, David Kerr wrote:
>> Getting protocols out of sync sort of sounds to me like a network glitch
>> of some sort to me. does that sound possible?
>
> No. TCP has pretty robust mechanisms for preventing that sort of error.
>
> Matthew
>

Ok, yes that's consistent with what we found yesterday where we were
able to reproduce it on system that was being used by other people
without problem.

we've got a working theory that the program opens up a connection to
DB1 and DB2.  In DB1 we're running a very long query (15 minutes) and
after some sort of timeout(or something) the program hangs leaving the
idle in TX in DB2.

Thanks

Dave

Re: Idle in TX / Java process hang's in the vicinity of JDBC

От
Oliver Jowett
Дата:
David Kerr wrote:

> we've got a working theory that the program opens up a connection to
> DB1 and DB2.  In DB1 we're running a very long query (15 minutes) and
> after some sort of timeout(or something) the program hangs leaving the
> idle in TX in DB2.

You may want to look at any firewalls you have between the client and
the DBs. Often, stateful firewalls will drop state for connections that
are idle for extended periods.

-O

Re: Idle in TX / Java process hang's in the vicinity of JDBC

От
David Kerr
Дата:
On Sat, Jul 31, 2010 at 10:28:22AM +1200, Oliver Jowett wrote:
- David Kerr wrote:
-
- > we've got a working theory that the program opens up a connection to
- > DB1 and DB2.  In DB1 we're running a very long query (15 minutes) and
- > after some sort of timeout(or something) the program hangs leaving the
- > idle in TX in DB2.
-
- You may want to look at any firewalls you have between the client and
- the DBs. Often, stateful firewalls will drop state for connections that
- are idle for extended periods.
-
- -O
-

Turns out, they were accessing a 2nd DB on a seprate server that (that the
developers neglected to mention) and that process was spinning due to bad
statistics on the tables.

so this process was just sitting there legitimately waiting.

Thanks for all the replies!

Dave