Обсуждение: segfault on rollback

Поиск
Список
Период
Сортировка

segfault on rollback

От
"Sergey E. Koposov"
Дата:
Hello -hackers,

I recently have seen the segfault with Postgres 8.1.4.

The situation in which the segfault occur is quite complicated (a lot of
commands in one transaction from JDBC), but Here I show the gdb core dump of
it. From looking in my application, it seems that the core dump occurs when
the application issue the ROLLBACK command in the end of large transaction.
(when I do Connection.rollback() from JDBC)

Also I must notice that the segfault only occur if
log_min_duration_statement is set to 0

if it is set to -1, the segfault do not occur.

Also I must say that I specially removed all the C functions calls from the
transaction, to be sure that I'm not corrupting the memory myself.


Program received signal SIGSEGV, Segmentation fault.
0xb7dbd6db in strlen () from /lib/tls/libc.so.6
(gdb) bt
#0  0xb7dbd6db in strlen () from /lib/tls/libc.so.6
#1  0xb7d91ef1 in vfprintf () from /lib/tls/libc.so.6
#2  0xb7db0700 in vsnprintf () from /lib/tls/libc.so.6
#3  0x08179c47 in appendStringInfoVA (str=0xbfd639f0,    fmt=0x83d43c0 "duration: %ld.%03ld ms  statement: %sEXECUTE %s
[PREPARE:  %s]", args=0xbfd63a24 "Y") at stringinfo.c:125 
#4  0x08290471 in errmsg (    fmt=0x834b0c0 "duration: %ld.%03ld ms  statement: %sEXECUTE %s  [PREPARE:  %s]") at
elog.c:647
#5  0x08202bd2 in exec_execute_message (portal_name=0x8421a6c "", max_rows=1)    at postgres.c:1825
#6  0x0820498a in PostgresMain (argc=4, argv=0x83d2754,    username=0x83d2658 "cas_user_tmp") at postgres.c:3268
#7  0x081d3528 in BackendRun (port=0x83eaed0) at postmaster.c:2856
#8  0x081d2b3f in BackendStartup (port=0x83eaed0) at postmaster.c:2500
#9  0x081d0bc9 in ServerLoop () at postmaster.c:1230
#10 0x081d0474 in PostmasterMain (argc=1, argv=0x83bb000) at postmaster.c:941
#11 0x081821d8 in main (argc=1, argv=0x83bb000) at main.c:265

(gdb) bt full
#0  0xb7dbd6db in strlen () from /lib/tls/libc.so.6
No symbol table info available.
#1  0xb7d91ef1 in vfprintf () from /lib/tls/libc.so.6
No symbol table info available.
#2  0xb7db0700 in vsnprintf () from /lib/tls/libc.so.6
No symbol table info available.
#3  0x08179c47 in appendStringInfoVA (str=0xbfd639f0,    fmt=0x83d43c0 "duration: %ld.%03ld ms  statement: %sEXECUTE %s
[PREPARE:  %s]", args=0xbfd63a24 "Y") at stringinfo.c:125        avail = 255        nprinted = 256 
#4  0x08290471 in errmsg (    fmt=0x834b0c0 "duration: %ld.%03ld ms  statement: %sEXECUTE %s  [PREPARE:  %s]") at
elog.c:647       args = 0xbfd63a24 "Y"        success = -65 '©'        fmtbuf = 0x83d43c0 "duration: %ld.%03ld ms
statement:%sEXECUTE %s  [PREPARE:  %s]"        buf = {  data = 0x83d42b4 "duration: 89.073 ms  statement: EXECUTE
<unnamed> [PREPARE:  ", '\177' <repeats 138 times>..., len = 0, maxlen = 256, cursor = 0}        edata = (ErrorData *)
0x83ab3a0       oldcontext = 0x83d1f00 
---Type <return> to continue, or q <return> to quit---
#5  0x08202bd2 in exec_execute_message (portal_name=0x8421a6c "", max_rows=1)    at postgres.c:1825        usecs =
89073       dest = DestRemoteExecute        receiver = (DestReceiver *) 0x8421b78        portal = 0x842c114
completed= 1 '\001'        completionTag = "ROLLBACK\000:ж©Ф\033\030\bэ:ж©m\032B\b\004\000\000\000%\n
\b╓\201?\b\000\001\000\000Х:ж©Ё\031\030\b\000\000\000\001э:ж©\004\000\000\000\224\035\030\b"       start_t = {tv_sec =
1155432847,tv_usec = 699143}        stop_t = {tv_sec = 1155432847, tv_usec = 788216}        save_log_duration = 0 '\0'
     save_log_min_duration_statement = 0        save_log_statement_stats = 0 '\0'        execute_is_fetch = 0 '\0' 
#6  0x0820498a in PostgresMain (argc=4, argv=0x83d2754,    username=0x83d2658 "cas_user_tmp") at postgres.c:3268
portal_name= 0x8421a6c ""        max_rows = 1        flag = -1        dbname = 0x83dd198 "cas"        userDoption = 0x0 
---Type <return> to continue, or q <return> to quit---        secure = 0 '\0'        errs = 0        debug_flag = -1
   guc_names = (List *) 0x0        guc_values = (List *) 0x0        ctx = PGC_USERSET        gucsource = PGC_S_CLIENT
    am_superuser = 0 '\0'        tmp = 0x83d232c ""        firstchar = 69        stack_base = 0 '\0'
input_message= {data = 0x8421a6c "", len = 5, maxlen = 256,  cursor = 5}        local_sigjmp_buf = {{__jmpbuf =
{138325712,137883360, 1, -1076478888,      -1076479248, 136332736}, __mask_was_saved = 1, __saved_mask = {__val = {
  1073347075, 4294967294, 0, 0, 0, 0, 0, 7004, 0, 8, 0, 0, 138225460,        138224272, 138223360, 3218488264,
4294967295,138223360, 3085359680,        3084618868, 138225440, 138224272, 3084378772, 138326168, 3085312352,
200,138249720, 138241528, 3085367328, 137883360, 1, 3218488344}}}}        send_rfq = 0 '\0' 
#7  0x081d3528 in BackendRun (port=0x83eaed0) at postmaster.c:2856        status = 0        remote_host =
"127.0.0.1\000ж╥\000ЮФ╥\004\aГ╥\004\aГ╥xAж©bЦщ╥\004\aГ╥\00---Type<return> to continue, or q <return> to quit--- 

0\000\000\000hAж©╠\201*\b<\032Г╥\000\000\000\000\004\aГ╥\002\024\000\000\000\024\000\000\000\000\000\000\000ЮФ╥╪\177=\bо]3\b\000ЮФ╥
ХФ╥\000шФ╥hAж©\b\212з╥
ХФ╥\005\000\000\000\000\000\000\000xEж©И╗ш╥\000ЮФ╥ЬAж©дБщ╥\210Aж©\000ЮФ╥\004\aГ╥\224Бв╥ЬAж©bЦщ╥\004\aГ╥\000\000\000\000\002\024\000\000\005\000\000\000<\032Г╥\000\000\000\000\004\aГ╥\003\024\000\000\000\024\000\000\000\000\000\000"...
      remote_port = "32989\000ж©╥\220У╥ГEш╥FD\a\bЭDж©И╗ш╥\000ЮФ"        remote_ps_data =
"127.0.0.1(32989)\000╚х\233pq©\233\034чФ╥p<ж©.фУ╥\000\000\000\000\000\000\000\000╦qЖ╥\000\000\000\000\000\000\000\000\003\000\000\000P<ж©ьoЖ╥\000\000\000\000\b©;\bL=ж©\"еУ╥\b©;\b\001\000\000\000дGж©ю!=\b╦\031ж╥Жiж╥╦qЖ╥╪tЖ╥\000\000\000\000дGж©\001\000\000\000\000ЮФ╥юlЖ╥ю!=\bл=ж©O\000Д╥ю!=\b\000\000\000\200\001\000\000\220\000\000\000\000
1\001\000Ю\236>\b\004\000\000\000\003\000\000\000ЬФ╥\000ЮФ╥ ЬФ╥юlЖ╥═=ж©"...        av = (char **) 0x83d2754
maxac= 10        ac = 4        protobuf = "-v196608\000<ж©\000\024\000\000ЪЪЪЪ\000\000\000\000ПТ$\234\000\000\000"
 i = 4 
#8  0x081d2b3f in BackendStartup (port=0x83eaed0) at postmaster.c:2500        bn = (Backend *) 0x83dd0c8        pid = 0
#9  0x081d0bc9 in ServerLoop () at postmaster.c:1230        timeout = {tv_sec = 59, tv_usec = 948000}
---Type <return> to continue, or q <return> to quit---        selres = 1        i = 0        port = (Port *) 0x83eaed0
     rmask = {fds_bits = {8, 0 <repeats 31 times>}}        readmask = {fds_bits = {24, 0 <repeats 31 times>}}
nSockets= 5        now = 1155432807        last_touch_time = 1155432686        earlier = {tv_sec = 1155432686, tv_usec
=49004}        later = {tv_sec = 1155432765, tv_usec = 301709} 
#10 0x081d0474 in PostmasterMain (argc=1, argv=0x83bb000) at postmaster.c:941        opt = -1        status = 0
userDoption= 0x0        i = 64 
#11 0x081821d8 in main (argc=1, argv=0x83bb000) at main.c:265        pw = (struct passwd *) 0xb7f66cc0
pw_name_persist= 0xb7e6e000 "\\}\021" 


Any ideas ? Or I should try to narrower the problem ?... (it is not very easy
for my application)...

Initially I thought that the problem can be due to JDBC, but it doesn't
seem so... In any case, since JDBC for PG is pure java and is using libpq,
Posgres shouldn't crash.

Thanks,
Regards,    Sergey

*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

Re: segfault on rollback

От
Tom Lane
Дата:
"Sergey E. Koposov" <math@sai.msu.ru> writes:
> I recently have seen the segfault with Postgres 8.1.4.

I'm betting that portal->sourceText has already been deallocated when
exec_execute_message tries to print the log message.  Getting an actual
segfault from that would very probably be hard to reproduce, but if you
build with --enable-cassert it should not be too hard to reproduce
corruption of the log message, ie, display of garbage instead of
"ROLLBACK" as the command text.  Please try that and see if you can
generate a self-contained test case.
        regards, tom lane


Re: segfault on rollback

От
"Sergey E. Koposov"
Дата:
On Sat, 12 Aug 2006, Tom Lane wrote:

> "Sergey E. Koposov" <math@sai.msu.ru> writes:
>> I recently have seen the segfault with Postgres 8.1.4.
>
> I'm betting that portal->sourceText has already been deallocated when
> exec_execute_message tries to print the log message.  Getting an actual
> segfault from that would very probably be hard to reproduce, but if you
> build with --enable-cassert it should not be too hard to reproduce
> corruption of the log message, ie, display of garbage instead of
> "ROLLBACK" as the command text.  Please try that and see if you can
> generate a self-contained test case.


I succeeded to get the self contained case, but in java (it is just 15 
lines,  but it crashes the backend). (probably I should forward the mail 
pgsql-jdbc...)

import java.sql.*;
public class xx
{        public static void main(String args[]) throws Exception        {
Class.forName("org.postgresql.Driver");       Connection dbcon =
DriverManager.getConnection("jdbc:postgresql://localhost:5432/template1","postgres","");
dbcon.setAutoCommit(false);       Statement stmt = dbcon.createStatement();        stmt.execute("create table xx(a int,
bdouble precision)");        dbcon.rollback();        }
 
}

I'm not sure that it is possible to reproduce without java, since JDBC do 
its own query preparing and a lot of other stuff internally which 
should be non trivial to reproduce without jdbc...

Regards,    Sergey


*******************************************************************
Sergey E. Koposov
Max Planck Institute for Astronomy/Sternberg Astronomical Institute
Tel: +49-6221-528-349
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru


Re: segfault on rollback

От
Tom Lane
Дата:
"Sergey E. Koposov" <math@sai.msu.ru> writes:
> I'm also sending The output from strace (during the execution of the test 
> program which I have sent before) showing what jdbc is sending to the backend. 

Great, I was just going to ask for that.

I can reproduce it in HEAD pretty trivially using libpq:
   res = PQexecParams(conn,                      "BEGIN",                      0,        /* no params */
     NULL,    /* let the backend deduce param type */                      NULL,                      NULL,    /* don't
needparam lengths since text */                      NULL,    /* default to all text params */                      0);
      /* ask for text results */
 
   if (PQresultStatus(res) != PGRES_COMMAND_OK)   {       fprintf(stderr, "BEGIN failed: %s", PQerrorMessage(conn));
  PQclear(res);       exit_nicely(conn);   }
 
   printf("cmd status = '%s'\n", PQcmdStatus(res));
   PQclear(res);
   res = PQexecParams(conn,                      "ROLLBACK",                      0,        /* no params */
        NULL,    /* let the backend deduce param type */                      NULL,                      NULL,    /*
don'tneed param lengths since text */                      NULL,    /* default to all text params */
 0);        /* ask for text results */
 
   if (PQresultStatus(res) != PGRES_COMMAND_OK)   {       fprintf(stderr, "ROLLBACK failed: %s", PQerrorMessage(conn));
     PQclear(res);       exit_nicely(conn);   }
 
   printf("cmd status = '%s'\n", PQcmdStatus(res));
   PQclear(res);

It turns out that COMMIT dies just the same way.  The problem is that
any transaction-exiting command destroys portals at the
finish_xact_command step, but the log_duration code is expecting the
portal to still be there afterwards.

We could "fix" this by emitting the log message before calling
finish_xact_command, but I think that would result in seriously
underreporting the time required for a COMMIT.  Probably the right fix
is to copy the data we might need out of the Portal before committing.
Or, if people don't like the overhead for that, we could reduce the
amount of info included in the log message ... but probably that
would make it unhelpful :-(
        regards, tom lane


Re: segfault on rollback

От
Tom Lane
Дата:
I wrote:
> We could "fix" this by emitting the log message before calling
> finish_xact_command, but I think that would result in seriously
> underreporting the time required for a COMMIT.  Probably the right fix
> is to copy the data we might need out of the Portal before committing.

Patch applied.

This reminds me that even though the V3-protocol code has been in there
since 7.4, we don't have nearly complete test coverage on it.  It'd be a
good idea to have some basic tests included somewhere in the standard or
contrib regression tests.  Any thoughts about an easy way to do it?
        regards, tom lane


Re: segfault on rollback

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> I wrote:
> > We could "fix" this by emitting the log message before calling
> > finish_xact_command, but I think that would result in seriously
> > underreporting the time required for a COMMIT.  Probably the right fix
> > is to copy the data we might need out of the Portal before committing.
> 
> Patch applied.
> 
> This reminds me that even though the V3-protocol code has been in there
> since 7.4, we don't have nearly complete test coverage on it.  It'd be a
> good idea to have some basic tests included somewhere in the standard or
> contrib regression tests.  Any thoughts about an easy way to do it?

Do we have anything on core which speaks the v3 protocol?  Since libpq
doesn't expose functions to do all the stuff the v3 protocol can do, it
would be difficult.  Maybe the Python project on pgFoundry (I don't
think it uses libpq, does it?), or maybe JDBC, can help here.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: segfault on rollback

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> This reminds me that even though the V3-protocol code has been in there
>> since 7.4, we don't have nearly complete test coverage on it.  It'd be a
>> good idea to have some basic tests included somewhere in the standard or
>> contrib regression tests.  Any thoughts about an easy way to do it?

> Do we have anything on core which speaks the v3 protocol?  Since libpq
> doesn't expose functions to do all the stuff the v3 protocol can do, it
> would be difficult.

libpq doesn't expose everything in v3, but it exposes quite a lot ---
and will expose more once the Describe patch gets in.  I think the main
thing you still can't do is issue a Bind as a separate operation from an
Execute, and that's not something I feel is real important for testing
the backend.  The main problem is to have something that actually calls
the V3-protocol functions, which psql doesn't.

I was thinking of suggesting that even making psql use PQexecParams
instead of PQexec would help for testing purposes.  But from a raw
performance point of view that's probably a loss; also it would be the
final blow to using a modern psql to connect to a pre-7.4 server.
        regards, tom lane


Re: segfault on rollback

От
"Jim C. Nasby"
Дата:
On Sun, Aug 13, 2006 at 07:01:13PM -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Tom Lane wrote:
> >> This reminds me that even though the V3-protocol code has been in there
> >> since 7.4, we don't have nearly complete test coverage on it.  It'd be a
> >> good idea to have some basic tests included somewhere in the standard or
> >> contrib regression tests.  Any thoughts about an easy way to do it?
> 
> > Do we have anything on core which speaks the v3 protocol?  Since libpq
> > doesn't expose functions to do all the stuff the v3 protocol can do, it
> > would be difficult.
> 
> libpq doesn't expose everything in v3, but it exposes quite a lot ---
> and will expose more once the Describe patch gets in.  I think the main
> thing you still can't do is issue a Bind as a separate operation from an
> Execute, and that's not something I feel is real important for testing
> the backend.  The main problem is to have something that actually calls
> the V3-protocol functions, which psql doesn't.
> 
> I was thinking of suggesting that even making psql use PQexecParams
> instead of PQexec would help for testing purposes.  But from a raw
> performance point of view that's probably a loss; also it would be the
> final blow to using a modern psql to connect to a pre-7.4 server.

How much cruft would it add to psql to allow use of V3 as an option?
Presumably no one would use it in real life, but it would make it easier
to test things. If it's not too invasive to psql it'd be a win to go
that route over adding extra code just to test V3.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461