Обсуждение: snapper vs. HEAD
Buildfarm member snapper has been crashing in the core regression tests since commit 17a28b0364 (well, there's a bit of a range of uncertainty there, but 17a28b0364 looks to be the only such commit that could have affected code in gistget.c where the crash is). Curiously, its sibling skate is *not* failing, despite being on the same machine and compiler. I looked into this by dint of setting up a similar environment in a qemu VM. I might not have reproduced things exactly, but I managed to get the same kind of crash at approximately the same place, and what it looks like to me is a compiler bug. It's iterating gistindex_keytest's loop over search keys ScanKey key = scan->keyData; int keySize = scan->numberOfKeys; while (keySize > 0) { ... key++; keySize--; } one time too many, and accessing a garbage ScanKey value off the end of the keyData array, leading to a function call into never-never land. I'm no expert on Sparc assembly code, but it looks like the compiler forgot the ",a" (annul) modifier here: .loc 1 181 0 andcc %o7, 64, %g0 be,pt %icc, .L134 <---- addcc %l5, -1, %l5 .loc 1 183 0 lduh [%i4+16], %o7 add %i4, %o7, %o7 lduh [%o7+12], %o7 andcc %o7, 1, %g0 bne,pt %icc, .L141 ld [%fp-32], %g2 .loc 1 163 0 ba,pt %xcc, .L134 addcc %l5, -1, %l5 causing %l5 (which contains the keySize value) to be decremented an extra time in the case where that branch is not taken and we fall through as far as the "ba" at the end. Even that would not be fatal, perhaps, except that the compiler also decided to optimize the "keySize > 0" test to "keySize != 0", for ghod only knows what reason (surely it's not any faster on a RISC machine), so that arriving at .L134 with %l5 containing -1 allows the loop to be iterated again. Kaboom. It's unclear how 17a28b0364 would have affected this, but there is an elog() call elsewhere in the same function, so maybe the new coding for that changed register assignments or some other phase-of-the-moon effect. I doubt that anyone's going to take much interest in fixing this old compiler version, so my recommendation is to back off the optimization level on snapper to -O1, and probably on skate as well because there's no obvious reason why the same compiler bug might not bite skate at some point. I was able to get through the core regression tests on my qemu VM after recompiling gistget.c at -O1 (with other flags the same as snapper is using). regards, tom lane
Hi, On 2020-03-28 23:50:32 -0400, Tom Lane wrote: > Buildfarm member snapper has been crashing in the core regression tests > since commit 17a28b0364 (well, there's a bit of a range of uncertainty > there, but 17a28b0364 looks to be the only such commit that could have > affected code in gistget.c where the crash is). Curiously, its sibling > skate is *not* failing, despite being on the same machine and compiler. Hm. There's some difference in code-gen specific options. snapper has: 'CFLAGS' => '-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security ', 'CPPFLAGS' => '-D_FORTIFY_SOURCE=2', 'LDFLAGS' => '-Wl,-z,relro -Wl,-z,now' and specifies (among others) '--enable-thread-safety', '--with-gnu-ld', whereas skate has --enable-cassert. Not too hard to imagine that several of these could cause enough code-gen differences so that one exhibits the bug, and the other doesn't. The different commandlines for gistget end up being: snapper: ccache gcc-4.7 -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -g-O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -I../../../../src/include -D_FORTIFY_SOURCE=2-D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/mit-krb5 -c -o gistget.o gistget.c skate: ccache gcc-4.7 -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2-I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -c -o gistget.o gistget.c > I looked into this by dint of setting up a similar environment in a > qemu VM. I might not have reproduced things exactly, but I managed > to get the same kind of crash at approximately the same place, and > what it looks like to me is a compiler bug. What options were you using? Reproducing snapper as exactly as possible? > It's unclear how 17a28b0364 would have affected this, but there is > an elog() call elsewhere in the same function, so maybe the new > coding for that changed register assignments or some other > phase-of-the-moon effect. Yea, wouldn't be too surprising. > I doubt that anyone's going to take much interest in fixing this > old compiler version, so my recommendation is to back off the > optimization level on snapper to -O1, and probably on skate as > well because there's no obvious reason why the same compiler bug > might not bite skate at some point. I was able to get through > the core regression tests on my qemu VM after recompiling > gistget.c at -O1 (with other flags the same as snapper is using). If you still have the environment it might make sense to check wether it's related to one of the other options. But otherwise I wouldn't be against the proposal. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2020-03-28 23:50:32 -0400, Tom Lane wrote: >> ... Curiously, its sibling >> skate is *not* failing, despite being on the same machine and compiler. > Hm. There's some difference in code-gen specific options. Yeah, I confirmed that on my VM install too: using our typical codegen options (just -O2 -g), the regression tests pass, matching skate's results. It fell over after I matched snapper's CFLAGS, CPPFLAGS, and LDFLAGS. I didn't try to break things down more finely as to which option(s) trigger the bad code, since it looks like it's probably some purely-internal compiler state ... > What options were you using? Reproducing snapper as exactly as possible? Yeah, see above. > If you still have the environment it might make sense to check wether > it's related to one of the other options. But otherwise I wouldn't be > against the proposal. I could, but it's mighty slow, so I don't especially want to try all 2^N combinations. Do you have any specific cases in mind? (I guess we can exclude LDFLAGS, since the assembly output is visibly bad.) regards, tom lane
Hi, On 2020-03-29 20:25:32 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > If you still have the environment it might make sense to check wether > > it's related to one of the other options. But otherwise I wouldn't be > > against the proposal. > > I could, but it's mighty slow, so I don't especially want to try all 2^N > combinations. Do you have any specific cases in mind? I'd be most suspicious of -fstack-protector --param=ssp-buffer-size=4 and -D_FORTIFY_SOURCE=2. The first two have direct codegen implications, the latter can lead to quite different headers being included and adds a lot of size tracking to the optimizer. > (I guess we can exclude LDFLAGS, since the assembly output is visibly > bad.) Seems likely. Is it visibly bad when looking at the .s of gistget.c "directly", or when disassembling the fiinal binary? Because I've seen linkers screw up on a larger scale than I'd have expected in the past. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Is it visibly bad when looking at the .s of gistget.c "directly", or > when disassembling the fiinal binary? Because I've seen linkers screw up > on a larger scale than I'd have expected in the past. Yes, the bogus asm that I showed was from gistget.s, not from disassembling. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2020-03-29 20:25:32 -0400, Tom Lane wrote: >> I could, but it's mighty slow, so I don't especially want to try all 2^N >> combinations. Do you have any specific cases in mind? > I'd be most suspicious of -fstack-protector --param=ssp-buffer-size=4 > and -D_FORTIFY_SOURCE=2. The first two have direct codegen implications, > the latter can lead to quite different headers being included and adds a > lot of size tracking to the optimizer. It occurred to me that just recompiling gistget.c, rather than the whole backend, would be enough to prove the point. So after a few trials: * Removing "-fstack-protector --param=ssp-buffer-size=4" does nothing; the generated .o file is bitwise the same. * Removing -D_FORTIFY_SOURCE=2 does change the bits, but it still crashes. So that eliminates all of snapper's special compile options :-(. I'm forced to the conclusion that the important difference between snapper and skate is that the latter uses --enable-cassert and the former doesn't, because that's the only remaining difference between how I built a working version before and the not-working version I have right now. Which means that this really is pretty much a phase-of-the-moon compiler bug, and we've just been very lucky that we haven't tripped over it before. regards, tom lane
I wrote: > I'm forced to the conclusion that the important difference between > snapper and skate is that the latter uses --enable-cassert and the > former doesn't, because that's the only remaining difference between > how I built a working version before and the not-working version > I have right now. Confirmed: building gistget with --enable-cassert, and all of snapper's compile/link options, produces something that passes regression. The generated asm differs in a whole lot of details, but it looks like the compiler remembers to annul the branch delay slot in all the relevant places: .loc 1 163 0 addcc %l7, -1, %l7 .L186: be,pn %icc, .L80 add %l6, 48, %l6 ... .loc 1 189 0 be,a,pt %icc, .L186 addcc %l7, -1, %l7 ... .loc 1 183 0 lduh [%g4+12], %g4 andcc %g4, 1, %g0 be,a,pt %icc, .L186 addcc %l7, -1, %l7 andcc %o7, 0xff, %g0 bne,a,pt %icc, .L186 addcc %l7, -1, %l7 regards, tom lane
Hi, Tom Lane wrote: > Confirmed: building gistget with --enable-cassert, and all of snapper's > compile/link options, produces something that passes regression. Skate uses buildfarm default configuration, whereas snapper uses settings which are used when building postgresql debianpackages. Debian packages are built without --enable-cassert, but most buildfarm animals build with --enable-cassert.I specifically configured skate and snapper like that because I ran into such issues in the past (wheredebian packages would fail to build on sparc, but buildfarm animals on debian sparc did not highlight the issue). In the past, I've already switched from gcc 4.6 to gcc 4.7 as a workaround for a similar compiler bug, but I can't upgradeto a newer gcc without backporting it myself, so for the moment I've switched snapper to use -O1 instead of -O2, forHEAD only. Not sure whether wheezy on sparc 32-bit is very relevant today, but it's an exotic platform, so I try to keep those buildfarmanimals alive as long as it's possible. Best regards, Tom Turelinckx
"Tom Turelinckx" <pgbf@twiska.com> writes: > In the past, I've already switched from gcc 4.6 to gcc 4.7 as a workaround for a similar compiler bug, but I can't upgradeto a newer gcc without backporting it myself, so for the moment I've switched snapper to use -O1 instead of -O2, forHEAD only. Thanks! But it doesn't seem to have taken: snapper just did a new run that still failed, and it still seems to be using -O2. > Not sure whether wheezy on sparc 32-bit is very relevant today, but it's > an exotic platform, so I try to keep those buildfarm animals alive as > long as it's possible. Yeah, I've got a couple of those myself. But perhaps it'd be sensible to move to a newer Debian LTS release? Or have they dropped Sparc support altogether? (As of this weekend, it seemed to be impossible to find the wheezy sparc distribution images on-line anymore. Fortunately I still had a download of the dvd-1 image stashed away, or I would not have been able to recreate my qemu VM for the purpose. It's going to be very hard for any other PG hackers to investigate that platform in future.) regards, tom lane
Hi, Tom Lane wrote: > Thanks! But it doesn't seem to have taken: snapper just did a new run > that still failed, and it still seems to be using -O2. Snapper did build using -O1 a few hours ago, but it failed the check stage very early with a different error: FATAL: null value in column "classid" of relation "pg_depend" violates not-null constraint I then cleared out the ccache and forced a build of HEAD: same issue. Next I cleared out the ccache and forced a build of HEAD with -O2: this is the one you saw. Finally, I've cleared out both the ccache and the accache and forced a build of HEAD with -O1. It failed the check stageagain very early with the above error. > to move to a newer Debian LTS release? Or have they dropped Sparc > support altogether? Wheezy was the last stable release for Debian sparc. Sparc64 is a Debian ports architecture, but there are no stable releasesfor sparc64. I do maintain private sparc64 repositories for Stretch and Buster, and I could configure buildfarm animalsfor those (on faster hardware too), but those releases are not officially available. Best regards, Tom Turelinckx
Hi, On 2020-03-30 12:24:06 -0400, Tom Lane wrote: > "Tom Turelinckx" <pgbf@twiska.com> writes: > Yeah, I've got a couple of those myself. But perhaps it'd be sensible > to move to a newer Debian LTS release? Or have they dropped Sparc > support altogether? I think the 32bit sparc support has been phased out. Sparc64 isn't a "official port", but there's a port: https://wiki.debian.org/Sparc64 including seemingly regularly updated images. > (As of this weekend, it seemed to be impossible to find the wheezy sparc > distribution images on-line anymore. Fortunately I still had a download > of the dvd-1 image stashed away, or I would not have been able to recreate > my qemu VM for the purpose. It's going to be very hard for any other PG > hackers to investigate that platform in future.) They've been moved to archive.debian.org, but they should still be downloadable. Seems like the website hasn't been quite updated to that fact... The installer downloads are still available at: https://www.debian.org/releases/wheezy/debian-installer/ but sources.list would need to be pointed at something like deb http://archive.debian.org/debian/ wheezy contrib main non-free See also https://www.debian.org/distrib/archive Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2020-03-30 12:24:06 -0400, Tom Lane wrote: >> (As of this weekend, it seemed to be impossible to find the wheezy sparc >> distribution images on-line anymore. > The installer downloads are still available at: > https://www.debian.org/releases/wheezy/debian-installer/ Ah, I should have clarified. That's 7.11, which I'd tried last time I was interested in duplicating snapper, and I found that it does not work under qemu's sparc emulation (my notes don't have much detail, but some installer component was core-dumping). What did work, after some hair pulling, was 7.6 ... and that's the version I can no longer find on-line. But it has what seems to be the same gcc release as snapper is using, so I figured it was close enough. regards, tom lane
"Tom Turelinckx" <pgbf@twiska.com> writes: > Tom Lane wrote: >> Thanks! But it doesn't seem to have taken: snapper just did a new run >> that still failed, and it still seems to be using -O2. > Snapper did build using -O1 a few hours ago, but it failed the check stage very early with a different error: > FATAL: null value in column "classid" of relation "pg_depend" violates not-null constraint Ugh. Compiler bugs coming out of the woodwork? Not sure what to suggest here. It certainly is useful to us to have sparc buildfarm testing, but probably sparc64 would be more useful than sparc32. (It looks like FreeBSD and OpenBSD have dropped sparc32 altogether, and NetBSD has bumped it to tier II status.) One idea is to try -O0 ... regards, tom lane