Created attachment 28096 [details] Reduced test case Please see the attached cpp and h files. I used gcc-4.6.4 at 19788:190734 (as reported by svnversion -c). Configured it as follows: configure --disable-multilib --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/armv7a-cros-linux-gnueabi/gcc-bin/4.6.4 --datadir=/usr/share/gcc-data/armv7a-cros-linux-gnueabi/4.6.4 --mandir=/usr/share/gcc-data/armv7a-cros-linux-gnueabi/4.6.4/man --infodir=/usr/share/gcc-data/armv7a-cros-linux-gnueabi/4.6.4/info --includedir=/usr/lib/gcc/armv7a-cros-linux-gnueabi/4.6.4/include --with-gxx-include-dir=/usr/lib/gcc/armv7a-cros-linux-gnueabi/4.6.4/include/g++-v4 --host=x86_64-pc-linux-gnu --target=armv7a-cros-linux-gnueabi --build=x86_64-pc-linux-gnu --enable-languages=c,c++ --with-float=hard --with-mode=thumb --with-sysroot=/usr/armv7a-cros-linux-gnueabi --disable-libmudflap --disable-libssp --enable-libgomp --enable-__cxa_atexit --enable-checking=release --disable-libquadmath --with-arch=armv7-a --disable-esp --enable-linker-build-id Built my file as follows: g++ -fno-exceptions -Wno-unused-parameter -Wno-missing-field-initializers -D_FILE_OFFSET_BITS=64 -fvisibility=hidden -pipe -fPIC -fno-strict-aliasing -g -mthumb -Wa,-mimplicit-it=thumb -march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=neon -O2 -fno-ident -fdata-sections -ffunction-sections -fno-rtti -fno-threadsafe-statics -fvisibility-inlines-hidden -Wsign-compare -Wno-invalid-offsetof -Wno-multichar -Wno-sign-compare -Wno-abi -O2 -pipe -march=armv7-a -mtune=cortex-a15 -mfpu=neon -mfloat-abi=hard -g -fno-omit-frame-pointer -o reduced.out reduced.cpp The final binary is called reduced.out, and produces different output at -O0 and -O2 with -fno-omit-frame-pointer. At -O0 (as well as for x86 backend): p1.x 0 p1.y 0 p2.x 5 p2.y 5 p3.x 10 p3.y 10 p1.x 4 p1.y 4 p2.x 7 p2.y 7 p3.x 10 p3.y 10 p1.x 0 p1.y 0 p2.x 2 p2.y 2 p3.x 4 p3.y 4 At -fno-omit-frame-pointer -O2: p1.x 0 p1.y 0 p2.x 5 p2.y 5 p3.x 10 p3.y 10 p1.x 4 p1.y 4 p2.x 777508 p2.y 7 p3.x 10 p3.y 10 p1.x 0 p1.y 0 p2.x 777492 p2.y 79193 p3.x 4 p3.y 4 In the assembly, I see: 7a6: 6139 str r1, [r7, #16] # Note non-0-offset from r7 7a8: f8c7 9014 str.w r9, [r7, #20] # Note non-0-offset from r7 7ac: e893 0003 ldmia.w r3, {r0, r1} 7b0: e884 0003 stmia.w r4, {r0, r1} 7b4: e897 0003 ldmia.w r7, {r0, r1} # Note 0-offset from r7 7b8: e88c 0003 stmia.w ip, {r0, r1} The two offsets should match otherwise we are using garbage values.
On 29 Aug 2012, at 01:21, asharif at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org> wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54398 > > Bug #: 54398 > Summary: Incorrect ARM assembly when building with > -fno-omit-frame-pointer -O2 > Classification: Unclassified > Product: gcc > Version: 4.6.4 > Status: UNCONFIRMED > Severity: normal > Priority: P3 > Component: other > AssignedTo: unassigned@gcc.gnu.org > ReportedBy: asharif@gcc.gnu.org > > > Created attachment 28096 [details] > --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=28096 > Reduced test case > > Please see the attached cpp and h files. > Please report bugs as per http://gcc.gnu.org/bugs/ paying special attention to http://gcc.gnu.org/bugs/#dontwant Thanks, Ramana
Created attachment 28131 [details] Repro case
Sorry about that. Attached is a preprocessed file that reproduces the bug. gcc version: gcc-4.6.4 at 19788:190734 (as reported by svnversion -c) system type: arm linux configured as: configure --disable-multilib --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/armv7a-cros-linux-gnueabi/gcc-bin/4.6.4 --datadir=/usr/share/gcc-data/armv7a-cros-linux-gnueabi/4.6.4 --mandir=/usr/share/gcc-data/armv7a-cros-linux-gnueabi/4.6.4/man --infodir=/usr/share/gcc-data/armv7a-cros-linux-gnueabi/4.6.4/info --includedir=/usr/lib/gcc/armv7a-cros-linux-gnueabi/4.6.4/include --with-gxx-include-dir=/usr/lib/gcc/armv7a-cros-linux-gnueabi/4.6.4/include/g++-v4 --host=x86_64-pc-linux-gnu --target=armv7a-cros-linux-gnueabi --build=x86_64-pc-linux-gnu --enable-languages=c,c++ --with-float=hard --with-mode=thumb --with-sysroot=/usr/armv7a-cros-linux-gnueabi --disable-libmudflap --disable-libssp --enable-libgomp --enable-__cxa_atexit --enable-checking=release --disable-libquadmath --with-arch=armv7-a --disable-esp --enable-linker-build-id command line: armv7a-cros-linux-gnueabi-g++ -fno-exceptions -Wno-unused-parameter -Wno-missing-field-initializers -D_FILE_OFFSET_BITS=64 -fvisibility=hidden -pipe -fPIC -fno-strict-aliasing -g -mthumb -Wa,-mimplicit-it=thumb -march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=neon -O2 -fno-ident -fdata-sections -ffunction-sections -fno-rtti -fno-threadsafe-statics -fvisibility-inlines-hidden -Wsign-compare -Wno-invalid-offsetof -Wno-multichar -Wno-sign-compare -Wno-abi -O2 -pipe -march=armv7-a -mtune=cortex-a15 -mfpu=neon -mfloat-abi=hard -g -fno-omit-frame-pointer -o reduced.out reduced.ii # There is no compiler output # When I run this on an ARM box, I get this output from the binary: p1.x 0 p1.y 0 p2.x 5 p2.y 5 p3.x 10 p3.y 10 p1.x 4 p1.y 4 p2.x 762508 # This is 7 with -fomit-frame-pointer p2.y 7 p3.x 10 p3.y 10 p1.x 0 p1.y 0 p2.x 569140 # this is 2 with -fomit-frame-pointer p2.y 0 p3.x 4 p3.y 4
The code before the position Ahmad pointed out is already wrong. The fault instruction sequence is: asrs r5, r5, #1 asr ip, ip, #1 ; A, tmp1.x asrs r0, r0, #1 ; B, tmp1.y asrs r6, r6, #1 mov r4, r1 add r8, ip, r6 ; C, tmp3.x add r9, r0, r5 ; D, tmp3.y add r7, sp, #0 asr r1, r8, #1 add ip, r4, #8 ; E, asr r9, r9, #1 str r1, [r7, #16] str r9, [r7, #20] ldmia r3, {r0, r1} ; F, stmia r4, {r0, r1} Instruction A computes the result of tmp1.x, instruction C use it to compute tmp3.x, instruction E overwrite the value of tmp1.x. But in the source code, tmp1.x is still needed to execute "dst1->p2 = tmp1;", so at last dest1->p2.x gets garbage. Similarly instruction B computes tmp1.y, instruction D uses it to compute tmp3.y, instruction F overwrites it. After executing "dst1->p2 = tmp1;", dst1->p2.y gets another garbage value. For comparison, following is the correct version asrs r7, r7, #1 ; A, tmp1.x asrs r0, r0, #1 ; B, tmp1.y asrs r6, r6, #1 asrs r5, r5, #1 sub sp, sp, #28 mov r4, r1 add r8, r7, r6 ; C, tmp3.x add ip, r0, r5 ; D, tmp3.y str r7, [sp, #0] ; X, save tmp1.x str r0, [sp, #4] ; Y, save tmp1.y asr r1, ip, #1 add r7, r4, #8 ; E asr r8, r8, #1 str r1, [sp, #20] str r8, [sp, #16] ldmia r3, {r0, r1} ; F stmia r4, {r0, r1} The obvious difference is the extra instructions X and Y, they save the value of tmp1 to stack before reusing the register. The simplified preprocessed source code is struct A { int x; int y; void f(const A &a, const A &b) { x = (a.x + b.x)>>1; y = (a.y + b.y)>>1; } }; class C { public: A p1; A p2; A p3; bool b; void g(C *, C *) const; }; void C::g(C *dst1, C *dst2) const { A tmp1, tmp2, tmp3; tmp1.f(p2,p1); tmp2.f(p2,p3); tmp3.f(tmp1, tmp2); dst1->p1 = p1; dst1->p2 = tmp1; dst1->p3 = dst2->p1 = tmp3; dst2->p2 = tmp2; dst2->p3 = p3; } The simplified command line is: ./cc1plus -fpreprocessed t.ii -quiet -dumpbase t.cpp -mthumb "-march=armv7-a" "-mtune=cortex-a15" -auxbase t -O2 -fno-omit-frame-pointer -o t.s It looks like the dse2 pass did wrong transformation. The gcc4.7 and trunk generate correct code.
It's the bug in local dse sub step in dse.c. 66 (insn/f 70 69 71 2 (set (reg/f:SI 7 r7) 67 (plus:SI (reg/f:SI 13 sp) 68 (const_int 0 [0]))) t.ii:24 -1 69 (nil)) This insn setup the hfp, r7 197 (insn 12 30 17 2 (set (mem/s/c:SI (reg/f:SI 7 r7) [4 tmp1.x+0 S4 A64]) 198 (reg:SI 12 ip [orig:137 D.1799 ] [137])) t.ii:8 694 {*thumb2_movsi_insn} 199 (nil)) This is the store instruction, the memory base address is r7, the hfp register, dse think hfp is constant inside the function, so give it a store group 221 (insn 37 36 34 2 (set (reg/f:SI 8 r8 [170]) 222 (reg/f:SI 7 r7)) t.ii:32 694 {*thumb2_movsi_insn} 223 (expr_list:REG_EQUIV (plus:SI (reg/f:SI 7 r7) 224 (const_int 0 [0])) 225 (nil))) This insn move r7 to r8, it also equals to the value of sp 245 (insn 38 35 39 2 (parallel [ 246 (set (reg:SI 0 r0) 247 (mem/s/c:SI (reg/f:SI 8 r8 [170]) [3 tmp1+0 S4 A64])) 248 (set (reg:SI 1 r1) 249 (mem/s/c:SI (plus:SI (reg/f:SI 8 r8 [170]) 250 (const_int 4 [0x4])) [3 tmp1+4 S4 A32])) 251 ]) t.ii:32 369 {*ldm2_ia} 252 (nil)) This is the load instruction, the memory base address is r8, const_or_frame_p returns false for r8, after using cselib_expand_value_rtx to r8, we get a base address sp, const_or_frame_p still return false for it. So the corresponding group id is -1 (no corresponding store group), then it can't match the store insn 12. So dse consider the memory stored in insn 12 is never used, thus a dead store, and can be eliminated. The problem is the hfp based address is considered constant base address, sp and derived addresses are considered varied base address, they will not be matched when detecting interfering memory access. But in many cases sp and hfp can be same. Even worse, addresses copied from or derived from hfp could be recognized as derived from sp, like in this case, and causes memory access mismatch.
cselib.c has for this the various spots that special case HARD_FRAME_POINTER_REGNUM (or STACK_POINTER_REGNUM or FRAME_POINTER_REGNUM). Please see why that doesn't work in this case.
On 09/11/12 07:09, jakub at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54398 > > Jakub Jelinek <jakub at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |jakub at gcc dot gnu.org > > --- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-09-11 06:09:13 UTC --- > cselib.c has for this the various spots that special case > HARD_FRAME_POINTER_REGNUM (or STACK_POINTER_REGNUM or FRAME_POINTER_REGNUM). > Please see why that doesn't work in this case. > This rings a bell. Maybe the patch mentioned below needs backporting given Carrot is reporting this against the 4.6 branch. What's not clear if this is reproducible on anything later though. http://old.nabble.com/-PATCH--Prevent-cselib-substitution-of-FP,-SP,-SFP-td33080657.html Full disclaimer that I've not investigated whether the same problem occurs on trunk. HTH, Ramana
(In reply to comment #7) > > This rings a bell. > > Maybe the patch mentioned below needs backporting given Carrot is > reporting this against the 4.6 branch. What's not clear if this is > reproducible on anything later though. > > http://old.nabble.com/-PATCH--Prevent-cselib-substitution-of-FP,-SP,-SFP-td33080657.html > The patch can fix this bug.
I don't know precisely when this was fixed, but the testcase produces correct output on (at least) gcc-7 and later. So presume fixed.
Guess it would be nice to add the testcase into the testsuite in that case.