Hello, The ASM code created by GCC 4.2.1 for 68k/Coldfire platform is not optimal. Comparing ASM output created by GCC 2.9 with GCC 4.2, the generated code got partially much worse with GCC 4. One problem that was visible in very many places was that GCC created unnecessary TST instructions. Please see the below example for details. The TST.L instruction at address 36 is in the loop and unneeded. The lsrl at address (10) and the subql #1,%d0 at address (34) do both set the condition codes already, there is no need for using an extra TST instruction at all. Example: C-source Code: void * copy_32x4a(void *destparam, const void *srcparam, size_t size) { int *dest = destparam; const int *src = srcparam; int size32; size32 = size / 16; for (; size32; size32--) { *dest++ = *src++; *dest++ = *src++; *dest++ = *src++; *dest++ = *src++; } } Compile option: m68k-linux-gnu-gcc -mcpu=54455 -msoft-float -o example -Os -fomit-frame-pointer example.c Code generated by GCC 4.2: 04: 202f 000c movel %sp@(12),%d0 08: 226f 0004 moveal %sp@(4),%a1 0c: 206f 0008 moveal %sp@(8),%a0 10: e888 lsrl #4,%d0 12: 6022 bras 36 14: 2290 movel %a0@,%a1@ 16: 2368 0004 0004 movel %a0@(4),%a1@(4) 1c: 2368 0008 0008 movel %a0@(8),%a1@(8) 22: 2368 000c 000c movel %a0@(12),%a1@(12) 28: d3fc 0000 0010 addal #16,%a1 2e: d1fc 0000 0010 addal #16,%a0 34: 5380 subql #1,%d0 36: 4a80 tstl %d0 38: 66da bnes 14 3a: 4e75 rts For comparison here is code that you would expect: 04: 202f 000c movel %sp@(12),%d0 08: 226f 0004 moveal %sp@(4),%a1 0c: 206f 0008 moveal %sp@(8),%a0 10: e888 lsrl #4,%d0 12: 6022 beq 20 14: 20d9 movel %a1@+,%a0@+ 16: 20d9 movel %a1@+,%a0@+ 18: 20d9 movel %a1@+,%a0@+ 1a: 20d9 movel %a1@+,%a0@+ 1c: 5380 subql #1,%d0 1e: 66da bnes 14 20: 4e75 rts Compiler used: m68k-linux-gnu-gcc -v Using built-in specs. Target: m68k-linux-gnu Configured with: /scratch/shinwell/cf-fall-linux-lite/src/gcc-4.2/configure --build=i686-pc-linux-gnu --host=i686-pc-linux-gnu --target=m68k-linux-gnu --enable-threads --disable-libmudflap --disable-libssp --disable-libgomp --disable-libstdcxx-pch --with-arch=cf --with-gnu-as --with-gnu-ld --enable-languages=c,c++ --enable-shared --enable-symvers=gnu --enable-__cxa_atexit --with-pkgversion=Sourcery G++ Lite 4.2-47 --with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls --prefix=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux --with-sysroot=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux/m68k-linux-gnu/libc --with-build-sysroot=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/libc --enable-poison-system-directories --with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin --with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin Thread model: posix gcc version 4.2.1 (Sourcery G++ Lite 4.2-47) I hope that this report help you to improve the quality of GCC. Kind regards Gunnar von Boehn
It would have been nice to check at least gcc 4.3 (or better current trunk).
This could be a duplicate of PR20211.
(In reply to comment #1) > It would have been nice to check at least gcc 4.3 (or better current trunk). > I've verified with latest source gcc source "version 4.4.0 20080523 (experimental) (GCC)" The problem that GCC used totally unneeded TST instructions is still in the current source. Regards Gunnar
I want to add that this wrong behavior is partly related to the compile option "-Os". There are two causes where GCC generates unneeded TST instructions. A) General arithmetic lsr.l #1,D0 tst.l d0 jbne ... This tst instruction is unneeded as the LSR is setting the flags correctly already. B) subq.l #1,D1 tst.l d1 jbne ... This unneeded TST is related to the compile option used. If you compile the source with "-O2" then the second unneeded TST instructions are not included in the source. It seems to me that a general important optimizations step - which used to be in "Os" in GCC 2.9 was removed from "Os" causing GCC to generate worse code now. Can you please be so kind and correct this? I believe that this issue is quite serious for the performance of the generated code. 1st The unneeded TST instructions are increasing code size, which is important in embedded environments. 2nd There are case were the instruction which really did set the condition codes correctly in the first place is far enough away from the conditional branch and no CC trashing instruction in between them - so that the instruction fetcher can 100% correctly predict the branch and fold it away completely. The unneeded TST instruction makes branch folding impossible and requires the CPU to guess the branch instead. This will cause a serious performance impact in case of mispredicting the branch. It should be clear that the unneeded TST instruction doas not only bloat the code but the above mentioned conditions can serious degrade the performance as well, depending on your used CPU of course. In the light of this, wouldn't it might sense to increase the Severity of this issue? Regards Gunnar
Please find below a proposed patch. The patch will making GCC aware that shift does set the CC already and the TST is not needed in this case. The same example could be used to used to make GCC aware of the CC set by other instructions. Index: gcc/config/m68k/m68k.md =================================================================== *** gcc/config/m68k/m68k.md.orig 2008-05-30 10:00:55.000000000 +0200 --- gcc/config/m68k/m68k.md 2008-06-04 17:01:11.000000000 +0200 *************** *** 5198,5203 **** --- 5198,5215 ---- [(set_attr "type" "shift") (set_attr "opy" "2")]) + (define_insn "*lshrsi3_cc" + [(set (cc0) + (lshiftrt:SI (match_operand:SI 1 "register_operand" "0") + (match_operand:SI 2 "general_operand" "dI"))) + (set (match_operand:SI 0 "register_operand" "=d") + (lshiftrt:SI (match_dup 1) + (match_dup 2)))] + "" + "lsr%.l %2,%0" + [(set_attr "type" "shift") + (set_attr "opy" "2")]) + (define_insn "lshrhi3" [(set (match_operand:HI 0 "register_operand" "=d") (lshiftrt:HI (match_operand:HI 1 "register_operand" "0")
Andreas, Could you have a look at this? Cheers Gunnar
(In reply to comment #4) > There are two causes where GCC generates unneeded TST instructions. > A) General arithmetic > lsr.l #1,D0 > tst.l d0 > jbne ... > > This tst instruction is unneeded as the LSR is setting the flags correctly > already. This is NOT correct. LSL does write to the condition codes, but not all of it. In particular, the bit involved in the not-equal test is not set. This TST *is* required. > B) subq.l #1,D1 > tst.l d1 > jbne ... > > This unneeded TST is related to the compile option used. > If you compile the source with "-O2" then the second unneeded TST instructions > are not included in the source. Please check the code mode carefully. This TST is not related to the SUBQ instruction. This TST is related to the LSL instruction - the TST is a branch target. The compiler has combined the initial and subsequent loop condition tests into one test, thus saving space, as per the -Os requirements. In this case there is an alternative, more speed efficient solution, but that is a separate problem, and not a misuse of the TST instruction. As you say, the compiler gets it right at -O2, so the machine description is clearly correct, in this respect.
(In reply to comment #7) > (In reply to comment #4) > > There are two causes where GCC generates unneeded TST instructions. > > A) General arithmetic > > lsr.l #1,D0 > > tst.l d0 > > jbne ... > > > > This tst instruction is unneeded as the LSR is setting the flags correctly > > already. > > This is NOT correct. LSL does write to the condition codes, but not all of it. > In particular, the bit involved in the not-equal test is not set. > > This TST *is* required. What you is say is not correct. The bit involved in the not-eval test is the "Z-Bit" Both the LSR and TST do set the Z-Bit, 100% equally. The TST instruction in the example is 100% unneeded and NOT required. Please check the official 68K documentation for the which flags the conditinal branch instruction BCC tests (in this case BNE), and verify the behavior of TST and LSR in regards of setting these bits. Best Regards Gunnar von Boehn
> > This tst instruction is unneeded as the LSR is setting the flags correctly > > already. > > This is NOT correct. LSL does write to the condition codes, but not all of it. > In particular, the bit involved in the not-equal test is not set. Hmm, on re-reading the manual, I seem to have misunderstood this. The description of the instruction states certain bits are set, but then, on the next page the table shows all the bits being updated. I'm going to need to think about this one some more. Sorry for the noise :( Andrew
Subject: Bug 36133 Author: ams Date: Wed Nov 19 11:23:28 2008 New Revision: 141999 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=141999 Log: 2008-11-19 Andrew Stubbs <ams@codesourcery.com> gcc/ PR target/36133 * config/m68k/m68k.h (CC_OVERFLOW_UNUSABLE, CC_NO_CARRY): New defines. * config/m68k/m68k.c (notice_update_cc): Set cc_status properly for shift instructions. * config/m68k/m68k.md: Adjust all conditional branches that use the carry and overflow flags so they understand CC_OVERFLOW_UNUSABLE. gcc/testsuite/ PR target/36133 * gcc.target/m68k/pr36133.c: New test. Added: trunk/gcc/testsuite/gcc.target/m68k/pr36133.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/m68k/m68k.c trunk/gcc/config/m68k/m68k.h trunk/gcc/config/m68k/m68k.md trunk/gcc/testsuite/ChangeLog
The patch just committed should fix this issue. The patch discussion is here: http://gcc.gnu.org/ml/gcc-patches/2008-11/msg00641.html