Bug 36133 - GCC creates suboptimal ASM : Code includes unneeded TST instructions
Summary: GCC creates suboptimal ASM : Code includes unneeded TST instructions
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.2.1
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2008-05-05 13:40 UTC by Gunnar von Boehn
Modified: 2008-11-19 12:03 UTC (History)
5 users (show)

See Also:
Host: i686-pc-linux-gnu
Target: m68k-linux-gnu
Build: i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gunnar von Boehn 2008-05-05 13:40:05 UTC
Hello,

The ASM code created by GCC 4.2.1 for 68k/Coldfire platform is not optimal. Comparing ASM output created by GCC 2.9 with GCC 4.2,
the generated code got partially much worse with GCC 4.


One problem that was visible in very many places was
that GCC created unnecessary TST instructions.

Please see the below example for details.
The TST.L instruction at address 36 is in the loop and unneeded.
The lsrl at address (10) and the subql #1,%d0 at address (34) do both set the condition codes already, there is no need for using an extra TST instruction at all.



Example: C-source
Code:
void * copy_32x4a(void *destparam, const void *srcparam, size_t size)
{
        int *dest = destparam;
        const int *src = srcparam;
        int size32;
        size32 = size / 16;
        for (; size32; size32--) {
                *dest++ = *src++;
                *dest++ = *src++;
                *dest++ = *src++;
                *dest++ = *src++;
        }
}

Compile option: m68k-linux-gnu-gcc -mcpu=54455 -msoft-float -o example -Os -fomit-frame-pointer example.c

Code generated by GCC 4.2:
04:       202f 000c       movel %sp@(12),%d0
08:       226f 0004       moveal %sp@(4),%a1
0c:       206f 0008       moveal %sp@(8),%a0
10:       e888            lsrl #4,%d0
12:       6022            bras 36
14:       2290            movel %a0@,%a1@
16:       2368 0004 0004  movel %a0@(4),%a1@(4)
1c:       2368 0008 0008  movel %a0@(8),%a1@(8)
22:       2368 000c 000c  movel %a0@(12),%a1@(12)
28:       d3fc 0000 0010  addal #16,%a1
2e:       d1fc 0000 0010  addal #16,%a0
34:       5380            subql #1,%d0
36:       4a80            tstl %d0
38:       66da            bnes 14
3a:       4e75            rts    

For comparison here is code that you would expect:
04:       202f 000c       movel %sp@(12),%d0
08:       226f 0004       moveal %sp@(4),%a1
0c:       206f 0008       moveal %sp@(8),%a0
10:       e888            lsrl #4,%d0
12:       6022            beq 20
14:       20d9            movel %a1@+,%a0@+
16:       20d9            movel %a1@+,%a0@+
18:       20d9            movel %a1@+,%a0@+
1a:       20d9            movel %a1@+,%a0@+
1c:       5380            subql #1,%d0
1e:       66da            bnes 14
20:       4e75            rts 

Compiler used:
m68k-linux-gnu-gcc -v
Using built-in specs.
Target: m68k-linux-gnu
Configured with: /scratch/shinwell/cf-fall-linux-lite/src/gcc-4.2/configure --build=i686-pc-linux-gnu --host=i686-pc-linux-gnu --target=m68k-linux-gnu --enable-threads --disable-libmudflap --disable-libssp --disable-libgomp --disable-libstdcxx-pch --with-arch=cf --with-gnu-as --with-gnu-ld --enable-languages=c,c++ --enable-shared --enable-symvers=gnu --enable-__cxa_atexit --with-pkgversion=Sourcery G++ Lite 4.2-47 --with-bugurl=https://support.codesourcery.com/GNUToolchain/ --disable-nls --prefix=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux --with-sysroot=/opt/freescale/usr/local/gcc-4.2.47-eglibc-2.5.47/m68k-linux/m68k-linux-gnu/libc --with-build-sysroot=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/libc --enable-poison-system-directories --with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin --with-build-time-tools=/scratch/shinwell/cf-fall-linux-lite/install/m68k-linux-gnu/bin
Thread model: posix
gcc version 4.2.1 (Sourcery G++ Lite 4.2-47)


I hope that this report help you to improve the quality of GCC.

Kind regards

Gunnar von Boehn
Comment 1 Richard Biener 2008-05-07 19:33:47 UTC
It would have been nice to check at least gcc 4.3 (or better current trunk).
Comment 2 Hans-Peter Nilsson 2008-05-23 23:04:50 UTC
This could be a duplicate of PR20211.
Comment 3 Gunnar von Boehn 2008-05-28 16:14:42 UTC
(In reply to comment #1)
> It would have been nice to check at least gcc 4.3 (or better current trunk).
> 

I've verified with latest source gcc source "version 4.4.0 20080523 (experimental) (GCC)" 
The problem that GCC used totally unneeded TST instructions is still in the current source.

Regards
Gunnar
Comment 4 Gunnar von Boehn 2008-06-04 09:29:16 UTC
I want to add that this wrong behavior is partly related to the compile option "-Os".

There are two causes where GCC generates unneeded TST instructions.
A) General arithmetic 
 lsr.l #1,D0 
 tst.l d0
 jbne ...

This tst instruction is unneeded as the LSR is setting the flags correctly already.

B) subq.l #1,D1
  tst.l d1
  jbne ...

This unneeded TST is related to the compile option used.
If you compile the source with "-O2" then the second unneeded TST instructions are not included in the source.

It seems to me that a general important optimizations step - which used to be in  "Os" in GCC 2.9 was removed from "Os" causing GCC to generate worse code now.

Can you please be so kind and correct this?

I believe that this issue is quite serious for the performance of the generated code.

1st The unneeded TST instructions are increasing code size, which is important in embedded environments.

2nd
There are case were the instruction which really did set the condition codes correctly in the first place is far enough away from the conditional branch and no CC trashing instruction in between them - so that the instruction fetcher can 100% correctly predict the branch and fold it away completely. The unneeded TST instruction makes branch folding impossible and requires the CPU to guess the branch instead. This will cause a serious performance impact in case of mispredicting the branch.
It should be clear that the unneeded TST instruction doas not only bloat the code but the above mentioned conditions can serious degrade the performance as well, depending on your used CPU of course.

In the light of this, wouldn't it might sense to increase the Severity of this issue?


Regards
Gunnar
Comment 5 Gunnar von Boehn 2008-06-05 12:07:34 UTC
Please find below a proposed patch.
The patch will making GCC aware that shift does set the CC already
and the TST is not needed in this case.
The same example could be used to used to make GCC aware of the CC set by other instructions.

Index: gcc/config/m68k/m68k.md

===================================================================

*** gcc/config/m68k/m68k.md.orig        2008-05-30 10:00:55.000000000 +0200

--- gcc/config/m68k/m68k.md     2008-06-04 17:01:11.000000000 +0200

***************

*** 5198,5203 ****

--- 5198,5215 ----

    [(set_attr "type" "shift")

     (set_attr "opy" "2")])



+ (define_insn "*lshrsi3_cc"

+   [(set (cc0)

+       (lshiftrt:SI (match_operand:SI 1 "register_operand" "0")

+                    (match_operand:SI 2 "general_operand" "dI")))

+    (set (match_operand:SI 0 "register_operand" "=d")

+       (lshiftrt:SI (match_dup 1)

+                    (match_dup 2)))]

+   ""

+   "lsr%.l %2,%0"

+   [(set_attr "type" "shift")

+    (set_attr "opy" "2")])

+

  (define_insn "lshrhi3"

    [(set (match_operand:HI 0 "register_operand" "=d")

        (lshiftrt:HI (match_operand:HI 1 "register_operand" "0")
Comment 6 Gunnar von Boehn 2008-06-12 14:26:23 UTC
Andreas,

Could you have a look at this?

Cheers
Gunnar
Comment 7 Andrew Stubbs 2008-11-10 12:29:53 UTC
(In reply to comment #4)
> There are two causes where GCC generates unneeded TST instructions.
> A) General arithmetic 
>  lsr.l #1,D0 
>  tst.l d0
>  jbne ...
>
> This tst instruction is unneeded as the LSR is setting the flags correctly
> already.

This is NOT correct. LSL does write to the condition codes, but not all of it. In particular, the bit involved in the not-equal test is not set.

This TST *is* required.
 
> B) subq.l #1,D1
>   tst.l d1
>   jbne ...
> 
> This unneeded TST is related to the compile option used.
> If you compile the source with "-O2" then the second unneeded TST instructions
> are not included in the source.

Please check the code mode carefully. This TST is not related to the SUBQ instruction. This TST is related to the LSL instruction - the TST is a branch target.

The compiler has combined the initial and subsequent loop condition tests into one test, thus saving space, as per the -Os requirements. In this case there is an alternative, more speed efficient solution, but that is a separate problem, and not a misuse of the TST instruction.

As you say, the compiler gets it right at -O2, so the machine description is clearly correct, in this respect.
Comment 8 Gunnar von Boehn 2008-11-10 12:54:18 UTC
(In reply to comment #7)
> (In reply to comment #4)
> > There are two causes where GCC generates unneeded TST instructions.
> > A) General arithmetic 
> >  lsr.l #1,D0 
> >  tst.l d0
> >  jbne ...
> >
> > This tst instruction is unneeded as the LSR is setting the flags correctly
> > already.
> 
> This is NOT correct. LSL does write to the condition codes, but not all of it.
> In particular, the bit involved in the not-equal test is not set.
> 
> This TST *is* required.

What you is say is not correct.

The bit involved in the not-eval test is the "Z-Bit"
Both the LSR and TST do set the Z-Bit, 100% equally.

The TST instruction in the example is 100% unneeded and NOT required.

Please check the official 68K documentation for the which flags the conditinal branch instruction BCC tests (in this case BNE), and verify the behavior of TST and LSR in regards of setting these bits.

Best Regards
Gunnar von Boehn
Comment 9 Andrew Stubbs 2008-11-10 13:01:41 UTC
> > This tst instruction is unneeded as the LSR is setting the flags correctly
> > already.
> 
> This is NOT correct. LSL does write to the condition codes, but not all of it.
> In particular, the bit involved in the not-equal test is not set.

Hmm, on re-reading the manual, I seem to have misunderstood this. The description of the instruction states certain bits are set, but then, on the next page the table shows all the bits being updated.

I'm going to need to think about this one some more.

Sorry for the noise :(

Andrew
Comment 10 Andrew Stubbs 2008-11-19 11:25:02 UTC
Subject: Bug 36133

Author: ams
Date: Wed Nov 19 11:23:28 2008
New Revision: 141999

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=141999
Log:
2008-11-19  Andrew Stubbs  <ams@codesourcery.com>

	gcc/
	PR target/36133
	* config/m68k/m68k.h (CC_OVERFLOW_UNUSABLE, CC_NO_CARRY): New defines.
	* config/m68k/m68k.c (notice_update_cc): Set cc_status properly for
	shift instructions.
	* config/m68k/m68k.md: Adjust all conditional branches that use the
	carry and overflow flags so they understand CC_OVERFLOW_UNUSABLE.

	gcc/testsuite/
	PR target/36133
	* gcc.target/m68k/pr36133.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/m68k/pr36133.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/m68k/m68k.c
    trunk/gcc/config/m68k/m68k.h
    trunk/gcc/config/m68k/m68k.md
    trunk/gcc/testsuite/ChangeLog

Comment 11 Andrew Stubbs 2008-11-19 12:03:09 UTC
The patch just committed should fix this issue.

The patch discussion is here: http://gcc.gnu.org/ml/gcc-patches/2008-11/msg00641.html