Bug 28946 - assembler shifts set the flag ZF, no need to re-test to zero
Summary: assembler shifts set the flag ZF, no need to re-test to zero
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.1.2
: P2 minor
Target Milestone: 4.0.4
Assignee: Uroš Bizjak
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords: missed-optimization, patch
Depends on:
Blocks:
 
Reported: 2006-09-04 12:02 UTC by etienne_lorrain
Modified: 2006-09-19 11:31 UTC (History)
5 users (show)

See Also:
Host:
Target: i486-linux-gnu
Build:
Known to work: 2.95.3 4.2.0 4.1.2 4.0.4
Known to fail: 3.0.4 3.2.3 3.3.3
Last reconfirmed: 2006-09-05 11:45:14


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description etienne_lorrain 2006-09-04 12:02:03 UTC
etienne@cygne:~$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
 --enable-nls --program-suffix=-4.1 --enable-__cxa_atexit --enable-clocale=gnu --enable-libstdcxx-debug --
enable-mpfr --with-tune=i686 --enable-checking=release i486-linux-gnu
Thread model: posix
gcc version 4.1.2 20060729 (prerelease) (Debian 4.1.1-10)
etienne@cygne:~$ cat tmp1.c
int fct1(void);
int fct2(void);

int fct (unsigned nb)
  {
  if ((nb >> 5) != 0)
    return fct1();
  else
    return fct2();
  }
etienne@cygne:~$ gcc -O3 -fomit-frame-pointer -S tmp1.c -o tmp1.s
etienne@cygne:~$ cat tmp1.s
        .file   "tmp1.c"
        .text
        .p2align 4,,15
.globl fct
        .type   fct, @function
fct:
        movl    4(%esp), %eax
        shrl    $5, %eax
        testl   %eax, %eax
        je      .L2
        jmp     fct1
        .p2align 4,,7
.L2:
        jmp     fct2
        .size   fct, .-fct
        .ident  "GCC: (GNU) 4.1.2 20060729 (prerelease) (Debian 4.1.1-10)"
        .section        .note.GNU-stack,"",@progbits
etienne@cygne:~$

 The assembly instruction "testl %eax, %eax" is not needed considering the
Intel documentation of "SAL/SAR/SHL/SHR", "Flags Affected":
The SF, ZF, and PF flags are set according to the result.
Comment 1 Andrew Pinski 2006-09-04 16:50:06 UTC
Confirmed, a regression from 2.95.3 which almost means the new ia32 back-end caused this.
Comment 2 H.J. Lu 2006-09-04 17:49:01 UTC
It is entirely coincident. For some processors, it is an optimization to avoid
partial flag register stall. When it is fixed, it should be reenabled with a
new flag, something like TARGET_PARTIAL_FLAG_REG_STALL.
Comment 3 Dan Nicolaescu 2006-09-04 17:56:44 UTC
This specific case can probably be solved at the tree level by changing the test:

(nb >> 5) != 0
to 
nb > 32

Comment 4 Uroš Bizjak 2006-09-05 06:20:29 UTC
(In reply to comment #2)
> It is entirely coincident. For some processors, it is an optimization to avoid
> partial flag register stall. When it is fixed, it should be reenabled with a
> new flag, something like TARGET_PARTIAL_FLAG_REG_STALL.

There is TARGET_USE_INCDEC flag that already implements your suggestion.

From predicates.md:
  /* On Pentium4, the inc and dec operations causes extra dependency on flag
     registers, since carry flag is not set.  */
  if (!TARGET_USE_INCDEC && !optimize_size)

If used elsewhere, this flag should perhaps be renamed to proposed TARGET_PARTIAL_FLAG_REG_STALL.
Comment 5 Uroš Bizjak 2006-09-05 09:35:54 UTC
The problem here is following:

We already have the patterns, that would satisfy combined instruction (*lshrsi3_cmp) in above testcase. However, combiner rejects combined instruction because the register that holds shifted result is unused!

The problematic part is in combine.c, around line 2236 (please read the comment, which describes exactly the situation we have here). This part of code is activated only when the register that holds the result of arith operation is keept alive. This is quite strange - even if the result is unused, resulting code will be still smaller as we avoid extra CC setting instruction.

The patch bellow (currently under testing, but so far OK) forces generation of combined instruction even if the arithmetic result is unused.

Index: combine.c
===================================================================
--- combine.c   (revision 116691)
+++ combine.c   (working copy)
@@ -2244,7 +2244,7 @@
      needed, and make the PARALLEL by just replacing I2DEST in I3SRC with
      I2SRC.  Later we will make the PARALLEL that contains I2.  */
 
-  if (i1 == 0 && added_sets_2 && GET_CODE (PATTERN (i3)) == SET
+  if (i1 == 0 && GET_CODE (PATTERN (i3)) == SET
       && GET_CODE (SET_SRC (PATTERN (i3))) == COMPARE
       && XEXP (SET_SRC (PATTERN (i3)), 1) == const0_rtx
       && rtx_equal_p (XEXP (SET_SRC (PATTERN (i3)), 0), i2dest))
@@ -2254,6 +2254,13 @@
       enum machine_mode compare_mode;
 #endif
 
+      /* To force generation of the combined comparison and arithmetic
+        operation PARALLEL, pretend that the set in I2 is to be used,
+        even if it is dead after I2. This results in better generated
+        code, as only CC setting arithmetic instruction will be
+        emitted in conditionals.  */
+      added_sets_2 = 1;
+
       newpat = PATTERN (i3);
       SUBST (XEXP (SET_SRC (newpat), 0), i2src);
 

Compiling testcase with this patch results in following code:

fct:
        movl 4(%esp), %eax
        shrl $5, %eax
        je  .L2
        jmp fct1
        .p2align 4,,7
.L2:
        jmp fct2
Comment 6 Uroš Bizjak 2006-09-05 11:45:14 UTC
Patch at http://gcc.gnu.org/ml/gcc-patches/2006-09/msg00137.html

BTW: This patch eliminates 869 "test" instructions in povray-3.6.1 compile.
(And my test raytraced pictures are still correct.)
Comment 7 Uroš Bizjak 2006-09-05 13:43:32 UTC
Hm, proposed patch now generates worse code for following test:

extern int fnc1(void);
extern int fnc2(void);

int test(int x)
{
        if (x & 0x02)
         return fnc1();
        else if (x & 0x01)
         return fnc2();
        else
         return 0;
}

It generates:

test:
        movl 4(%esp), %edx
        movl %edx, %eax
        andl $2, %eax
        jne .L10
        andl $1, %edx
        jne .L11
        xorl %eax, %eax
        ret
        .p2align 4,,7
.L11:
        .p2align 4,,8
        jmp fnc2
        .p2align 4,,7
.L10:
        .p2align 4,,7
        jmp fnc1

due to marking %eax live in first comparison, "and" is used instead of "test", and a regmove is emitted before comparison. Ideally gcc should generate:

test:
        movl 4(%esp), %eax
        testl  $2, %eax
        jne .L6
        andl $1, %eax
        jne .L7
        xorl %eax, %eax
        ret
        .p2align 2,,3
.L7:
        jmp fnc2
        .p2align 2,,3
.L6:
        jmp fnc1
Comment 8 H.J. Lu 2006-09-05 14:54:20 UTC
TARGET_PARTIAL_FLAG_REG_STALL and TARGET_USE_INCDEC are totally different.
TARGET_USE_INCDEC favors inc/dec over add/sub while TARGET_PARTIAL_FLAG_REG_STALL
adds test after shift.
Comment 9 Uroš Bizjak 2006-09-06 11:33:20 UTC
Patch at http://gcc.gnu.org/ml/gcc-patches/2006-09/msg00162.html implements missing i386.md RTL patterns. This is i386 target-specific fix for this bug.

The patch was bootstrapped on i686-pc-linux-gnu and x86_64-pc-linux-gnu, regtested for c,c++ and fortran.
Comment 10 H.J. Lu 2006-09-06 14:26:46 UTC
The proposed patch will slow down Core and Core 2 by 70-100% in some testcases
due to partial flag register stall. I have a followup patch to implement
TARGET_PARTIAL_FLAG_REG_STALL.
Comment 11 uros 2006-09-07 17:45:56 UTC
Subject: Bug 28946

Author: uros
Date: Thu Sep  7 17:45:48 2006
New Revision: 116756

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=116756
Log:
        PR target/28946
        * config/i386/i386.md ("*ashldi3_cconly_rex64", "*ashlsi3_cconly",
        "*ashlhi3_cconly", "*ashlqi3_cconly", "*ashrdi3_one_bit_cconly_rex64",
        "*ashrdi3_cconly_rex64", "*ashrsi3_one_bit_cconly", "*ashrsi3_cconly",
        "*ashrhi3_one_bit_cconly", "*ashrhi3_cconly",
        "*ashrqi3_one_bit_cconly", "*ashrqi3_cconly",
        "*lshrdi3_cconly_one_bit_rex64", "*lshrdi3_cconly_rex64",
        "*lshrsi3_one_bit_cconly", "*lshrsi3_cconly",
        "*lshrhi3_one_bit_cconly", "*lshrhi3_cconly",
        "*lshrqi2_one_bit_cconly", "*lshrqi2_cconly": New patterns to
        implement only CC setting effects of shift instructions.

testsuite/ChangeLog:

       PR target/28946
       * gcc.target/i386/pr28946.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr28946.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md
    trunk/gcc/testsuite/ChangeLog

Comment 12 uros 2006-09-15 17:42:50 UTC
Subject: Bug 28946

Author: uros
Date: Fri Sep 15 17:42:40 2006
New Revision: 116979

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=116979
Log:
        PR target/28946
        * config/i386/i386.md ("*ashldi3_cconly_rex64", "*ashlsi3_cconly",
        "*ashlhi3_cconly", "*ashlqi3_cconly", "*ashrdi3_one_bit_cconly_rex64",
        "*ashrdi3_cconly_rex64", "*ashrsi3_one_bit_cconly", "*ashrsi3_cconly",
        "*ashrhi3_one_bit_cconly", "*ashrhi3_cconly",
        "*ashrqi3_one_bit_cconly", "*ashrqi3_cconly",
        "*lshrdi3_cconly_one_bit_rex64", "*lshrdi3_cconly_rex64",
        "*lshrsi3_one_bit_cconly", "*lshrsi3_cconly",
        "*lshrhi3_one_bit_cconly", "*lshrhi3_cconly",
        "*lshrqi2_one_bit_cconly", "*lshrqi2_cconly": New patterns to
        implement only CC setting effects of shift instructions.

testsuite/ChangeLog:

        PR target/28946
        * gcc.target/i386/pr28946.c: New test.


Added:
    branches/gcc-4_1-branch/gcc/testsuite/gcc.target/i386/pr28946.c
Modified:
    branches/gcc-4_1-branch/gcc/ChangeLog
    branches/gcc-4_1-branch/gcc/config/i386/i386.md
    branches/gcc-4_1-branch/gcc/testsuite/ChangeLog

Comment 13 uros 2006-09-18 10:15:01 UTC
Subject: Bug 28946

Author: uros
Date: Mon Sep 18 10:14:53 2006
New Revision: 117022

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=117022
Log:
	PR target/28946
	* config/i386/i386.md ("*ashldi3_cconly_rex64", "*ashlsi3_cconly",
	"*ashlhi3_cconly", "*ashlqi3_cconly", "*ashrdi3_one_bit_cconly_rex64",
	"*ashrdi3_cconly_rex64", "*ashrsi3_one_bit_cconly", "*ashrsi3_cconly",
	"*ashrhi3_one_bit_cconly", "*ashrhi3_cconly",
	"*ashrqi3_one_bit_cconly", "*ashrqi3_cconly",
	"*lshrdi3_cconly_one_bit_rex64", "*lshrdi3_cconly_rex64",
	"*lshrsi3_one_bit_cconly", "*lshrsi3_cconly",
	"*lshrhi3_one_bit_cconly", "*lshrhi3_cconly",
	"*lshrqi2_one_bit_cconly", "*lshrqi2_cconly": New patterns to
	implement only CC setting effects of shift instructions.

testsuite/ChangeLog:

	PR target/28946
	* gcc.target/i386/pr28946.c: New test.


Added:
    branches/gcc-4_0-branch/gcc/testsuite/gcc.target/i386/pr28946.c
Modified:
    branches/gcc-4_0-branch/gcc/ChangeLog
    branches/gcc-4_0-branch/gcc/config/i386/i386.md
    branches/gcc-4_0-branch/gcc/testsuite/ChangeLog

Comment 14 Uroš Bizjak 2006-09-19 11:31:42 UTC
Fixed everywhere.