This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[PATCH] x86 peephole2s to optimize "1LL << x"

From: Roger Sayle <roger at eyesopen dot com>
To: gcc-patches at gcc dot gnu dot org
Date: Sat, 11 Sep 2004 12:04:44 -0600 (MDT)
Subject: [PATCH] x86 peephole2s to optimize "1LL << x"

I'm hoping the following performance improving patch to the i386
backend will be considered acceptable in early stage3.  This issue
might be considered serious enough to be a bug, though I couldn't
find a matching PR.

The issue is that GCC's x86 backend currently generates particularly
inefficient code for the relatively common idiom, "1LL << x".  Current
mainline generates the following code with "-O2 -fomit-frame-pointer".

long long foo(int x)
{
  return 1LL << x;
}

foo:	movl    4(%esp), %ecx
        movl    $1, %eax
        xorl    %edx, %edx
        shldl   %eax, %edx   <---
        sall    %cl, %eax
        andl    $32, %ecx
        je      .L3
        movl    %eax, %edx
        xorl    %eax, %eax
.L3:	ret


The cause of concern is the "shldl %eax, %edx" instruction in the
middle of this sequence.  This is a special instruction designed
to simplify double word shifts by lshifting the contents of %edx
(the high word) and filling the bottom bits with the top bits of
%eax (the low word).  The amount shifted being 0 to 31 bits.

When the LHS operand of the DImode shift is 1 or -1, however, this
operation is redundant, as the top 0 to 31 bits of %eax always have
the same value as all of the bits in %edx.

One most x86 processors this redundant instruction is an inconvenience,
on the Intel Pentium4 however it is crippling.  This instruction isn't
implemented directly in hardware, and is decomposed to several SImode
shift operations which are already hideously slow on P4.  The result
is that "shldl" and "shrdl" cost 14(!) cycles, or are approximately 28
times slower than an addition or logical integer operation.  When
tuning for pentium4, the additional overhead of this surplus insn
more than doubles the cost of these DImode shifts.

The use of "(long long)1 << x" is particularly common in many C codes,
including compilers and chess programs, such as crafty.  Analysis of
the DImode left shifts generated by GCC during a three stage bootstrap
including runtimes, revealed that 20% had a immediate constant as the
first operand, of these 73% where the value 1, 23% were the value -1
and other constant values making up the remaining 4%.


The following patch adds three new peephole2 patterns, that catch these
problematic cases.  The first catches "1", the second catches "-1" and
the third catches "-1" when compiled with "-Os".  The same optimiation
applies to "0" which should never occur, and "-2" which appears too
rarely to worry about.  It might be possible to combine the second and
third, using clever predicates but alas that was beyond my .md skills.
These optimizations are implemented as peepholes, as the i386 backend
doesn't lower ashldi3 until after cse/gcse/combine, and doesn't know
whether op0 is an immediate constant at the point the shift is split.


The following patch has been tested on i686-pc-linux-gnu with a full
"make bootstrap", all default languages, and regression tested with a
top-level "make -k check" with no new failures.

Ok for mainline?



2004-09-11  Roger Sayle  <roger@eyesopen.com>

	* config/i386/i386.md: Add new peephole2's to eliminate unnecessary
	uses of the x86's shld instruction in "1LL << x" and "-1LL << x".


Index: config/i386/i386.md
===================================================================
RCS file: /cvs/gcc/gcc/gcc/config/i386/i386.md,v
retrieving revision 1.558
diff -c -3 -p -r1.558 i386.md
*** config/i386/i386.md	8 Sep 2004 05:08:26 -0000	1.558
--- config/i386/i386.md	11 Sep 2004 14:56:51 -0000
***************
*** 10793,10798 ****
--- 10793,10850 ----
     (set_attr "pent_pair" "np")
     (set_attr "athlon_decode" "vector")])

+ ;; The x86_shld_1 pattern can be eliminated if the lhs of the shift is
+ ;; either 1 or -1.  The use of xorl (which clobbers flags) and the -Os
+ ;; optimization of loading -1 requires the three peephole patterns below.
+
+ (define_peephole2
+   [(set (match_operand:SI 0 "register_operand" "")
+ 	(const_int 1))
+    (parallel [(set (match_operand:SI 1 "register_operand" "")
+ 		   (const_int 0))
+ 	      (clobber (reg:CC FLAGS_REG))])
+    (parallel [(set (match_dup 1)
+ 		   (ior:SI (ashift:SI (match_dup 1)
+ 			      (match_operand:QI 2 "nonmemory_operand" ""))
+ 			   (lshiftrt:SI (match_dup 0)
+ 			      (minus:QI (const_int 32) (match_dup 2)))))
+ 	      (clobber (reg:CC FLAGS_REG))])]
+   ""
+   [(set (match_dup 0) (const_int 1))
+    (parallel [(set (match_dup 1) (const_int 0))
+ 	      (clobber (reg:CC FLAGS_REG))])])
+
+ (define_peephole2
+   [(set (match_operand:SI 0 "register_operand" "")
+ 	(const_int -1))
+    (set (match_operand:SI 1 "register_operand" "")
+ 	(const_int -1))
+    (parallel [(set (match_dup 1)
+ 		   (ior:SI (ashift:SI (match_dup 1)
+ 			      (match_operand:QI 2 "nonmemory_operand" ""))
+ 			   (lshiftrt:SI (match_dup 0)
+ 			      (minus:QI (const_int 32) (match_dup 2)))))
+ 	      (clobber (reg:CC FLAGS_REG))])]
+   ""
+   [(set (match_dup 0) (const_int -1))
+    (set (match_dup 1) (const_int -1))])
+
+ (define_peephole2
+   [(set (match_operand:SI 0 "register_operand" "")
+ 	(const_int -1))
+    (set (match_operand:SI 1 "register_operand" "")
+ 	(match_dup 0))
+    (parallel [(set (match_dup 1)
+ 		   (ior:SI (ashift:SI (match_dup 1)
+ 			      (match_operand:QI 2 "nonmemory_operand" ""))
+ 			   (lshiftrt:SI (match_dup 0)
+ 			      (minus:QI (const_int 32) (match_dup 2)))))
+ 	      (clobber (reg:CC FLAGS_REG))])]
+   ""
+   [(set (match_dup 0) (const_int -1))
+    (set (match_dup 1) (match_dup 0))])
+
+
  (define_expand "x86_shift_adj_1"
    [(set (reg:CCZ FLAGS_REG)
  	(compare:CCZ (and:QI (match_operand:QI 2 "register_operand" "")


Roger
--
Roger Sayle,                         E-mail: roger@eyesopen.com
OpenEye Scientific Software,         WWW: http://www.eyesopen.com/
Suite 1107, 3600 Cerrillos Road,     Tel: (+1) 505-473-7385
Santa Fe, New Mexico, 87507.         Fax: (+1) 505-473-0833

Follow-Ups:
- Re: [PATCH] x86 peephole2s to optimize "1LL << x"
  - From: Richard Henderson
- Re: [PATCH] x86 peephole2s to optimize "1LL << x"
  - From: Richard Henderson

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]