This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] x86 peephole2s to optimize "1LL << x"


On Sat, 11 Sep 2004, Richard Henderson wrote:
> Stage3 aside, could you do some other P4 testing for me?
> I understand that P4 also is embarrasingly slow with cmove.
> The question is: is cmov slow enough to outweigh the slowness
> of the shifter?
>
> So we're selecting between
>
> 	(A) 2 setcc, 2 shifts
> 	(B) 1 shift, 2 cmov
> 	(C) 1 shift, branch


Ok.  I've now done a number of timings using different "1LL << x"
implementations, and the short answer to your question is "yes",
the large cost of the "cmov" outweighs the speed of the shifter,
and (A) is faster than (B) is faster than (C).


Now the details.  The experiment was Genuine Intel Pentium 4
running at 2.80GHz using the following driver program:

static unsigned int table[1024] = {
  0, 58,  6, 59, 24, 45, 46, 63, 44, 36, 43, 27, 42, 18, 29, 32,
 55, 23,  3, 47, 56, 22, 32, 47,  1, 30, 47, 24,  3, 31, 40, 11,
 23, 49, 40, 42, 32, 42,  0,  3, 35,  6, 43, 10,  3, 51, 25,  2,
 62, 54, 11, 29, 37, 11, 10, 44, 48, 29, 11,  9, 59, 17,  3,  6,
 ... };

extern long long foo(int x);  // return 1LL << x;

int main()
{
  long long x = 0;
  int i, j;

  for (i=0; i<4000000; i++)
  {
    for (j=0; j<1024; j++)
      x ^= foo(table[j]);
  }
  printf("%016Lu\n",x);
  return 0;
}

using different implementations of "foo" assembled at link-time.
The metric was user time reported by "time a.out" running on RedHat 9.
Each test was designated a single letter code.  The three tests
described above were "a", "b" and "c".  Additionally, the current
mainline is designated "o", mainline with "-march=pentium4" which
uses cmov as designated "p" and a baseline run where foo simply
returned zero was designated "z".

Two additional tests were also run.  Tests "b" and "p" have a
perhaps unfair disadvantage in that they require an extra register
"%ebx" which is caller-saved, hence "b" and "p" have "pushl %ebx"
and "popl %ebx" overhead.  The run "d" is a duplicate of "a" where
we also save/restore %ebx.  Finally, run "e" was a variation of
"c" where I tried using the "xchgl %eax, %edx" after the branch,
instead of the "mov %eax, %dx ; xor %eax, %eax" sequence.

Results (median of three executions)

				Percent
	Time	Time-BaseLine	Mainline ("o")	Instructions

a	26.98s	 2.98s		 13.1%		2*setcc+2*shift
b	30.03s	 6.03s		 26.6%		push+shift+2*cmov
c	43.23s	19.23s		 84.8%		shift+branch
d	29.96s	 5.96s		 26.3%		push+2*setcc+2*shift
e	44.36s	20.36s		 89.7%		shift+branch+xchg
o	46.69s	22.69s		100.0%		shld+shift+branch
p	39.67s	15.67s		 69.1%		push+shld+shift+2*cmov
z	24.00s	 0.00s		  0.0%


To summarize, method "a" is clearly the fastest.  Even when factoring
in the save/restore overhead of an extra register "d" it's still faster
than method "b".  Clearly "cmov" on the pentium is relatively expensive.

Compared to branch sequences, like "c", "e" and "o" however, conditional
moves are clearly a win over a conditional branch.  Interestingly, the
cost of a branch is so high it hides the large "shld" overhead, notice
"o" is not much worse than "c", but that "p" is far worse than "b" [where
the cost of "shld" can't be hidden by the "cmovs"].

My idea of using "xchg" also didn't work out ("e" is slower than "c")!



But the real bottom line (in my mind) is that implementating your method
(A), 2*setcc + 2*shift in the x86 backend results in an "1LL << x"
implementation an astounding seven times faster than current mainline.


Roger
--


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]