This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] x86 peephole2s to optimize "1LL << x"
- From: Roger Sayle <roger at eyesopen dot com>
- To: Richard Henderson <rth at redhat dot com>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Sat, 11 Sep 2004 21:14:08 -0600 (MDT)
- Subject: Re: [PATCH] x86 peephole2s to optimize "1LL << x"
On Sat, 11 Sep 2004, Richard Henderson wrote:
> Stage3 aside, could you do some other P4 testing for me?
> I understand that P4 also is embarrasingly slow with cmove.
> The question is: is cmov slow enough to outweigh the slowness
> of the shifter?
>
> So we're selecting between
>
> (A) 2 setcc, 2 shifts
> (B) 1 shift, 2 cmov
> (C) 1 shift, branch
Ok. I've now done a number of timings using different "1LL << x"
implementations, and the short answer to your question is "yes",
the large cost of the "cmov" outweighs the speed of the shifter,
and (A) is faster than (B) is faster than (C).
Now the details. The experiment was Genuine Intel Pentium 4
running at 2.80GHz using the following driver program:
static unsigned int table[1024] = {
0, 58, 6, 59, 24, 45, 46, 63, 44, 36, 43, 27, 42, 18, 29, 32,
55, 23, 3, 47, 56, 22, 32, 47, 1, 30, 47, 24, 3, 31, 40, 11,
23, 49, 40, 42, 32, 42, 0, 3, 35, 6, 43, 10, 3, 51, 25, 2,
62, 54, 11, 29, 37, 11, 10, 44, 48, 29, 11, 9, 59, 17, 3, 6,
... };
extern long long foo(int x); // return 1LL << x;
int main()
{
long long x = 0;
int i, j;
for (i=0; i<4000000; i++)
{
for (j=0; j<1024; j++)
x ^= foo(table[j]);
}
printf("%016Lu\n",x);
return 0;
}
using different implementations of "foo" assembled at link-time.
The metric was user time reported by "time a.out" running on RedHat 9.
Each test was designated a single letter code. The three tests
described above were "a", "b" and "c". Additionally, the current
mainline is designated "o", mainline with "-march=pentium4" which
uses cmov as designated "p" and a baseline run where foo simply
returned zero was designated "z".
Two additional tests were also run. Tests "b" and "p" have a
perhaps unfair disadvantage in that they require an extra register
"%ebx" which is caller-saved, hence "b" and "p" have "pushl %ebx"
and "popl %ebx" overhead. The run "d" is a duplicate of "a" where
we also save/restore %ebx. Finally, run "e" was a variation of
"c" where I tried using the "xchgl %eax, %edx" after the branch,
instead of the "mov %eax, %dx ; xor %eax, %eax" sequence.
Results (median of three executions)
Percent
Time Time-BaseLine Mainline ("o") Instructions
a 26.98s 2.98s 13.1% 2*setcc+2*shift
b 30.03s 6.03s 26.6% push+shift+2*cmov
c 43.23s 19.23s 84.8% shift+branch
d 29.96s 5.96s 26.3% push+2*setcc+2*shift
e 44.36s 20.36s 89.7% shift+branch+xchg
o 46.69s 22.69s 100.0% shld+shift+branch
p 39.67s 15.67s 69.1% push+shld+shift+2*cmov
z 24.00s 0.00s 0.0%
To summarize, method "a" is clearly the fastest. Even when factoring
in the save/restore overhead of an extra register "d" it's still faster
than method "b". Clearly "cmov" on the pentium is relatively expensive.
Compared to branch sequences, like "c", "e" and "o" however, conditional
moves are clearly a win over a conditional branch. Interestingly, the
cost of a branch is so high it hides the large "shld" overhead, notice
"o" is not much worse than "c", but that "p" is far worse than "b" [where
the cost of "shld" can't be hidden by the "cmovs"].
My idea of using "xchg" also didn't work out ("e" is slower than "c")!
But the real bottom line (in my mind) is that implementating your method
(A), 2*setcc + 2*shift in the x86 backend results in an "1LL << x"
implementation an astounding seven times faster than current mainline.
Roger
--