This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH, i386]: Fix PR target/13958:Conversion from unsigned to double is painfully slow on P4
- From: "Richard Guenther" <richard dot guenther at gmail dot com>
- To: "Uros Bizjak" <ubizjak at gmail dot com>
- Cc: "GCC Patches" <gcc-patches at gcc dot gnu dot org>, "H. J. Lu" <hjl at lucon dot org>
- Date: Sat, 22 Mar 2008 13:30:10 +0100
- Subject: Re: [PATCH, i386]: Fix PR target/13958:Conversion from unsigned to double is painfully slow on P4
- References: <47E40BF2.4010303@gmail.com>
On Fri, Mar 21, 2008 at 8:26 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> Hello!
>
> Due to store forwarding penalty (this is how partial memory access is
> called nowadays), the code from PR runs "painfully" slow:
>
> --cut here--
> unsigned a[2]={1,2};
>
> inline unsigned foo1(int i) { return a[i]; }
>
> int main()
> {
> double x=0;
> int i;
>
> for ( i=0; i<100000000; ++i )
> x+=foo1(i%2);
>
> return (int)x;
> }
> --cut here--
>
> The inner loop is compiled (-O2 -march=pentium4 -malign-double) to:
>
> .L4:
> movl %ecx, %eax
> andl $1, %eax
> movl a(,%eax,4), %eax
> xorl %edx, %edx
> (*) pushl %edx
> (*) pushl %eax
> (*) fildll (%esp)
> addl $8, %esp
> faddp %st, %st(1)
> addl $1, %ecx
> cmpl $100000000, %ecx
> jne .L4
>
> Instructions marked with (*) form partial memory access.
>
> Runtime:
>
> time ./a.out
>
> real 0m0.794s
> user 0m0.724s
> sys 0m0.000s
>
> Patched gcc creates:
>
> .L4:
> movl %edx, %eax
> andl $1, %eax
> movd a(,%eax,4), %xmm0
> movq %xmm0, -16(%ebp)
> fildll -16(%ebp)
> faddp %st, %st(1)
> addl $1, %edx
> cmpl $100000000, %edx
> jne .L4
>
> time ./a.out
>
> real 0m0.123s
> user 0m0.124s
> sys 0m0.000s
>
> This represents more than 5.8x speedup on what is claimed as:
>
> --quote--
>
> Btw, such conversions are quite common in numerical codes that deal
> with uniform grids: the array index can be used as a coordinate (usually
> after some trivial scaling). Given that the indices used in libstdc++
> are usually of the type size_t the slow conversion can have quite a
> negative performance impact.
>
> --unqoute--
>
> I guess that such a speedup comes quite handy. This code prefers DImode
> aligned to 8, since we are dealing with real DImode values. H.J. -
> should we align DImode values to 8 for TARGET_MMX/TARGET_SSE ?
>
> 2008-03-21 Uros Bizjak <ubizjak@gmail.com>
>
> PR target/13958
> * config/i386/i386.md ("*floatunssi<mode2>_1"): New pattern with
> corresponding post-reload splitters.
> ("floatunssi<mode>2"): Expand to unsigned_float x87 insn pattern
> when x87 FP math is selected.
> * config/i386/i386-protos.h (ix86_expand_convert_uns_sixf_sse):
> New function prototype.
> * config/i386/i386.c (ix86_expand_convert_uns_sixf_sse): New
> unreachable function to ease macroization of insn patterns.
>
> The patch was bootstrapped and regression tested on x86_64-pc-linux-gnu
> {,-m32}, patch is committed to SVN.
>
> RMs, Do we want this patch in 4.3.1, although it isn't strictly a
> regression?
Does this only affect P4 as the PR states? Does this have a measuable positive
impact on SPEC?
Otherwise in general no, not without overwhelming benefit.
Thanks,
Richard.