This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [AVX] PATCH: Add vzeroall/vzeroupper patterns
- From: "H.J. Lu" <hjl dot tools at gmail dot com>
- To: Uros Bizjak <ubizjak at gmail dot com>
- Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>
- Date: Sat, 12 Apr 2008 07:49:21 -0700
- Subject: Re: [AVX] PATCH: Add vzeroall/vzeroupper patterns
- References: <4800B0C2.4090000@gmail.com>
On Sat, Apr 12, 2008 at 02:53:22PM +0200, Uros Bizjak wrote:
> Hello!
>
>> +(define_insn "avx_vzeroall"
>> + [(unspec_volatile [(const_int 0)] UNSPECV_VZEROALL)
>> + (clobber (reg:V8SI XMM0_REG))
>> + (clobber (reg:V8SI XMM1_REG))
>> + (clobber (reg:V8SI XMM2_REG))
>> + (clobber (reg:V8SI XMM3_REG))
>> + (clobber (reg:V8SI XMM4_REG))
>> + (clobber (reg:V8SI XMM5_REG))
>> + (clobber (reg:V8SI XMM6_REG))
>> + (clobber (reg:V8SI XMM7_REG))]
>
> This is not good approach, since by using unspecs, you are hiding from the
> compiler what the pattern actually does. I would recommend using parallel
> of (set (reg:V8SI XMMx_REG) (const_vector: V8SI [(const_int:SI 0)...]))),
> something like attached patch (modeled as V4SI version, so I was able to
> "test" it on SSE2):
>
> Also, by exactly defining true operation of vzeroall, IMO it does not need
> to be declared as volatile.
>
> Using attached patch (please note that these patterns handle REX SSE
> registers as well), I was able to generate vzeroall by following test:
>
> --cut here--
> typedef float __m128f __attribute__ ((__vector_size__ (16)));
>
> __m128f _mm_vsetzero (int i)
> {
> register __m128f x __asm__ ("xmm5");
>
> __builtin_ia32_vzeroall ();
>
> // x = (__m128f) { 0.0f, 0.0f, 0.0f, 0.0f };
> return __builtin_ia32_cvtsi2ss (x, i);
> }
> --cut here--
>
> gcc -O2:
>
> .LFB2:
> vzeroall
> movaps %xmm5, %xmm0
> cvtsi2ss %edi, %xmm0
> ret
>
> when the third line was uncommented, gcc figured out that the dependency
> chain between vzeroall and cvtsi2ss was broken and since all other xmm
> registers were unused, it correctly removed vzeroall.
>
> Uros.
I tried it. It doesn't work as expected:
bash-3.2$ cat bad.c
#include <gmmintrin.h>
extern __m256i bar (__m256i);
extern __m128i bar (__m128i);
__m128i
foo2 (__m128i x, __m256i y)
{
_mm256_zeroall ();
return bar (x);
}
bash-3.2$ /export/build/gnu/gcc-avx/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/gcc-avx/build-x86_64-linux/gcc/ -mavx -Wall -S
bad.c -O2
bash-3.2$ cat bad.s
.file "bad.c"
.text
.p2align 4,,15
.globl foo2
.type foo2, @function
foo2:
.LFB686:
jmp bar
.LFE686:
.size foo2, .-foo2
We don't save and restore x. Since we generate
(set (reg/v:V4SI 60 [ x ]) (reg:V4SI 21 xmm0 [ x ]))
(parallel [
(set (reg:V8SI 21 xmm0)
...
set (reg:V4SI 21 xmm0) (reg/v:V4SI 60 [ x ]))
We optimize out vzeroall. Do you have any suggestions?
Thanks.
H.J.