[PATCH] AVX2 vec_widen_[su]mult_{hi,lo}, sdot_prod and udot_prod*

Fri Oct 14 16:32:00 GMT 2011

On 10/14/2011 07:18 AM, Jakub Jelinek wrote:
> +  /* This would be 2 insns shorter if
> +     rperm[i] = GEN_INT (((~i & 1) << 2) + i / 2);
> +     has been used instead (both vpslrq insns wouldn't be needed),
> +     but vec_widen_*mult_hi_* is usually used together with
> +     vec_widen_*mult_lo_* and by writing it this way the load
> +     of the constant and the two vpermd instructions (cross-lane)
> +     can be CSEd together.  */
> +  for (i = 0; i < 8; ++i)
> +    rperm[i] = GEN_INT (((i & 1) << 2) + i / 2);
> +  vperm = gen_rtx_CONST_VECTOR (V8SImode, gen_rtvec_v (8, rperm));
> +  vperm = force_reg (V8SImode, vperm);
> +  emit_insn (gen_avx2_permvarv8si (t1, vperm, operands[1]));
> +  emit_insn (gen_avx2_permvarv8si (t2, vperm, operands[2]));
> +  emit_insn (gen_lshrv4di3 (gen_lowpart (V4DImode, t3),
> +			    gen_lowpart (V4DImode, t1), GEN_INT (32)));
> +  emit_insn (gen_lshrv4di3 (gen_lowpart (V4DImode, t4),
> +			    gen_lowpart (V4DImode, t2), GEN_INT (32)));
> +  emit_insn (gen_avx2_<u>mulv4siv4di3 (operands[0], t3, t4));

So what you're doing here is the low-part permutation:

	0 4 1 5 2 6 3 7

followed by a shift to get

	4 . 5 . 6 . 7 .

But you need to load a 256-bit constant from memory to get it.

I wonder if it wouldn't be better to use VPERMQ to handle the lane change:

	0   2   1   3
	0 1 4 5 2 3 6 7

shared between the hi/lo, and a VPSHUFD to handle the in-lane ordering:

	0 0 1 1 2 2 3 3
	4 4 5 5 6 6 7 7

In the end we get 2+(2+2)=6 insns as setup prior to the VPMULDQs, as compared
to your 1+2+(0+2)=5 insns, but no need to wait for the constant load.  Of 
course, if the constant load gets hoisted out of the loop, yours will likely
win on throughput.

Thoughts, Uros and those looking in from Intel?

Otherwise it looks ok.

r~

[PATCH] AVX2 vec_widen_[su]mult_{hi,lo}*, sdot_prod* and udot_prod*

[PATCH] AVX2 vec_widen_[su]mult_{hi,lo}, sdot_prod and udot_prod*