This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Wed, 01 Aug 2018 05:51:23 +0000
Subject: [Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using
Auto-submitted: auto-generated
References: <bug-82459-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
The VPAND instructions in the 256-bit version are a missed-optimization.

I had another look at this with current trunk.  Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512.  (If we improve code-gen
for that choice, it will make it a win in more cases.)

https://godbolt.org/g/2dfkNV

Loads are folding into the shifts now, unlike with gcc7.3.  (But they can't
micro-fuse because of the indexed addressing mode.  A pointer increment might
save 1 front-end uop even in the non-unrolled loop)

The separate integer loop counter is gone, replaced with a compare against an
end-index.

But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. 
Fewer instructions and (more importantly) 1/3 the shuffle uops.  GCC knows how
to do this for the 256-bit version, so it's apparently a failure of the
cost-model that it doesn't for the 512-bit version.  (Maybe requiring a
shuffle-control vector instead of immediate puts it off?  Or maybe it's
counting the cost of the useless vpand instructions for the pack / permq
option, even though they're not part of the shuffle-throughput bottleneck?)

----

We do use vpackuswb + vpermq for 256-bit, but we have redundant AND
instructions with set1_epi16(0x00FF) after a right shift already leaves the
high byte zero.

---

Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu.  GCC should be
using the VEX encoding of an instruction whenever it does exactly the same
thing.  At least we didn't use vpandd or vpandq EVEX instructions.

(I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a
store with no masking.  Hopefully it's efficient.)

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]