Summary: | Failure to optimize vectorized conversion to `int` with AVX | ||
---|---|---|---|
Product: | gcc | Reporter: | Gabriel Ravier <gabravier> |
Component: | tree-optimization | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | crazylht, hjl.tools, rguenth, rsandifo |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 11.0 | ||
Target Milestone: | --- | ||
Host: | Target: | x86_64-* i?86-*-* | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2020-08-17 00:00:00 | |
Bug Depends on: | 36844 | ||
Bug Blocks: | 53947 |
Description
Gabriel Ravier
2020-08-17 11:57:33 UTC
The relevant pattern is present in sse.md: (define_insn "fix_truncv4dfv4si2<mask_name>" [(set (match_operand:V4SI 0 "register_operand" "=v") (fix:V4SI (match_operand:V4DF 1 "nonimmediate_operand" "vm")))] "TARGET_AVX || (TARGET_AVX512VL && TARGET_AVX512F)" "vcvttpd2dq{y}\t{%1, %0<mask_operand2>|%0<mask_operand2>, %1}" but for some reason not exercised by target-independent part of the compiler. Confirmed as a tree optimization problem. gcc doesn't seem very fond of using 2 different vector bitsizes at the same time, so VEC_PACK_FIX_TRUNC_EXPR takes 2 vectors of 2 double and gives one vector of 4 int. At the RTL level, we have a vec_concat:V4DF of 2 V2DF adjacent in memory, but nothing knows to turn that into a single load. (the conversion itself of 4 double to int is fine) The pattern is exercised directly by BB vectorization only, loop vectorization still uses a fixed vector size. Still the assembly shows basically the same code when doing BB vectorization only: f: .LFB0: .cfi_startproc vmovupd (%rdi), %xmm1 vinsertf128 $0x1, 16(%rdi), %ymm1, %ymm0 vcvttpd2dqy %ymm0, %xmm0 vmovdqu %xmm0, (%rsi) vzeroupper ret this is probably because of some tuning (split unaligned loads, not using a memory operand for vcvttpd2dqy). With -O3 -fno-tree-loop-vectorize -march=core-avx2 I get f: .LFB0: .cfi_startproc vcvttpd2dqy (%rdi), %xmm0 vmovdqu %xmm0, (%rsi) ret |