[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics
crazylht at gmail dot com
gcc-bugzilla@gcc.gnu.org
Tue Oct 20 07:59:28 GMT 2020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366
--- Comment #6 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Alexander Monakov from comment #5)
> afaict LRA is just following IRA decisions, and IRA allocates that pseudo to
> memory due to costs.
>
> Not sure where strange cost is coming from, but it depends on x86 tuning
> options: with -mtune=skylake we get the expected code, with -mtune=haswell
> we get 128-bit vectors right and extra load for 256-bit, with -mtune=generic
> both cases have extra loads.
in
----
/* If this insn loads a parameter from its stack slot, then it
represents a savings, rather than a cost, if the parameter is
stored in memory. Record this fact.
Similarly if we're loading other constants from memory (constant
pool, TOC references, small data areas, etc) and this is the only
assignment to the destination pseudo.
Don't do this if SET_SRC (set) isn't a general operand, if it is
a memory requiring special instructions to load it, decreasing
mem_cost might result in it being loaded using the specialized
instruction into a register, then stored into stack and loaded
again from the stack. See PR52208.
Don't do this if SET_SRC (set) has side effect. See PR56124. */
if (set != 0 && REG_P (SET_DEST (set)) && MEM_P (SET_SRC (set))
&& (note = find_reg_note (insn, REG_EQUIV, NULL_RTX)) != NULL_RTX
&& ((MEM_P (XEXP (note, 0))
&& !side_effects_p (SET_SRC (set)))
|| (CONSTANT_P (XEXP (note, 0))
&& targetm.legitimate_constant_p (GET_MODE (SET_DEST (set)),
XEXP (note, 0))
&& REG_N_SETS (REGNO (SET_DEST (set))) == 1))
&& general_operand (SET_SRC (set), GET_MODE (SET_SRC (set)))
/* LRA does not use equiv with a symbol for PIC code. */
&& (! ira_use_lra_p || ! pic_offset_table_rtx
|| ! contains_symbol_ref_p (XEXP (note, 0))))
{
enum reg_class cl = GENERAL_REGS;
rtx reg = SET_DEST (set);
int num = COST_INDEX (REGNO (reg));
COSTS (costs, num)->mem_cost
-= ira_memory_move_cost[GET_MODE (reg)][cl][1] * frequency;
record_address_regs (GET_MODE (SET_SRC (set)),
MEM_ADDR_SPACE (SET_SRC (set)),
XEXP (SET_SRC (set), 0), 0, MEM, SCRATCH,
frequency * 2);
counted_mem = true;
}
---
for
(insn 9 8 11 3 (set (reg:V2DI 88 [ _16 ])
(mem:V2DI (plus:DI (reg/v/f:DI 91 [ input ])
(reg:DI 89 [ ivtmp.11 ])) [0 MEM[(const __m128i *
{ref-all})input_7(D) + ivtmp.11_40 * 1]+0 S16 A128]))
"/export/users2/liuhongt/tools-build/build_gcc11_master_debug/gcc/include/emmintrin.h":697:10
1405 {movv2di_internal}
mem_cost for r88 would minus ira_memory_move_cost[V2DImode][GENERAL_REGS][1],
and got -11808 as an initial value, but for reality it should minus
ira_memory_move_cost[V2DImode][SSE_REGS][1], then have -5905 as an initial
value. It seems it adds too much preference to memory here.
Then in the later record_operand_costs, when ira found r88 would also be used
in shift and ior instruction, the mem_cost for r88 increases, but still smaller
than costs of SSE_REGS because we add too much preference to memory in the
upper. Finally, ira would choose memory for r88 because it has lowest cost and
it's suboptimal.
a10(r88,l1) costs: SSE_FIRST_REG:0,0 NO_REX_SSE_REGS:0,0 SSE_REGS:0,0
MEM:-984,-984
More information about the Gcc-bugs
mailing list