https://godbolt.org/z/v5arbjh3n This case ARM-clang can vectorize but ARM-GCC failed. Can anyone fix it? Or give me some guideline to fix it? code: typedef float real_t; #define iterations 100000 #define LEN_1D 32000 #define LEN_2D 256 real_t flat_2d_array[LEN_2D*LEN_2D]; real_t x[LEN_1D]; real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D], bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D]; int indx[LEN_1D]; real_t* __restrict__ xx; real_t* yy; real_t s243(void) { for (int nl = 0; nl < iterations; nl++) { for (int i = 0; i < LEN_1D-1; i++) { a[i] = b[i] + c[i ] * d[i]; b[i] = a[i] + d[i ] * e[i]; a[i] = b[i] + a[i+1] * d[i]; } } }
> real_t* __restrict__ xx; > real_t* yy; > real_t s243(void) > { > for (int nl = 0; nl < iterations; nl++) { > for (int i = 0; i < LEN_1D-1; i++) { > a[i] = b[i] + c[i ] * d[i]; > b[i] = a[i] + d[i ] * e[i]; > a[i] = b[i] + a[i+1] * d[i]; > } > } > } Manually change the code to below, gcc can vectorize the loop. real_t s243(void) { for (int nl = 0; nl < iterations; nl++) { for (int i = 0; i < LEN_1D-1; i++) { // a[i] = b[i] + c[i ] * d[i]; propagate it into next line. b[i] = b[i] + c[i ] * d[i] + d[i ] * e[i]; a[i] = b[i] + a[i+1] * d[i]; } } }
/app/example.cpp:20:25: note: Detected interleaving store a[i_27] and a[i_27] /app/example.cpp:20:25: note: Queuing group with duplicate access for fixup /app/example.cpp:20:25: note: zero step in outer loop. /app/example.cpp:20:25: note: zero step in outer loop. /app/example.cpp:20:25: missed: not vectorized: complicated access pattern. /app/example.cpp:22:18: missed: not vectorized: complicated access pattern. /app/example.cpp:20:25: missed: bad data access. ... /app/example.cpp:21:27: note: dependence distance = 0. /app/example.cpp:21:27: note: dependence distance == 0 between b[i_27] and b[i_27] /app/example.cpp:21:27: note: dependence distance = 1. /app/example.cpp:22:18: missed: not vectorized, possible dependence between data-refs a[i_27] and a[_9] /app/example.cpp:21:27: missed: bad data dependence. /app/example.cpp:21:27: note: ***** Analysis failed with vector mode V4SF There is a missing DSE before hand: # VUSE <.MEM_28> _1 = bD.3768[i_27]; # VUSE <.MEM_28> _2 = cD.3769[i_27]; # VUSE <.MEM_28> _3 = dD.3770[i_27]; _4 = _2 * _3; _5 = _1 + _4; # .MEM_19 = VDEF <.MEM_28> aD.3767[i_27] = _5; # VUSE <.MEM_19> _6 = eD.3771[i_27]; _7 = _3 * _6; _8 = _5 + _7; # .MEM_20 = VDEF <.MEM_19> bD.3768[i_27] = _8; # RANGE [irange] int [1, 31999] NONZERO 0x7fff _9 = i_27 + 1; # VUSE <.MEM_20> _10 = aD.3767[_9]; _11 = _3 * _10; _12 = _8 + _11; # .MEM_21 = VDEF <.MEM_20> aD.3767[i_27] = _12; DSE does not notice the store defining MEM_19 does touch the load: # VUSE <.MEM_20> _10 = aD.3767[_9]; And that it is redudent with the store defining MEM_21.
The DSE happens but only at the RTL level ....
(In reply to Andrew Pinski from comment #3) > The DSE happens but only at the RTL level .... Is it a good idea to do data-ref in DSE and remove the first redundant store?
(In reply to JuzheZhong from comment #4) > (In reply to Andrew Pinski from comment #3) > > The DSE happens but only at the RTL level .... > > Is it a good idea to do data-ref in DSE and remove the first redundant store? Probably - of course that makes things more expensive. The case at hand can probably be handled by what LIMs mem_refs_may_alias_p does, using get_inner_reference_aff (mem1->mem.ref, &off1, &size1); get_inner_reference_aff (mem2->mem.ref, &off2, &size2); aff_combination_expand (&off1, ttae_cache); aff_combination_expand (&off2, ttae_cache); aff_combination_scale (&off1, -1); aff_combination_add (&off2, &off1); if (aff_comb_cannot_overlap_p (&off2, size1, size2)) return false; but I'm not sure we should add more code doing things like that ... If we think that firing up dataref analysis at the point we discover the possible use is too expensive we could also optimistically queue them and only when we find a killing def (and thus the store would be dead), process the queued uses, checking them if they are really important. But well, maybe just try the simplest approach and measure the compile-time effect. That is, in /* If the statement is a use the store is not dead. */ else if (ref_maybe_used_by_stmt_p (use_stmt, ref)) { /* Handle common cases where we can easily build an ao_ref structure for USE_STMT and in doing so we find that the references hit non-live bytes and thus can be ignored. for a gimple assignment, check dr_may_alias_p after analyzing both stmts (we can of course at least cache the DR for 'stmt'). Guarded with flag_expensive_optimizations (thus -O2+). You also need to then initialize loops and SCEV.
Exact dup of bug 99407. *** This bug has been marked as a duplicate of bug 99407 ***