Bug 106989 - GCC fail to vectorize and clang succeed
Summary: GCC fail to vectorize and clang succeed
Status: RESOLVED DUPLICATE of bug 99407
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 13.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2022-09-20 23:05 UTC by JuzheZhong
Modified: 2022-09-22 15:56 UTC (History)
2 users (show)

See Also:
Host:
Target: Aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-09-21 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description JuzheZhong 2022-09-20 23:05:36 UTC
https://godbolt.org/z/v5arbjh3n

This case ARM-clang can vectorize but ARM-GCC failed.
Can anyone fix it? Or give me some guideline to fix it?


code:
typedef float real_t;

#define iterations 100000
#define LEN_1D 32000
#define LEN_2D 256
real_t flat_2d_array[LEN_2D*LEN_2D];

real_t x[LEN_1D];

real_t a[LEN_1D],b[LEN_1D],c[LEN_1D],d[LEN_1D],e[LEN_1D],
bb[LEN_2D][LEN_2D],cc[LEN_2D][LEN_2D],tt[LEN_2D][LEN_2D];

int indx[LEN_1D];

real_t* __restrict__ xx;
real_t* yy;
real_t s243(void)
{
    for (int nl = 0; nl < iterations; nl++) {
        for (int i = 0; i < LEN_1D-1; i++) {
            a[i] = b[i] + c[i  ] * d[i];
            b[i] = a[i] + d[i  ] * e[i];
            a[i] = b[i] + a[i+1] * d[i];
        }
    }
}
Comment 1 Hongtao.liu 2022-09-20 23:46:32 UTC
> real_t* __restrict__ xx;
> real_t* yy;
> real_t s243(void)
> {
>     for (int nl = 0; nl < iterations; nl++) {
>         for (int i = 0; i < LEN_1D-1; i++) {
>             a[i] = b[i] + c[i  ] * d[i];
>             b[i] = a[i] + d[i  ] * e[i];
>             a[i] = b[i] + a[i+1] * d[i];
>         }
>     }
> }

Manually change the code to below, gcc can vectorize the loop.
real_t s243(void)
{
    for (int nl = 0; nl < iterations; nl++) {
        for (int i = 0; i < LEN_1D-1; i++) {
//          a[i] = b[i] + c[i  ] * d[i]; propagate it into next line.
            b[i] = b[i] + c[i  ] * d[i] + d[i  ] * e[i];
            a[i] = b[i] + a[i+1] * d[i];
        }
    }
}
Comment 2 Andrew Pinski 2022-09-21 00:42:48 UTC
/app/example.cpp:20:25: note:   Detected interleaving store a[i_27] and a[i_27]
/app/example.cpp:20:25: note:   Queuing group with duplicate access for fixup
/app/example.cpp:20:25: note:   zero step in outer loop.
/app/example.cpp:20:25: note:   zero step in outer loop.
/app/example.cpp:20:25: missed:   not vectorized: complicated access pattern.
/app/example.cpp:22:18: missed:   not vectorized: complicated access pattern.
/app/example.cpp:20:25: missed:  bad data access.
...

/app/example.cpp:21:27: note:   dependence distance  = 0.
/app/example.cpp:21:27: note:   dependence distance == 0 between b[i_27] and b[i_27]
/app/example.cpp:21:27: note:   dependence distance  = 1.
/app/example.cpp:22:18: missed:   not vectorized, possible dependence between data-refs a[i_27] and a[_9]
/app/example.cpp:21:27: missed:  bad data dependence.
/app/example.cpp:21:27: note:  ***** Analysis  failed with vector mode V4SF

There is a missing DSE before hand:
  # VUSE <.MEM_28>
  _1 = bD.3768[i_27];
  # VUSE <.MEM_28>
  _2 = cD.3769[i_27];
  # VUSE <.MEM_28>
  _3 = dD.3770[i_27];
  _4 = _2 * _3;
  _5 = _1 + _4;
  # .MEM_19 = VDEF <.MEM_28>
  aD.3767[i_27] = _5;
  # VUSE <.MEM_19>
  _6 = eD.3771[i_27];
  _7 = _3 * _6;
  _8 = _5 + _7;
  # .MEM_20 = VDEF <.MEM_19>
  bD.3768[i_27] = _8;
  # RANGE [irange] int [1, 31999] NONZERO 0x7fff
  _9 = i_27 + 1;
  # VUSE <.MEM_20>
  _10 = aD.3767[_9];
  _11 = _3 * _10;
  _12 = _8 + _11;
  # .MEM_21 = VDEF <.MEM_20>
  aD.3767[i_27] = _12;

DSE does not notice the store defining MEM_19 does touch the load:
  # VUSE <.MEM_20>
  _10 = aD.3767[_9];

And that it is redudent with the store defining MEM_21.
Comment 3 Andrew Pinski 2022-09-21 00:45:49 UTC
The DSE happens but only at the RTL level ....
Comment 4 JuzheZhong 2022-09-21 00:50:53 UTC
(In reply to Andrew Pinski from comment #3)
> The DSE happens but only at the RTL level ....

Is it a good idea to do data-ref in DSE and remove the first redundant store?
Comment 5 Richard Biener 2022-09-21 08:11:51 UTC
(In reply to JuzheZhong from comment #4)
> (In reply to Andrew Pinski from comment #3)
> > The DSE happens but only at the RTL level ....
> 
> Is it a good idea to do data-ref in DSE and remove the first redundant store?

Probably - of course that makes things more expensive.  The case at hand
can probably be handled by what LIMs mem_refs_may_alias_p does, using

  get_inner_reference_aff (mem1->mem.ref, &off1, &size1);
  get_inner_reference_aff (mem2->mem.ref, &off2, &size2);
  aff_combination_expand (&off1, ttae_cache);
  aff_combination_expand (&off2, ttae_cache);
  aff_combination_scale (&off1, -1);
  aff_combination_add (&off2, &off1);

  if (aff_comb_cannot_overlap_p (&off2, size1, size2))
    return false;

but I'm not sure we should add more code doing things like that ...

If we think that firing up dataref analysis at the point we discover the
possible use is too expensive we could also optimistically queue them
and only when we find a killing def (and thus the store would be dead),
process the queued uses, checking them if they are really important.

But well, maybe just try the simplest approach and measure the compile-time
effect.  That is, in

          /* If the statement is a use the store is not dead.  */
          else if (ref_maybe_used_by_stmt_p (use_stmt, ref))
            {
              /* Handle common cases where we can easily build an ao_ref
                 structure for USE_STMT and in doing so we find that the
                 references hit non-live bytes and thus can be ignored.

for a gimple assignment, check dr_may_alias_p after analyzing both
stmts (we can of course at least cache the DR for 'stmt').  Guarded
with flag_expensive_optimizations (thus -O2+).  You also need to then
initialize loops and SCEV.
Comment 6 Andrew Pinski 2022-09-22 15:56:14 UTC
Exact dup of bug 99407.

*** This bug has been marked as a duplicate of bug 99407 ***