Bug 46006

Summary: vectorization outside of loops starting from loads
Product: gcc Reporter: Jakub Jelinek <jakub>
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: enhancement CC: irar, irar
Priority: P3 Keywords: missed-optimization
Version: 4.6.0   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2023-06-21 00:00:00
Bug Depends on:    
Bug Blocks: 53947    

Description Jakub Jelinek 2010-10-13 14:23:45 UTC
Are there any plans to try to vectorize parts of code like:
struct A
{
  double x, y, z;
};

struct B
{
  struct A a, b;
};

struct C
{
  struct A c;
  double d;
};

__attribute__((noinline, noclone)) int
foo (const struct C *u, struct B v)
{
  double a, b, c, d;

  a = v.b.x * v.b.x + v.b.y * v.b.y + v.b.z * v.b.z;
  b = 2.0 * v.b.x * (v.a.x - u->c.x)
      + 2.0 * v.b.y * (v.a.y - u->c.y) + 2.0 * v.b.z * (v.a.z - u->c.z);
  c = u->c.x * u->c.x + u->c.y * u->c.y + u->c.z * u->c.z
      + v.a.x * v.a.x + v.a.y * v.a.y + v.a.z * v.a.z
      + 2.0 * (-u->c.x * v.a.x - u->c.y * v.a.y - u->c.z * v.a.z)
      - u->d * u->d;
  if ((d = b * b - 4.0 * a * c) < 0.0)
    return 0;
  return d;
}

int
main (void)
{
  int i, j;
  struct C c = { { 1.0, 1.0, 1.0 }, 1.0 };
  struct B b = { { 1.0, 1.0, 1.0 }, { 1.0, 1.0, 1.0 } };
  for (i = 0; i < 100000000; i++)
    {
      asm volatile ("" : : "r" (&c), "r" (&b) : "memory");
      j = foo (&c, b);
      asm volatile ("" : : "r" (j));
    }
  return 0;
}
(this is the hot spot from c-ray benchmark, the function is actually larger but at least according to callgrind in most cases the early return on < 0.0 happens;
as the function is large and called from multiple spots, it isn't inlined).
I'd say (though, haven't tried to code it by hand using intrinsics) that by
doing many of the multiplications/additions in parallel (especially for AVX) there could be significant speedups (-O3 -ffast-math).
Comment 1 Ira Rosen 2010-10-17 13:22:18 UTC
This code requires SLP to originate from loads, which seems to be a bit more complicated than the currently implemented use-def scan (it will also need to reduce/extract scalars from the vectors in the end of vector computation). I don't see any major obstacles for this, however, currently I don't plan to work on this.

Another required feature is to work on groups bigger than vectorization factor, i.e., combining 2 statements in this example and leaving the 3rd one scalar.

Ira
Comment 2 Andrew Pinski 2012-03-13 22:59:24 UTC
Confirmed.
Comment 3 Richard Biener 2016-11-07 14:58:03 UTC
So currently we indeed miss the "sinks":

t.i:29:6: note: === vect_analyze_data_ref_accesses ===
t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 8B]
t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 16B]
t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 24B]
t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 32B]
t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 40B]
t.i:29:6: note: Detected interleaving load u_43(D)->c.x and u_43(D)->c.y
t.i:29:6: note: Detected interleaving load u_43(D)->c.x and u_43(D)->c.z
t.i:29:6: note: Detected interleaving load u_43(D)->c.x and u_43(D)->d
t.i:29:6: note: Detected interleaving load of size 6 starting with v$a$x_48 = MEM[(struct B *)&v];
t.i:29:6: note: Detected interleaving load of size 4 starting with _5 = u_43(D)->c.x;
t.i:29:6: note: not vectorized: no grouped stores in basic block.

two classes of sinks are currently missing: reductions and vector CONSTRUCTORs

parts of the testcase might be handled with reduction support.
Comment 4 Richard Biener 2023-06-21 13:17:07 UTC
We're almost there:

t2.c:22:5: note:   Starting SLP discovery for
t2.c:22:5: note:     powmult_4 = v$b$z_53 * v$b$z_53;
t2.c:22:5: note:     powmult_1 = v$b$x_51 * v$b$x_51;
t2.c:22:5: note:     powmult_2 = v$b$y_52 * v$b$y_52;

but:

t2.c:22:5: note:   vectype: vector(2) double
t2.c:22:5: note:   nunits = 2
t2.c:22:5: missed:   Build SLP failed: unrolling required in basic block SLP

and for reductions we do not try to split the group.