Bug 46006 - vectorization outside of loops
Summary: vectorization outside of loops
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.6.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2010-10-13 14:23 UTC by Jakub Jelinek
Modified: 2016-11-07 14:58 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-03-13 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jakub Jelinek 2010-10-13 14:23:45 UTC
Are there any plans to try to vectorize parts of code like:
struct A
{
  double x, y, z;
};

struct B
{
  struct A a, b;
};

struct C
{
  struct A c;
  double d;
};

__attribute__((noinline, noclone)) int
foo (const struct C *u, struct B v)
{
  double a, b, c, d;

  a = v.b.x * v.b.x + v.b.y * v.b.y + v.b.z * v.b.z;
  b = 2.0 * v.b.x * (v.a.x - u->c.x)
      + 2.0 * v.b.y * (v.a.y - u->c.y) + 2.0 * v.b.z * (v.a.z - u->c.z);
  c = u->c.x * u->c.x + u->c.y * u->c.y + u->c.z * u->c.z
      + v.a.x * v.a.x + v.a.y * v.a.y + v.a.z * v.a.z
      + 2.0 * (-u->c.x * v.a.x - u->c.y * v.a.y - u->c.z * v.a.z)
      - u->d * u->d;
  if ((d = b * b - 4.0 * a * c) < 0.0)
    return 0;
  return d;
}

int
main (void)
{
  int i, j;
  struct C c = { { 1.0, 1.0, 1.0 }, 1.0 };
  struct B b = { { 1.0, 1.0, 1.0 }, { 1.0, 1.0, 1.0 } };
  for (i = 0; i < 100000000; i++)
    {
      asm volatile ("" : : "r" (&c), "r" (&b) : "memory");
      j = foo (&c, b);
      asm volatile ("" : : "r" (j));
    }
  return 0;
}
(this is the hot spot from c-ray benchmark, the function is actually larger but at least according to callgrind in most cases the early return on < 0.0 happens;
as the function is large and called from multiple spots, it isn't inlined).
I'd say (though, haven't tried to code it by hand using intrinsics) that by
doing many of the multiplications/additions in parallel (especially for AVX) there could be significant speedups (-O3 -ffast-math).
Comment 1 Ira Rosen 2010-10-17 13:22:18 UTC
This code requires SLP to originate from loads, which seems to be a bit more complicated than the currently implemented use-def scan (it will also need to reduce/extract scalars from the vectors in the end of vector computation). I don't see any major obstacles for this, however, currently I don't plan to work on this.

Another required feature is to work on groups bigger than vectorization factor, i.e., combining 2 statements in this example and leaving the 3rd one scalar.

Ira
Comment 2 Andrew Pinski 2012-03-13 22:59:24 UTC
Confirmed.
Comment 3 Richard Biener 2016-11-07 14:58:03 UTC
So currently we indeed miss the "sinks":

t.i:29:6: note: === vect_analyze_data_ref_accesses ===
t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 8B]
t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 16B]
t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 24B]
t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 32B]
t.i:29:6: note: Detected interleaving load MEM[(struct B *)&v] and MEM[(struct B *)&v + 40B]
t.i:29:6: note: Detected interleaving load u_43(D)->c.x and u_43(D)->c.y
t.i:29:6: note: Detected interleaving load u_43(D)->c.x and u_43(D)->c.z
t.i:29:6: note: Detected interleaving load u_43(D)->c.x and u_43(D)->d
t.i:29:6: note: Detected interleaving load of size 6 starting with v$a$x_48 = MEM[(struct B *)&v];
t.i:29:6: note: Detected interleaving load of size 4 starting with _5 = u_43(D)->c.x;
t.i:29:6: note: not vectorized: no grouped stores in basic block.

two classes of sinks are currently missing: reductions and vector CONSTRUCTORs

parts of the testcase might be handled with reduction support.