This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

The "right way" to handle alignment of pointer targets in the compiler?


Hi,

I have been playing with the GCC vectorizer and examining assembly code that is produced for dot products that are not for a fixed number of elements. (This comes up surprisingly often in scientific codes.) So far, the generated code is not faster than non-vectorized code, and I think that it is because I can't find a way to tell the compiler that the target of a double* is 16-byte aligned.

I have three questions:

1. First, is there (unofficially, or in someone's head) a planned solution to this problem? If so, is it based on types with special attributes, or on something else?

The reason that I ask is that the current approaches seem to be based on types -- and I am able to hack this information into the compiler using the following 2-line approach:

  typedef double aligned_double __attribute__((aligned(16)))
  typedef aligned_double* SSE_PTR;

This works in simple examples. (But doesn't work in my actually code - see the attached file test.C)

The more obvious approach of the 1-line

typedef double __attribute__((aligned(16))) *SSE_PTR;

does not work. I believe this is documented as (a) being correct and (b) not being implemented. (I think this is the real cause of PR38011)

So, I am wondering if people have avoided implementing this yet because it REALLY should be implemented using VRP (since 16-byte alignment is in fact information about the value of a ptr) or if, in fact, the method of types with special attributes is regarded as the way forward, and just hasn't been done yet. (Also, are there any ideas for how one could specify that (say) a pointer target deviates from 16-byte alignment by 8 bytes?)

I don't suppose the vectorizer can use information from __builtin_expect ... ?

2. Second, there should be a PR for this, at last according to Dorit Nuzman, in this thread:
http://gcc.gnu.org/ml/gcc/2006-11/msg00084.html


I can't find one, so I'm planning to file one (or several.) Unless someone minds, I am planning to file a general "alignment of pointer targets" PR, and then file several more detailed bugs as blocking the general alignment PR. Then, these detailed bugs can have smaller, more specific testcases.

3. Casts to aligned pointer types do not seem to work, in general - am I missing something?

For example, the vectorizer incorrectly sees unaligned accesses in the following code:

// This function uses UNALIGNED accesses
real f3(const double* p_, const double* q_,int n)
{
  SSE_PTR __restrict__ p = SSE_PTR(p_+1); // +2 leads to an aligned access
  SSE_PTR __restrict__ q = SSE_PTR(q_+1); // +2 leads to an aligned access
  real sum = 0;
  for(int i=0; i<n;i++)
    sum += p[i] * q[i];

  return sum;
}


This and several other testcases are included in the attached test.C file.


Thanks for any information!

-BenRI

P.S. Here is some background information that I came across:

These messages/threads seem to indicate that the machinery is not in an optimal state, and could be improved:
- "Specifying alignment of pointer targets" [http://gcc.gnu.org/ml/gcc/2005-03/msg00952.html]
- "16 byte alignment hint for sse vectorization" [http://gcc.gnu.org/ml/gcc/2006-11/msg00084.html]


From Pr 27827 - http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827 :
"I just quickly glanced at the code, and I see that it never uses "movapd" from memory, which is a key to getting decent performance."
// File: test.C
// gcc-4.4 -c test.C -O3 -ffast-math -msse3 -ftree-vectorizer-verbose=4 
// gcc-4.5 -c test.C -O3 -ffast-math -msse3 -ftree-vectorizer-verbose=4 

typedef double real;

// These two lines work (together)
typedef real aligned_real __attribute__((aligned(16)));
typedef const aligned_real* SSE_PTR;
typedef       aligned_real* SSE_PTR_non_const;

//typedef real real_array[] __attribute__((aligned(16)));
//typedef real_array* SSE_PTR;


// These three lines do NOT work -- there is no effect.
//typedef const real __attribute__((aligned(16))) *SSE_PTR;
//typedef const real *SSE_PTR __attribute__((aligned(16)));
//typedef const __attribute__((aligned(16))) real *SSE_PTR;


// This function uses ALIGNED accesses
real f(SSE_PTR p, SSE_PTR q,int n)
{
  real sum = 0;
  for(int i=0; i<n;i++)
    sum += p[i] * q[i];

  return sum;
}

// This function uses ALIGNED accesses
real f2a(const double* p_, const double* q_,int n)
{
  SSE_PTR __restrict__ p = p_;
  SSE_PTR __restrict__ q = q_;
  real sum = 0;
  for(int i=0; i<n;i++)
    sum += p[i] * q[i];

  return sum;
}

// This function uses 1 ALIGNED and 1 UNALIGNED access
real f2b(const double* p_, const double* q_,int n)
{
  SSE_PTR __restrict__ p = p_;  // this one is aligned   (?)
  SSE_PTR q = q_;               // this one is unaligned (?)
  real sum = 0;
  for(int i=0; i<n;i++)
    sum += p[i] * q[i];

  return sum;
}

// This function uses UNALIGNED accesses
real f2c(const double* p_, const double* q_,int n)
{
  SSE_PTR p = p_;
  SSE_PTR q = q_;
  real sum = 0;
  for(int i=0; i<n;i++)
    sum += p[i] * q[i];

  return sum;
}

// This function uses UNALIGNED accesses
real f3(const double* p_, const double* q_,int n)
{
  SSE_PTR __restrict__ p = SSE_PTR(p_+1); // +2 leads to an aligned access, again.
  SSE_PTR __restrict__ q = SSE_PTR(q_+1); // +2 leads to an aligned access, again.
  real sum = 0;
  for(int i=0; i<n;i++)
    sum += p[i] * q[i];

  return sum;
}

// This function uses UNALIGNED accesses
real f3b(const double* p_, const double* q_,int n)
{
  SSE_PTR __restrict__ p = p_+2; // +2 leads to an aligned access, again.
  SSE_PTR __restrict__ q = q_+2; // +2 leads to an aligned access, again.
  real sum = 0;
  for(int i=0; i<n;i++)
    sum += p[i] * q[i];

  return sum;
}

// This function uses UNALIGNED accesses
real f4(const double* p_, const double* q_,int m, int n)
{
  SSE_PTR p = p_+m;
  SSE_PTR q = q_+m;
  real sum = 0;
  for(int i=0; i<n;i++)
    sum += p[i] * q[i];

  return sum;
}

template <typename RealType>
RealType f5(const RealType* p_, const RealType* q_, int n)
{
  // This works in 4.4 (both in templates, and outside)
  // This partially works in 4.5 (not in templates, but only outside)
  typedef RealType AlignedReal __attribute__((aligned(16)));
  typedef const AlignedReal* TEMPLATE_SSE_PTR;

  TEMPLATE_SSE_PTR __restrict__ p = p_;
  TEMPLATE_SSE_PTR __restrict__ q = q_;

  RealType sum = 0;
  for(int i=0; i<n;i++)
    sum += p[i] * q[i];

  return sum;
}

double f5a(const double* p, const double* q, int n)
{
  return f5<double>(p,q,n);
}

double f5b(const float* p, const float* q, int n)
{
  return f5<float>(p,q,n);
}

// This is aligned
void g(SSE_PTR_non_const __restrict__ a,
       SSE_PTR __restrict__ b,
       double coef, 
       unsigned count) 
{
  for(unsigned i=0; i<count; i++)
    a[i] = b[i]*coef;
}

// This is partially aligned
void g2(double* __restrict__ a,
       SSE_PTR __restrict__ b,
       double coef, 
       unsigned count) 
{
  for(unsigned i=0; i<count; i++)
    a[i] = b[i]*coef;
}

// This is partially aligned
void g3(SSE_PTR_non_const __restrict__ a,
       const double* __restrict__ b,
       double coef, 
       unsigned count) 
{
  for(unsigned i=0; i<count; i++)
    a[i] = b[i]*coef;
}

// This is NOT aligned
void g4(double* __restrict__ a,
       const double* __restrict__ b,
       double coef, 
       unsigned count) 
{
  for(unsigned i=0; i<count; i++)
    a[i] = b[i]*coef;
}


// Alignment of access forced using versioning x 2
// This is NOT aligned
// Um... we only force alignment of the store...
void g5(double* __restrict__ a,
	const double* __restrict__ b,
	double* __restrict__ c,
	const double* __restrict__ d,
	double coef, 
	unsigned count) 
{
  for(unsigned i=0; i<count; i++)
  {
    a[i] = b[i]*coef;
    c[i] = d[i]*coef;
  }
}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]