This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
The "right way" to handle alignment of pointer targets in the compiler?
- From: Benjamin Redelings I <benjamin_redelings at ncsu dot edu>
- To: gcc at gcc dot gnu dot org
- Date: Fri, 01 Jan 2010 19:12:24 -0500
- Subject: The "right way" to handle alignment of pointer targets in the compiler?
Hi,
I have been playing with the GCC vectorizer and examining assembly code
that is produced for dot products that are not for a fixed number of
elements. (This comes up surprisingly often in scientific codes.) So
far, the generated code is not faster than non-vectorized code, and I
think that it is because I can't find a way to tell the compiler that
the target of a double* is 16-byte aligned.
I have three questions:
1. First, is there (unofficially, or in someone's head) a planned
solution to this problem? If so, is it based on types with special
attributes, or on something else?
The reason that I ask is that the current approaches seem to be based on
types -- and I am able to hack this information into the compiler using
the following 2-line approach:
typedef double aligned_double __attribute__((aligned(16)))
typedef aligned_double* SSE_PTR;
This works in simple examples. (But doesn't work in my actually code -
see the attached file test.C)
The more obvious approach of the 1-line
typedef double __attribute__((aligned(16))) *SSE_PTR;
does not work. I believe this is documented as (a) being correct and
(b) not being implemented. (I think this is the real cause of PR38011)
So, I am wondering if people have avoided implementing this yet because
it REALLY should be implemented using VRP (since 16-byte alignment is in
fact information about the value of a ptr) or if, in fact, the method of
types with special attributes is regarded as the way forward, and just
hasn't been done yet. (Also, are there any ideas for how one could
specify that (say) a pointer target deviates from 16-byte alignment by 8
bytes?)
I don't suppose the vectorizer can use information from __builtin_expect
... ?
2. Second, there should be a PR for this, at last according to Dorit
Nuzman, in this thread:
http://gcc.gnu.org/ml/gcc/2006-11/msg00084.html
I can't find one, so I'm planning to file one (or several.) Unless
someone minds, I am planning to file a general "alignment of pointer
targets" PR, and then file several more detailed bugs as blocking the
general alignment PR. Then, these detailed bugs can have smaller, more
specific testcases.
3. Casts to aligned pointer types do not seem to work, in general - am I
missing something?
For example, the vectorizer incorrectly sees unaligned accesses in the
following code:
// This function uses UNALIGNED accesses
real f3(const double* p_, const double* q_,int n)
{
SSE_PTR __restrict__ p = SSE_PTR(p_+1); // +2 leads to an aligned access
SSE_PTR __restrict__ q = SSE_PTR(q_+1); // +2 leads to an aligned access
real sum = 0;
for(int i=0; i<n;i++)
sum += p[i] * q[i];
return sum;
}
This and several other testcases are included in the attached test.C file.
Thanks for any information!
-BenRI
P.S. Here is some background information that I came across:
These messages/threads seem to indicate that the machinery is not in an
optimal state, and could be improved:
- "Specifying alignment of pointer targets"
[http://gcc.gnu.org/ml/gcc/2005-03/msg00952.html]
- "16 byte alignment hint for sse vectorization"
[http://gcc.gnu.org/ml/gcc/2006-11/msg00084.html]
From Pr 27827 - http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827 :
"I just quickly glanced at the code, and I see that it never uses
"movapd" from memory, which is a key to getting decent performance."
// File: test.C
// gcc-4.4 -c test.C -O3 -ffast-math -msse3 -ftree-vectorizer-verbose=4
// gcc-4.5 -c test.C -O3 -ffast-math -msse3 -ftree-vectorizer-verbose=4
typedef double real;
// These two lines work (together)
typedef real aligned_real __attribute__((aligned(16)));
typedef const aligned_real* SSE_PTR;
typedef aligned_real* SSE_PTR_non_const;
//typedef real real_array[] __attribute__((aligned(16)));
//typedef real_array* SSE_PTR;
// These three lines do NOT work -- there is no effect.
//typedef const real __attribute__((aligned(16))) *SSE_PTR;
//typedef const real *SSE_PTR __attribute__((aligned(16)));
//typedef const __attribute__((aligned(16))) real *SSE_PTR;
// This function uses ALIGNED accesses
real f(SSE_PTR p, SSE_PTR q,int n)
{
real sum = 0;
for(int i=0; i<n;i++)
sum += p[i] * q[i];
return sum;
}
// This function uses ALIGNED accesses
real f2a(const double* p_, const double* q_,int n)
{
SSE_PTR __restrict__ p = p_;
SSE_PTR __restrict__ q = q_;
real sum = 0;
for(int i=0; i<n;i++)
sum += p[i] * q[i];
return sum;
}
// This function uses 1 ALIGNED and 1 UNALIGNED access
real f2b(const double* p_, const double* q_,int n)
{
SSE_PTR __restrict__ p = p_; // this one is aligned (?)
SSE_PTR q = q_; // this one is unaligned (?)
real sum = 0;
for(int i=0; i<n;i++)
sum += p[i] * q[i];
return sum;
}
// This function uses UNALIGNED accesses
real f2c(const double* p_, const double* q_,int n)
{
SSE_PTR p = p_;
SSE_PTR q = q_;
real sum = 0;
for(int i=0; i<n;i++)
sum += p[i] * q[i];
return sum;
}
// This function uses UNALIGNED accesses
real f3(const double* p_, const double* q_,int n)
{
SSE_PTR __restrict__ p = SSE_PTR(p_+1); // +2 leads to an aligned access, again.
SSE_PTR __restrict__ q = SSE_PTR(q_+1); // +2 leads to an aligned access, again.
real sum = 0;
for(int i=0; i<n;i++)
sum += p[i] * q[i];
return sum;
}
// This function uses UNALIGNED accesses
real f3b(const double* p_, const double* q_,int n)
{
SSE_PTR __restrict__ p = p_+2; // +2 leads to an aligned access, again.
SSE_PTR __restrict__ q = q_+2; // +2 leads to an aligned access, again.
real sum = 0;
for(int i=0; i<n;i++)
sum += p[i] * q[i];
return sum;
}
// This function uses UNALIGNED accesses
real f4(const double* p_, const double* q_,int m, int n)
{
SSE_PTR p = p_+m;
SSE_PTR q = q_+m;
real sum = 0;
for(int i=0; i<n;i++)
sum += p[i] * q[i];
return sum;
}
template <typename RealType>
RealType f5(const RealType* p_, const RealType* q_, int n)
{
// This works in 4.4 (both in templates, and outside)
// This partially works in 4.5 (not in templates, but only outside)
typedef RealType AlignedReal __attribute__((aligned(16)));
typedef const AlignedReal* TEMPLATE_SSE_PTR;
TEMPLATE_SSE_PTR __restrict__ p = p_;
TEMPLATE_SSE_PTR __restrict__ q = q_;
RealType sum = 0;
for(int i=0; i<n;i++)
sum += p[i] * q[i];
return sum;
}
double f5a(const double* p, const double* q, int n)
{
return f5<double>(p,q,n);
}
double f5b(const float* p, const float* q, int n)
{
return f5<float>(p,q,n);
}
// This is aligned
void g(SSE_PTR_non_const __restrict__ a,
SSE_PTR __restrict__ b,
double coef,
unsigned count)
{
for(unsigned i=0; i<count; i++)
a[i] = b[i]*coef;
}
// This is partially aligned
void g2(double* __restrict__ a,
SSE_PTR __restrict__ b,
double coef,
unsigned count)
{
for(unsigned i=0; i<count; i++)
a[i] = b[i]*coef;
}
// This is partially aligned
void g3(SSE_PTR_non_const __restrict__ a,
const double* __restrict__ b,
double coef,
unsigned count)
{
for(unsigned i=0; i<count; i++)
a[i] = b[i]*coef;
}
// This is NOT aligned
void g4(double* __restrict__ a,
const double* __restrict__ b,
double coef,
unsigned count)
{
for(unsigned i=0; i<count; i++)
a[i] = b[i]*coef;
}
// Alignment of access forced using versioning x 2
// This is NOT aligned
// Um... we only force alignment of the store...
void g5(double* __restrict__ a,
const double* __restrict__ b,
double* __restrict__ c,
const double* __restrict__ d,
double coef,
unsigned count)
{
for(unsigned i=0; i<count; i++)
{
a[i] = b[i]*coef;
c[i] = d[i]*coef;
}
}