void foo(int nr, int nc, int ldim,
double *__restrict a, double *__restrict b)
a = __builtin_assume_aligned(a, 32);
b = __builtin_assume_aligned(b, 32);
ldim = (ldim >> 5) << 5;
for (int i = 0; i < nr; i++)
for (int j = 0; j < nc; j++)
a[i*ldim + j] += b[i*ldim + j];
Both GCC 4.7 and 4.8 on an AVX capable system with -march=native and -O3 vectorize the inner loop but utilise unaligned loads and stores. It should be possible to reason that as "a" and "b" are aligned and ldim is a multiple of 32 bytes that "a + i*ldim" and "b + i*ldim" are also 32-byte aligned. This would permit the inner loop to be vectorized with aligned loads.
We still don't preserve VRP info (patches exist) and furthermore, don't record non-zero bits bitmask there either. Only after we do that we could perhaps handle it IMHO.
(In reply to Jakub Jelinek from comment #1)
> We still don't preserve VRP info (patches exist) and furthermore, don't
> record non-zero bits bitmask there either. Only after we do that we could
> perhaps handle it IMHO.
CCP computes bitmasks but remembers them only for pointers in the form
of alignment/misalignment info.
Would it be any easier --- from an implementation standpoint --- to adopt something similar to the "__assume(predicate)" directive in ICC? This would allow one to state explicitly:
__assume(ldim % 32 == 0);
(Although in the case of 256-bit AVX ldim % 4 == 0 would sufficient at double precision.) It is also worth noting that the current version of ICC doesn't spot the alignment either with the bit-shifting trick or an explicit assumption.
We have a way to say assume something,
if (ldim % 32 != 0) __builtin_unreachable (), and you can define a __assume macro using it.
But, as we don't record the non-zero bitmasks from it anywhere, it can't be used by later optimization passes yet.
Thank you for this information. As an alternative would it be worth considering a pragma along the lines of:
#pragma gcc aligned(32)
which would confer that "in the first iteration of the loop which follows all relevant variables can be taken as having 32-byte alignment." This would provide quite a nice way of allowing loops like the above to be fully vectorized and further avoid the need for explicit calls to __builtin_assume_aligned.
ICC has a similar directive but it only applies to the base pointers. So it would assume that "a" is aligned but not "a + i*ldim".
Related to PR63202 too.