Bug 58280 - Missed Opportunity for Aligned Vectorized Load
Summary: Missed Opportunity for Aligned Vectorized Load
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.8.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2013-08-30 12:17 UTC by Freddie Witherden
Modified: 2021-07-21 03:23 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Freddie Witherden 2013-08-30 12:17:16 UTC
Consider

void foo(int nr, int nc, int ldim,
         double *__restrict a, double *__restrict b)
{
    a = __builtin_assume_aligned(a, 32);
    b = __builtin_assume_aligned(b, 32);

    ldim = (ldim >> 5) << 5;
    
    for (int i = 0; i < nr; i++)
        for (int j = 0; j < nc; j++)
            a[i*ldim + j] += b[i*ldim + j];
}

Both GCC 4.7 and 4.8 on an AVX capable system with -march=native and -O3 vectorize the inner loop but utilise unaligned loads and stores.  It should be possible to reason that as "a" and "b" are aligned and ldim is a multiple of 32 bytes that "a + i*ldim" and "b + i*ldim" are also 32-byte aligned.  This would permit the inner loop to be vectorized with aligned loads.
Comment 1 Jakub Jelinek 2013-08-30 12:25:34 UTC
We still don't preserve VRP info (patches exist) and furthermore, don't record non-zero bits bitmask there either.  Only after we do that we could perhaps handle it IMHO.
Comment 2 Richard Biener 2013-08-30 12:38:24 UTC
(In reply to Jakub Jelinek from comment #1)
> We still don't preserve VRP info (patches exist) and furthermore, don't
> record non-zero bits bitmask there either.  Only after we do that we could
> perhaps handle it IMHO.

CCP computes bitmasks but remembers them only for pointers in the form
of alignment/misalignment info.
Comment 3 Freddie Witherden 2013-08-30 12:43:45 UTC
Would it be any easier --- from an implementation standpoint --- to adopt something similar to the "__assume(predicate)" directive in ICC?  This would allow one to state explicitly:

__assume(ldim % 32 == 0);

(Although in the case of 256-bit AVX ldim % 4 == 0 would sufficient at double precision.)  It is also worth noting that the current version of ICC doesn't spot the alignment either with the bit-shifting trick or an explicit assumption.
Comment 4 Jakub Jelinek 2013-08-30 12:52:37 UTC
We have a way to say assume something,
if (ldim % 32 != 0) __builtin_unreachable (), and you can define a __assume macro using it.
But, as we don't record the non-zero bitmasks from it anywhere, it can't be used by later optimization passes yet.
Comment 5 Freddie Witherden 2013-08-30 13:15:31 UTC
Thank you for this information.  As an alternative would it be worth considering a pragma along the lines of:

#pragma gcc aligned(32)

which would confer that "in the first iteration of the loop which follows all relevant variables can be taken as having 32-byte alignment."  This would provide quite a nice way of allowing loops like the above to be fully vectorized and further avoid the need for explicit calls to __builtin_assume_aligned.

ICC has a similar directive but it only applies to the base pointers.  So it would assume that "a" is aligned but not "a + i*ldim".
Comment 6 Andrew Pinski 2021-07-21 03:23:21 UTC
Related to PR63202 too.