Bug 114995 - C++23 Assume keyword not being used for vectorization
Summary: C++23 Assume keyword not being used for vectorization
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 15.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: VRP
  Show dependency treegraph
 
Reported: 2024-05-08 20:14 UTC by Pratik Chowdhury
Modified: 2024-05-15 10:34 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
testcase (381 bytes, text/plain)
2024-05-08 20:18 UTC, Andrew Pinski
Details
proof of concept implementing a range-op entry for builtin_assume_aligned (746 bytes, patch)
2024-05-14 09:04 UTC, Aldy Hernandez
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Pratik Chowdhury 2024-05-08 20:14:54 UTC
I would like to share a [simple example](https://gcc.godbolt.org/z/dbTsb3YMG)

In this, as can be seen in the _second function_

1. GCC is able to take advantage of builtin_unreachable (because removing it would change the line count)
2. GCC is able to take advantage of __builtin_assume_aligned. (Aligned Loads and Stores)

Both of these seem fair

However, in the _first function_:-

1. [Assume Keyword](https://en.cppreference.com/w/cpp/language/attributes/assume) is used
2. Unaligned Load Store is generated by GCC
3. Modulo operation advantage is not taken by GCC.  
   Clang seems to be taking advantage of the same though.  

It seems that thanks to the assume keyword there is scope for further optimization gains.

The following syntax also seems to work in Clang

```cpp
    [[assume(__builtin_assume_aligned(x_array, 32))]];
    [[assume(__builtin_assume_aligned(mul_array, 32))]];
    [[assume(__builtin_assume_aligned(add_array, 32))]];
```

Why I wanted to use the assume keyword?

1. There is scope for increased const correctness
2. Looks nicer

```cpp
/// @note This prototype works
void MulAddLoop(const float*  const __restrict mul_array,
                const float*const  __restrict add_array, const ::std::size_t size,
                float* const __restrict x_array) {
```

PS:-

1. This is my first time filing a GCC Issue Report (I wouldn't really call this a bug)  
2. I hope I linked to the correct component as I am not as well versed as I hope I could be with the internals of GCC.
Comment 1 Andrew Pinski 2024-05-08 20:18:32 UTC
Created attachment 58135 [details]
testcase
Comment 2 Andrew Pinski 2024-05-08 20:20:06 UTC
I suspect the syntax you want instead is:
[[assume(uintptr_t(x_array) & (32-1) == 0]];

Becuase __builtin_assume_aligned takes a pointer and returns a pointer that has the assumed alignment
Comment 3 Pratik Chowdhury 2024-05-08 20:26:32 UTC
Yeah definitely.

My bad.

Sorry.

@Andrew Pinski [however even that change does not seem to change the results for GCC with Aligned Loads not being used](https://gcc.godbolt.org/z/9WbMbePc1)

Added the more corrected 2 Power Trick function at the very end 

Any other mistake I could have made?
Comment 4 Andrew Pinski 2024-05-08 20:29:22 UTC
Oh  this is the more correct syntax:
    [[assume((uintptr_t(x_array) & (32-1)) == 0)]];
    [[assume((uintptr_t(mul_array) & (32-1)) == 0)]];
    [[assume((uintptr_t(add_array) & (32-1)) == 0)]];
Comment 5 Andrew Pinski 2024-05-08 20:33:28 UTC
Right now we don't always prop back what information we get from the assume attributes.
Maybe with the recent prange addition, it can for pointers ...
Comment 6 Pratik Chowdhury 2024-05-08 20:53:36 UTC
> [[assume((uintptr_t(x_array) & (32-1)) == 0)]];

The Parans in the & have definitely given someone sleepless nights LOL. I myself was saved by the warnings.

> Right now we don't always prop back what information we get from the assume attributes.
Maybe with the recent prange addition, it can for pointers ...

Aah

Guess we will switch to assume in the future.


I tried [something else just about now](https://gcc.godbolt.org/z/d8aPjzMhq)

I think its a bit wrong. Clang seems to be able to handle it.

Is this syntax even valid?

```cpp
    if(reinterpret_cast<::std::uintptr_t>(__builtin_assume_aligned((void*)(mul_array), 32)) != reinterpret_cast<::std::uintptr_t>(mul_array))
    {
        __builtin_unreachable();
    }
    if(reinterpret_cast<::std::uintptr_t>(__builtin_assume_aligned((void*)(add_array), 32)) != reinterpret_cast<::std::uintptr_t>(add_array))
    {
        __builtin_unreachable();
    }
    if(reinterpret_cast<::std::uintptr_t>(__builtin_assume_aligned((void*)(x_array), 32)) != reinterpret_cast<::std::uintptr_t>(x_array))
    {
        __builtin_unreachable();
    }
```

I have my doubts on the previous one

But this should ideally be valid

```cpp
    if((reinterpret_cast<::std::uintptr_t>(__builtin_assume_aligned((void*)(mul_array), 32)) & (32-1)) != 0)
    {
        __builtin_unreachable();
    }
    if((reinterpret_cast<::std::uintptr_t>(__builtin_assume_aligned((void*)(add_array), 32)) & (32-1)) != 0)
    {
        __builtin_unreachable();
    }
    if((reinterpret_cast<::std::uintptr_t>(__builtin_assume_aligned((void*)(x_array), 32)) & (32-1)) != 0)
    {
        __builtin_unreachable();
    }
```

But either of them are unable to change the Load Stores from Unaligned to Aligned.

Maybe victims of aggressive Dead Code elimination here? GCC intrinsics believe that there can be no case its false and the code is deleted for the same? Because __builtin_assume_aligned should always be a multiple of 32 in the above cases.

Or __builtin_assume_aligned does not support usage like this in GCC and in Clang it does? And due to that difference, we have a difference in behavior.

Its pretty interesting either way.
Comment 7 Jakub Jelinek 2024-05-08 21:52:06 UTC
The above examples just show misunderstanding what __builtin_assume_aligned is and what it is not.  You need to use the result of the built-in function in the accesses to be able to use the alignment information, if you just try to compare __builtin_assume_aligned (x, 32) == x, it will just fold as always true.  The design of the builtin is to attach the alignment information to the result of the builtin function only.

CCing Aldy/Andrew for whether prange can or could be taught to handle the assume cases with uintptr_t and bitwise and + comparison.
Comment 8 Pratik Chowdhury 2024-05-09 19:59:00 UTC
> if you just try to compare __builtin_assume_aligned (x, 32) == x, it will just fold as always true

Aah. Dead code elimination.

> CCing Aldy/Andrew for whether prange can or could be taught to handle the assume cases with uintptr_t and bitwise and + comparison.

Yeah this could be very helpful in cases such as this.

@Jakub @Andrew I think [this](https://gcc.godbolt.org/z/MEre8hr71) also has scope for taking advantage of the same.

```cpp
void MulAddLoopWorksWithBuitInUnreachableAndConst(const float* const __restrict mul_array,
                     const float* const __restrict add_array,
                     const ::std::size_t size, float* const __restrict x_array) {
    if((reinterpret_cast<::std::uintptr_t>(mul_array) & (32-1)) != 0)
    {
        __builtin_unreachable();
    }
    if((reinterpret_cast<::std::uintptr_t>(add_array) & (32-1)) != 0)
    {
        __builtin_unreachable();
    }
    if((reinterpret_cast<::std::uintptr_t>(x_array) & (32-1)) != 0)
    {
        __builtin_unreachable();
    }
    if ((size & (32 - 1)) != 0) __builtin_unreachable();
    for (::std::size_t i = 0; i != size; i++) [[likely]] {
        const auto mul = *(mul_array + i);
        const auto add = *(add_array + i);
        x_array[i] = x_array[i] * mul + add;
        // x_array[i] *= mul;
        // x_array[i] += add;
    }
}
```

Here we are working under the assumption that the memory addresses themselves are multiples of 32 if aligned for AVX2.

Clang seems to be able to take advantage of the same here.

If the __builtin_assume_aligned is kinda not supported due to dead code elimination, then this looks like a nice enough alternative.

It also retains const correctness for me.
Comment 9 Aldy Hernandez 2024-05-14 09:00:01 UTC
(In reply to Jakub Jelinek from comment #7)
> The above examples just show misunderstanding what __builtin_assume_aligned
> is and what it is not.  You need to use the result of the built-in function
> in the accesses to be able to use the alignment information, if you just try
> to compare __builtin_assume_aligned (x, 32) == x, it will just fold as
> always true.  The design of the builtin is to attach the alignment
> information to the result of the builtin function only.
> 
> CCing Aldy/Andrew for whether prange can or could be taught to handle the
> assume cases with uintptr_t and bitwise and + comparison.

All the pieces are there to make it work, both with the assume aligned and with the uintptr_t case.  And we could probably get it all without prange.

For example:

#include <cstdint>

void foo (const float *);

void bar1 (const float *array)
{
  [[assume(array != nullptr)]];
  const float *aligned = (const float *) __builtin_assume_aligned (array, 32);
  foo (aligned);
}

The __builtin_assume_aligned hasn't been expanded by evrp, so we should be able to add a range-op entry for it.  This is what evrp sees:

void bar1 (const float * array)
{
  const float * aligned;

  <bb 2> :
  aligned_2 = __builtin_assume_aligned (array_1(D), 32);
  foo (aligned_2);
  return;

}

All we need is a range-op implementation for builtin_assume_aligned.  The attached crude implementation does it.

=========== BB 2 ============
    <bb 2> :
    aligned_2 = __builtin_assume_aligned (array_1(D), 32);
    foo (aligned_2);
    return;

aligned_2 : [prange] const float * [0, +INF] MASK 0xffffffff00000000 VALUE 0x0

That is, the bottom 32 bits are cleared.

Andrew will have to comment on the uintptr_t idiom, because it gets expanded into an .ASSUME() function which I'm unfamiliar with.

For this small function:

void bar2 (const float *array)
{
  [[assume((uintptr_t (array) & (32 - 1)) == 0)]];
  foo (array);
}

evrp expands to:

=========== BB 2 ============
Partial equiv (array.0_3 pe64 array_2(D))
    <bb 2> :
    array.0_3 = (long unsigned int) array_2(D);
    _4 = array.0_3 & 31;
    _5 = _4 == 0;
    return _5;

_4 : [irange] long unsigned int [0, 31] MASK 0x1f VALUE 0x0

I don't see any reason why we couldn't get that array.0_3 and array_2 are aligned to 32-bits.  Maybe we don't set the value/mask pair for the bitwise_and::op1_range?  The value/mask stuff is not very fleshed out, especially for the op1_range operators.
Comment 10 Aldy Hernandez 2024-05-14 09:04:55 UTC
Created attachment 58202 [details]
proof of concept implementing a range-op entry for builtin_assume_aligned

Something like this (properly coded and enhanced) would work.

The pointers_handled_p() cruft is temporary boilerplate to indicate that we only support a fold-range() of PRANGE = IRANGE op PRANGE for this operation.  This is going away when the dust settles.
Comment 11 Aldy Hernandez 2024-05-14 09:15:25 UTC
Just to clarify.  prange as well as irange keep a value/mask pair where we can store alignment info, so every calculation (range-op) along the way can track this properly.

Now, for pointers we loose this information across passes because of the union in SSA_NAME_RANGE_INFO (struct tree_ssa_name) which keeps the range info in an ptr_info_def, not a proper vrange_storage.  It's in my long term TODO list to propose we properly track pointer ranges once prange comes live.

Note that IPA keeps alignment info on the side, so part of this info is kept across passes.  But I assume they're doing this, because originally ranges didn't have alignment (value/mask) info associated with them.
Comment 12 GCC Commits 2024-05-15 10:33:10 UTC
The master branch has been updated by Aldy Hernandez <aldyh@gcc.gnu.org>:

https://gcc.gnu.org/g:c400b2100719d0a9e5989c63e0827b9e98919df3

commit r15-504-gc400b2100719d0a9e5989c63e0827b9e98919df3
Author: Aldy Hernandez <aldyh@redhat.com>
Date:   Tue May 14 16:21:50 2024 +0200

    [prange] Default pointers_handled_p() to false.
    
    The pointers_handled_p() method is an internal range-op helper to help
    catch dispatch type mismatches for pointer operands.  This is what
    caught the IPA mismatch in PR114985.
    
    This method is only a temporary measure to catch any incompatibilities
    in the current pointer range-op entries.  This patch returns true for
    any *new* entries in the range-op table, as the current ones are
    already fleshed out.  This keeps us from having to implement this
    boilerplate function for any new range-op entries.
    
            PR tree-optimization/114995
            * range-op-ptr.cc (range_operator::pointers_handled_p): Default to true.
Comment 13 Aldy Hernandez 2024-05-15 10:34:03 UTC
(In reply to Aldy Hernandez from comment #10)
> Created attachment 58202 [details]
> proof of concept implementing a range-op entry for builtin_assume_aligned
> 
> Something like this (properly coded and enhanced) would work.
> 
> The pointers_handled_p() cruft is temporary boilerplate to indicate that we
> only support a fold-range() of PRANGE = IRANGE op PRANGE for this operation.
> This is going away when the dust settles.

FYI, I've removed the pointers_handled_p requirement for new range-op entries.