Bug 117016 - Alignment requirements of std::simd too large
Summary: Alignment requirements of std::simd too large
Status: RESOLVED INVALID
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 14.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-10-08 10:58 UTC by Pieter P
Modified: 2024-10-09 06:45 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pieter P 2024-10-08 10:58:56 UTC
The value of std::experimental::memory_alignment appears to be too large for fixed-size vectors that are larger than the native vector length.

For example, on both x86 AVX512 and ARM NEON, the alignment requirement for a 12-element double vector reported by libstdc++ is 128 bytes, even though there are no instructions that require such a huge alignment (the maximum is 512 bits/64 bytes for AVX512).

https://godbolt.org/z/45hEWG14b

    #include <experimental/simd>
    namespace stdx = std::experimental;
    using simd12 = stdx::simd<double, stdx::simd_abi::fixed_size<12>>;
    static_assert(stdx::memory_alignment_v<simd12> <= 64);
    // static assertion failed: the comparison reduces to '(128 <= 64)'

Looking at bits/simd.h, it appears that the alignment for fixed_size_simd<T, N> is computed simply as bit_ceil(sizeof(T) * N), which is very conservative.

Having different alignment requirements for large SIMD vectors makes allocating buffers that are reused with different vector lengths quite tricky.
Comment 1 Andrew Pinski 2024-10-08 11:24:09 UTC
> Having different alignment requirements for large SIMD vectors makes allocating buffers that are reused with different vector lengths quite tricky.


Not really. Considering operator new gets passed the alignment since c++14.
Comment 2 Andrew Pinski 2024-10-08 11:58:04 UTC
(In reply to Andrew Pinski from comment #1)
> Not really. Considering operator new gets passed the alignment since c++14.

s/c++14/c++17/
Comment 3 Matthias Kretz (Vir) 2024-10-08 12:28:05 UTC
Thank you for taking the time to report this inefficiency. But this is working as intended. The design goal of the fixed_size ABI was an ABI-stable "this is never going to break on ABI boundaries" type. The only way to guarantee that without an oracle that can predict the future is to use the most conservative alignment.

FWIW, I always thought that fixed_size is an interesting experiment and potentially an ABI tag someone at some point might want. But I don't think it should have been the prominent "I want a fixed number of elements" ABI that it is in the TS. This is going to be better with C++26 ... if we make it. :)
Comment 4 Pieter P 2024-10-08 12:46:42 UTC
Thank you, I appreciate for the quick responses.

> The design goal of the fixed_size ABI was an ABI-stable "this is never going to break on ABI boundaries" type.

You're right, ABI stability is not something I considered.

> Not really. Considering operator new gets passed the alignment since c++17.

Indeed, but allocation is not the issue I was struggling with: I wanted to be able to have the simple requirement that "all array arguments should be aligned to the native SIMD size" for the user-facing APIs, and then decide on the optimal SIMD size in the underlying algorithm, as an implementation detail.

The algorithm may encounter remainders that are not multiples of the native SIMD sizes, and I was using the fixed_size_simd type to handle those. As a work-around, I'll just have to implement the remainders manually using multiple smaller native SIMD types and/or masks, rather than using fixed_size_simd.
Comment 5 Matthias Kretz (Vir) 2024-10-09 06:45:29 UTC
Wrt. working on a larger data set you might be interested in:
https://github.com/mattkretz/vir-simd?tab=readme-ov-file#simd-execution-policy-p0350

For the problem you seem to describe, I like to have a native_simd-aligned array of scalars and then iterate over it using native_simd. If your algorithm allows, the simplest epilogue is allocation of some extra values (this allocation is free, because of alignment and how allocators work) and then simply process a few more inputs and ignore the outputs from the padding.