This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On 11/15/2013 2:26 PM, OndÅej BÃlka wrote:
Where gcc or gfortran choose to split sse2 or sse4 loads, I found a marked advantage in that choice on my Westmere (which I seldom power on nowadays). You are correct that this finding is in disagreement with Intel documentation, and it has the effect that Intel option -xHost is not the optimum one. I suspect the Westmere was less well performing than Nehalem on unaligned loads. Another poorly documented feature of Nehalem and Westmere was a preference for 32-byte aligned data, more so than Sandy Bridge. Intel documentation encourages use of unaligned AVX-256 loads on Ivy Bridge and Haswell, but Intel compilers don't implement them (except for intrinsics) until AVX2. Still, on my own Haswell tests, the splitting of unaligned loads by use of AVX compile option comes out ahead. Supposedly, the preference of Windows intrinsics programmers for the relative simplicity of unaligned moves was taken into account in the more recent hardware designs, as it was disastrous for Sandy Bridge. I have only remote access to Haswell although I plan to buy a laptop soon. I'm skeptical about whether useful findings on these points may be obtained on a Windows laptop. In case you didn't notice it, Intel compilers introduced #pragma vector unaligned as a means to specify handling of unaligned access without peeling. I guess it is expected to be useful on Ivy Bridge or Haswell for cases where the loop count is moderate but expected to match unrolled AVX-256, or if the case where peeling can improve alignment is rare. In addition, Intel compilers learned from gcc the trick of using AVX-128 for situations where frequent unaligned accesses are expected and peeling is clearly undesirable. The new facility for vectorizing OpenMP parallel loops (e.g. #pragma omp parallel for simd) uses AVX-128, consistent with the fact that OpenMP chunks are more frequently unaligned. In fact, parallel for simd seems to perform nearly the same with gcc-4.9 as with icc. Many decisions on compiler defaults still are based on an unscientific choice of benchmarks, with gcc evidently more responsive to input from the community.On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:Also keep in mind that usually costs go up significantly if misalignment causes cache line splits (processor will fetch 2 lines). There are non-linear costs of filling up the store queue in modern out-of-order processors (x86). Bottom line is that it's much better to peel e.g. for AVX2/AVX3 if the loop would cause loads that cross cache line boundaries otherwise. The solution is to either actually always peel for alignment, or insert an additional check for cache line boundaries (for high trip count loops).That is quite bold claim do you have a benchmark to support that? Since nehalem there is no overhead of unaligned sse loads except of fetching cache lines. As haswell avx2 loads behave in similar way.
-- Tim Prince
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |