Created attachment 59292 [details] ii file Overview: Found a sequence of code using bitset where using -march=native and -O2 is 25% slower than just -O2, under a Intel i7-9750H. Repeatable, also occurs on an Intel i7-3770 but with a much lower decrease in performance (around ~5%). At -O2 runtime duration is ~15 seconds, -march=native;-O2 it's ~20 seconds. Have used a PCG-based rand() in the code since regular rand() slows down the program by 5x (difference in runtime duration between -march=native;-O2 and -O2 is still the same, but percentage change is dramatically influenced by rand taking all the CPU time). Code adds values to a total to prevent the code being optimized out. And yes, .count() would've probably been more typical code. GCC version: 13.2.0 System type: x86_64-w64-mingw32, windows, x64 Configured with: ../src/configure --enable-languages=c,c++ --build=x86_64-w64-mingw32 --host=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --disable-multilib --prefix=/e/temp/gcc/dest --with-sysroot=/e/temp/gcc/dest --disable-libstdcxx-pch --disable-libstdcxx-verbose --disable-nls --disable-shared --disable-win32-registry --enable-threads=posix --enable-libgomp --with-zstd=/c/mingw The complete command line that triggers the bug, for -O2 only: C:/programming/libraries/nuwen/bin/g++.exe -c "C:/programming/workspaces/march_pessimisation_demo/march_pessimisation_demo.cpp" -O2 -std=c++23 -s -save-temps -DNDEBUG -o build-Release/march_pessimisation_demo.cpp.o -I. -I. C:/programming/libraries/nuwen/bin/g++.exe -o build-Release\bin\march_pessimisation_demo.exe @build-Release/ObjectsList.txt -L. -O2 -s Or for -march=native: C:/programming/libraries/nuwen/bin/g++.exe -c "C:/programming/workspaces/march_pessimisation_demo/march_pessimisation_demo.cpp" -O2 -march=native -std=c++23 -s -save-temps -DNDEBUG -o build-Release/march_pessimisation_demo.cpp.o -I. -I. C:/programming/libraries/nuwen/bin/g++.exe -o build-Release\bin\march_pessimisation_demo.exe @build-Release/ObjectsList.txt -L. -s The compiler output (error messages, warnings, etc.): No errors or warnings.
Can you provide the output of invoking g++ with -march=native and when compiling?
Looks like the reduction loop is vectorized and that is causing the slow down. Semi reduced (unincluded) testcase: ``` #include <bitset> void g(std::bitset<12800000> &); int f() { unsigned int total = 0; std::bitset<12800000> values; g(values); for (unsigned int index = 0; index != 12800000; ++index) total += values[index]; return total ; } ``` For Linux, you need `-m32 -O2 -mavx2` (-m32 since it uses long and for mingw that is 32bits while for linux it is 64bits and that does not get vectorized).
Note your `total+=values[index];` loop could be reduced down to just `total += values.count();` and that will over 10x faster. I am not sure sure if this is useful benchmark either. because count uses popcount directly. Maybe GCC could detect the popcount here but I am not sure. LLVM does a slightly better job at vectorizing the loop but still messes it up. Plus once you add other code around values[index], the vectorizer will no longer kick in so the slow down is only for this bad micro-benchmark.
Yeah, I know, I mentioned that in the report. It's not a bad benchmark, it's benchmarking access of individual consecutive bits, not summing. The counting is merely for preventing the compiler from optimizing out the loop. I could equally make it benchmark random indices and I imagine the problem would remain, though I haven't checked. Still, your point is valid in that most non-benchmark code would likely have more code around the access. Could potentially lead to misleading benchmark results in other scenarios though. I haven't tested whether vector/array indexing triggers the same bad vectorisation.
(In reply to Andrew Pinski from comment #1) > Can you provide the output of invoking g++ with -march=native and when > compiling? The .ii files were identical, so did you you mean .o files?
Btw, GCC 14.2 doesn't vectorize for me anymore, likely because the use of gather has been nerfed (for Intel). With AVX2 we see <bb 3> [local count: 139586405]: # vect_total_21.15_24 = PHI <vect_total_11.24_41(3), { 0, 0, 0, 0, 0, 0, 0, 0 }(2)> # vect_vec_iv_.16_20 = PHI <_18(3), { 0, 1, 2, 3, 4, 5, 6, 7 }(2)> # ivtmp.27_27 = PHI <ivtmp.27_23(3), 0(2)> _18 = vect_vec_iv_.16_20 + { 8, 8, 8, 8, 8, 8, 8, 8 }; vect__17.17_5 = vect_vec_iv_.16_20 >> 5; vect__19.18_3 = vect_vec_iv_.16_20 & { 31, 31, 31, 31, 31, 31, 31, 31 }; vect_30 = VIEW_CONVERT_EXPR<vector(8) int>(vect__17.17_5); vect_31 = __builtin_ia32_gathersiv8si ({ 0, 0, 0, 0, 0, 0, 0, 0 }, &values, vect_30, { -1, -1, -1, -1, -1, -1, -1, -1 }, 4); vect__13.19_32 = VIEW_CONVERT_EXPR<vector(8) long unsigned int>(vect_31); vect__14.20_34 = { 1, 1, 1, 1, 1, 1, 1, 1 } << vect__19.18_3; vect__15.21_35 = vect__13.19_32 & vect__14.20_34; mask__16.22_37 = vect__15.21_35 != { 0, 0, 0, 0, 0, 0, 0, 0 }; _51 = VIEW_CONVERT_EXPR<vector(8) unsigned int>(mask__16.22_37); vect_total_11.24_41 = vect_total_21.15_24 - _51; ivtmp.27_23 = ivtmp.27_27 + 1; if (ivtmp.27_23 != 1600000) so the first point is we are not able to analyze the memory access pattern in a very good way and then of course cost modeling breaks down here as well. The scalar IL is <bb 3> [local count: 1063004409]: # total_21 = PHI <total_11(5), 0(2)> # index_23 = PHI <index_12(5), 0(2)> # ivtmp_27 = PHI <ivtmp_26(5), 12800000(2)> _17 = index_23 >> 5; _19 = index_23 & 31; _13 = MEM <struct _Base_bitset> [(_WordT *)&values]._M_w[_17]; _14 = 1 << _19; _15 = _13 & _14; _16 = _15 != 0; _1 = (unsigned int) _16; total_11 = _1 + total_21; index_12 = index_23 + 1; ivtmp_26 = ivtmp_27 - 1; if (ivtmp_26 != 0) I think for this kind of access pattern it might be nice to unroll the loop as many times as the same memory location is accessed (1 << 5, aka 32 times). But as Andrew says - it's just a very bad testcase ;)
(In reply to Matt Bentley from comment #4) > I haven't tested whether vector/array > indexing triggers the same bad vectorisation. I'm certain it won't be, because (apart from vector<bool>) they access memory locations that are at least one byte, not repeating very similar bitwise ops for every bit in a word.
(In reply to Matt Bentley from comment #5) > (In reply to Andrew Pinski from comment #1) > > Can you provide the output of invoking g++ with -march=native and when > > compiling? > > The .ii files were identical, so did you you mean .o files? I'm guessing Andrew meant the output when you add -v to the compilation command, with and without march=native That would show exactly the instruction sets being enabled by march=native.
(In reply to Jonathan Wakely from comment #7) > I'm certain it won't be, because (apart from vector<bool>) they access > memory locations that are at least one byte, not repeating very similar > bitwise ops for every bit in a word. Have tested 4 additional scenarios: random access, sequential iteration with an if statement, the vector equivalent, and sequential iteration with an if statement and a more complicated action. Can confirm that the issue only occurs in the second of those, ie.: { std::bitset<12800000> values; for (unsigned int counter = 0; counter != 500; ++counter) { for (unsigned int index = 0; index != 12800000; ++index) { values.set(index, plf::rand() & 1); } for (unsigned int index = 0; index != 12800000; ++index) { if (values[index]) { total += values[index] * 2; } } } } While this statement is closer to what you might see in real world code ie. if 1, do this, I'm assuming it gets optimized into a conditional move or something, as the same 25% slowdown occurs. Once you add in a more complicated action ie.: for (unsigned int index = 0; index != 12800000; ++index) { if (values[index]) { total += values[index] * 2; total ^= values[index]; } } The problem goes away. So essentially Andrew's correct, unless the user's chosen action can be trivially optimized. Will generate the -v output for the second of the scenarios now.
Created attachment 59293 [details] -v build output for -o2;-march=native for second "if" variant
Created attachment 59294 [details] -v build output for -o2 only in second "if" variant As requested
Thanks!
Bool C-arrays also display the same problem, but only if a warm-up loop is present before the main loop (to 'warm up' the cache). I discovered this by accident the other day. char arrays and int arrays did not display the same slowdown. A pre-mainloop warm up is quite common for microbenchmarks to get the data in cache. I hadn't done that with the previous additional tests mentioned above, but once I did I found that vector<char> had the same -march=native pessimisation. I'm not assuming that the cause of this is the same, but the result certainly is (between 10%-20% slowdown under -march=native). Attaching the .ii files for the 2 new cases.
Created attachment 59355 [details] .ii file for vector<char> code causing same issue
Created attachment 59356 [details] .ii file for bool[array] causing same issue