Bug 117008 - -march=native pessimization of 25% with bitset [] and 20% with bool array []
Summary: -march=native pessimization of 25% with bitset [] and 20% with bool array []
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 13.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-10-08 00:00 UTC by Matt Bentley
Modified: 2024-10-15 22:41 UTC (History)
0 users

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-10-08 00:00:00


Attachments
ii file (292.95 KB, application/x-zip-compressed)
2024-10-08 00:00 UTC, Matt Bentley
Details
-v build output for -o2;-march=native for second "if" variant (1.75 KB, text/plain)
2024-10-08 08:51 UTC, Matt Bentley
Details
-v build output for -o2 only in second "if" variant (1.28 KB, text/plain)
2024-10-08 08:51 UTC, Matt Bentley
Details
.ii file for vector<char> code causing same issue (288.41 KB, application/x-zip-compressed)
2024-10-15 22:40 UTC, Matt Bentley
Details
.ii file for bool[array] causing same issue (292.96 KB, application/x-zip-compressed)
2024-10-15 22:41 UTC, Matt Bentley
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matt Bentley 2024-10-08 00:00:33 UTC
Created attachment 59292 [details]
ii file

Overview: Found a sequence of code using bitset where using -march=native and -O2 is 25% slower than just -O2, under a Intel i7-9750H. Repeatable, also occurs on an Intel i7-3770 but with a much lower decrease in performance (around ~5%).

At -O2 runtime duration is ~15 seconds, -march=native;-O2 it's ~20 seconds.

Have used a PCG-based rand() in the code since regular rand() slows down the program by 5x (difference in runtime duration between -march=native;-O2 and -O2 is still the same, but percentage change is dramatically influenced by rand taking all the CPU time).

Code adds values to a total to prevent the code being optimized out. And yes, .count() would've probably been more typical code. 


GCC version: 13.2.0


System type: x86_64-w64-mingw32, windows, x64


Configured with:

 ../src/configure --enable-languages=c,c++ --build=x86_64-w64-mingw32 --host=x86_64-w64-mingw32 --target=x86_64-w64-mingw32 --disable-multilib --prefix=/e/temp/gcc/dest --with-sysroot=/e/temp/gcc/dest --disable-libstdcxx-pch --disable-libstdcxx-verbose --disable-nls --disable-shared --disable-win32-registry --enable-threads=posix --enable-libgomp --with-zstd=/c/mingw


The complete command line that triggers the bug, for -O2 only: 

C:/programming/libraries/nuwen/bin/g++.exe  -c  "C:/programming/workspaces/march_pessimisation_demo/march_pessimisation_demo.cpp" -O2 -std=c++23 -s -save-temps -DNDEBUG  -o build-Release/march_pessimisation_demo.cpp.o -I. -I.
C:/programming/libraries/nuwen/bin/g++.exe -o build-Release\bin\march_pessimisation_demo.exe @build-Release/ObjectsList.txt -L.   -O2 -s

Or for -march=native:

C:/programming/libraries/nuwen/bin/g++.exe  -c  "C:/programming/workspaces/march_pessimisation_demo/march_pessimisation_demo.cpp" -O2 -march=native -std=c++23 -s -save-temps -DNDEBUG  -o build-Release/march_pessimisation_demo.cpp.o -I. -I.
C:/programming/libraries/nuwen/bin/g++.exe -o build-Release\bin\march_pessimisation_demo.exe @build-Release/ObjectsList.txt -L. -s


The compiler output (error messages, warnings, etc.):

No errors or warnings.
Comment 1 Andrew Pinski 2024-10-08 00:07:23 UTC
Can you provide the output of invoking g++ with -march=native and when compiling?
Comment 2 Andrew Pinski 2024-10-08 00:36:57 UTC
Looks like the reduction loop is vectorized and that is causing the slow down.

Semi reduced (unincluded) testcase:
```
#include <bitset>
void g(std::bitset<12800000> &);

int f()
{
  unsigned int total = 0;
  std::bitset<12800000> values;
  g(values);
  for (unsigned int index = 0; index != 12800000; ++index)
    total += values[index];
  return total ;
}
```

For Linux, you need `-m32 -O2 -mavx2` (-m32 since it uses long and for mingw that is 32bits while for linux it is 64bits and that does not get vectorized).
Comment 3 Andrew Pinski 2024-10-08 00:44:24 UTC
Note your `total+=values[index];` loop could be reduced down to just `total += values.count();` and that will over 10x faster.


I am not sure sure if this is useful benchmark either. because count uses popcount directly. Maybe GCC could detect the popcount here but I am not sure. LLVM does a slightly better job at vectorizing the loop but still messes it up.

Plus once you add other code around values[index], the vectorizer will no longer kick in so the slow down is only for this bad micro-benchmark.
Comment 4 Matt Bentley 2024-10-08 04:03:58 UTC
Yeah, I know, I mentioned that in the report.

It's not a bad benchmark, it's benchmarking access of individual consecutive bits, not summing. The counting is merely for preventing the compiler from optimizing out the loop.

I could equally make it benchmark random indices and I imagine the problem would remain, though I haven't checked.

Still, your point is valid in that most non-benchmark code would likely have more code around the access. Could potentially lead to misleading benchmark results in other scenarios though. I haven't tested whether vector/array indexing triggers the same bad vectorisation.
Comment 5 Matt Bentley 2024-10-08 04:04:50 UTC
(In reply to Andrew Pinski from comment #1)
> Can you provide the output of invoking g++ with -march=native and when
> compiling?

The .ii files were identical, so did you you mean .o files?
Comment 6 Richard Biener 2024-10-08 07:57:08 UTC
Btw, GCC 14.2 doesn't vectorize for me anymore, likely because the use of gather has been nerfed (for Intel).

With AVX2 we see

  <bb 3> [local count: 139586405]:
  # vect_total_21.15_24 = PHI <vect_total_11.24_41(3), { 0, 0, 0, 0, 0, 0, 0, 0 }(2)>
  # vect_vec_iv_.16_20 = PHI <_18(3), { 0, 1, 2, 3, 4, 5, 6, 7 }(2)>
  # ivtmp.27_27 = PHI <ivtmp.27_23(3), 0(2)>
  _18 = vect_vec_iv_.16_20 + { 8, 8, 8, 8, 8, 8, 8, 8 };
  vect__17.17_5 = vect_vec_iv_.16_20 >> 5;
  vect__19.18_3 = vect_vec_iv_.16_20 & { 31, 31, 31, 31, 31, 31, 31, 31 };
  vect_30 = VIEW_CONVERT_EXPR<vector(8) int>(vect__17.17_5);
  vect_31 = __builtin_ia32_gathersiv8si ({ 0, 0, 0, 0, 0, 0, 0, 0 }, &values, vect_30, { -1, -1, -1, -1, -1, -1, -1, -1 }, 4);
  vect__13.19_32 = VIEW_CONVERT_EXPR<vector(8) long unsigned int>(vect_31);
  vect__14.20_34 = { 1, 1, 1, 1, 1, 1, 1, 1 } << vect__19.18_3;
  vect__15.21_35 = vect__13.19_32 & vect__14.20_34;
  mask__16.22_37 = vect__15.21_35 != { 0, 0, 0, 0, 0, 0, 0, 0 };
  _51 = VIEW_CONVERT_EXPR<vector(8) unsigned int>(mask__16.22_37);
  vect_total_11.24_41 = vect_total_21.15_24 - _51;
  ivtmp.27_23 = ivtmp.27_27 + 1;
  if (ivtmp.27_23 != 1600000)

so the first point is we are not able to analyze the memory access pattern
in a very good way and then of course cost modeling breaks down here as well.

The scalar IL is

  <bb 3> [local count: 1063004409]:
  # total_21 = PHI <total_11(5), 0(2)>
  # index_23 = PHI <index_12(5), 0(2)>
  # ivtmp_27 = PHI <ivtmp_26(5), 12800000(2)>
  _17 = index_23 >> 5;
  _19 = index_23 & 31;
  _13 = MEM <struct _Base_bitset> [(_WordT *)&values]._M_w[_17];
  _14 = 1 << _19;
  _15 = _13 & _14;
  _16 = _15 != 0;
  _1 = (unsigned int) _16;
  total_11 = _1 + total_21;
  index_12 = index_23 + 1;
  ivtmp_26 = ivtmp_27 - 1;
  if (ivtmp_26 != 0)

I think for this kind of access pattern it might be nice to unroll the
loop as many times as the same memory location is accessed (1 << 5, aka
32 times).

But as Andrew says - it's just a very bad testcase ;)
Comment 7 Jonathan Wakely 2024-10-08 08:04:33 UTC
(In reply to Matt Bentley from comment #4)
> I haven't tested whether vector/array
> indexing triggers the same bad vectorisation.

I'm certain it won't be, because (apart from vector<bool>) they access memory locations that are at least one byte, not repeating very similar bitwise ops for every bit in a word.
Comment 8 Jonathan Wakely 2024-10-08 08:07:06 UTC
(In reply to Matt Bentley from comment #5)
> (In reply to Andrew Pinski from comment #1)
> > Can you provide the output of invoking g++ with -march=native and when
> > compiling?
> 
> The .ii files were identical, so did you you mean .o files?

I'm guessing Andrew meant the output when you add -v to the compilation command, with and without march=native

That would show exactly the instruction sets being enabled by march=native.
Comment 9 Matt Bentley 2024-10-08 08:45:43 UTC
(In reply to Jonathan Wakely from comment #7)

> I'm certain it won't be, because (apart from vector<bool>) they access
> memory locations that are at least one byte, not repeating very similar
> bitwise ops for every bit in a word.

Have tested 4 additional scenarios: random access, sequential iteration with an if statement, the vector equivalent, and sequential iteration with an if statement and a more complicated action.

Can confirm that the issue only occurs in the second of those, ie.:
	{
		std::bitset<12800000> values;

		for (unsigned int counter = 0; counter != 500; ++counter)
		{
			for (unsigned int index = 0; index != 12800000; ++index)
			{
				values.set(index, plf::rand() & 1);
			}


			for (unsigned int index = 0; index != 12800000; ++index)
			{
				if (values[index])
				{
					total += values[index] * 2;
				}
			}
		}
	}


While this statement is closer to what you might see in real world code ie. if 1, do this,
I'm assuming it gets optimized into a conditional move or something, as the same 25% slowdown occurs.
Once you add in a more complicated action ie.:

			for (unsigned int index = 0; index != 12800000; ++index)
			{
				if (values[index])
				{
					total += values[index] * 2;
					total ^= values[index];
				}
			}


The problem goes away. So essentially Andrew's correct, unless the user's chosen action can be trivially optimized.

Will generate the -v output for the second of the scenarios now.
Comment 10 Matt Bentley 2024-10-08 08:51:00 UTC
Created attachment 59293 [details]
-v build output for -o2;-march=native for second "if" variant
Comment 11 Matt Bentley 2024-10-08 08:51:45 UTC
Created attachment 59294 [details]
-v build output for -o2 only in second "if" variant

 As requested
Comment 12 Jonathan Wakely 2024-10-08 09:00:53 UTC
Thanks!
Comment 13 Matt Bentley 2024-10-15 22:39:30 UTC
Bool C-arrays also display the same problem, but only if a warm-up loop is present before the main loop (to 'warm up' the cache). I discovered this by accident the other day. char arrays and int arrays did not display the same slowdown.

A pre-mainloop warm up is quite common for microbenchmarks to get the data in cache. I hadn't done that with the previous additional tests mentioned above, but once I did I found that vector<char> had the same -march=native pessimisation.

I'm not assuming that the cause of this is the same, but the result certainly is (between 10%-20% slowdown under -march=native).

Attaching the .ii files for the 2 new cases.
Comment 14 Matt Bentley 2024-10-15 22:40:27 UTC
Created attachment 59355 [details]
.ii file for vector<char> code causing same issue
Comment 15 Matt Bentley 2024-10-15 22:41:07 UTC
Created attachment 59356 [details]
.ii file for bool[array] causing same issue