Hello, I noticed today what may look like quite a large performance regression with Eigen (3.3.4) matrix multiplication. It only seems to occur on non-AVX2 code paths, meaning that if I compile with -march=native on my core-i7 with AVX2, then it's blazingly fast on both g++ versions, but not on an older core-i5 with only AVX, or if I use -march=core2. Here are some example timings, but it applies to all matrix sizes that the benchmark script tests (see end of the message for the code): g++-5 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=core2 -O3 -o gcc5_gemm_test 1124 1215 1465 elapsed_ms: 1970 -------- 1730 1235 1758 elapsed_ms: 3505 g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=core2 -O3 -march=core2 -o gcc7_gemm_test 1124 1215 1465 elapsed_ms: 2998 -------- 1730 1235 1758 elapsed_ms: 4628 It's even worse if I test this on a i5-3550, which has AVX, but not AVX2: g++-5 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o gcc5_gemm_test 1124 1215 1465 elapsed_ms: 941 -------- 1730 1235 1758 elapsed_ms: 1780 g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o gcc7_gemm_test 1124 1215 1465 elapsed_ms: 1988 -------- 1730 1235 1758 elapsed_ms: 3740 I tried the same with -O2 and it gave the same results. That's a drop to nearly half the speed in matrix multiplication on AVX CPUs. Or maybe I've done something wrong. :-) I realise the benchmark might be a bit crude (better use Google Benchmark or something like that...) But the results I'm getting are pretty consistent on various CPUs, compilers, and with various flags. === Benchmark code: // gemm_test.cpp #include <array> #include <chrono> #include <iostream> #include <random> #include <Eigen/Dense> using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor>; using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic, Eigen::ColMajor>; template <typename Mat> void run_test(const std::string& name, int s1, int s2, int s3) { using namespace std::chrono; float checksum = 0.0f; // to prevent compiler from optimizing everything away const auto start_time_ns = high_resolution_clock::now().time_since_epoch().count(); for (size_t i = 0; i < 10; ++i) { Mat a_rm(s1, s2); Mat b_rm(s2, s3); const auto c_rm = a_rm * b_rm; checksum += c_rm(0, 0); } const auto end_time_ns = high_resolution_clock::now().time_since_epoch().count(); const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000; std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " << elapsed_ms << std::endl; } int main() { //std::random_device rd; //std::mt19937 gen(0); //std::uniform_int_distribution<> dis(1, 2048); std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116, 1736, 868, 1278, 1323, 788 }; for (std::size_t i = 0; i < 12; ++i) { int s1 = vals[i++];//dis(gen); int s2 = vals[i++];//dis(gen); int s3 = vals[i];//dis(gen); std::cout << s1 << " " << s2 << " " << s3 << std::endl; run_test<ColMajorMatrixXf>("col major", s1, s2, s3); run_test<RowMajorMatrixXf>("row major", s1, s2, s3); std::cout << "--------" << std::endl; } return 0; } ===
Can you attach preprocessed source please?
The difference seems to be between gcc-5 and gcc-6.
@Richard: I'm not 100% sure what you mean with "preprocessed source" but I googled and you probably mean the output of compiling with "-c -save-temps". Please see attached.
Created attachment 43363 [details] gcc5_gemm_test.s
Created attachment 43364 [details] gcc7_gemm_test.s
I could also upload you the .ii files but they are 5 MB, which the bugtracker doesn't allow (1 MB limit).
Created attachment 43365 [details] gemm_test.cpp
Created attachment 43366 [details] full_log.txt
(In reply to Patrik Huber from comment #6) > I could also upload you the .ii files but they are 5 MB, which the > bugtracker doesn't allow (1 MB limit). preprocessed sources are the .ii files (you can use compression).
Created attachment 43367 [details] gcc5_gemm_test.ii
Created attachment 43368 [details] gcc7_gemm_test.ii
Hmm, the preprocessed source(s) are hard to work with given the eigen headers seem to have conditional code on the enabled ISAs. From a quick look it seems to be inlining related? My past experience says that compute kernels in C++ should have the flatten attribute attached to them... Did you try with FDO? (-fprofile-generate, run, -fprofile-use)
>> Did you try with FDO? (-fprofile-generate, run, -fprofile-use) I just tried this with g++-7. It didn't help, the final executable has the same slower run time as in the attached log without the FDO.
It even seems a few percent slower after the FDO stuff. But the ` -fprofile-use` is a bit weird. If there is no .gcda file, it doesn't complain. If you give it a file that doesn't exist (e.g. -fprofile-use=foo), then it doesn't complain either. So how can I check whether it really ran the FDO?
I can confirm that on my Haswell machine: model name : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz I see regression with -march=core2 -mtune=core2 -O3 starting from r226901 (first time in GCC 5.x). Time difference is: 0:00:14.975390 vs 0:00:11.889274
Created attachment 43653 [details] optimized dump before the revision
Created attachment 43654 [details] optimized dump after the revision
(In reply to Patrik Huber from comment #14) > It even seems a few percent slower after the FDO stuff. But the ` > -fprofile-use` is a bit weird. If there is no .gcda file, it doesn't > complain. If you give it a file that doesn't exist (e.g. -fprofile-use=foo), > then it doesn't complain either. So how can I check whether it really ran > the FDO? Yep, maybe having an option that will cause failure would be a good idea. Anyway, you can use -fdump-ipa-profile and check *.065i.profile file where you should see something like: ... Read edge from 0 to 2, count:1 1 edge counts read ... Note that -fprofile-use=foo tells the compiler to search in *folder* foo for corresponding gcda files.
Ok, so this means it is coalescing related. We still don't know which coalescing is good/bad though.
GCC 6 branch is being closed
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.