84280 – [11/12/13/14 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

Bug 84280 - [11/12/13/14 Regression] Performance regression in g++-7 with Eigen for non-AVX2 CPUs

Summary: [11/12/13/14 Regression] Performance regression in g++-7 with Eigen for non-A...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	7.2.1

Importance:	P2 normal
Target Milestone:	11.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2018-02-08 10:40 UTC by Patrik Huber
Modified:	2023-07-07 10:33 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	x86_64-linux-gnu
Build:
Known to work:	5.5.0
Known to fail:	6.4.0, 7.2.0
Last reconfirmed:	2018-02-08 00:00:00

Attachments
gcc5_gemm_test.s (23.81 KB, text/plain) 2018-02-08 12:08 UTC, Patrik Huber	Details
gcc7_gemm_test.s (24.54 KB, text/plain) 2018-02-08 12:09 UTC, Patrik Huber	Details
gemm_test.cpp (699 bytes, text/plain) 2018-02-08 12:11 UTC, Patrik Huber	Details
full_log.txt (870 bytes, text/plain) 2018-02-08 12:11 UTC, Patrik Huber	Details
gcc5_gemm_test.ii (502.16 KB, application/x-zip-compressed) 2018-02-08 12:17 UTC, Patrik Huber	Details
gcc7_gemm_test.ii (534.63 KB, application/x-zip-compressed) 2018-02-08 12:18 UTC, Patrik Huber	Details
optimized dump before the revision (44.32 KB, text/plain) 2018-03-14 12:44 UTC, Martin Liška	Details
optimized dump after the revision (45.12 KB, text/plain) 2018-03-14 12:44 UTC, Martin Liška	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Patrik Huber 2018-02-08 10:40:37 UTC

Hello,

I noticed today what may look like quite a large performance regression
with Eigen (3.3.4) matrix multiplication. It only seems to occur on
non-AVX2 code paths, meaning that if I compile with -march=native on my
core-i7 with AVX2, then it's blazingly fast on both g++ versions, but not
on an older core-i5 with only AVX, or if I use -march=core2.

Here are some example timings, but it applies to all matrix sizes that the
benchmark script tests (see end of the message for the code):

g++-5 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=core2 -O3 -o
gcc5_gemm_test

1124 1215 1465
elapsed_ms: 1970
--------
1730 1235 1758
elapsed_ms: 3505

g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=core2 -O3
-march=core2 -o gcc7_gemm_test

1124 1215 1465
elapsed_ms: 2998
--------
1730 1235 1758
elapsed_ms: 4628

It's even worse if I test this on a i5-3550, which has AVX, but not AVX2:

g++-5 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc5_gemm_test
1124 1215 1465
elapsed_ms: 941
--------
1730 1235 1758
elapsed_ms: 1780


g++-7 gemm_test.cpp -std=c++17 -I 3rdparty/eigen/ -march=native -O3 -o
gcc7_gemm_test

1124 1215 1465
elapsed_ms: 1988
--------
1730 1235 1758
elapsed_ms: 3740

I tried the same with -O2 and it gave the same results. That's a drop to
nearly half the speed in matrix multiplication on AVX CPUs. Or maybe I've
done something wrong. :-) I realise the benchmark might be a bit crude
(better use Google Benchmark or something like that...) But the results I'm
getting are pretty consistent on various CPUs, compilers, and with various
flags.


=== Benchmark code:
// gemm_test.cpp
#include <array>
#include <chrono>
#include <iostream>
#include <random>
#include <Eigen/Dense>

using RowMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::RowMajor>;
using ColMajorMatrixXf = Eigen::Matrix<float, Eigen::Dynamic,
Eigen::Dynamic, Eigen::ColMajor>;

template <typename Mat>
void run_test(const std::string& name, int s1, int s2, int s3)
{
    using namespace std::chrono;
    float checksum = 0.0f; // to prevent compiler from optimizing
everything away
    const auto start_time_ns =
high_resolution_clock::now().time_since_epoch().count();
    for (size_t i = 0; i < 10; ++i)
    {
        Mat a_rm(s1, s2);
        Mat b_rm(s2, s3);
        const auto c_rm = a_rm * b_rm;
        checksum += c_rm(0, 0);
    }
    const auto end_time_ns =
high_resolution_clock::now().time_since_epoch().count();
    const auto elapsed_ms = (end_time_ns - start_time_ns) / 1000000;
    std::cout << name << " (checksum: " << checksum << ") elapsed_ms: " <<
elapsed_ms << std::endl;
}
int main()
{
    //std::random_device rd;
    //std::mt19937 gen(0);
    //std::uniform_int_distribution<> dis(1, 2048);
    std::vector<int> vals = { 1124, 1215, 1465, 1730, 1235, 1758, 1116,
1736, 868, 1278, 1323, 788 };
    for (std::size_t i = 0; i < 12; ++i)
    {
        int s1 = vals[i++];//dis(gen);
        int s2 = vals[i++];//dis(gen);
        int s3 = vals[i];//dis(gen);
        std::cout << s1 << " " << s2 << " " << s3 << std::endl;
        run_test<ColMajorMatrixXf>("col major", s1, s2, s3);
        run_test<RowMajorMatrixXf>("row major", s1, s2, s3);
        std::cout << "--------" << std::endl;
    }
    return 0;
}
===

Comment 1 Richard Biener 2018-02-08 11:39:46 UTC

Can you attach preprocessed source please?

Comment 2 Marc Glisse 2018-02-08 12:05:08 UTC

The difference seems to be between gcc-5 and gcc-6.

Comment 3 Patrik Huber 2018-02-08 12:07:58 UTC

@Richard: I'm not 100% sure what you mean with "preprocessed source" but I googled and you probably mean the output of compiling with "-c -save-temps".

Please see attached.

Comment 4 Patrik Huber 2018-02-08 12:08:56 UTC

Created attachment 43363 [details]
gcc5_gemm_test.s

Comment 5 Patrik Huber 2018-02-08 12:09:23 UTC

Created attachment 43364 [details]
gcc7_gemm_test.s

Comment 6 Patrik Huber 2018-02-08 12:10:16 UTC

I could also upload you the .ii files but they are 5 MB, which the bugtracker doesn't allow (1 MB limit).

Comment 7 Patrik Huber 2018-02-08 12:11:14 UTC

Created attachment 43365 [details]
gemm_test.cpp

Comment 8 Patrik Huber 2018-02-08 12:11:34 UTC

Created attachment 43366 [details]
full_log.txt

Comment 9 Marc Glisse 2018-02-08 12:16:39 UTC

(In reply to Patrik Huber from comment #6)
> I could also upload you the .ii files but they are 5 MB, which the
> bugtracker doesn't allow (1 MB limit).

preprocessed sources are the .ii files (you can use compression).

Comment 10 Patrik Huber 2018-02-08 12:17:59 UTC

Created attachment 43367 [details]
gcc5_gemm_test.ii

Comment 11 Patrik Huber 2018-02-08 12:18:17 UTC

Created attachment 43368 [details]
gcc7_gemm_test.ii

Comment 12 Richard Biener 2018-02-08 13:24:43 UTC

Hmm, the preprocessed source(s) are hard to work with given the eigen headers seem to have conditional code on the enabled ISAs.

From a quick look it seems to be inlining related?  My past experience says
that compute kernels in C++ should have the flatten attribute attached to them...

Did you try with FDO?  (-fprofile-generate, run, -fprofile-use)

Comment 13 Patrik Huber 2018-02-08 20:32:15 UTC

>> Did you try with FDO?  (-fprofile-generate, run, -fprofile-use)

I just tried this with g++-7. It didn't help, the final executable has the same slower run time as in the attached log without the FDO.

Comment 14 Patrik Huber 2018-02-08 20:37:47 UTC

It even seems a few percent slower after the FDO stuff. But the `
-fprofile-use` is a bit weird. If there is no .gcda file, it doesn't complain. If you give it a file that doesn't exist (e.g. -fprofile-use=foo), then it doesn't complain either. So how can I check whether it really ran the FDO?

Comment 15 Martin Liška 2018-03-14 12:36:55 UTC

I can confirm that on my Haswell machine:
model name	: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

I see regression with -march=core2 -mtune=core2 -O3 starting from r226901 (first time in GCC 5.x).

Time difference is:
0:00:14.975390
vs
0:00:11.889274

Comment 16 Martin Liška 2018-03-14 12:44:39 UTC

Created attachment 43653 [details]
optimized dump before the revision

Comment 17 Martin Liška 2018-03-14 12:44:59 UTC

Created attachment 43654 [details]
optimized dump after the revision

Comment 18 Martin Liška 2018-03-14 12:47:38 UTC

(In reply to Patrik Huber from comment #14)
> It even seems a few percent slower after the FDO stuff. But the `
> -fprofile-use` is a bit weird. If there is no .gcda file, it doesn't
> complain. If you give it a file that doesn't exist (e.g. -fprofile-use=foo),
> then it doesn't complain either. So how can I check whether it really ran
> the FDO?

Yep, maybe having an option that will cause failure would be a good idea.
Anyway, you can use -fdump-ipa-profile and check *.065i.profile file where you should see something like:

...
Read edge from 0 to 2, count:1
1 edge counts read
...

Note that -fprofile-use=foo tells the compiler to search in *folder* foo for corresponding gcda files.

Comment 19 Richard Biener 2018-03-27 10:30:28 UTC

Ok, so this means it is coalescing related.  We still don't know which coalescing is good/bad though.

Comment 20 Jakub Jelinek 2018-10-26 10:08:47 UTC

GCC 6 branch is being closed

Comment 21 Richard Biener 2019-11-14 07:49:59 UTC

The GCC 7 branch is being closed, re-targeting to GCC 8.4.

Comment 22 Jakub Jelinek 2020-03-04 09:41:43 UTC

GCC 8.4.0 has been released, adjusting target milestone.

Comment 23 Jakub Jelinek 2021-05-14 09:49:50 UTC

GCC 8 branch is being closed.

Comment 24 Richard Biener 2021-06-01 08:10:09 UTC

GCC 9.4 is being released, retargeting bugs to GCC 9.5.

Comment 25 Richard Biener 2022-05-27 09:38:18 UTC

GCC 9 branch is being closed

Comment 26 Jakub Jelinek 2022-06-28 10:34:27 UTC

GCC 10.4 is being released, retargeting bugs to GCC 10.5.

Comment 27 Richard Biener 2023-07-07 10:33:11 UTC

GCC 10 branch is being closed.