This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64

From: "burnus at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Thu, 26 Sep 2013 07:26:41 +0000
Subject: [Bug target/58529] GCC -funroll-loops 150% slower with -march=native on x86-64
Auto-submitted: auto-generated
References: <bug-58529-4 at http dot gcc dot gnu dot org/bugzilla/>

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58529

Tobias Burnus <burnus at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|middle-end                  |target
            Summary|Loop 30% faster with Intel  |GCC -funroll-loops 150%
                   |than with GCC               |slower with -march=native
                   |                            |on x86-64

--- Comment #9 from Tobias Burnus <burnus at gcc dot gnu.org> ---
(In reply to Tobias Burnus from comment #8)
> I have to re-check why unrolling made it slower on that Xeon E5-2630
> (comment 0) but faster on the i5.

Seems to be a tuning problem. All timings on the Xeon E5-2630, but using the
-march=native compile from the i5 vs. the -march=native compilation for the
Xeon E5:

real 1.530s  user 1.528s  sys 0.000s i5,   no unrolling
real 1.483s  user 1.481s  sys 0.000s Xeon, no unrolling
real 0.937s  user 0.934s  sys 0.002s i5,   -funroll-loops
real 2.480s  user 2.478s  sys 0.000s Xeon, -funroll-loops
real 0.935s  user 0.934s  sys 0.000s Xeon, -funroll-loops max-unroll-times=7

The i5's -march=native expands into:
-march=core-avx-i -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a
-mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma
-mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2
-msse4.1 -mno-lzcnt  -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed
-mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er
-mno-avx512cd -mno-avx512pf --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=6144 -mtune=core-avx-i

The Xeon's -march=native
-march=corei7-avx -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a
-mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma
-mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2
-msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase
-mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f
-mno-avx512er -mno-avx512cd -mno-avx512pf --param l1-cache-size=32 --param
l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=corei7-avx

Namely:
i5:   -march=core-avx-i -mrdrnd    -mf16c    -mfsgsbase
      --param l2-cache-size=6144  -mtune=core-avx-i
Xeon: -march=corei7-avx -mno-rdrnd -mno-f16c -mno-fsgsbase
      --param l2-cache-size=15360 -mtune=corei7-avx

References:
- [Bug middle-end/58529] New: Loop 30% faster with Intel than with GCC
  - From: burnus at gcc dot gnu.org

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]