This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug rtl-optimization/67072] Slow code generated for getting each byte of a 64bit register as a LUT index.

From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Fri, 31 Jul 2015 01:18:56 +0000
Subject: [Bug rtl-optimization/67072] Slow code generated for getting each byte of a 64bit register as a LUT index.
Auto-submitted: auto-generated
References: <bug-67072-4 at http dot gcc dot gnu dot org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67072

--- Comment #4 from Peter Cordes <peter at cordes dot ca> ---
I just timed with Linux perf:

time taskset 0x04 perf stat -e
task-clock,cycles,instructions,r1b1,r10e,r2c2,r1c2,stalled-cycles-frontend,stalled-cycles-backend
./rs-asmbench

my code averages 3.57 fused-domain uops / cycle (3x 1000 iters over a 1MiB
buffer).

gcc's code averages 3.10 fused-domain uops / cycle (3x 1000 iters over a 1MiB
buffer).

So it's not just extra mov uops slowing things down.  gcc's code isn't
scheduled as well.  Or else the extra mov uops are taking up execution units
and preventing the CPU from running enough load/store uops to go beyond 3 uops
per cycle.

References:
- [Bug rtl-optimization/67072] New: Slow code generated for getting each byte of a 64bit register as a LUT index.
  - From: peter at cordes dot ca

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]