This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment)
- From: "trippels at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 03 Sep 2015 07:18:55 +0000
- Subject: [Bug c/67435] Large performance drop on apparently unrelated changes (potential cause : critical loop instruction alignment)
- Auto-submitted: auto-generated
- References: <bug-67435-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435
--- Comment #7 from Markus Trippelsdorf <trippels at gcc dot gnu.org> ---
(In reply to Yann Collet from comment #6)
> The issue seems in fact related to _instruction alignment_.
> More precisely, to alignment of some critical loop.
>
> That's basically why adding some code in the file would just "pushes" some
> other code into another position, potentially into a less favorable path
> (hence the appearance of "random impact").
>
>
> The following GCC command saved the day :
> -falign-loops=32
>
> Note that -falign-loops=16 doesn't work.
> I'm suspecting it might be the default value, but can't be sure.
> I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell
> cpu.
Here are the default values (from gcc/config/i386/i386.c):
2540 /* Processor target table, indexed by processor number */
2541 struct ptt
2542 {
2543 const char *const name; /* processor name */
2544 const struct processor_costs *cost; /* Processor costs */
2545 const int align_loop; /* Default alignments.
*/
2546 const int align_loop_max_skip;
2547 const int align_jump;
2548 const int align_jump_max_skip;
2549 const int align_func;
2550 };
2551
2552 /* This table must be in sync with enum processor_type in i386.h. */
2553 static const struct ptt processor_target_table[PROCESSOR_max] =
2554 {
2555 {"generic", &generic_cost, 16, 10, 16, 10, 16},
2556 {"i386", &i386_cost, 4, 3, 4, 3, 4},
2557 {"i486", &i486_cost, 16, 15, 16, 15, 16},
2558 {"pentium", &pentium_cost, 16, 7, 16, 7, 16},
2559 {"iamcu", &iamcu_cost, 16, 7, 16, 7, 16},
2560 {"pentiumpro", &pentiumpro_cost, 16, 15, 16, 10, 16},
2561 {"pentium4", &pentium4_cost, 0, 0, 0, 0, 0},
2562 {"nocona", &nocona_cost, 0, 0, 0, 0, 0},
2563 {"core2", &core_cost, 16, 10, 16, 10, 16},
2564 {"nehalem", &core_cost, 16, 10, 16, 10, 16},
2565 {"sandybridge", &core_cost, 16, 10, 16, 10, 16},
2566 {"haswell", &core_cost, 16, 10, 16, 10, 16},
2567 {"bonnell", &atom_cost, 16, 15, 16, 7, 16},
2568 {"silvermont", &slm_cost, 16, 15, 16, 7, 16},
2569 {"knl", &slm_cost, 16, 15, 16, 7, 16},
2570 {"intel", &intel_cost, 16, 15, 16, 7, 16},
2571 {"geode", &geode_cost, 0, 0, 0, 0, 0},
2572 {"k6", &k6_cost, 32, 7, 32, 7, 32},
2573 {"athlon", &athlon_cost, 16, 7, 16, 7, 16},
2574 {"k8", &k8_cost, 16, 7, 16, 7, 16},
2575 {"amdfam10", &amdfam10_cost, 32, 24, 32, 7, 32},
2576 {"bdver1", &bdver1_cost, 16, 10, 16, 7, 11},
2577 {"bdver2", &bdver2_cost, 16, 10, 16, 7, 11},
2578 {"bdver3", &bdver3_cost, 16, 10, 16, 7, 11},
2579 {"bdver4", &bdver4_cost, 16, 10, 16, 7, 11},
2580 {"btver1", &btver1_cost, 16, 10, 16, 7, 11},
2581 {"btver2", &btver2_cost, 16, 10, 16, 7, 11}
2582 };
As you can see only AMD's k6 and amdfam10 default to align_loop=32.
> Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter.
> It seems not possible to apply this optimization from within the source file,
> such as using :
> #pragma GCC optimize ("align-loops=32")
> or the function targeted :
> __attribute__((optimize("align-loops=32")))
>
> None of these alternatives does work.
I don't think this makes much sense for a binary that should run on
any X86 processor anyway. Optimizing for just one specific model will
negatively affect performance on an other.
If you want maximal performance you need to offer different binaries for
different CPUs.
See also (for a similar issue):
http://pzemtsov.github.io/2014/05/12/mystery-of-unstable-performance.html