Summary: | Feature request: Implement align-loops attribute | ||
---|---|---|---|
Product: | gcc | Reporter: | Yann Collet <yann.collet.73> |
Component: | c | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | Keywords: | missed-optimization |
Priority: | P3 | ||
Version: | 4.8.4 | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2015-09-04 00:00:00 |
Description
Yann Collet
2015-09-02 13:24:37 UTC
First of all, version 4.8.4 is not supported anymore. Do you see similar effects with 4.9.3 or 5.2? And secondly, if you could come up with a (relative) small testcase, that shows the issue, it would help analysis very much. And you can use -fdump-ipa-inline to look at gcc's inline decisions in detail. Gcc also tries to limit code growth for the unit also which might be something you are seeing. You can also try -Winline. Also -flto is becoming more prevalent and more popular. So you might want to give that a try also with 4.9 and above. > Gcc also tries to limit code growth for the unit also which might be something you are seeing. Yes, that could be the case. Is there some information available somewhere on such unit-level limit ? Specifically, I'm wondering if splitting the file into 2 would help. But since it's a fairly large and difficult task, I'm really looking for hints that it's the right solution before starting that direction. > you can use -fdump-ipa-inline to look at gcc's inline decisions in detail. > You can also try -Winline Sure, I will try them. > Do you see similar effects with 4.9.3 or 5.2? I've difficulties installing multiple versions of gcc on the same dev system. I will try again when I've got time. But anyway, that's not the sole issue : my users have the compiler they have, meaning I can't target only the latest version, since >90% of users won't have it. I don't intend to support gcc 1.2 either, but there is a middle ground to find. If I can have a solution which works with gcc 4.6 / 4.8, without relying on new features from 5.x, then it's a better solution. Complementary information : -Winline : does not output anything (is that normal ?) -fdump-ipa-inline : produce several large files, the interesting one being 1.5 MB long. That's a huge dump to analyze. Nonetheless, I had a deeper look directly at the function which speed is affected. Looking at both slow and fast versions, I could spot *no difference* regarding inline decisions. From what I can tell, the dump file seems strictly identical. (note : there could be some differences somewhere else that I did not spotted). Since then, I've also been suggested that maybe this effect could related to something else, instruction cache line alignment. The issue seems in fact related to _instruction alignment_. More precisely, to alignment of some critical loop. That's basically why adding some code in the file would just "pushes" some other code into another position, potentially into a less favorable path (hence the appearance of "random impact"). The following GCC command saved the day : -falign-loops=32 Note that -falign-loops=16 doesn't work. I'm suspecting it might be the default value, but can't be sure. I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell cpu. Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter. It seems not possible to apply this optimization from within the source file, such as using : #pragma GCC optimize ("align-loops=32") or the function targeted : __attribute__((optimize("align-loops=32"))) None of these alternatives does work. (In reply to Yann Collet from comment #6) > The issue seems in fact related to _instruction alignment_. > More precisely, to alignment of some critical loop. > > That's basically why adding some code in the file would just "pushes" some > other code into another position, potentially into a less favorable path > (hence the appearance of "random impact"). > > > The following GCC command saved the day : > -falign-loops=32 > > Note that -falign-loops=16 doesn't work. > I'm suspecting it might be the default value, but can't be sure. > I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell > cpu. Here are the default values (from gcc/config/i386/i386.c): 2540 /* Processor target table, indexed by processor number */ 2541 struct ptt 2542 { 2543 const char *const name; /* processor name */ 2544 const struct processor_costs *cost; /* Processor costs */ 2545 const int align_loop; /* Default alignments. */ 2546 const int align_loop_max_skip; 2547 const int align_jump; 2548 const int align_jump_max_skip; 2549 const int align_func; 2550 }; 2551 2552 /* This table must be in sync with enum processor_type in i386.h. */ 2553 static const struct ptt processor_target_table[PROCESSOR_max] = 2554 { 2555 {"generic", &generic_cost, 16, 10, 16, 10, 16}, 2556 {"i386", &i386_cost, 4, 3, 4, 3, 4}, 2557 {"i486", &i486_cost, 16, 15, 16, 15, 16}, 2558 {"pentium", &pentium_cost, 16, 7, 16, 7, 16}, 2559 {"iamcu", &iamcu_cost, 16, 7, 16, 7, 16}, 2560 {"pentiumpro", &pentiumpro_cost, 16, 15, 16, 10, 16}, 2561 {"pentium4", &pentium4_cost, 0, 0, 0, 0, 0}, 2562 {"nocona", &nocona_cost, 0, 0, 0, 0, 0}, 2563 {"core2", &core_cost, 16, 10, 16, 10, 16}, 2564 {"nehalem", &core_cost, 16, 10, 16, 10, 16}, 2565 {"sandybridge", &core_cost, 16, 10, 16, 10, 16}, 2566 {"haswell", &core_cost, 16, 10, 16, 10, 16}, 2567 {"bonnell", &atom_cost, 16, 15, 16, 7, 16}, 2568 {"silvermont", &slm_cost, 16, 15, 16, 7, 16}, 2569 {"knl", &slm_cost, 16, 15, 16, 7, 16}, 2570 {"intel", &intel_cost, 16, 15, 16, 7, 16}, 2571 {"geode", &geode_cost, 0, 0, 0, 0, 0}, 2572 {"k6", &k6_cost, 32, 7, 32, 7, 32}, 2573 {"athlon", &athlon_cost, 16, 7, 16, 7, 16}, 2574 {"k8", &k8_cost, 16, 7, 16, 7, 16}, 2575 {"amdfam10", &amdfam10_cost, 32, 24, 32, 7, 32}, 2576 {"bdver1", &bdver1_cost, 16, 10, 16, 7, 11}, 2577 {"bdver2", &bdver2_cost, 16, 10, 16, 7, 11}, 2578 {"bdver3", &bdver3_cost, 16, 10, 16, 7, 11}, 2579 {"bdver4", &bdver4_cost, 16, 10, 16, 7, 11}, 2580 {"btver1", &btver1_cost, 16, 10, 16, 7, 11}, 2581 {"btver2", &btver2_cost, 16, 10, 16, 7, 11} 2582 }; As you can see only AMD's k6 and amdfam10 default to align_loop=32. > Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter. > It seems not possible to apply this optimization from within the source file, > such as using : > #pragma GCC optimize ("align-loops=32") > or the function targeted : > __attribute__((optimize("align-loops=32"))) > > None of these alternatives does work. I don't think this makes much sense for a binary that should run on any X86 processor anyway. Optimizing for just one specific model will negatively affect performance on an other. If you want maximal performance you need to offer different binaries for different CPUs. See also (for a similar issue): http://pzemtsov.github.io/2014/05/12/mystery-of-unstable-performance.html Thanks for the link. It's a very good read, and indeed, completely in line with my recent experience. Recommended solution seems to be the same : "-falign-loops=32" The article also mentions that the issue is valid for Sandy Bridge cpus. This broadens the scope : it's not just about Broadwell, but also Haswell, Ivy Bridge and sandy Bridge. All new cpus from Intel since 2011. It looks like a large enough installed base to care about. However, for some reason, in the table provided, both Sandy Bridge and Haswell get a default loop alignment value of 16. not 32. Is there a reason for that choice ? > Optimizing for just one specific model will negatively affect performance on an other. Well, this issue is apparently important for more than one architecture. Moreover, being inlined on 32 imply being inlined on 16 too, so it doesn't introduce drawback for older siblings. Since then, I could find a few other complaints about the same issue. One example here : https://software.intel.com/en-us/forums/topic/479392 and a close cousin here : http://stackoverflow.com/questions/9881002/is-this-a-gcc-bug-when-using-falign-loops-option This last one introduce a good question : while it's possible to use "-falign-loops=32" to set the preference for the whole program, it seems not possible to set it precisely for a single loop. It looks like a good feature request, as this loop-alignment issue can have a pretty large impact on performance (~20%), but only matters for a few selected critical loops. The programmer is typically in good position to know which loop matters the most. Hence, we don't necessarily need *all* loops to be 32-bytes aligned, just a handful ones. Less precise but still great, having the ability to set this optimization parameter for a function or a section code would be great. But my experiment seem to show that using #pragma or __attribute__ with align-loops does not work, as if the optimization setting was simply ignored. (In reply to Yann Collet from comment #8) > However, for some reason, in the table provided, both Sandy Bridge and > Haswell get a default loop alignment value of 16. not 32. > > Is there a reason for that choice ? These values are normally strait out of the Vendors manuals. And there are also drawbacks to high alignment values. > Less precise but still great, having the ability to set this optimization > parameter for a function or a section code would be great. But my experiment > seem to show that using #pragma or __attribute__ with align-loops does not > work, as if the optimization setting was simply ignored. Well, there already is an aligned attribute for functions, variables and fields, see: http://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes > there already is an aligned attribute for functions, variables and fields, Sure, but none of them is related to aligning the start of an hot instruction loop. Aligning the function instead looks like a poor proxy. > there are also drawbacks to high alignment values Yes. I could test that using -falign-loops=32 on a larger code base produces drawbacks. Not just larger code size, worse speed speed. This makes it all the more relevant to have the ability to select which loop should be aligned, instead of relying on a unique program-wide compilation flag. |