Bug 67435

Summary:	Feature request: Implement align-loops attribute
Product:	gcc	Reporter:	Yann Collet <yann.collet.73>
Component:	c	Assignee:	Not yet assigned to anyone <unassigned>
Status:	NEW ---
Severity:	normal	Keywords:	missed-optimization
Priority:	P3
Version:	4.8.4
Target Milestone:	---
Host:		Target:
Build:		Known to work:
Known to fail:		Last reconfirmed:	2015-09-04 00:00:00

Description Yann Collet 2015-09-02 13:24:37 UTC

Some weird effect with gcc (tested version : 4.8.4).

I've got a performance oriented code, which runs pretty fast. Its speed depends for a large part on inlining many small functions.
There is no inline statement. All functions are either normal or static. Automatic inlining decision is solely within compiler's realm, which has worked fine so far (functions to inline are very small, typically from 1 to 5 lines).

Since inlining across multiple .c files is difficult (-flto is not yet widely available), I've kept a lot of small functions into a single `.c` file, into which I'm also developing a codec, and its associated decoder. It's "relatively" large by my standard (about ~2000 lines, although a lot of them are mere comments and blank lines), but breaking it into smaller parts opens new problems, so I would prefer to avoid that, if that is possible.

Encoder and Decoder are related, since they are inverse operations. But from a programming perspective, they are completely separated, sharing nothing in common, except a few typedef and very low-level functions (such as reading from unaligned memory position).

The strange effect is this one :

I recently added a new function fnew to the encoder side. It's a new "entry point". It's not used nor called from anywhere within the .c file.

The simple fact that it exists makes the performance of the decoder function fdec drops substantially, by more than 20%, which is way too much to be ignored.

Now, keep in mind that encoding and decoding operations are completely separated, they share almost nothing, save some minor typedef (u32, u16 and such) and associated operations (read/write).

When defining the new encoding function fnew as static, performance of the decoder fdec increases back to normal. Since fnew isn't called from the .c, I guess it's the same as if it was not there (dead code elimination).

If static fnew is now called from the encoder side, performance of fdec remains good.
But as soon as fnew is modified, fdec performance just drops substantially.

Presuming fnew modifications crossed a threshold, I increased the following gcc parameter : --param max-inline-insns-auto=60 (by default, its value is supposed to be 40.) And it worked : performance of fdec is now back to normal.

But I guess this game will continue forever with each little modification of fnew or anything else similar, requiring further tweak on some customized advance parameter. So I want to avoid that.

I tried another variant : I'm adding another completely useless function, just to play with. Its content is strictly exactly a copy-paste of fnew, but the name of the function is obviously different, so let's call it wtf.

When wtf exists (on top of fnew), it doesn't matter if fnew is static or not, nor what is the value of max-inline-insns-auto : performance of fdec is just back to normal. Even though wtf is not used nor called from anywhere... :'(


All these effects look plain weird. There is no logical reason for some little modification in function fnew to have knock-on effect on completely unrelated function fdec, which only relation is to be in the same file.

I'm trying to understand what could be going on, in order to develop the codec more reliably. 
For the time being, any modification in function A can have large ripple effects (positive or negative) on completely unrelated function B, making each step a tedious process with random outcome. A developer's nightmare.

Comment 1 Markus Trippelsdorf 2015-09-02 13:54:33 UTC

First of all, version 4.8.4 is not supported anymore.
Do you see similar effects with 4.9.3 or 5.2?

And secondly, if you could come up with a (relative) small testcase,
that shows the issue, it would help analysis very much.

Comment 2 Markus Trippelsdorf 2015-09-02 14:00:57 UTC

And you can use -fdump-ipa-inline to look at gcc's inline decisions in detail.

Comment 3 Andrew Pinski 2015-09-02 14:03:53 UTC

Gcc also tries to limit code growth for the unit also which might be something you are seeing. You can also try -Winline. Also -flto is becoming more prevalent and more popular. So you might want to give that a try also with 4.9 and above.

Comment 4 Yann Collet 2015-09-02 14:15:23 UTC

> Gcc also tries to limit code growth for the unit also which might be something you are seeing.

Yes, that could be the case.
Is there some information available somewhere on such unit-level limit ?

Specifically, I'm wondering if splitting the file into 2 would help. But since it's a fairly large and difficult task, I'm really looking for hints that it's the right solution before starting that direction.

> you can use -fdump-ipa-inline to look at gcc's inline decisions in detail.
> You can also try -Winline

Sure, I will try them.

> Do you see similar effects with 4.9.3 or 5.2?

I've difficulties installing multiple versions of gcc on the same dev system. I will try again when I've got time.
But anyway, that's not the sole issue : my users have the compiler they have, meaning I can't target only the latest version, since >90% of users won't have it.
I don't intend to support gcc 1.2 either, but there is a middle ground to find.

If I can have a solution which works with gcc 4.6 / 4.8, without relying on new features from 5.x, then it's a better solution.

Comment 5 Yann Collet 2015-09-02 14:51:13 UTC

Complementary information :

-Winline : does not output anything (is that normal ?)

-fdump-ipa-inline : produce several large files, the interesting one being 1.5 MB long. That's a huge dump to analyze.

Nonetheless, I had a deeper look directly at the function which speed is affected.
Looking at both slow and fast versions, I could spot *no difference* regarding inline decisions. From what I can tell, the dump file seems strictly identical.

(note : there could be some differences somewhere else that I did not spotted).


Since then, I've also been suggested that maybe this effect could related to something else, instruction cache line alignment.

Comment 6 Yann Collet 2015-09-03 02:44:44 UTC

The issue seems in fact related to _instruction alignment_.
More precisely, to alignment of some critical loop.

That's basically why adding some code in the file would just "pushes" some other code into another position, potentially into a less favorable path (hence the appearance of "random impact").


The following GCC command saved the day :
-falign-loops=32

Note that -falign-loops=16 doesn't work.
I'm suspecting it might be the default value, but can't be sure.
I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell cpu.


Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter.
It seems not possible to apply this optimization from within the source file,
such as using :
#pragma GCC optimize ("align-loops=32")
or the function targeted :
__attribute__((optimize("align-loops=32")))

None of these alternatives does work.

Comment 7 Markus Trippelsdorf 2015-09-03 07:18:55 UTC

(In reply to Yann Collet from comment #6)
> The issue seems in fact related to _instruction alignment_.
> More precisely, to alignment of some critical loop.
> 
> That's basically why adding some code in the file would just "pushes" some
> other code into another position, potentially into a less favorable path
> (hence the appearance of "random impact").
> 
> 
> The following GCC command saved the day :
> -falign-loops=32
> 
> Note that -falign-loops=16 doesn't work.
> I'm suspecting it might be the default value, but can't be sure.
> I'm also suspecting that -falign-loops=32 is primarily useful for Broadwell
> cpu.

Here are the default values (from gcc/config/i386/i386.c):

 2540 /* Processor target table, indexed by processor number */                                                                                                               
 2541 struct ptt                                                                                                                                                              
 2542 {                                                                                                                                                                       
 2543   const char *const name;                       /* processor name  */                                                                                                   
 2544   const struct processor_costs *cost;           /* Processor costs */                                                                                                   
 2545   const int align_loop;                         /* Default alignments.  */                                                                                              
 2546   const int align_loop_max_skip;                                                                                                                                        
 2547   const int align_jump;                                                                                                                                                 
 2548   const int align_jump_max_skip;                                                                                                                                        
 2549   const int align_func;                                                                                                                                                 
 2550 };                                                                                                                                                                      
 2551                                                                                                                                                                         
 2552 /* This table must be in sync with enum processor_type in i386.h.  */                                                                                                   
 2553 static const struct ptt processor_target_table[PROCESSOR_max] =                                                                                                         
 2554 {                                                                                                                                                                       
 2555   {"generic", &generic_cost, 16, 10, 16, 10, 16},                                                                                                                       
 2556   {"i386", &i386_cost, 4, 3, 4, 3, 4},                                                                                                                                  
 2557   {"i486", &i486_cost, 16, 15, 16, 15, 16},                                                                                                                             
 2558   {"pentium", &pentium_cost, 16, 7, 16, 7, 16},                                                                                                                         
 2559   {"iamcu", &iamcu_cost, 16, 7, 16, 7, 16},                                                                                                                             
 2560   {"pentiumpro", &pentiumpro_cost, 16, 15, 16, 10, 16},                                                                                                                 
 2561   {"pentium4", &pentium4_cost, 0, 0, 0, 0, 0},                                                                                                                          
 2562   {"nocona", &nocona_cost, 0, 0, 0, 0, 0},                                                                                                                              
 2563   {"core2", &core_cost, 16, 10, 16, 10, 16},                                                                                                                            
 2564   {"nehalem", &core_cost, 16, 10, 16, 10, 16},                                                                                                                          
 2565   {"sandybridge", &core_cost, 16, 10, 16, 10, 16},                                                                                                                      
 2566   {"haswell", &core_cost, 16, 10, 16, 10, 16},                                                                                                                          
 2567   {"bonnell", &atom_cost, 16, 15, 16, 7, 16},                                                                                                                           
 2568   {"silvermont", &slm_cost, 16, 15, 16, 7, 16},                                                                                                                         
 2569   {"knl", &slm_cost, 16, 15, 16, 7, 16},                                                                                                                                
 2570   {"intel", &intel_cost, 16, 15, 16, 7, 16},                                                                                                                            
 2571   {"geode", &geode_cost, 0, 0, 0, 0, 0},                                                                                                                                
 2572   {"k6", &k6_cost, 32, 7, 32, 7, 32},                                                                                                                                   
 2573   {"athlon", &athlon_cost, 16, 7, 16, 7, 16},                                                                                                                           
 2574   {"k8", &k8_cost, 16, 7, 16, 7, 16},                                                                                                                                   
 2575   {"amdfam10", &amdfam10_cost, 32, 24, 32, 7, 32},                                                                                                                      
 2576   {"bdver1", &bdver1_cost, 16, 10, 16, 7, 11},                                                                                                                          
 2577   {"bdver2", &bdver2_cost, 16, 10, 16, 7, 11},                                                                                                                          
 2578   {"bdver3", &bdver3_cost, 16, 10, 16, 7, 11},                                                                                                                          
 2579   {"bdver4", &bdver4_cost, 16, 10, 16, 7, 11},                                                                                                                          
 2580   {"btver1", &btver1_cost, 16, 10, 16, 7, 11},                                                                                                                          
 2581   {"btver2", &btver2_cost, 16, 10, 16, 7, 11}                                                                                                                           
 2582 };  

As you can see only AMD's k6 and amdfam10 default to align_loop=32.

> Now, the problem is, `-falign-loops=32` is a gcc-only command line parameter.
> It seems not possible to apply this optimization from within the source file,
> such as using :
> #pragma GCC optimize ("align-loops=32")
> or the function targeted :
> __attribute__((optimize("align-loops=32")))
> 
> None of these alternatives does work.

I don't think this makes much sense for a binary that should run on
any X86 processor anyway. Optimizing for just one specific model will
negatively affect performance on an other.
If you want maximal performance you need to offer different binaries for different CPUs.
  
See also (for a similar issue):
http://pzemtsov.github.io/2014/05/12/mystery-of-unstable-performance.html

Comment 8 Yann Collet 2015-09-03 18:47:44 UTC

Thanks for the link.
It's a very good read, and indeed, completely in line with my recent experience.
Recommended solution seems to be the same : "-falign-loops=32"


The article also mentions that the issue is valid for Sandy Bridge cpus.
This broadens the scope : it's not just about Broadwell, but also Haswell, Ivy Bridge and sandy Bridge. All new cpus from Intel since 2011. It looks like a large enough installed base to care about.

However, for some reason, in the table provided, both Sandy Bridge and Haswell get a default loop alignment value of 16. not 32.

Is there a reason for that choice ?


> Optimizing for just one specific model will negatively affect performance on an other.

Well, this issue is apparently important for more than one architecture.
Moreover, being inlined on 32 imply being inlined on 16 too, so it doesn't introduce drawback for older siblings.


Since then, I could find a few other complaints about the same issue. One example here : https://software.intel.com/en-us/forums/topic/479392

and a close cousin here : http://stackoverflow.com/questions/9881002/is-this-a-gcc-bug-when-using-falign-loops-option


This last one introduce a good question : while it's possible to use "-falign-loops=32" to set the preference for the whole program, it seems not possible to set it precisely for a single loop.

It looks like a good feature request, as this loop-alignment issue can have a pretty large impact on performance (~20%), but only matters for a few selected critical loops. The programmer is typically in good position to know which loop matters the most. Hence, we don't necessarily need *all* loops to be 32-bytes aligned, just a handful ones.

Less precise but still great, having the ability to set this optimization parameter for a function or a section code would be great. But my experiment seem to show that using #pragma or __attribute__ with align-loops does not work, as if the optimization setting was simply ignored.

Comment 9 Markus Trippelsdorf 2015-09-04 10:23:42 UTC

(In reply to Yann Collet from comment #8)
> However, for some reason, in the table provided, both Sandy Bridge and
> Haswell get a default loop alignment value of 16. not 32.
> 
> Is there a reason for that choice ?

These values are normally strait out of the Vendors manuals.
And there are also drawbacks to high alignment values.
 
> Less precise but still great, having the ability to set this optimization
> parameter for a function or a section code would be great. But my experiment
> seem to show that using #pragma or __attribute__ with align-loops does not
> work, as if the optimization setting was simply ignored.

Well, there already is an aligned attribute for functions, variables and fields,
see:  http://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes

Comment 10 Yann Collet 2015-10-20 13:09:24 UTC

> there already is an aligned attribute for functions, variables and fields,

Sure, but none of them is related to aligning the start of an hot instruction loop. Aligning the function instead looks like a poor proxy.

> there are also drawbacks to high alignment values

Yes. I could test that using -falign-loops=32 on a larger code base produces drawbacks. Not just larger code size, worse speed speed.

This makes it all the more relevant to have the ability to select which loop should be aligned, instead of relying on a unique program-wide compilation flag.