Bug 114532 - gcc -fno-common option causes performance degradation on certain architectures
Summary: gcc -fno-common option causes performance degradation on certain architectures
Status: WAITING
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 10.3.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2024-03-30 09:47 UTC by huyubiao
Modified: 2024-06-05 09:30 UTC (History)
8 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-03-30 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description huyubiao 2024-03-30 09:47:07 UTC
gcc 10.3 has turnd off -fcommon into fno-common by default, which led to performance decrease about 7% (CPU intel sapphire rapids) on UnixBench dhry2reg subtest.

We found that L1-dcache-load-misses increased under -fno-common option than -fcommon, which we  suspect is the cause of the performance decrease. 
L1 dcache size is 48K per core in CPU intel sapphire rapids (32 KB instruction + 48 KB data).

We have compared the binary layout of the executables generated under -fcommon and -fno-common, we found that the location offset of XXX_Time and XXX_Glob global variables in BSS section increased about 10K under -fno-common.
Maybe the increase of bss gap influence the probability of L1-dcache-load-miss.

I am not sure about this conclusion, is there any way to verify our conjecture?
Comment 1 Andrew Pinski 2024-03-30 18:48:19 UTC
```
Rec_Pointer     Ptr_Glob,
                Next_Ptr_Glob;
int             Int_Glob;
Boolean         Bool_Glob;
char            Ch_1_Glob,
                Ch_2_Glob;
int             Arr_1_Glob [50];
int             Arr_2_Glob [50] [50];
```

Maybe the order of these changed in the layout of the final executable.
In the case of -fcommon, the layout of these are handled by the linker while with -fno-common, they are handled by compiler into the assembly into the specific section (and then the sections are combined/laid out by the linker).

So maybe look at the linker map and compare it to what GCC does with -fno-common in the .s file.
Comment 2 Zhaohaifeng 2024-06-04 01:38:39 UTC
(In reply to Andrew Pinski from comment #1)
> ```
> Rec_Pointer     Ptr_Glob,
>                 Next_Ptr_Glob;
> int             Int_Glob;
> Boolean         Bool_Glob;
> char            Ch_1_Glob,
>                 Ch_2_Glob;
> int             Arr_1_Glob [50];
> int             Arr_2_Glob [50] [50];
> ```
> 
> Maybe the order of these changed in the layout of the final executable.
> In the case of -fcommon, the layout of these are handled by the linker while
> with -fno-common, they are handled by compiler into the assembly into the
> specific section (and then the sections are combined/laid out by the linker).
> 
> So maybe look at the linker map and compare it to what GCC does with
> -fno-common in the .s file.

Some test results:
1. Using gcc 10.3 the variables are arranged from the last Dhrystone_Per_Second to the first Ptr_Glob, both in .s file and the final binary. If we change the sequence of the variables in the source code, the sequence in assembly and binary is also changed as in source code.

2. Using gcc 8.5 the variables are arranged specially both in assembly and final binary,If the variable sequence is changed in the source code, the sequence in assembly and binary is NOT changed.

Do we expect the fcommon option do some performance optimizatin? How does fcommon arrange the variables?
Comment 3 Sam James 2024-06-04 01:41:32 UTC
As David said at https://inbox.sourceware.org/gcc/d92b8262-5835-a7b1-363d-709724a8dc5b@hesbynett.no/, I'd expect it to improve performance in that we're no longer pessimising by forcing them to be common, but AFAIK there's no specific optimisation otherwise.

http://kristerw.blogspot.com/2016/11/tentative-variable-definitions-and-fno.html is a nice writeup about it.
Comment 4 David Brown 2024-06-04 11:19:02 UTC
I'm not personally particularly interested in performance on x86 systems - my work is in embedded microcontroller programming.  But I did push for "-fno-common" to be the default in gcc because "-fcommon" greatly reduces the risk of some kinds of errors in code.

I've tried fiddling around a bit with different gcc targets and options on godbolt.org :

<https://godbolt.org/z/KqxKqeKbK>

It's easy to see the difference between common symbols and non-common symbols by using "-fcommon" and comparing non-initialised externally linked objects with initialised ones (since these are never common).  It seems that for some targets (like x86-64), there is no "-fsection-anchors" support at all.  For some (like mips), you can choose it explicitly.  And for some (like ARM 32-bit and 64-bit), it is automatic when optimising.  I assume section anchors can be a gain for some targets, but not so much for others.

So certainly "-fsection-anchors" will not be a help for x86-64, since that target does not support section anchors.  (And for targets that /do/ support them, such as ARM, it's important not to enable -fdata-sections since that blocks the anchors.)
Comment 5 Zhaohaifeng 2024-06-05 04:12:44 UTC
(In reply to David Brown from comment #4)
> I'm not personally particularly interested in performance on x86 systems -
> my work is in embedded microcontroller programming.  But I did push for
> "-fno-common" to be the default in gcc because "-fcommon" greatly reduces
> the risk of some kinds of errors in code.
> 
> I've tried fiddling around a bit with different gcc targets and options on
> godbolt.org :
> 
> <https://godbolt.org/z/KqxKqeKbK>
> 
> It's easy to see the difference between common symbols and non-common
> symbols by using "-fcommon" and comparing non-initialised externally linked
> objects with initialised ones (since these are never common).  It seems that
> for some targets (like x86-64), there is no "-fsection-anchors" support at
> all.  For some (like mips), you can choose it explicitly.  And for some
> (like ARM 32-bit and 64-bit), it is automatic when optimising.  I assume
> section anchors can be a gain for some targets, but not so much for others.
> 
> So certainly "-fsection-anchors" will not be a help for x86-64, since that
> target does not support section anchors.  (And for targets that /do/ support
> them, such as ARM, it's important not to enable -fdata-sections since that
> blocks the anchors.)

Does gcc implement -fsection-anchors like function in -fcommon option for x86? In general concept, gcc should has some similar feature for x86 and ARM.
Comment 6 Xi Ruoyao 2024-06-05 05:46:34 UTC
(In reply to Zhaohaifeng from comment #5)

> Does gcc implement -fsection-anchors like function in -fcommon option for
> x86? In general concept, gcc should has some similar feature for x86 and ARM.

AFAIK it's not very useful for CISC architectures supporting variable-length fancy memory operands.
Comment 7 David Brown 2024-06-05 08:24:50 UTC
(In reply to Xi Ruoyao from comment #6)
> (In reply to Zhaohaifeng from comment #5)
> 
> > Does gcc implement -fsection-anchors like function in -fcommon option for
> > x86? In general concept, gcc should has some similar feature for x86 and ARM.
> 

AFAIK, -fsection-anchors and -fcommon / -fno-common are completely independent.  But section anchors cannot work with "common" symbols, no matter what architecture, because at compile time the compiler does not know the order of allocation of the common symbols.  It /does/ know the order of allocation of symbols defined in the current translation unit, such as initialised data, -fno-common zero initialised data, and static data.  This information can be used with section anchors and also with other optimisations based on the relative positions of objects.

> AFAIK it's not very useful for CISC architectures supporting variable-length
> fancy memory operands.

That seems strange to me.  But I know very little about how targets such as x86-64 work for global data that might be complicated with load-time or run-time linking - my experience and understanding is all with statically linked binaries.

It seems, from my brief testing, that for the x86-64 target, the compiler does not do any optimisations based on the relative positions of data defined in a unit (whether initialised, non-common bss, or static).  For targets such as the ARM, gcc can optimise as though the individual variables were fields in a struct where it knows the relative positions.  I don't see any reason why x86-64 should not benefit from some of these, though I realise that scheduling and out-of-order execution will mean some apparent optimisations would be counter-productive.  Maybe there is some kind of address space layout randomisation that is playing a role here?


Anyway, I cannot see any reason while -fno-common should result in the slower run-times the OP saw (though I have only looked at current gcc versions).  I haven't seen any differences in the code generated for -fcommon and -fno-common on the x86-64.  And my experience on other targets is that -fcommon allows optimisations that cannot be done with -fno-common, thus giving faster code.

I have not, however, seen the OP's real code - I've just made small tests.
Comment 8 Zhaohaifeng 2024-06-05 08:46:07 UTC
(In reply to David Brown from comment #7)
> (In reply to Xi Ruoyao from comment #6)
> > (In reply to Zhaohaifeng from comment #5)
> > 
> > > Does gcc implement -fsection-anchors like function in -fcommon option for
> > > x86? In general concept, gcc should has some similar feature for x86 and ARM.
> > 
> 
> AFAIK, -fsection-anchors and -fcommon / -fno-common are completely
> independent.  But section anchors cannot work with "common" symbols, no
> matter what architecture, because at compile time the compiler does not know
> the order of allocation of the common symbols.  It /does/ know the order of
> allocation of symbols defined in the current translation unit, such as
> initialised data, -fno-common zero initialised data, and static data.  This
> information can be used with section anchors and also with other
> optimisations based on the relative positions of objects.
> 
> > AFAIK it's not very useful for CISC architectures supporting variable-length
> > fancy memory operands.
> 
> That seems strange to me.  But I know very little about how targets such as
> x86-64 work for global data that might be complicated with load-time or
> run-time linking - my experience and understanding is all with statically
> linked binaries.
> 
> It seems, from my brief testing, that for the x86-64 target, the compiler
> does not do any optimisations based on the relative positions of data
> defined in a unit (whether initialised, non-common bss, or static).  For
> targets such as the ARM, gcc can optimise as though the individual variables
> were fields in a struct where it knows the relative positions.  I don't see
> any reason why x86-64 should not benefit from some of these, though I
> realise that scheduling and out-of-order execution will mean some apparent
> optimisations would be counter-productive.  Maybe there is some kind of
> address space layout randomisation that is playing a role here?
> 
> 
> Anyway, I cannot see any reason while -fno-common should result in the
> slower run-times the OP saw (though I have only looked at current gcc
> versions).  I haven't seen any differences in the code generated for
> -fcommon and -fno-common on the x86-64.  And my experience on other targets
> is that -fcommon allows optimisations that cannot be done with -fno-common,
> thus giving faster code.
> 
> I have not, however, seen the OP's real code - I've just made small tests.

The difference generated for -fcommon and -fno-common is just the global variable order in memory address.

-fcommon is as following (some special order):
stderr@GLIBC_2.2.5
completed.0
Begin_Time
Arr_2_Glob
Ch_2_Glob
Run_Index
Microseconds
Ptr_Glob
Dhrystones_Per_Second
End_Time
Int_Glob
Bool_Glob
User_Time
Next_Ptr_Glob
Arr_1_Glob
Ch_1_Glob

-fno-common is as following (reversed order of source code):
stderr@GLIBC_2.2.5
completed.0
Dhrystones_Per_Second
Microseconds
User_Time
End_Time
Begin_Time
Reg
Arr_2_Glob
Arr_1_Glob
Ch_2_Glob
Ch_1_Glob
Bool_Glob
Int_Glob
Next_Ptr_Glob
Ptr_Glob
Run_Index
Comment 9 Xi Ruoyao 2024-06-05 09:26:58 UTC
Then will -fno-toplevel-reorder help?
Comment 10 Xi Ruoyao 2024-06-05 09:29:35 UTC
Anyway if you really require a specific order of some data you need to either use -fno-toplevel-reorder, or group the data with a struct or linker script explicitly.

Relying on any implicit behavior like -fcommon is just fragile and it may "break" if the compiler or the linker are changed.
Comment 11 David Brown 2024-06-05 09:30:50 UTC
(In reply to Zhaohaifeng from comment #8)
> (In reply to David Brown from comment #7)
> > (In reply to Xi Ruoyao from comment #6)

> > Anyway, I cannot see any reason while -fno-common should result in the
> > slower run-times the OP saw (though I have only looked at current gcc
> > versions).  I haven't seen any differences in the code generated for
> > -fcommon and -fno-common on the x86-64.  And my experience on other targets
> > is that -fcommon allows optimisations that cannot be done with -fno-common,
> > thus giving faster code.
> > 
> > I have not, however, seen the OP's real code - I've just made small tests.
> 
> The difference generated for -fcommon and -fno-common is just the global
> variable order in memory address.
> 
> -fcommon is as following (some special order):
> stderr@GLIBC_2.2.5
> completed.0
> Begin_Time
...
> -fno-common is as following (reversed order of source code):
> stderr@GLIBC_2.2.5
> completed.0
> Dhrystones_Per_Second
> Microseconds
> User_Time
...

A change in the order is not unexpected.  But it is hard to believe this will make a significant difference to the speed of the code as much as you describe - it would have to involve particularly unlucky cache issues.

On the x86-64, defined variables appear to be allocated in the reverse order from the source code unless there are overriding reasons to change that.  I don't know why that is the case.  You can avoid this by using the "-fno-toplevel-reorder" switch.  I don't know how common variables are allocated - that may depend on ordering in the code, or linker scripts, or declarations in headers.

I have no idea about your program, but one situation where the details of memory  layout can have a big effect is if you have multiple threads, and nominally independent data used by multiple threads happen to share a cache line.  Access patterns to arrays and structs can also have different effects depending on the alignment of the data to cache lines.

So you might try "-fno-toplevel-reorder" to have tighter control of the ordering.  It may also be worth adding cacheline-sized _Alignas specifiers to some objects, particularly bigger or critical structs or arrays.  (If you are using a C standard prior to C11, gcc's __attribute__((aligned(XXX))) can be used.)