This is the mail archive of the
mailing list for the GCC project.
Re: Suboptimal bb ordering with -Os on arm
- From: Nicolai Stange <nicstange at gmail dot com>
- To: Segher Boessenkool <segher at kernel dot crashing dot org>
- Cc: Nicolai Stange <nicstange at gmail dot com>, Andi Kleen <andi at firstfloor dot org>, gcc at gcc dot gnu dot org
- Date: Fri, 11 Nov 2016 02:16:18 +0100
- Subject: Re: Suboptimal bb ordering with -Os on arm
- Authentication-results: sourceware.org; auth=none
- References: <email@example.com> <20161110233405.GB17570@gate.crashing.org>
thanks for your prompt reply!
Segher Boessenkool <firstname.lastname@example.org> writes:
> On Fri, Nov 11, 2016 at 12:03:44AM +0100, Nicolai Stange wrote:
>> in the course of doing some benchmarks on arm with -Os, I noticed that
>> some list traversion code became significantly slower since gcc 5.3 when
>> instruction caches are cold.
> But is it smaller? This tiny example function is not, but on average?
The Linux kernel's .text with for my config at hand is smaller by ~0.1%
with simple than with stc.
I gave this tiny example only to demonstrate the bb ordering issue I was
talking about. Of course, it's made up. So in particular it was not
meant to show anything related to code size.
> If you care about speed instead of size, you should not use -Os.
>> That being said, I could certainly go and submit a patch to the Linux
>> kernel setting -freorder-blocks-algorithm=stc for the -Os case.
> Or do not set CONFIG_CC_OPTIMIZE_FOR_SIZE in your kernel config.
Yes, of course.
>> >From the discussion on gcc-patches  of what is now the aforementioned
>> r228318 ("bb-reorder: Add -freorder-blocks-algorithm= and wire it up"),
>> it is not clear to me whether this change can actually reduce code size
>> beyond those 0.1% given there for -Os.
> There is r228692 as well.
Ok, summarizing, that changelog says that the simple algorithm
potentially produced even bigger code with -Os than stc did. From that
commit on, this remains true only on x86 and mn10300. Right?
>> So first question:
>> Do you guys know of any code where there are more significant code size
>> savings achieved?
> For -O2 it is ~15%, which matters a lot for targets where STC isn't faster
> at all (targets without cache / with tiny cache / with only cache memory).
If I understand you correctly, this means that there is a use case for
having -O2 -freorder-blocks-algorithm=simple, right?
My question is about whether switching the default algorithm for -Os
might make sense, c.f. below.
>> And second question:
>> If that isn't the case, would it possibly make sense to partly revert
>> gcc's behaviour and set -freorder-blocks-algorithm=stc at -Os?
> -Os does many other things that are slower but smaller as well.
Sure. Let me restate my original question: assume for a moment that it
is true that -Os with simple never produces code smaller than 0.1% of
what is created by -Os with stc. I haven't got any idea what the "other
things" are able to achieve w.r.t code size savings, but to me, 0.1%
doesn't appear to be that huge. Don't get me wrong: I *really* can't
judge on whether 0.1% is a significant improvement or not. I'm just
assuming that it's not. With this assumption, the question of whether
those saved 0.1% are really worth the significantly decreased
performance encountered in some situations seemed just natural...
> There is no way to ask for somewhat fast and somewhat small at the
> same time, which seems to be what you want?
No, I want small, possibly at the cost of performance to the extent of
what's sensible. What sensible actually is is what my question is about.
Example: A (hypothetical) code size saving of 0.00000000001% at the cost
of 10000000000x slower code certainly isn't. But 0.1% at the cost of
some additional 0.5us here and there -- no clue.
So, summarizing, I'm not asking whether I should use -O2 or -Os or
whatever, but whether the current behaviour I'm seeing with -Os is