Quantitative analysis of -Os vs -O3

Andrew Pinski pinskia@gmail.com
Sat Aug 26 08:56:00 GMT 2017

On Sat, Aug 26, 2017 at 1:23 AM, Michael Clark <michaeljclark@mac.com> wrote:
> Dear GCC folk,
> I have to say that’s GCC’s -Os caught me by surprise after several years using Apple GCC and more recently LLVM/Clang in Xcode. Over the last year and a half I have been working on RISC-V development and have been exclusively using GCC for RISC-V builds, and initially I was using -Os. After performing a qualitative/quantitative assessment I don’t believe GCC’s current -Os is particularly useful, at least for my needs as it doesn’t provide a commensurate saving in size given the sometimes quite huge drop in performance.
> I’m quoting an extract from Eric’s earlier email on the Overwhelmed by GCC frustration thread, as I think Apple’s documentation which presumably documents Clang/LLVM -Os policy is what I would call an ideal -Os (perhaps using -O2 as a starting point) with the idea that the current -Os is renamed to -Oz.
>         -Oz
>                (APPLE ONLY) Optimize for size, regardless of performance. -Oz
>                enables the same optimization flags that -Os uses, but -Oz also
>                enables other optimizations intended solely to reduce code size.
>                In particular, instructions that encode into fewer bytes are
>                preferred over longer instructions that execute in fewer cycles.
>                -Oz on Darwin is very similar to -Os in FSF distributions of GCC.
>                -Oz employs the same inlining limits and avoids string instructions
>                just like -Os.
>         -Os
>                Optimize for size, but not at the expense of speed. -Os enables all
>                -O2 optimizations that do not typically increase code size.
>                However, instructions are chosen for best performance, regardless
>                of size. To optimize solely for size on Darwin, use -Oz (APPLE
>                ONLY).
> I have recently  been working on a benchmark suite to test a RISC-V JIT engine. I have performed all testing using GCC 7.1 as the baseline compiler, and during the process I have collected several performance metrics, some that are neutral to the JIT runtime environment. In particular I have made performance comparisons between -Os and -O3 on x86, along with capturing executable file sizes, dynamic retired instruction and micro-op counts for x86, dynamic retired instruction counts for RISC-V as well as dynamic register and instruction usage histograms for RISC-V, for both -Os and -O3.
> See the Optimisation section for a charted performance comparison between -O3 and -Os. There are dozens of other plots that show the differences between -Os and -O3.
>         - https://rv8.io/bench
> The Geomean on x86 shows a 19% performance hit for -Os vs -O3 on x86. The Geomean of course smooths over some pathological cases where -Os performance is severely degraded versus -O3 but not with significant, or commensurate savings in size.

First let me put into some perspective on -Os usage and some history:
1) -Os is not useful for non-embedded users
2) the embedded folks really need the smallest code possible and
usually will be willing to afford the performance hit
3) -Os was a mistake for Apple to use in the first place; they used it
and then GCC got better for PowerPC to use the string instructions
which is why -Oz was added :)
4) -Os is used heavily by the arm/thumb2 folks in bare metal applications.

Comparing -O3 to -Os is not totally fair on x86 due to the many
different instructions and encodings.
Compare it on ARM/Thumb2 or MIPS/MIPS16 (or micromips) where size is a
big issue.
I soon have a need to keep overall (bare-metal) application size down
to just 256k.
Micro-controllers are places where -Os matters the most.

> I don’t currently have -O2 in my results however it seems like I should add -O2 to the benchmark suite. If you take a look at the web page you’ll see that there is already a huge amount of data given we have captured dynamic register frequencies and dynamic instruction frequencies for -Os and -O3. The tables and charts are all generated by scripts so if there is interest I could add -O2. I can also pretty easily perform runs with new compiler versions as everything is completely automated. The biggest factor is that it currently takes 4 hours for a full run as we run all of the benchmarks in a simulator to capture dynamic register usage and dynamic instruction usage.
> After looking at the results, one has to question the utility of -Os in its present form, and indeed question how it is actually used in practice, given the proportion of savings in executable size. After my assessment I would not recommend anyone to use -Os because its savings in size are not proportionate to the loss in performance. I feel discouraged from using it after looking at the results. I really don’t believe -Os makes the right trades e.g. reducing icache pressure can indeed lead to better performance due to reduced code size.

This comment does not help my application usage.  It rather hurts it
and goes against what -Os is really about.  It is not about reducing
icache pressure but overall application code size.  I really need the
code to fit into a specific size.


> I also wonder whether -O2 level optimisations may be a good starting point for a more useful -Os and how one would proceed towards selecting optimisations to add back to -Os to increase its usability, or rename the current -Os to -Oz and make -Os an alias for -O2. A similar profile to -O2 would probably produce less shock for anyone who does quantitative performance analysis of -Os.
> In fact there are some interesting issues for the RISC-V backend given the assembler performs RVC compression and GCC doesn’t really see the size of emitted instructions. It would be an interesting backend to investigate improving -Os presuming that a backend can opt in to various optimisations for a given optimisation level. RISC-V would gain most of its size and runtime icache pressure reduction improvements by getting the highest frequency registers allocated within the 8 register set that is accessible by the RVC instructions. Merely controlling register allocation to favour the RVC accessible registers would produce the largest savings in executable size, and may indeed be good for performance due to reduced icache pressure.
> I have Dynamic Register Frequency Charts but they are not presently labeled or coloured whether the registers are RVC accessible registers (x8 to x15). I did however work on some crude ASCII histograms that indicate register access frequency and whether the register is RVC accessible. Ideally the register allocator would allocate highest frequency registers first from the RVC set. The register order is already correctly defined in the RISC-V backend. I have been experimenting with riscv_register_priority to try to nudge LRA but have not yet had success. riscv_register_priority currently returns 1 for RVC registers (if the C extension is present) and 0 for regular registers however the loop frequency information is obviously not accurate enough or LRA does not completely honour the register order and priority. It’s likely it may not make a lot of difference on platforms with very regular register files. See this gist for one of the benchmarks register access frequency labeled as to whether the register is accessible from compressed instructions:
> - https://gist.github.com/michaeljclark/8ba727e56084833e4f838c941eeca6be
> Question. Who uses -Os on GCC?
> I have for many years used -Os on macOS for Clang builds, as it has been an Xcode default, but I’m considering using -O2 instead of -Os with FSF GCC. I was using FSF GCC’s -Os under the mistaken impression that it operates similarly to -Os in Xcode. i.e. produces code that performs well.
> In any case, despite my rant, I hope the quantitative states in the link above prove to be useful.
> Thanks and Regards,
> Michael.

More information about the Gcc mailing list