Bug 84327 - Copy pasting the documented optimize flags is not equal to -O1
Summary: Copy pasting the documented optimize flags is not equal to -O1
Status: RESOLVED INVALID
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: 8.0.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-12 10:52 UTC by xyzdr4gon333
Modified: 2018-02-12 13:05 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Program showing that the single optimization flags don't work as well as O1 (1.69 KB, text/x-csrc)
2018-02-12 10:52 UTC, xyzdr4gon333
Details

Note You need to log in before you can comment on or make changes to this bug.
Description xyzdr4gon333 2018-02-12 10:52:31 UTC
Created attachment 43392 [details]
Program showing that the single optimization flags don't work as well as O1

My program reduces its runtime from 20s to 5s when using -O1. So I wanted to know which optimization is responsible for that and used the optimizations flags found here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

But not even when copy-pasting ALL flags up to -O3 listed there, can I reproduce the speedup to 5s!

See the attached file which also contains code comments on how I did compile it.

This seems to be a very long-standing bug (5+ years):
https://stackoverflow.com/questions/12769173/selecting-gcc-optimisation-flags-equivalent-to-o1
https://stackoverflow.com/questions/20246357/gcc-using-o1-and-spelling-the-o1-options-out-leads-to-different-result-one-w

And even after trying to find the difference by using -Q --help=optimizers which showed this diff:
    17d16
    <   -fdelayed-branch
    100a100
    >   -ftree-builtin-call-dce
Even when adding -ftree-builtin-call-dce I still don't get the same speedup ?!?! In fact nothing changes...

    g++ "${O1Flags[@]}" -ftree-builtin-call-dce -std=c++11 optimizeFlags.cpp && ./a.out

Tested with:

    g++ (Debian 7.3.0-3) 7.3.0
    g++-8 (Debian 8-20180207-2) 8.0.1 20180207 (experimental) [trunk revision 257435]
Comment 1 xyzdr4gon333 2018-02-12 11:23:09 UTC
This bug becomes more important for the actual real-life example which becomes slower at -O2 compared to -O1! Actually in the earlier attached file you only have to replace the `interleaveZeros` function with this one:

unsigned int interleaveTwoZeros( unsigned int n )
{
    n&= 0x000003ff;
    n = (n ^ (n << 16)) & 0xFF0000FF;
    n = (n ^ (n <<  8)) & 0x0300F00F;
    n = (n ^ (n <<  4)) & 0x030C30C3;
    n = (n ^ (n <<  2)) & 0x09249249;
    return n;
}

I.e. the only difference are slightly different constants, nothing else! The timings:

        1234567890 iterations took 19.151s and resulted in 806157809
    -O0 1234567890 iterations took 19.1547s and resulted in 1772082360
    -O1 1234567890 iterations took 5.69619s and resulted in 2085417644
    -O2 1234567890 iterations took 6.21504s and resulted in 32256352
    -O3 1234567890 iterations took 6.14414s and resulted in 357018037

Not sure if this is worth another bug. Can reproduce this for the following compiler versions:

for GPP in g++-4.9 g++-5 g++-6 g++-7 g++-8; do 
    $GPP --version | head -1
    for flag in '   ' -O0 -O1 -O2 -O3; do 
        echo -n "$flag "
        $GPP $flag -std=c++11 optimizeFlags.cpp && ./a.out
    done
done

    g++-4.9 (Debian 4.9.4-2) 4.9.4
        1234567890 iterations took 19.1979s and resulted in 1918993912
    -O0 1234567890 iterations took 19.1785s and resulted in 710267642
    -O1 1234567890 iterations took 5.6609s and resulted in 1898524753
    -O2 1234567890 iterations took 5.71375s and resulted in 1117037030
    -O3 1234567890 iterations took 5.67933s and resulted in 1451088646
    g++-5 (Debian 5.5.0-8) 5.5.0 20171010
        1234567890 iterations took 19.2387s and resulted in 999898210
    -O0 1234567890 iterations took 19.1464s and resulted in 1358121256
    -O1 1234567890 iterations took 5.64181s and resulted in 642760018
    -O2 1234567890 iterations took 5.65094s and resulted in 191105767
    -O3 1234567890 iterations took 5.68849s and resulted in 1555980094
    g++-6 (Debian 6.4.0-12) 6.4.0 20180123
        1234567890 iterations took 19.1786s and resulted in 1613186065
    -O0 1234567890 iterations took 19.2001s and resulted in 424276129
    -O1 1234567890 iterations took 5.73263s and resulted in 1828427433
    -O2 1234567890 iterations took 6.16005s and resulted in 814826690
    -O3 1234567890 iterations took 6.1438s and resulted in 867162058
    g++-7 (Debian 7.3.0-3) 7.3.0
        1234567890 iterations took 19.1302s and resulted in 1147954921
    -O0 1234567890 iterations took 19.1694s and resulted in 734785107
    -O1 1234567890 iterations took 5.72652s and resulted in 1133709951
    -O2 1234567890 iterations took 6.15633s and resulted in 352136223
    -O3 1234567890 iterations took 6.14089s and resulted in 1468150013
    g++-8 (Debian 8-20180207-2) 8.0.1 20180207 (experimental) [trunk revision 257435]
        1234567890 iterations took 19.1278s and resulted in 694826541
    -O0 1234567890 iterations took 19.1454s and resulted in 249938642
    -O1 1234567890 iterations took 5.72959s and resulted in 365780913
    -O2 1234567890 iterations took 6.20064s and resulted in 2033700921
    -O3 1234567890 iterations took 6.12829s and resulted in 1244532281

=> seems like this is somehow a regression bug since g++ 6!

Actually a mix of -O1 with the additional O2-flags seems to work to reproduce the weird slowdown! 

   g++ -O1 "${O2Flags[@]}" -std=c++11 optimizeFlags.cpp && ./a.out
     => 6.16161s

Actually by bisecting the additional O2-flags this can be traced down to -finline-small-functions ... I will open another bug for this.
Comment 2 Jonathan Wakely 2018-02-12 12:47:41 UTC
(In reply to xyzdr4gon333 from comment #0)
> This seems to be a very long-standing bug (5+ years):
> https://stackoverflow.com/questions/12769173/selecting-gcc-optimisation-
> flags-equivalent-to-o1
> https://stackoverflow.com/questions/20246357/gcc-using-o1-and-spelling-the-
> o1-options-out-leads-to-different-result-one-w

This is expected, not a bug:
https://gcc.gnu.org/wiki/FAQ#optimization-options
Comment 3 Richard Biener 2018-02-12 12:52:13 UTC
.
Comment 4 Jonathan Wakely 2018-02-12 12:56:56 UTC
(In reply to xyzdr4gon333 from comment #1)
> Actually by bisecting the additional O2-flags this can be traced down to
> -finline-small-functions ... I will open another bug for this.

I see you've opened Bug 84328 for this, so I'm closing this one because it's not a bug.
Comment 5 xyzdr4gon333 2018-02-12 12:59:44 UTC
Too bad. Before I have to take a longer look at the assembler code, any quick thoughts about what optimization not available as any single option could lead to the speedup of 4x?
Comment 6 Jonathan Wakely 2018-02-12 13:03:42 UTC
I think you misunderstand. Listing all the individual -fxxx options without -O1 results in NO OPTIMIZATION. The difference you see is due to all the passes enabled by -O1, not by the ones without flags.

As I wrote on stackoverflow:

If you don't use one of the -O1, -O2, -O3, -Ofast, or -Og optimization options (and not -O0) then no optimization happens at all, so adjusting which optimization passes are active doesn't do anything.

To find which optimization pass makes the difference you can turn on -O1 and then disable individual optimization passes until you find the one that makes a difference.

i.e. instead of gcc -fxxx -fyyy -fzzz ... use gcc -O1 -fno-xxx -fno-yyy -fno-zzz
Comment 7 xyzdr4gon333 2018-02-12 13:05:03 UTC
Ah, thank you very much! And sorry for misusing the bug tracker out of lack of knowledge.