84327 – Copy pasting the documented optimize flags is not equal to -O1

Bug 84327 - Copy pasting the documented optimize flags is not equal to -O1

Summary: Copy pasting the documented optimize flags is not equal to -O1

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	c++ (show other bugs)
Version:	8.0.1

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2018-02-12 10:52 UTC by xyzdr4gon333
Modified:	2018-02-12 13:05 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Program showing that the single optimization flags don't work as well as O1 (1.69 KB, text/x-csrc) 2018-02-12 10:52 UTC, xyzdr4gon333	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description xyzdr4gon333 2018-02-12 10:52:31 UTC

Created attachment 43392 [details]
Program showing that the single optimization flags don't work as well as O1

My program reduces its runtime from 20s to 5s when using -O1. So I wanted to know which optimization is responsible for that and used the optimizations flags found here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

But not even when copy-pasting ALL flags up to -O3 listed there, can I reproduce the speedup to 5s!

See the attached file which also contains code comments on how I did compile it.

This seems to be a very long-standing bug (5+ years):
https://stackoverflow.com/questions/12769173/selecting-gcc-optimisation-flags-equivalent-to-o1
https://stackoverflow.com/questions/20246357/gcc-using-o1-and-spelling-the-o1-options-out-leads-to-different-result-one-w

And even after trying to find the difference by using -Q --help=optimizers which showed this diff:
    17d16
    <   -fdelayed-branch
    100a100
    >   -ftree-builtin-call-dce
Even when adding -ftree-builtin-call-dce I still don't get the same speedup ?!?! In fact nothing changes...

    g++ "${O1Flags[@]}" -ftree-builtin-call-dce -std=c++11 optimizeFlags.cpp && ./a.out

Tested with:

    g++ (Debian 7.3.0-3) 7.3.0
    g++-8 (Debian 8-20180207-2) 8.0.1 20180207 (experimental) [trunk revision 257435]

Comment 1 xyzdr4gon333 2018-02-12 11:23:09 UTC

This bug becomes more important for the actual real-life example which becomes slower at -O2 compared to -O1! Actually in the earlier attached file you only have to replace the `interleaveZeros` function with this one:

unsigned int interleaveTwoZeros( unsigned int n )
{
    n&= 0x000003ff;
    n = (n ^ (n << 16)) & 0xFF0000FF;
    n = (n ^ (n <<  8)) & 0x0300F00F;
    n = (n ^ (n <<  4)) & 0x030C30C3;
    n = (n ^ (n <<  2)) & 0x09249249;
    return n;
}

I.e. the only difference are slightly different constants, nothing else! The timings:

        1234567890 iterations took 19.151s and resulted in 806157809
    -O0 1234567890 iterations took 19.1547s and resulted in 1772082360
    -O1 1234567890 iterations took 5.69619s and resulted in 2085417644
    -O2 1234567890 iterations took 6.21504s and resulted in 32256352
    -O3 1234567890 iterations took 6.14414s and resulted in 357018037

Not sure if this is worth another bug. Can reproduce this for the following compiler versions:

for GPP in g++-4.9 g++-5 g++-6 g++-7 g++-8; do 
    $GPP --version | head -1
    for flag in '   ' -O0 -O1 -O2 -O3; do 
        echo -n "$flag "
        $GPP $flag -std=c++11 optimizeFlags.cpp && ./a.out
    done
done

    g++-4.9 (Debian 4.9.4-2) 4.9.4
        1234567890 iterations took 19.1979s and resulted in 1918993912
    -O0 1234567890 iterations took 19.1785s and resulted in 710267642
    -O1 1234567890 iterations took 5.6609s and resulted in 1898524753
    -O2 1234567890 iterations took 5.71375s and resulted in 1117037030
    -O3 1234567890 iterations took 5.67933s and resulted in 1451088646
    g++-5 (Debian 5.5.0-8) 5.5.0 20171010
        1234567890 iterations took 19.2387s and resulted in 999898210
    -O0 1234567890 iterations took 19.1464s and resulted in 1358121256
    -O1 1234567890 iterations took 5.64181s and resulted in 642760018
    -O2 1234567890 iterations took 5.65094s and resulted in 191105767
    -O3 1234567890 iterations took 5.68849s and resulted in 1555980094
    g++-6 (Debian 6.4.0-12) 6.4.0 20180123
        1234567890 iterations took 19.1786s and resulted in 1613186065
    -O0 1234567890 iterations took 19.2001s and resulted in 424276129
    -O1 1234567890 iterations took 5.73263s and resulted in 1828427433
    -O2 1234567890 iterations took 6.16005s and resulted in 814826690
    -O3 1234567890 iterations took 6.1438s and resulted in 867162058
    g++-7 (Debian 7.3.0-3) 7.3.0
        1234567890 iterations took 19.1302s and resulted in 1147954921
    -O0 1234567890 iterations took 19.1694s and resulted in 734785107
    -O1 1234567890 iterations took 5.72652s and resulted in 1133709951
    -O2 1234567890 iterations took 6.15633s and resulted in 352136223
    -O3 1234567890 iterations took 6.14089s and resulted in 1468150013
    g++-8 (Debian 8-20180207-2) 8.0.1 20180207 (experimental) [trunk revision 257435]
        1234567890 iterations took 19.1278s and resulted in 694826541
    -O0 1234567890 iterations took 19.1454s and resulted in 249938642
    -O1 1234567890 iterations took 5.72959s and resulted in 365780913
    -O2 1234567890 iterations took 6.20064s and resulted in 2033700921
    -O3 1234567890 iterations took 6.12829s and resulted in 1244532281

=> seems like this is somehow a regression bug since g++ 6!

Actually a mix of -O1 with the additional O2-flags seems to work to reproduce the weird slowdown! 

   g++ -O1 "${O2Flags[@]}" -std=c++11 optimizeFlags.cpp && ./a.out
     => 6.16161s

Actually by bisecting the additional O2-flags this can be traced down to -finline-small-functions ... I will open another bug for this.

Comment 2 Jonathan Wakely 2018-02-12 12:47:41 UTC

(In reply to xyzdr4gon333 from comment #0)
> This seems to be a very long-standing bug (5+ years):
> https://stackoverflow.com/questions/12769173/selecting-gcc-optimisation-
> flags-equivalent-to-o1
> https://stackoverflow.com/questions/20246357/gcc-using-o1-and-spelling-the-
> o1-options-out-leads-to-different-result-one-w

This is expected, not a bug:
https://gcc.gnu.org/wiki/FAQ#optimization-options

Comment 3 Richard Biener 2018-02-12 12:52:13 UTC

Comment 4 Jonathan Wakely 2018-02-12 12:56:56 UTC

(In reply to xyzdr4gon333 from comment #1)
> Actually by bisecting the additional O2-flags this can be traced down to
> -finline-small-functions ... I will open another bug for this.

I see you've opened Bug 84328 for this, so I'm closing this one because it's not a bug.

Comment 5 xyzdr4gon333 2018-02-12 12:59:44 UTC

Too bad. Before I have to take a longer look at the assembler code, any quick thoughts about what optimization not available as any single option could lead to the speedup of 4x?

Comment 6 Jonathan Wakely 2018-02-12 13:03:42 UTC

I think you misunderstand. Listing all the individual -fxxx options without -O1 results in NO OPTIMIZATION. The difference you see is due to all the passes enabled by -O1, not by the ones without flags.

As I wrote on stackoverflow:

If you don't use one of the -O1, -O2, -O3, -Ofast, or -Og optimization options (and not -O0) then no optimization happens at all, so adjusting which optimization passes are active doesn't do anything.

To find which optimization pass makes the difference you can turn on -O1 and then disable individual optimization passes until you find the one that makes a difference.

i.e. instead of gcc -fxxx -fyyy -fzzz ... use gcc -O1 -fno-xxx -fno-yyy -fno-zzz

Comment 7 xyzdr4gon333 2018-02-12 13:05:03 UTC

Ah, thank you very much! And sorry for misusing the bug tracker out of lack of knowledge.