Bug 58863 - for loop not aligned at -O2 or -O3
Summary: for loop not aligned at -O2 or -O3
Status: RESOLVED INVALID
Alias: None
Product: gcc
Classification: Unclassified
Component: other (show other bugs)
Version: 4.7.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL: http://stackoverflow.com/q/19470873/3...
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-24 17:14 UTC by Ali Baharev
Modified: 2015-06-16 15:44 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ali Baharev 2013-10-24 17:14:28 UTC
The for loop in work() is the hotspot:

const int LOOP_BOUND = 200000000;

__attribute__((noinline))
static int add(const int& x, const int& y) {
    return x + y;
}

__attribute__((noinline))
static int work(int xval, int yval) {
    int sum(0);
    for (int i=0; i<LOOP_BOUND; ++i) {
        int x(xval+sum);
        int y(yval+sum);
        int z = add(x, y);
        sum += z;
    }
    return sum;
}

int main(int , char* argv[]) {
    int result = work(*argv[1], *argv[2]);
    return result;
}


Running 

g++ -O2 main.cpp && objdump -d | c++filt 

gives

  400598:       41 8d 34 1c             lea    (%r12,%rbx,1),%esi
  [...]
  4005ab:       75 eb                   jne    400598 <work(int, int)+0x18>

According to the documentation:

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-falign-loops   Enabled at levels -O2, -O3. 

By analyzing the assembly code, it looks like gcc aligns things to the next 16 byte boundary by default on this machine in other cases.

If I pass -falign-loops=16 it becomes:

  4005a0:       41 8d 34 1c             lea    (%r12,%rbx,1),%esi
  [...]
  4005b3:       75 eb                   jne    4005a0 <work(int, int)+0x20>

I guess it is also supposed to look like this when just -O2 is passed, at least that is what the documentation suggestes to me.
Comment 1 Andrew Pinski 2013-10-24 17:23:55 UTC
We have:
	.p2align 4,,10
	.p2align 3

so the max number of bytes we will skip is 10 but still align it to a 8 byte boundary.
Comment 2 Ali Baharev 2013-10-24 17:31:21 UTC
Please check with objdump. It's not what I get in the executable.
Comment 3 Andrew Pinski 2013-10-24 17:33:49 UTC
(In reply to Ali Baharev from comment #2)
> Please check with objdump. It's not what I get in the executable.

Yes it is.  Read my comment again.  we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes.
Comment 4 Ali Baharev 2013-10-24 17:37:24 UTC
My mistake, sorry. 

So, you are saying that the default alignment is 8 byte for loops?

The funny thing is, this code runs 15% faster, if any of the followings are passed:

 -Os
 -O2 -fno-align-loops -fno-align-functions
 -O2 -fno-omit-frame-pointer

At least on my machine and in this case, 16 byte alignment is better (or any multiple of 16 byte). -march=native has no effect on the performance.
Comment 5 Ali Baharev 2013-10-24 17:39:16 UTC
OK, then 8 byte default alignment for loops is the default. If you think it is not a bug, then let's close this. Sorry for the false alarm.
Comment 6 Richard Biener 2013-10-25 10:19:08 UTC
It works as designed.