The for loop in work() is the hotspot: const int LOOP_BOUND = 200000000; __attribute__((noinline)) static int add(const int& x, const int& y) { return x + y; } __attribute__((noinline)) static int work(int xval, int yval) { int sum(0); for (int i=0; i<LOOP_BOUND; ++i) { int x(xval+sum); int y(yval+sum); int z = add(x, y); sum += z; } return sum; } int main(int , char* argv[]) { int result = work(*argv[1], *argv[2]); return result; } Running g++ -O2 main.cpp && objdump -d | c++filt gives 400598: 41 8d 34 1c lea (%r12,%rbx,1),%esi [...] 4005ab: 75 eb jne 400598 <work(int, int)+0x18> According to the documentation: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html -falign-loops Enabled at levels -O2, -O3. By analyzing the assembly code, it looks like gcc aligns things to the next 16 byte boundary by default on this machine in other cases. If I pass -falign-loops=16 it becomes: 4005a0: 41 8d 34 1c lea (%r12,%rbx,1),%esi [...] 4005b3: 75 eb jne 4005a0 <work(int, int)+0x20> I guess it is also supposed to look like this when just -O2 is passed, at least that is what the documentation suggestes to me.
We have: .p2align 4,,10 .p2align 3 so the max number of bytes we will skip is 10 but still align it to a 8 byte boundary.
Please check with objdump. It's not what I get in the executable.
(In reply to Ali Baharev from comment #2) > Please check with objdump. It's not what I get in the executable. Yes it is. Read my comment again. we align loops to 8 byte by default but try to align it to 16 byte if we don't need to fill in over 10 bytes.
My mistake, sorry. So, you are saying that the default alignment is 8 byte for loops? The funny thing is, this code runs 15% faster, if any of the followings are passed: -Os -O2 -fno-align-loops -fno-align-functions -O2 -fno-omit-frame-pointer At least on my machine and in this case, 16 byte alignment is better (or any multiple of 16 byte). -march=native has no effect on the performance.
OK, then 8 byte default alignment for loops is the default. If you think it is not a bug, then let's close this. Sorry for the false alarm.
It works as designed.