Created attachment 42200 [details] "slow" testcase I have attached a test case (slow_add.cpp) that allocates a large vector of double, initializes each double then adds them. When I compile the attached test case with "g++ -std=c++11 -march=nehalem -O3" (gcc 7.1), the adding part (the program prints the timing) takes on my machine ~14 seconds. The same program written without a vector (fast_add.cpp) takes ~5 seconds to add the doubles. Both programs print the same result as the end. Slow add loop (slow_add compiled with the options above): const double * ptr = array.data(); const double *const end = array.data() + array.size(); double result = 0.0; startTime = std::chrono::system_clock::now(); while (ptr != end) { result += *ptr; ++ptr; } endTime = std::chrono::system_clock::now(); 400c00: e8 eb fd ff ff callq 4009f0 <std::chrono::_V2::system_clock::now()@plt> 400c05: 49 89 c4 mov %rax,%r12 400c08: 4c 39 f3 cmp %r14,%rbx 400c0b: 0f 84 09 01 00 00 je 400d1a <main+0x27a> 400c11: 48 8b 2d 78 03 00 00 mov 0x378(%rip),%rbp # 400f90 <_IO_stdin_used+0x28> 400c18: 4c 89 f0 mov %r14,%rax 400c1b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 400c20: 66 48 0f 6e cd movq %rbp,%xmm1 <== BAD 400c25: f2 0f 58 08 addsd (%rax),%xmm1 400c29: 48 83 c0 08 add $0x8,%rax 400c2d: 66 48 0f 7e cd movq %xmm1,%rbp <== BAD 400c32: 48 39 d8 cmp %rbx,%rax 400c35: 75 e9 jne 400c20 <main+0x180> 400c37: e8 b4 fd ff ff callq 4009f0 <std::chrono::_V2::system_clock::now()@plt> As you can see the result is copied back and forth between %xmm1 and %rbp un-necessarily Pretty much the same program without a vector produces a much better version: Fast add loop: (fast_add compiled with the options above): startTime = std::chrono::system_clock::now(); while (ptr != end) { result += *ptr; ++ptr; } endTime = std::chrono::system_clock::now(); 400a68: e8 93 fe ff ff callq 400900 <std::chrono::_V2::system_clock::now()@plt> 400a6d: 66 0f ef c9 pxor %xmm1,%xmm1 400a71: 49 89 c4 mov %rax,%r12 400a74: 48 b8 00 00 00 00 08 movabs $0x800000000,%rax 400a7b: 00 00 00 400a7e: 48 01 e8 add %rbp,%rax 400a81: eb 09 jmp 400a8c <main+0x11c> 400a83: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 400a88: 48 83 c3 08 add $0x8,%rbx 400a8c: f2 0f 58 4d 00 addsd 0x0(%rbp),%xmm1 400a91: 48 89 dd mov %rbx,%rbp 400a94: 48 39 c3 cmp %rax,%rbx 400a97: 75 ef jne 400a88 <main+0x118> 400a99: f2 0f 11 4c 24 08 movsd %xmm1,0x8(%rsp) 400a9f: e8 5c fe ff ff callq 400900 <std::chrono::_V2::system_clock::now()@plt> If I remove -march when compiling slow_add.cpp, the performance and the generated assembly is in line wth fast_add.cpp. Compiling with -fno-exceptions but keeping -march also solves the issue. I tried -march=native on both Westemere and Haswell machines as well and it produces slow code as well on both. Removing -march or adding -fno-exceptions fixes the issue on both. I see the same issue with gcc 6.3
Created attachment 42201 [details] "fast" testcase
Nothing gets vectorized :-( Note that to fill the vector, this would be better std::vector<double> array(size, 1e-9); In the reduction, we seem to do strange things with the accumulator. addsd (%rax), %xmm1 addq $8, %rax cmpq %rbx, %rax movsd %xmm1, (%rsp) jne .L13 or vmovq %rbp, %xmm2 vaddsd (%rax), %xmm2, %xmm1 addq $8, %rax vmovq %xmm1, %rbp cmpq %rbx, %rax jne .L13 We aren't happy with xmm1, we save the value to memory in the first case, and to an integer register in the second case where we even restore the value from that register...
(Marc, for vectorization -fassociative-math would be required) I think it's due to handling of possibly-throwing insns in register allocation: allocno for the accumulator is considered to conflict with all SSE registers, but the throwing call is outside of the loop, and we don't want spills inside the loop. Minimal C++ testcase, needs just -O2: void c(); struct S {~S();}; double f(double *x, int n) { S s; double r = 0; for (; n; n--) r += *x++; c(); return r; } I don't understand why throwing calls are more special than normal calls for IRA, it seems values in call-clobbered registers would need to be spilled either way...
FWIW, removing 'if (can_throw_internal (insn)) { ... }' code in ira-lives.c:process_bb_node_lives introduced by fix for PR 38811 improves this testcase and doesn't seem to introduce testsuite regressions on both 32 and 64-bit x86, but this might just mean that there's not enough runtime testcases for catching exceptions. The place for a proper fix may be elsewhere, I don't know what special considerations are required for throwing calls.
The small testcase from comment 3 is now improved on trunk, possibly thanks to work in PR 110215.