Bug 82242 - IRA spills allocno in loop body if it crosses throwing call outside the loop
Summary: IRA spills allocno in loop body if it crosses throwing call outside the loop
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 7.1.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks:
 
Reported: 2017-09-19 01:14 UTC by Guillaume Morin
Modified: 2023-11-10 17:53 UTC (History)
4 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-09-20 00:00:00


Attachments
"slow" testcase (432 bytes, text/x-csrc)
2017-09-19 01:14 UTC, Guillaume Morin
Details
"fast" testcase (420 bytes, text/x-csrc)
2017-09-19 01:14 UTC, Guillaume Morin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Guillaume Morin 2017-09-19 01:14:04 UTC
Created attachment 42200 [details]
"slow" testcase

I have attached a test case (slow_add.cpp) that allocates a large vector of double, initializes each double then adds them.

When I compile the attached test case with "g++ -std=c++11 -march=nehalem -O3" (gcc 7.1), the adding part (the program prints the timing) takes on my machine ~14 seconds.  The same program written without a vector (fast_add.cpp) takes ~5 seconds to add the doubles.  Both programs print the same result as the end. 


Slow add loop (slow_add compiled with the options above):
    const double * ptr = array.data();
    const double *const end = array.data() + array.size();

    double result = 0.0;
    startTime = std::chrono::system_clock::now();
    while (ptr != end) {
        result += *ptr;
        ++ptr;
    }
    endTime = std::chrono::system_clock::now();
  400c00:       e8 eb fd ff ff          callq  4009f0 <std::chrono::_V2::system_clock::now()@plt>
  400c05:       49 89 c4                mov    %rax,%r12
  400c08:       4c 39 f3                cmp    %r14,%rbx
  400c0b:       0f 84 09 01 00 00       je     400d1a <main+0x27a>
  400c11:       48 8b 2d 78 03 00 00    mov    0x378(%rip),%rbp        # 400f90 <_IO_stdin_used+0x28>
  400c18:       4c 89 f0                mov    %r14,%rax
  400c1b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400c20:       66 48 0f 6e cd          movq   %rbp,%xmm1              <== BAD
  400c25:       f2 0f 58 08             addsd  (%rax),%xmm1
  400c29:       48 83 c0 08             add    $0x8,%rax
  400c2d:       66 48 0f 7e cd          movq   %xmm1,%rbp              <== BAD
  400c32:       48 39 d8                cmp    %rbx,%rax
  400c35:       75 e9                   jne    400c20 <main+0x180>
  400c37:       e8 b4 fd ff ff          callq  4009f0 <std::chrono::_V2::system_clock::now()@plt>

As you can see the result is copied back and forth between %xmm1 and %rbp un-necessarily

Pretty much the same program without a vector produces a much better version:
Fast add loop: (fast_add compiled with the options above):
    startTime = std::chrono::system_clock::now();
    while (ptr != end) {
        result += *ptr;
        ++ptr;
    }
    endTime = std::chrono::system_clock::now();
  400a68:       e8 93 fe ff ff          callq  400900 <std::chrono::_V2::system_clock::now()@plt>
  400a6d:       66 0f ef c9             pxor   %xmm1,%xmm1
  400a71:       49 89 c4                mov    %rax,%r12
  400a74:       48 b8 00 00 00 00 08    movabs $0x800000000,%rax
  400a7b:       00 00 00 
  400a7e:       48 01 e8                add    %rbp,%rax
  400a81:       eb 09                   jmp    400a8c <main+0x11c>
  400a83:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  400a88:       48 83 c3 08             add    $0x8,%rbx
  400a8c:       f2 0f 58 4d 00          addsd  0x0(%rbp),%xmm1
  400a91:       48 89 dd                mov    %rbx,%rbp
  400a94:       48 39 c3                cmp    %rax,%rbx
  400a97:       75 ef                   jne    400a88 <main+0x118>
  400a99:       f2 0f 11 4c 24 08       movsd  %xmm1,0x8(%rsp)
  400a9f:       e8 5c fe ff ff          callq  400900 <std::chrono::_V2::system_clock::now()@plt>

If I remove -march when compiling slow_add.cpp, the performance and the generated assembly is in line wth fast_add.cpp.  Compiling with -fno-exceptions but keeping -march also solves the issue.

I tried -march=native on both Westemere and Haswell machines as well and it produces slow code as well on both.  Removing -march or adding -fno-exceptions fixes the issue on both.

I see the same issue with gcc 6.3
Comment 1 Guillaume Morin 2017-09-19 01:14:34 UTC
Created attachment 42201 [details]
"fast" testcase
Comment 2 Marc Glisse 2017-09-19 06:53:32 UTC
Nothing gets vectorized :-(
Note that to fill the vector, this would be better

  std::vector<double> array(size, 1e-9);

In the reduction, we seem to do strange things with the accumulator.

        addsd   (%rax), %xmm1
        addq    $8, %rax
        cmpq    %rbx, %rax
        movsd   %xmm1, (%rsp)
        jne     .L13

or

        vmovq   %rbp, %xmm2
        vaddsd  (%rax), %xmm2, %xmm1
        addq    $8, %rax
        vmovq   %xmm1, %rbp
        cmpq    %rbx, %rax
        jne     .L13

We aren't happy with xmm1, we save the value to memory in the first case, and to an integer register in the second case where we even restore the value from that register...
Comment 3 Alexander Monakov 2017-09-20 13:01:32 UTC
(Marc, for vectorization -fassociative-math would be required)

I think it's due to handling of possibly-throwing insns in register allocation: allocno for the accumulator is considered to conflict with all SSE registers, but the throwing call is outside of the loop, and we don't want spills inside the loop.  Minimal C++ testcase, needs just -O2:

void c();
struct S {~S();};
double f(double *x, int n)
{
  S s;
  double r = 0;
  for (; n; n--)
    r += *x++;
  c();
  return r;
}

I don't understand why throwing calls are more special than normal calls for IRA, it seems values in call-clobbered registers would need to be spilled either way...
Comment 4 Alexander Monakov 2017-09-21 17:53:43 UTC
FWIW, removing 'if (can_throw_internal (insn)) { ... }' code in ira-lives.c:process_bb_node_lives introduced by fix for PR 38811 improves this testcase and doesn't seem to introduce testsuite regressions on both 32 and 64-bit x86, but this might just mean that there's not enough runtime testcases for catching exceptions. The place for a proper fix may be elsewhere, I don't know what special considerations are required for throwing calls.
Comment 5 Alexander Monakov 2023-11-10 17:52:14 UTC
The small testcase from comment 3 is now improved on trunk, possibly thanks to work in PR 110215.