Bug 106688 - out of ssa Coalescing sometimes chooses the wrong thing causing an extra move
Summary: out of ssa Coalescing sometimes chooses the wrong thing causing an extra move
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 13.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks:
 
Reported: 2022-08-19 19:41 UTC by Alexander Monakov
Modified: 2022-08-23 17:20 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-08-19 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Monakov 2022-08-19 19:41:17 UTC
For the following testcase, gcc -O2

unsigned foo(const unsigned char *buf, long size);
unsigned bar(const unsigned char *buf, long size)
{
        typedef char  i8v8  __attribute__((vector_size(8)));
        typedef short i16v8 __attribute__((vector_size(16)));
        long chunk_sz = 15*16;
        for (; size >= chunk_sz; size -= chunk_sz) {
                i16v8 vs1 = { 0 };
                const unsigned char *end = buf + chunk_sz;
                for (; buf != end; buf += 16) {
                        i16v8 b;
                        asm("pmovzxbw %1, %0" : "=x"(b) : "m"(*(i8v8*)buf));
                        vs1 += b;
                        asm("pmovzxbw %1, %0" : "=x"(b) : "m"(*(i8v8*)(buf+8)));
                        vs1 += b;
                }
                asm("" :: "x"(vs1));
        }
        return foo(buf, size);
}

(asms needed due to PR 31667)

generates

bar:
        cmp     rsi, 239
        jle     .L2
        lea     rdx, [rdi+240]
.L4:
        lea     rax, [rdx-240]
        pxor    xmm0, xmm0
.L3:
        pmovzxbw QWORD PTR [rax], xmm1
        add     rax, 16
        paddw   xmm0, xmm1

        mov     rdi, rdx ; <<< ehhh

        pmovzxbw QWORD PTR [rax-8], xmm1
        paddw   xmm0, xmm1
        cmp     rax, rdx
        jne     .L3
        sub     rsi, 240
        add     rdx, 240
        cmp     rsi, 239
        jg      .L4
.L2:
        jmp     foo

It looks as if going out of SSA places in the loop a register copy corresponding to a phi node which is outside of the loop. Strangely, RTL optimizations do not clean it up either.
Comment 1 Andrew Pinski 2022-08-19 20:01:46 UTC
(In reply to Alexander Monakov from comment #0)
> It looks as if going out of SSA places in the loop a register copy
> corresponding to a phi node which is outside of the loop. Strangely, RTL
> optimizations do not clean it up either.

No it is IVOPTs that places the copy inside the loop:
  <bb 5> [local count: 1006632961]:
  # buf_25 = PHI <buf_21(5), buf_22(4)>
  # vs1_28 = PHI <vs1_20(5), { 0, 0, 0, 0, 0, 0, 0, 0 }(4)>
  __asm__("pmovzxbw %1, %0" : "=x" b_17 : "m" MEM[(i8v8 *)buf_25]);
  vs1_18 = b_17 + vs1_28;
  _15 = (unsigned long) buf_25;
  _14 = _15 + 8;
  _2 = (const unsigned char *) _14;
  __asm__("pmovzxbw %1, %0" : "=x" b_19 : "m" MEM[(i8v8 *)_2]);
  vs1_20 = vs1_18 + b_19;
  buf_21 = buf_25 + 16;
  _33 = (const unsigned char *) ivtmp.18_7;
  if (buf_21 != _33)
    goto <bb 5>; [93.75%]
  else
    goto <bb 6>; [6.25%]

Notice the cast is of ivtmp.18_7 assigned to _33 here. The cast is an invariant.

I don't know why LIM4 didn't pull out the invariant.
Comment 2 Andrew Pinski 2022-08-19 20:02:18 UTC
_33 = (const unsigned char *) ivtmp.18_7;
  invariant up to level 2, cost 1.
Comment 3 Andrew Pinski 2022-08-19 20:04:47 UTC
With --param lim-expensive=0, the cast was pulled out of the loop and the move is gone too.

But then again the register move should have been removed during register allocation ...
Comment 4 Andrew Pinski 2022-08-19 20:10:13 UTC
Oh yes out of ssa used the wrong coalescing:
Coalesce list: (24)buf_24 & (33)_33 [map: 9, 16] : Success -> 9

Just by accident.
Comment 5 Andrew Pinski 2022-08-19 20:12:13 UTC
(In reply to Andrew Pinski from comment #4)
> Oh yes out of ssa used the wrong coalescing:
> Coalesce list: (24)buf_24 & (33)_33 [map: 9, 16] : Success -> 9
> 
> Just by accident.

The reason why nothing in the RTL can clean it up is because the psedu-register is used inside the loop and outside of the loop.
Comment 6 Andrew Pinski 2022-08-19 20:15:21 UTC
-fno-tree-coalesce-vars also fixes the issue.
Comment 7 Andrew Pinski 2022-08-19 20:43:14 UTC
I Knew there was an older bug about a similar thing. See PR 86270 and PR 70359.
Comment 8 Richard Biener 2022-08-22 06:25:20 UTC
(In reply to Andrew Pinski from comment #2)
> _33 = (const unsigned char *) ivtmp.18_7;
>   invariant up to level 2, cost 1.

Yep, that's an "old" thing, PRE would have moved it but LIM has something they call a cost model.
Comment 9 Richard Biener 2022-08-22 06:27:27 UTC
Btw, as of coalescing I think the only coalescing we should do at out-of-SSA time
is coalescing to avoid copies on edges (aka for PHIs and abnormals), the rest should be left to the RA (OTOH we're going to flood early RTL with a lot more pseudos that way, possibly increasing compile-time in DF).