For the following testcase, gcc -O2 unsigned foo(const unsigned char *buf, long size); unsigned bar(const unsigned char *buf, long size) { typedef char i8v8 __attribute__((vector_size(8))); typedef short i16v8 __attribute__((vector_size(16))); long chunk_sz = 15*16; for (; size >= chunk_sz; size -= chunk_sz) { i16v8 vs1 = { 0 }; const unsigned char *end = buf + chunk_sz; for (; buf != end; buf += 16) { i16v8 b; asm("pmovzxbw %1, %0" : "=x"(b) : "m"(*(i8v8*)buf)); vs1 += b; asm("pmovzxbw %1, %0" : "=x"(b) : "m"(*(i8v8*)(buf+8))); vs1 += b; } asm("" :: "x"(vs1)); } return foo(buf, size); } (asms needed due to PR 31667) generates bar: cmp rsi, 239 jle .L2 lea rdx, [rdi+240] .L4: lea rax, [rdx-240] pxor xmm0, xmm0 .L3: pmovzxbw QWORD PTR [rax], xmm1 add rax, 16 paddw xmm0, xmm1 mov rdi, rdx ; <<< ehhh pmovzxbw QWORD PTR [rax-8], xmm1 paddw xmm0, xmm1 cmp rax, rdx jne .L3 sub rsi, 240 add rdx, 240 cmp rsi, 239 jg .L4 .L2: jmp foo It looks as if going out of SSA places in the loop a register copy corresponding to a phi node which is outside of the loop. Strangely, RTL optimizations do not clean it up either.
(In reply to Alexander Monakov from comment #0) > It looks as if going out of SSA places in the loop a register copy > corresponding to a phi node which is outside of the loop. Strangely, RTL > optimizations do not clean it up either. No it is IVOPTs that places the copy inside the loop: <bb 5> [local count: 1006632961]: # buf_25 = PHI <buf_21(5), buf_22(4)> # vs1_28 = PHI <vs1_20(5), { 0, 0, 0, 0, 0, 0, 0, 0 }(4)> __asm__("pmovzxbw %1, %0" : "=x" b_17 : "m" MEM[(i8v8 *)buf_25]); vs1_18 = b_17 + vs1_28; _15 = (unsigned long) buf_25; _14 = _15 + 8; _2 = (const unsigned char *) _14; __asm__("pmovzxbw %1, %0" : "=x" b_19 : "m" MEM[(i8v8 *)_2]); vs1_20 = vs1_18 + b_19; buf_21 = buf_25 + 16; _33 = (const unsigned char *) ivtmp.18_7; if (buf_21 != _33) goto <bb 5>; [93.75%] else goto <bb 6>; [6.25%] Notice the cast is of ivtmp.18_7 assigned to _33 here. The cast is an invariant. I don't know why LIM4 didn't pull out the invariant.
_33 = (const unsigned char *) ivtmp.18_7; invariant up to level 2, cost 1.
With --param lim-expensive=0, the cast was pulled out of the loop and the move is gone too. But then again the register move should have been removed during register allocation ...
Oh yes out of ssa used the wrong coalescing: Coalesce list: (24)buf_24 & (33)_33 [map: 9, 16] : Success -> 9 Just by accident.
(In reply to Andrew Pinski from comment #4) > Oh yes out of ssa used the wrong coalescing: > Coalesce list: (24)buf_24 & (33)_33 [map: 9, 16] : Success -> 9 > > Just by accident. The reason why nothing in the RTL can clean it up is because the psedu-register is used inside the loop and outside of the loop.
-fno-tree-coalesce-vars also fixes the issue.
I Knew there was an older bug about a similar thing. See PR 86270 and PR 70359.
(In reply to Andrew Pinski from comment #2) > _33 = (const unsigned char *) ivtmp.18_7; > invariant up to level 2, cost 1. Yep, that's an "old" thing, PRE would have moved it but LIM has something they call a cost model.
Btw, as of coalescing I think the only coalescing we should do at out-of-SSA time is coalescing to avoid copies on edges (aka for PHIs and abnormals), the rest should be left to the RA (OTOH we're going to flood early RTL with a lot more pseudos that way, possibly increasing compile-time in DF).