[PATCH] New hook adjust_iv_update_pos

Tue Jun 29 09:19:25 GMT 2021

On 2021/6/28 16:25, Richard Biener wrote:
> On Mon, Jun 28, 2021 at 10:07 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>
>>
>>
>> On 2021/6/25 18:02, Richard Biener wrote:
>>> On Fri, Jun 25, 2021 at 11:41 AM Xionghu Luo <luoxhu@linux.ibm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2021/6/25 16:54, Richard Biener wrote:
>>>>> On Fri, Jun 25, 2021 at 10:34 AM Xionghu Luo via Gcc-patches
>>>>> <gcc-patches@gcc.gnu.org> wrote:
>>>>>>
>>>>>> From: Xiong Hu Luo <luoxhu@linux.ibm.com>
>>>>>>
>>>>>> adjust_iv_update_pos in tree-ssa-loop-ivopts doesn't help performance
>>>>>> on Power.  For example, it generates mismatched address offset after
>>>>>> adjust iv update statement position:
>>>>>>
>>>>>> <bb 32> [local count: 70988443]:
>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>>>> _34 = ref_180 + 18446744073709551615;
>>>>>> _86 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>>>>>> if (_84 == _86)
>>>>>>      goto <bb 56>; [94.50%]
>>>>>>      else
>>>>>>      goto <bb 87>; [5.50%]
>>>>>>
>>>>>> Disable it will produce:
>>>>>>
>>>>>> <bb 32> [local count: 70988443]:
>>>>>> _84 = MEM[(uint8_t *)ip_229 + ivtmp.30_414 * 1];
>>>>>> _86 = MEM[(uint8_t *)ref_180 + ivtmp.30_414 * 1];
>>>>>> ivtmp.30_415 = ivtmp.30_414 + 1;
>>>>>> if (_84 == _86)
>>>>>>      goto <bb 56>; [94.50%]
>>>>>>      else
>>>>>>      goto <bb 87>; [5.50%]
>>>>>>
>>>>>> Then later pass loop unroll could benefit from same address offset
>>>>>> with different base address and reduces register dependency.
>>>>>> This patch could improve performance by 10% for typical case on Power,
>>>>>> no performance change observed for X86 or Aarch64 due to small loops
>>>>>> not unrolled on these platforms.  Any comments?
>>>>>
>>>>> The case you quote is special in that if we hoisted the IV update before
>>>>> the other MEM _also_ used in the condition it would be fine again.
>>>>
>>>> Thanks.  I tried to hoist the IV update statement before the first MEM (Fix 2), it
>>>> shows even worse performance due to not unroll(two more "base-1" is generated in gimple,
>>>> then loop->ninsns is 11 so small loops is not unrolled), change the threshold from
>>>> 10 to 12 in rs6000_loop_unroll_adjust would make it also unroll 2 times, the
>>>> performance is SAME to the one that IV update statement in the *MIDDLE* (trunk).
>>>>   From the ASM, we can see the index register %r4 is used in two iterations which
>>>> maybe a bottle neck for hiding instruction latency?
>>>>
>>>> Then it seems reasonable the performance would be better if keep the IV update
>>>> statement at *LAST* (Fix 1).
>>>>
>>>> (Fix 2):
>>>>     <bb 32> [local count: 70988443]:
>>>>     ivtmp.30_415 = ivtmp.30_414 + 1;
>>>>     _34 = ip_229 + 18446744073709551615;
>>>>     _84 = MEM[(uint8_t *)_34 + ivtmp.30_415 * 1];
>>>>     _33 = ref_180 + 18446744073709551615;
>>>>     _86 = MEM[(uint8_t *)_33 + ivtmp.30_415 * 1];
>>>>     if (_84 == _86)
>>>>       goto <bb 56>; [94.50%]
>>>>     else
>>>>       goto <bb 87>; [5.50%]
>>>>
>>>>
>>>> .L67:
>>>>           lbzx %r12,%r24,%r4
>>>>           lbzx %r25,%r7,%r4
>>>>           cmpw %cr0,%r12,%r25
>>>>           bne %cr0,.L11
>>>>           mr %r26,%r4
>>>>           addi %r4,%r4,1
>>>>           lbzx %r12,%r24,%r4
>>>>           lbzx %r25,%r7,%r4
>>>>           mr %r6,%r26
>>>>           cmpw %cr0,%r12,%r25
>>>>           bne %cr0,.L11
>>>>           mr %r26,%r4
>>>> .L12:
>>>>           cmpdi %cr0,%r10,1
>>>>           addi %r4,%r26,1
>>>>           mr %r6,%r26
>>>>           addi %r10,%r10,-1
>>>>           bne %cr0,.L67
>>>>
>>>>>
>>>>> Now, adjust_iv_update_pos doesn't seem to check that the
>>>>> condition actually uses the IV use stmt def, so it likely applies to
>>>>> too many cases.
>>>>>
>>>>> Unfortunately the introducing rev didn't come with a testcase,
>>>>> but still I think fixing up adjust_iv_update_pos is better than
>>>>> introducing a way to short-cut it per target decision.
>>>>>
>>>>> One "fix" might be to add a check that either the condition
>>>>> lhs or rhs is the def of the IV use and the other operand
>>>>> is invariant.  Or if it's of similar structure hoist across the
>>>>> other iv-use as well.  Not that I understand the argument
>>>>> about the overlapping life-range.
>>>>>
>>>>> You also don't provide a complete testcase ...
>>>>>
>>>>
>>>> Attached the test code, will also add it it patch in future version.
>>>> The issue comes from a very small hot loop:
>>>>
>>>>           do {
>>>>             len++;
>>>>           } while(len < maxlen && ip[len] == ref[len]);
>>>
>>> unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
>>> {
>>>     unsigned int len = 2;
>>>     do {
>>>         len++;
>>>     }while(len < maxlen && ip[len] == ref[len]);
>>>     return len;
>>> }
>>>
>>> I can see the effect on this loop on x86_64 as well, we end up with
>>>
>>> .L6:
>>>           movzbl  (%rdi,%rax), %ecx
>>>           addq    $1, %rax
>>>           cmpb    -1(%rsi,%rax), %cl
>>>           jne     .L1
>>> .L3:
>>>           movl    %eax, %r8d
>>>           cmpl    %edx, %eax
>>>           jb      .L6
>>>
>>> but without the trick it is
>>>
>>> .L6:
>>>           movzbl  (%rdi,%rax), %r8d
>>>           movzbl  (%rsi,%rax), %ecx
>>>           addq    $1, %rax
>>>           cmpb    %cl, %r8b
>>>           jne     .L1
>>> .L3:
>>>           movl    %eax, %r9d
>>>           cmpl    %edx, %eax
>>>           jb      .L6
>>
>> Verified this small piece of code on X86, there is no performance
>> change with or without adjust_iv_update_pos (I've checked the ASM
>> exactly same with yours):
>>
>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ gcc -O2 test.c
>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
>>
>> real    0m7.003s
>> user    0m6.996s
>> sys     0m0.000s
>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ /home/luoxhu/workspace/build/gcc/xgcc -B/home/luoxhu/workspace/build/g
>> cc/ -O2 test.c
>> luoxhu@gcc14:~/workspace/lzf_compress_testcase$ time ./a.out  1
>>
>> real    0m7.070s
>> user    0m7.068s
>> sys     0m0.000s
>>
>>
>> But for AArch64, current GCC code also generates similar code with
>> or without adjust_iv_update_pos, the runtime is 10.705s for them.
>>
>> L6:
>>          ldrb    w4, [x6, x3]
>>          add     x3, x3, 1
>>          ldrb    w5, [x1, x3]
>>          cmp     w5, w4
>>          bne     .L1
>> .L3:
>>          mov     w0, w3
>>          cmp     w2, w3
>>          bhi     .L6
>>
>> No adjust_iv_update_pos:
>>
>> .L6:
>>          ldrb    w5, [x6, x3]
>>          ldrb    w4, [x1, x3]
>>          add     x3, x3, 1
>>          cmp     w5, w4
>>          bne     .L1
>> .L3:
>>          mov     w0, w3
>>          cmp     w2, w3
>>          bhi     .L6
>>
>>
>> While built with old GCC(gcc version 7.4.1 20190424), it generates
>> worse code and runtime is 11.664s:
>>
>> .L6:
>>          ldrb    w4, [x6, x3]
>>          add     x3, x3, 1
>>          add     x5, x1, x3
>>          ldrb    w5, [x5, -1]
>>          cmp     w5, w4
>>          bne     .L1
>> .L3:
>>          cmp     w3, w2
>>          mov     w0, w3
>>          bcc     .L6
>>
>>
>> In general, only Power shows negative performance with adjust_iv_update_pos,
>> that's why I tried to add target hook for it, is this reasonable?  Or we should
>> just remove adjust_iv_update_pos since it doesn't help performance for X86 or
>> other targets?
> 
> It was Bin who added the functionality - maybe he can remember the cases
> he saw improvements for.
> 
> I think we should instead try to change the code that it doesn't apply to
> CONDs where both operands are defined by stmts using the IV?
> And this time add a testcase ;)

Thanks. I am not sure whether should check both cond_lhs and cond_rhs or just
cond_lhs is enough if COND expr's LHS is always processed first?

For example, adjust_iv_update_pos will be called twice for this case,
after the first call, the gimple will be updated to:

<bb 4> [local count: 1014686026]:
_1 = (sizetype) len_8;
_2 = ip_10(D) + _1;
_3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1];
_5 = ref_12(D) + _1;
_6 = *_5;
ivtmp.15_16 = ivtmp.15_15 + 1;
if (_3 == _6)
  goto <bb 6>; [94.50%]
else
  goto <bb 10>; [5.50%]


At this moment, use->stmt is "_6 = *_5;", it is not using IV directly?



[PATCH] ivopts: Don't adjust IV update statement if both operands use the IV in COND [PR101250]

gcc/ChangeLog:

	PR middle-end/101250
	* tree-ssa-loop-ivopts.c (adjust_iv_update_pos): Check the COND
	operands whether both use IV.
---
 gcc/testsuite/gcc.dg/tree-ssa/pr101250.c | 15 ++++++++++++
 gcc/tree-ssa-loop-ivopts.c               | 31 +++++++++++++++++++++++-
 2 files changed, 45 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr101250.c

diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c b/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c
new file mode 100644
index 00000000000..70d3701de49
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr101250.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-ivopts-details" } */
+
+unsigned int foo (unsigned char *ip, unsigned char *ref, unsigned int maxlen)
+{
+  unsigned int len = 2;
+  do
+    {
+      len++;
+    }
+  while (len < maxlen && ip[len] == ref[len]);
+  return len;
+}
+
+/* { dg-final { scan-tree-dump-not "Reordering" "ivopts" } } */
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 12a8a49a307..082097bff38 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -7352,7 +7352,7 @@ static void
 adjust_iv_update_pos (struct iv_cand *cand, struct iv_use *use)
 {
   tree var_after;
-  gimple *iv_update, *stmt;
+  gimple *iv_update, *stmt, *cond_stmt;
   basic_block bb;
   gimple_stmt_iterator gsi, gsi_iv;
 
@@ -7370,6 +7370,7 @@ adjust_iv_update_pos (struct iv_cand *cand, struct iv_use *use)
   if (gimple_code (stmt) != GIMPLE_COND)
     return;
 
+  cond_stmt = stmt;
   gsi_prev_nondebug (&gsi);
   stmt = gsi_stmt (gsi);
   if (stmt != iv_update)
@@ -7386,6 +7387,34 @@ adjust_iv_update_pos (struct iv_cand *cand, struct iv_use *use)
   if (stmt != use->stmt)
     return;
 
+  gcond *gcond_stmt = as_a <gcond *> (cond_stmt);
+  tree *cond_lhs = gimple_cond_lhs_ptr (gcond_stmt);
+  tree *cond_rhs = gimple_cond_rhs_ptr (gcond_stmt);
+  if (TREE_CODE (*cond_lhs) == SSA_NAME && TREE_CODE (*cond_rhs) == SSA_NAME)
+    {
+      /* If both sides are memory operations use iv var_before, adjust the
+       iv update stmt before second meory operations will cause offset index
+       mismatch, which may hurt performance if unroll, so return here to avoid
+       reorder.
+
+       _1 = (sizetype) len_8;
+       _2 = ip_10(D) + _1;
+       _3 = MEM[(unsigned char *)ip_10(D) + ivtmp.15_15 * 1];
+       _5 = ref_12(D) + _1;
+       _6 = *_5;
+       ivtmp.15_16 = ivtmp.15_15 + 1;
+       if (_3 == _6).   */
+
+      gimple *lhs_mem = SSA_NAME_DEF_STMT (*cond_lhs);
+      gimple *rhs_mem = SSA_NAME_DEF_STMT (*cond_rhs);
+      gimple *use_stmt;
+      imm_use_iterator use_iter;
+      FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, cand->var_before)
+	if ((use_stmt == lhs_mem || use_stmt == rhs_mem)
+	    && use_stmt != use->stmt)
+	  return;
+    }
+
   if (TREE_CODE (gimple_assign_lhs (stmt)) != SSA_NAME)
     return;
 
-- 
2.27.0.90.geebb51ba8c



-- 
Thanks,
Xionghu