Bug 107090 - [aarch64] sequence logic should be combined with mul and umulh
Summary: [aarch64] sequence logic should be combined with mul and umulh
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 13.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 103216
Blocks:
  Show dependency treegraph
 
Reported: 2022-09-29 17:51 UTC by vfdff
Modified: 2022-10-29 01:54 UTC (History)
2 users (show)

See Also:
Host:
Target: Aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-09-29 00:00:00


Attachments
Add A ? B + CST : B match and simplify optimizations (737 bytes, patch)
2022-10-10 04:22 UTC, vfdff
Details | Diff
[PHIOPT] Add A ? B + CST : B match and simplify optimizations (829 bytes, patch)
2022-10-10 09:49 UTC, vfdff
Details | Diff
the huge bb sligtly change after match ResLo (36.40 KB, image/png)
2022-10-13 02:57 UTC, vfdff
Details
has different operand order base on different commit node (83.91 KB, image/jpeg)
2022-10-29 01:54 UTC, vfdff
Details

Note You need to log in before you can comment on or make changes to this bug.
Description vfdff 2022-09-29 17:51:35 UTC
* test case: https://godbolt.org/z/x5jMhqW8s
```
#  define BN_BITS4        32
#  define BN_MASK2        (0xffffffffffffffffL)
#  define BN_MASK2l       (0xffffffffL)
#  define BN_MASK2h       (0xffffffff00000000L)
#  define BN_MASK2h1      (0xffffffff80000000L)
#  define LBITS(a)        ((a)&BN_MASK2l)
#  define HBITS(a)        (((a)>>BN_BITS4)&BN_MASK2l)
#  define L2HBITS(a)      (((a)<<BN_BITS4)&BN_MASK2)

void mul64(unsigned long in0, unsigned long in1,
           unsigned long &l, unsigned long &h) {
    unsigned long m, m1, lt, ht, bl, bh;
    lt = LBITS(in0);
    ht = HBITS(in0);
    bl = LBITS(in1);
    bh = HBITS(in1);
    m  = bh * lt;
    lt = bl * lt;
    m1 = bl * ht;
    ht = bh * ht;
    m  = (m + m1) & BN_MASK2;
    if (m < m1) ht += L2HBITS((unsigned long)1);
    ht += HBITS(m);
    m1 = L2HBITS(m);
    lt = (lt + m1) & BN_MASK2; if (lt < m1) ht++;
    l  = lt;
    h  = ht;
}
```
* The above source is equel to an mull operater for two 64bits integer vaules, so it should be fold to similar assemble
```
   mul   x8,x1,x0
   umulh x9,x0,x1
   str   x8,[x2]
   str   x9,[x3]
   ret
```
Comment 1 Andrew Pinski 2022-09-29 21:32:35 UTC
A few issues.
First is:

  if (_26 != 0)
    goto <bb 3>; [50.00%]
  else
    goto <bb 4>; [50.00%]

  <bb 3> [local count: 536870913]:
  ht_15 = ht_13 + 4294967296;

  <bb 4> [local count: 1073741824]:
  # ht_2 = PHI <ht_13(2), ht_15(3)>

This should be done as:
tmp_ = _26 != 0
tmp1_ = (unsigned long) tmp_
tmp2_ = tmp1_ << 32;
ht_2 = ht_13 + tmp2_;

And then there is huge pattern matching with respect to doing widening multiple here.
Comment 2 vfdff 2022-10-01 23:42:29 UTC
Thanks for your suggestion.

As the combine pass can't address more than 4 sequence insns, which pass may be more suitable to match the huge pattern after fixing the 1st issue.
Comment 3 Andrew Pinski 2022-10-01 23:51:47 UTC
(In reply to vfdff from comment #2)
> Thanks for your suggestion.
> 
> As the combine pass can't address more than 4 sequence insns, which pass may
> be more suitable to match the huge pattern after fixing the 1st issue.

match.pd can handle more than 4 gimple statement to do the matching.
Comment 4 vfdff 2022-10-07 09:44:52 UTC
(In reply to Andrew Pinski from comment #1)
> A few issues.
> First is:
> 
>   if (_26 != 0)
>     goto <bb 3>; [50.00%]
>   else
>     goto <bb 4>; [50.00%]
> 
>   <bb 3> [local count: 536870913]:
>   ht_15 = ht_13 + 4294967296;
> 
>   <bb 4> [local count: 1073741824]:
>   # ht_2 = PHI <ht_13(2), ht_15(3)>
> 

For the 1st issue, I see the gcc works well before gcc8 with if-conversion, https://godbolt.org/z/99a9e59Ge
Comment 5 vfdff 2022-10-10 04:22:50 UTC
Created attachment 53684 [details]
Add A ? B + CST : B match and simplify optimizations

Fix the 1st issue of the pattern match
Comment 6 Andrew Pinski 2022-10-10 04:30:09 UTC
(In reply to vfdff from comment #5)
> Created attachment 53684 [details]
> Add A ? B + CST : B match and simplify optimizations
> 
> Fix the 1st issue of the pattern match

There is a generic way of implementing this which I had posted at https://gcc.gnu.org/pipermail/gcc-patches/2021-November/584411.html (you can finish up that patch if you want).
Comment 7 vfdff 2022-10-10 09:49:42 UTC
Created attachment 53685 [details]
[PHIOPT] Add A ? B + CST : B match and simplify optimizations

Thank you for your guidance, I have referred to your patch linked to supplement the various types of operations.
Due to an error in building the (op @1 (cond^ @0 @2 {build_zero_cst (type);})), I have not incorporated the modifications into your patch at this time. If it is your consent, I would like to merge these modifications separately. And then finish up that patch, after asking for further information.
Comment 8 vfdff 2022-10-12 13:36:59 UTC
hi @Andrew Pinski
  For the 2nd issue, I also matched the huge pattern, but it need return two value, it seems don't work with current framework? so should I have to split it into two simples to match the high and low values of ResHi and ResLo separately?
```
 (i64 ResLo, i64 ResHi) = Mul64(i64 In0, i64 In1) {
    In0Hi = In0(D) & 4294967295;
    In0Lo = In0(D) >> 32;
    In1Hi = In1(D) & 4294967295;
    In1Lo = In1(D) >> 32;
    Mull_01 = In0Lo * In1Hi;
    Addc = In0Hi * In1Lo + Mull_01;
    addc32 = Addc << 32;
    ResLo = In0Hi * In1Hi + addc32;
    ResHi = ((long unsigned int) (addc32 > ResLo)) + In0Lo * In1Lo +
	     (((long unsigned int) (Mull_01 > Addc)) << 32) + (Addc >> 32);
 }
```
Comment 9 Andrew Pinski 2022-10-12 21:33:05 UTC
Look at how ctz_table_index is done and used.
The matching is done in match.pd language and then inside simplify_count_trailing_zeroes (tree-ssa-forwprop.cc) it is used 

nop_atomic_bit_test_and_p is another example but that is more complex and is used inside tree-ssa-ccp.cc .
Comment 10 vfdff 2022-10-13 02:57:50 UTC
Created attachment 53698 [details]
the huge bb sligtly change after match ResLo

Thanks for your suggestion, and I think both ctz_table_index and nop_atomic_bit_test_and_p only return one value, so I'll try to match ResHi and ResLo separately as the bb only sligtly change after we first matched the ResLo
Comment 11 vfdff 2022-10-29 01:54:05 UTC
Created attachment 53787 [details]
has different operand order base on different commit node

hi @Andrew Pinski

* Showed as the figure swap_order.jpg attaiched, we can introduce flags :c for the plus node m_13 to match commutated node according https://gcc.gnu.org/onlinedocs/gccint/The-Language.html.

And for the plus node _24, does it also have some similar flag to simplify the matching ?