Bug 86544 - Popcount detection generates different code on C and C++
Summary: Popcount detection generates different code on C and C++
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: kugan
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2018-07-17 09:22 UTC by ktkachov
Modified: 2018-07-19 11:20 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-07-18 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description ktkachov 2018-07-17 09:22:17 UTC
Great to see that GCC now detects the popcount loop in PR 82479!
I am seeing some curious differences between gcc and g++ though.
int
pc (unsigned long long b)
{
    int c = 0;

    while (b) {
        b &= b - 1;
        c++;
    }

    return c;
}

If compiled with gcc -O3 on aarch64 this gives:
pc:
        fmov    d0, x0
        cnt     v0.8b, v0.8b
        addv    b0, v0.8b
        umov    w0, v0.b[0]
        ret

whereas if compiled with g++ -O3 it gives:
_Z2pcy:
.LFB0:
        .cfi_startproc
        fmov    d0, x0
        cmp     x0, 0
        cnt     v0.8b, v0.8b
        addv    b0, v0.8b
        umov    w0, v0.b[0]
        and     x0, x0, 255
        csel    w0, w0, wzr, ne
        ret

which is suboptimal. It seems that phiopt3 manages to optimise the C version better. The GIMPLE dumps just before the phiopt pass are:
For the C (good version):

  int c;
  int _7;

  <bb 2> [local count: 118111601]:
  if (b_4(D) != 0)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 3> [local count: 105119324]:
  _7 = __builtin_popcountl (b_4(D));

  <bb 4> [local count: 118111601]:
  # c_12 = PHI <0(2), _7(3)>
  return c_12;


For the C++ (bad version):

  int c;
  int _7;

  <bb 2> [local count: 118111601]:
  if (b_4(D) == 0)
    goto <bb 4>; [11.00%]
  else
    goto <bb 3>; [89.00%]

  <bb 3> [local count: 105119324]:
  _7 = __builtin_popcountl (b_4(D));

  <bb 4> [local count: 118111601]:
  # c_12 = PHI <0(2), _7(3)>
  return c_12;

As you can see the order of the gotos and the jump conditions is inverted.

It seems to me that the two are equivalent and GCC could be doing a better job of optimising.

Can we improve phiopt to handle this more effectively?
Comment 1 kugan 2018-07-17 09:30:07 UTC
(In reply to ktkachov from comment #0)
> Great to see that GCC now detects the popcount loop in PR 82479!
> I am seeing some curious differences between gcc and g++ though.
> int
> pc (unsigned long long b)
> {
>     int c = 0;
> 
>     while (b) {
>         b &= b - 1;
>         c++;
>     }
> 
>     return c;
> }
> 
> If compiled with gcc -O3 on aarch64 this gives:
> pc:
>         fmov    d0, x0
>         cnt     v0.8b, v0.8b
>         addv    b0, v0.8b
>         umov    w0, v0.b[0]
>         ret
> 
> whereas if compiled with g++ -O3 it gives:
> _Z2pcy:
> .LFB0:
>         .cfi_startproc
>         fmov    d0, x0
>         cmp     x0, 0
>         cnt     v0.8b, v0.8b
>         addv    b0, v0.8b
>         umov    w0, v0.b[0]
>         and     x0, x0, 255
>         csel    w0, w0, wzr, ne
>         ret
> 
> which is suboptimal. It seems that phiopt3 manages to optimise the C version
> better. The GIMPLE dumps just before the phiopt pass are:
> For the C (good version):
> 
>   int c;
>   int _7;
> 
>   <bb 2> [local count: 118111601]:
>   if (b_4(D) != 0)
>     goto <bb 3>; [89.00%]
>   else
>     goto <bb 4>; [11.00%]
> 
>   <bb 3> [local count: 105119324]:
>   _7 = __builtin_popcountl (b_4(D));
> 
>   <bb 4> [local count: 118111601]:
>   # c_12 = PHI <0(2), _7(3)>
>   return c_12;
> 
> 
> For the C++ (bad version):
> 
>   int c;
>   int _7;
> 
>   <bb 2> [local count: 118111601]:
>   if (b_4(D) == 0)
>     goto <bb 4>; [11.00%]
>   else
>     goto <bb 3>; [89.00%]
> 
>   <bb 3> [local count: 105119324]:
>   _7 = __builtin_popcountl (b_4(D));
> 
>   <bb 4> [local count: 118111601]:
>   # c_12 = PHI <0(2), _7(3)>
>   return c_12;
> 
> As you can see the order of the gotos and the jump conditions is inverted.
> 
> It seems to me that the two are equivalent and GCC could be doing a better
> job of optimising.
> 
> Can we improve phiopt to handle this more effectively?

Thanks for the test case. I will look at it.
Comment 2 kugan 2018-07-18 02:22:18 UTC
Patch posted at https://gcc.gnu.org/ml/gcc-patches/2018-07/msg00975.html
Comment 3 ktkachov 2018-07-18 09:01:50 UTC
(In reply to kugan from comment #2)
> Patch posted at https://gcc.gnu.org/ml/gcc-patches/2018-07/msg00975.html

Thanks for picking this up.
Marking as assign. Can you please add yourself to the assignee field?
Comment 4 kugan 2018-07-18 22:11:56 UTC
Author: kugan
Date: Wed Jul 18 22:11:24 2018
New Revision: 262864

URL: https://gcc.gnu.org/viewcvs?rev=262864&root=gcc&view=rev
Log:
gcc/ChangeLog:

2018-07-18  Kugan Vivekanandarajah  <kuganv@linaro.org>

	PR middle-end/86544
	* tree-ssa-phiopt.c (cond_removal_in_popcount_pattern): Handle comparision with EQ_EXPR
	in last stmt.

gcc/testsuite/ChangeLog:

2018-07-18  Kugan Vivekanandarajah  <kuganv@linaro.org>

	PR middle-end/86544
	* g++.dg/tree-ssa/pr86544.C: New test.


Added:
    trunk/gcc/testsuite/g++.dg/tree-ssa/pr86544.C
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-phiopt.c
Comment 5 ktkachov 2018-07-19 11:20:47 UTC
Can confirm this is fixed now. Thanks!