Bug 105253 - [13 Regression] __popcountdi2 calls generated in kernel code with gcc12
Summary: [13 Regression] __popcountdi2 calls generated in kernel code with gcc12
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 12.0
: P1 normal
Target Milestone: 12.0
Assignee: Jakub Jelinek
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2022-04-13 05:28 UTC by Khem Raj
Modified: 2022-11-07 18:01 UTC (History)
2 users (show)

See Also:
Host: x86_64-linux
Target:
Build: x86_64-linux
Known to work: 11.2.0
Known to fail: 12.0
Last reconfirmed: 2022-04-13 00:00:00


Attachments
test case (582.36 KB, application/x-xz)
2022-04-13 05:28 UTC, Khem Raj
Details
gcc12-pr105253.patch (1.08 KB, patch)
2022-04-13 09:52 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Khem Raj 2022-04-13 05:28:23 UTC
Created attachment 52794 [details]
test case

gcc 12 when using -march=core2 generates __popcountdi2 calls which go unresolved in kernel and build fails. When I use -march=corei7 then the calls are optimized away.
Interestingly, gcc-11 did not emit these calls even with -march=core2,

a similar error is seen when compiling kernel for arm

arm-yoe-linux-gnueabi-ld.bfd: arch/arm/probes/kprobes/actions-common.o: in function `simulate_ldm1stm1':
actions-common.c:(.kprobes.text+0x74): undefined reference to `__popcountsi2'

Attached testcase demonstrates the behaviour on x86_64. Is it expected behavior in gcc 12 ?
Comment 1 Andrew Pinski 2022-04-13 05:37:48 UTC
Two things. First the error message is for arm target but you then reference x86_64 target options.

Second there is a loop for popcount which gets translated but that was guarded by a check for optab for popcount existing but maybe that changed.
Comment 2 Andrew Pinski 2022-04-13 05:40:34 UTC
https://bugzilla.kernel.org/show_bug.cgi?id=200671

Looks like a bug was filed in the kernel for this but no action.

Is this with -Os?
Comment 3 Khem Raj 2022-04-13 05:47:03 UTC
(In reply to Andrew Pinski from comment #1)
> Two things. First the error message is for arm target but you then reference
> x86_64 target options.
> 
> Second there is a loop for popcount which gets translated but that was
> guarded by a check for optab for popcount existing but maybe that changed.

yes the issue is seen on both arm and x86_64, So test case is from x86_64 but the error message is from arm build. I can reduce that too tomorrow. Its with -O2, I have a shell script in tarball which has all the options used to compile.
Comment 4 Andrew Pinski 2022-04-13 06:40:35 UTC
With 12.0.1 20220213, I didn't get any popcountdi2 calls.
Comment 5 Richard Biener 2022-04-13 07:00:28 UTC
We're not checking for an optab here (that's on purpose IIRC).  The code using the niter analysis result is responsible to do cost checking.  The builtin
is recognized via number_of_iterations_popcount.

If you use GCC you have to link against libgcc.a at least.
Comment 6 Jakub Jelinek 2022-04-13 08:52:15 UTC
Note, there is some code to not do this if we'd not expand the builtin inline
since r9-4030-g06a6b46a16f9287a98aa , so I wonder what changed.
Comment 7 Richard Biener 2022-04-13 09:15:38 UTC
(In reply to Jakub Jelinek from comment #6)
> Note, there is some code to not do this if we'd not expand the builtin inline
> since r9-4030-g06a6b46a16f9287a98aa , so I wonder what changed.

That only applies to final value replacement - the niter result might be used by other passes.
Comment 8 Jakub Jelinek 2022-04-13 09:27:34 UTC
Reduced to just -O2:
int
foo (unsigned long long *p)
{
  int i, cnt = 0;
  unsigned long long elem;
  for (i = 0; i < (256 / 64); i++)
    {
      elem = p[i];
      for (; elem; cnt++)
        elem &= elem - 1;
    }
  return cnt;
}
Started with r12-8041-g973a2ce71f8dab559fbbfc34b59e39e047df74a6
Wonder how that affects the testcase...
Comment 9 Jakub Jelinek 2022-04-13 09:39:00 UTC
Ah, I see, expression_expensive_p is called, but number_of_iterations_popcount when creating a CALL_EXPR only cares about TYPE_PRECISION and happily creates
__builtin_popcountl with unsigned long long argument because it has the same precision.  tree_builtin_call_types_compatible_p says it is not valid builtin call and so we don't consider it expensive.

Either we make number_of_iterations_popcount more accurate on the types, or better in tree_builtin_call_types_compatible_p use something like:
  bool in_gimple_form = (cfun && (cfun->curr_properties & PROP_gimple)) != 0;
and if in_gimple_form, use useless_type_conversion_p instead of the TYPE_MAIN_VARIANT / tree_nop_conversion_p checks.
That is my strong preference...

What do you think?
Comment 10 Jakub Jelinek 2022-04-13 09:52:15 UTC
Created attachment 52796 [details]
gcc12-pr105253.patch

Now in (untested) patch form.
Note, the kernel still should provide __popcount[sd]i2, but in this case the intent was that we don't emit calls and only hw instructions.
Comment 11 Jakub Jelinek 2022-04-13 10:01:33 UTC
This is a regression (as in we don't want such a call in that case for performance reasons and in the past we didn't emit it).
Comment 12 Richard Biener 2022-04-13 12:22:48 UTC
(In reply to Jakub Jelinek from comment #9)
> Ah, I see, expression_expensive_p is called, but
> number_of_iterations_popcount when creating a CALL_EXPR only cares about
> TYPE_PRECISION and happily creates
> __builtin_popcountl with unsigned long long argument because it has the same
> precision.  tree_builtin_call_types_compatible_p says it is not valid
> builtin call and so we don't consider it expensive.

Shouldn't it run into

      if (!is_inexpensive_builtin (get_callee_fndecl (expr)))
        return true;

then?  Or does that simply consider all popcount as inexpensive?
Yes it does.

I think we should change the above to

diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc
index 44157265ce8..b53d7aaa71d 100644
--- a/gcc/tree-scalar-evolution.cc
+++ b/gcc/tree-scalar-evolution.cc
@@ -3420,12 +3420,15 @@ expression_expensive_p (tree expr, hash_map<tree, uint64_t> &cache,
                  break;
              return true;
            }
+         break;
+
        default:
+         if (cfn == CFN_LAST
+             || !is_inexpensive_builtin (get_callee_fndecl (expr)))
+           return true;
          break;
        }
 
-      if (!is_inexpensive_builtin (get_callee_fndecl (expr)))
-       return true;
       FOR_EACH_CALL_EXPR_ARG (arg, iter, expr)
        if (expression_expensive_p (arg, cache, op_cost))
          return true;

> Either we make number_of_iterations_popcount more accurate on the types, or
> better in tree_builtin_call_types_compatible_p use something like:
>   bool in_gimple_form = (cfun && (cfun->curr_properties & PROP_gimple)) != 0;
> and if in_gimple_form, use useless_type_conversion_p instead of the
> TYPE_MAIN_VARIANT / tree_nop_conversion_p checks.
> That is my strong preference...
> 
> What do you think?

Hmm, but still tree_nop_conversion_p would work there, no?
Comment 13 Richard Biener 2022-04-13 12:31:52 UTC
(In reply to Jakub Jelinek from comment #10)
> Created attachment 52796 [details]
> gcc12-pr105253.patch
> 
> Now in (untested) patch form.
> Note, the kernel still should provide __popcount[sd]i2, but in this case the
> intent was that we don't emit calls and only hw instructions.

Meh.  Adding (useless) conversions in niter analysis to make the predicate
happy would be against what we do everywhere else where we simply rewrite
GIMPLE back to GENERIC without considering GENERICs stronger type system :/
Comment 14 GCC Commits 2022-04-13 13:45:34 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:29c46490de4616b911fccb34a9479f768fb51e94

commit r12-8141-g29c46490de4616b911fccb34a9479f768fb51e94
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Wed Apr 13 15:44:51 2022 +0200

    tree.cc: Use useless_type_conversion_p in tree_builtin_call_types_compatible_p while in gimple form [PR105253]
    
    tree_builtin_call_types_compatible_p uses TYPE_MAIN_VARIANT comparisons
    or tree_nop_conversion_p to ensure a builtin has correct GENERIC arguments.
    Unfortunately this regressed when get_call_combined_fn is called during
    GIMPLE optimizations.  E.g. when number_of_iterations_popcount is called,
    it doesn't ensure TYPE_MAIN_VARIABLE compatible argument type, it picks
    __builtin_popcount{,l,ll} based just on types' precision and doesn't
    fold_convert the arg to the right type.  We are in GIMPLE, such conversions
    are useless...
    So, either we'd need to fix number_of_iterations_popcount to add casts
    and inspect anything else that creates CALL_EXPRs late, or we can
    in tree_builtin_call_types_compatible_p just use the GIMPLE type
    comparisons (useless_type_conversion_p) when we are in GIMPLE form and
    the TYPE_MAIN_VARIANT comparison or tree_nop_conversion_p test otherwise.
    
    I think especially this late in stage4 the latter seems safer to me.
    
    2022-04-13  Jakub Jelinek  <jakub@redhat.com>
    
            PR middle-end/105253
            * tree.cc (tree_builtin_call_types_compatible_p): If PROP_gimple,
            use useless_type_conversion_p checks instead of TYPE_MAIN_VARIANT
            comparisons or tree_nop_conversion_p checks.
    
            * gcc.target/i386/pr105253.c: New test.
Comment 15 Jakub Jelinek 2022-04-13 13:46:46 UTC
Fixed.
Comment 16 Hongtao.liu 2022-04-15 07:48:25 UTC
(In reply to Jakub Jelinek from comment #11)
> This is a regression (as in we don't want such a call in that case for
> performance reasons and in the past we didn't emit it).

Yes, just note it regresses 531.deepsjeng_r by 2% for -O2 -march=x86-64 since below function.

typedef unsigned long long BITBOARD;
int PopCount (BITBOARD b) {
    int c = 0;

    while (b) {
        b &= b - 1;
        c++;
    }

    return c;
}

popcnt is supported by x86-64_v2.