Bug 82405 - Function not inlined for switch and suboptimal assembly is generated
Summary: Function not inlined for switch and suboptimal assembly is generated
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: ipa (show other bugs)
Version: 8.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-10-02 13:51 UTC by Antony Polukhin
Modified: 2022-05-27 08:05 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-10-03 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Antony Polukhin 2017-10-02 13:51:05 UTC
Following code 

enum class eShape { eSquare, eCircle, eShpere, eTetraeder };

static inline double eval_square(double r) { return 4 * r*r; }
static inline double eval_circle(double r) { return 3.1415 * r * r; }
static inline double eval_sphere(double r) { return 4 * 3.1415 * r * r; }
static inline double eval_tetraeder(double r) { return 24 * 1.73205 * r*r; }

double test_switch_native(eShape shape, double r) {
    switch(shape) {
    case eShape::eSquare:    return eval_square(r);
    case eShape::eCircle:    return eval_circle(r);
    case eShape::eShpere:    return eval_sphere(r);
    case eShape::eTetraeder: return eval_tetraeder(r);
    }
}

Produces very long suboptimal assembly output with -O2 and -O3.

Clang inlines the functions and merges the common `r*r` part:

test_switch_native(eShape, double):         # @test_switch_native(eShape, double)
        movsxd  rax, edi
        movsd   xmm1, qword ptr [8*rax + .Lswitch.table._Z18test_switch_native6eShaped] # xmm1 = mem[0],zero
        mulsd   xmm1, xmm0
        mulsd   xmm0, xmm1
        ret
.Lswitch.table._Z18test_switch_native6eShaped:
        .quad   4616189618054758400     # double 4
        .quad   4614256447914709615     # double 3.1415000000000002
        .quad   4623263647169450607     # double 12.566000000000001
        .quad   4631047162110439693     # double 41.569200000000002
Comment 1 Martin Liška 2017-10-03 08:03:11 UTC
Confirmed, we also inline the function (-fdump-ipa-inline):

  <bb 2> [100.00%]:
  switch (shape_2(D)) <default: <L4> [20.00%], case 0: <L6> [20.00%], case 1: <L7> [20.00%], case 2: <L8> [20.00%], case 3: <L9> [20.00%]>

<L6> [20.00%]:
  _5 = r_4(D) * 4.0e+0;
  _6 = r_4(D) * _5;
  goto <bb 8>; [100.00%]

<L7> [20.00%]:
  _7 = r_4(D) * 3.141500000000000181188397618825547397136688232421875e+0;
  _8 = r_4(D) * _7;
  goto <bb 8>; [100.00%]

<L8> [20.00%]:
  _9 = r_4(D) * 1.25660000000000007247535904753021895885467529296875e+1;
  _10 = r_4(D) * _9;
  goto <bb 8>; [100.00%]

<L9> [20.00%]:
  _11 = r_4(D) * 4.156920000000000214868123293854296207427978515625e+1;
  _12 = r_4(D) * _11;
  goto <bb 8>; [100.00%]

<L4> [20.00%]:
  return;

  <bb 8> [80.00%]:
  # _1 = PHI <_6(3), _8(4), _10(5), _12(6)>
  return _1;

there are probably 2 issues I see:

1) clang knowing that switch handles all enum values -> thus default branch is removed:

$ clang++ -O2 pr82405.c -S -o/dev/stdout
_Z18test_switch_native6eShaped:         # @_Z18test_switch_native6eShaped
	.cfi_startproc
# BB#0:
	movslq	%edi, %rax
	movsd	.Lswitch.table(,%rax,8), %xmm1 # xmm1 = mem[0],zero
...
.Lswitch.table:
	.quad	4616189618054758400     # double 4
	.quad	4614256447914709615     # double 3.1415000000000002
	.quad	4623263647169450607     # double 12.566000000000001
	.quad	4631047162110439693     # double 41.569200000000002
	.size	.Lswitch.table, 32

Having a simpler test-case:
$ cat pr82405-2.c
enum class eShape { eSquare, eCircle, eShpere, eTetraeder };

double test_switch_native(eShape shape, double r) {
    switch(shape) {
    case eShape::eSquare:    return 2;
    case eShape::eCircle:    return 3;
    case eShape::eShpere:    return 4;
    case eShape::eTetraeder: return 5;
    }
}

we're unable to process switch conversion because:
beginning to process the following SWITCH statement (pr82405-2.c:4) : ------- 
switch (shape_2(D)) <default: <L4> [0.00%], case 0: <L6> [0.00%], case 1: <L1> [0.00%], case 2: <L2> [0.00%], case 3: <L3> [0.00%]>
Bailing out - no common successor to all case label target blocks found

I know that I discussed that with Jakub a bit, one can use | operator for enum values, but I still believe we should optimize switch
statements where all enum values are handled. Thoughts?

2) predictive commoning is not triggered for similar reason (there's still default edge not using r * r).

Thanks.
Comment 2 Martin Liška 2017-10-03 08:04:32 UTC
> Having a simpler test-case:
> $ cat pr82405-2.c
> enum class eShape { eSquare, eCircle, eShpere, eTetraeder };
> 
> double test_switch_native(eShape shape, double r) {
>     switch(shape) {
>     case eShape::eSquare:    return 2;
>     case eShape::eCircle:    return 3;
>     case eShape::eShpere:    return 4;
>     case eShape::eTetraeder: return 5;
>     }
> }

It's PR82404.
Comment 3 Martin Liška 2017-10-03 14:50:25 UTC
Isolated test-case where we do not reassociate expressions:

$ cat pr82405-reduced2.c
static inline double eval_square(double r) { return r * r * 4; }
static inline double eval_circle(double r) { return r * r * 3.1415; }

static inline double eval_square2(double r) { return 4 * r * r; }
static inline double eval_circle2(double r) { return 3.1415 * r * r; }

double test_switch_native_slow(int shape, double r) {
    if (shape == 123)
      return eval_circle2(r);
    else
      return eval_square2(r);
}

double test_switch_native_fast(int shape, double r) {
    if (shape == 123)
      return eval_circle(r);
    else
      return eval_square(r);
}

=================
g++  pr82405-reduced2.c -std=c++11 -O2 -fdump-tree-optimized=/dev/stdout

double test_switch_native_slow(int, double) (int shape, double r)
{
  double _1;
  double _5;
  double _6;
  double _7;
  double _8;

  <bb 2> [100.00%]:
  if (shape_2(D) == 123)
    goto <bb 3>; [30.50%]
  else
    goto <bb 4>; [69.50%]

  <bb 3> [30.50%]:
  _5 = r_4(D) * 3.141500000000000181188397618825547397136688232421875e+0;
  _6 = r_4(D) * _5;
  goto <bb 5>; [100.00%]

  <bb 4> [69.50%]:
  _7 = r_4(D) * 4.0e+0;
  _8 = r_4(D) * _7;

  <bb 5> [100.00%]:
  # _1 = PHI <_6(3), _8(4)>
  return _1;

}

double test_switch_native_fast(int, double) (int shape, double r)
{
  double _1;
  double _6;
  double _8;
  double _9;

  <bb 2> [100.00%]:
  _9 = r_4(D) * r_4(D);
  if (shape_2(D) == 123)
    goto <bb 3>; [30.50%]
  else
    goto <bb 4>; [69.50%]

  <bb 3> [30.50%]:
  _6 = _9 * 3.141500000000000181188397618825547397136688232421875e+0;
  goto <bb 5>; [100.00%]

  <bb 4> [69.50%]:
  _8 = _9 * 4.0e+0;

  <bb 5> [100.00%]:
  # _1 = PHI <_6(3), _8(4)>
  return _1;

}
Comment 4 Andrew Pinski 2017-10-03 14:56:53 UTC
> Isolated test-case where we do not reassociate expressions:

And I don't think you can reassociate here validly unless -ffast-math.
Comment 5 Martin Liška 2017-10-03 15:02:22 UTC
(In reply to Andrew Pinski from comment #4)
> > Isolated test-case where we do not reassociate expressions:
> 
> And I don't think you can reassociate here validly unless -ffast-math.

Yep, you're right. With -ffast-math we do the reassociation.
Clang does that w/ -O2:

$ ~/bin/llvm/bin/clang++ -v
clang version 6.0.0 (http://llvm.org/git/clang.git 80f50978299e20c5b2e444e9a4fc06bfaf0183cc) (http://llvm.org/git/llvm.git a29bdba93eaec8e2cf820532c02261ef93ba82b5)
Target: x86_64-unknown-linux-gnu

$ ~/bin/llvm/bin/clang++ -B.  pr82405-reduced2.c -std=c++11 -O2 -o /dev/stdout -S
...
.LCPI0_0:
	.quad	4616189618054758400     # double 4
	.quad	4614256447914709615     # double 3.1415000000000002
	.text
	.globl	_Z23test_switch_native_slowid
	.p2align	4, 0x90
	.type	_Z23test_switch_native_slowid,@function
_Z23test_switch_native_slowid:          # @_Z23test_switch_native_slowid
	.cfi_startproc
# BB#0:                                 # %entry
	xorl	%eax, %eax
	cmpl	$123, %edi
	sete	%al
	movsd	.LCPI0_0(,%rax,8), %xmm1 # xmm1 = mem[0],zero
	mulsd	%xmm0, %xmm1
	mulsd	%xmm1, %xmm0
	retq

...

Reported says the same.
Comment 6 Antony Polukhin 2017-10-03 15:48:22 UTC
> And I don't think you can reassociate here validly unless -ffast-math.

But you can. In the isolated test case, instead of getting r*r at first, just move the constant into the xmm1 first and after that multiply it twice with r. Clang does that

.LCPI0_0:
        .quad   4616189618054758400     # double 4
        .quad   4614256447914709615     # double 3.1415000000000002
test_switch_native_slow(int, double):          # @test_switch_native_slow(int, double)
        xor     eax, eax
        cmp     edi, 123
        sete    al
        movsd   xmm1, qword ptr [8*rax + .LCPI0_0] # xmm1 = mem[0],zero
        mulsd   xmm1, xmm0
        mulsd   xmm0, xmm1
        ret

Moreover, the current approach "multiply r twice and then multiply it on the constant" changes the observable behavior, such optimization could be enabled only with -ffast-math
Comment 7 Andrew Pinski 2017-10-03 15:52:39 UTC
Multiply binds left to right. 

That is a * b * c is the same as (a * b) * c and not the same as A * (b * c).
Comment 8 Antony Polukhin 2017-10-03 16:07:22 UTC
Yes, in the isolated test case constants are first:

static inline double eval_square2(double r) { return 4 * r * r; }
static inline double eval_circle2(double r) { return 3.1415 * r * r; }

double test_switch_native_slow(int shape, double r) {
    if (shape == 123)
      return eval_circle2(r);
    else
      return eval_square2(r);
}

So multiplying r at the beginning is not good.
Comment 9 Jakub Jelinek 2017-10-03 16:52:43 UTC
So it means maybe llvm performs more advanced switchconv in this case, at least judging from the #c0 assembly snippet.  We look solely for PHIs which have arguments SSA_NAMEs initialized in the cases to constants, while in order to optimize this without -ffast-math, it would need to handle at least a couple of stmts where the operands match except for some constant argument that is changing.

<L6> [20.00%]:
  _5 = r_4(D) * 4.0e+0;
  _6 = r_4(D) * _5;
  goto <bb 8>; [100.00%]

<L7> [20.00%]:
  _7 = r_4(D) * 3.141500000000000181188397618825547397136688232421875e+0;
  _8 = r_4(D) * _7;
  goto <bb 8>; [100.00%]

So, in the above case we'd look from the PHI with _6 and _8 arguments, and see that the because the def stmt isn't assignment from constant, we'd notice it is a binary (or unary or ternary) assign where one of the operands is identical (r_4(D), while the other one is another SSA_NAME defined in the case, and we'd loop to that, seeing another assign where one operand is the same and another one is a constant.  Thus, we'd build a table with the 4.0e+0 and 3.141500000000000181188397618825547397136688232421875e+0 constants, and after the load from the table did _21 = r_4(D) * value_loaded_from_table_20; _22 = r_4(D) * _21;
The question is if we'd require all operands matching except for one which could be a constant eventually, or something different (allow some small number of constant arguments to a computation).

Or should we have a separate pass that performs such an optimization (noticing similar code blocks with just changing constant parameters and merge the blocks except for computing the parameters)?
Comment 10 rguenther@suse.de 2017-10-04 12:17:03 UTC
On Tue, 3 Oct 2017, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82405
> 
> --- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> So it means maybe llvm performs more advanced switchconv in this case, at least
> judging from the #c0 assembly snippet.  We look solely for PHIs which have
> arguments SSA_NAMEs initialized in the cases to constants, while in order to
> optimize this without -ffast-math, it would need to handle at least a couple of
> stmts where the operands match except for some constant argument that is
> changing.
> 
> <L6> [20.00%]:
>   _5 = r_4(D) * 4.0e+0;
>   _6 = r_4(D) * _5;
>   goto <bb 8>; [100.00%]
> 
> <L7> [20.00%]:
>   _7 = r_4(D) * 3.141500000000000181188397618825547397136688232421875e+0;
>   _8 = r_4(D) * _7;
>   goto <bb 8>; [100.00%]
> 
> So, in the above case we'd look from the PHI with _6 and _8 arguments, and see
> that the because the def stmt isn't assignment from constant, we'd notice it is
> a binary (or unary or ternary) assign where one of the operands is identical
> (r_4(D), while the other one is another SSA_NAME defined in the case, and we'd
> loop to that, seeing another assign where one operand is the same and another
> one is a constant.  Thus, we'd build a table with the 4.0e+0 and
> 3.141500000000000181188397618825547397136688232421875e+0 constants, and after
> the load from the table did _21 = r_4(D) * value_loaded_from_table_20; _22 =
> r_4(D) * _21;
> The question is if we'd require all operands matching except for one which
> could be a constant eventually, or something different (allow some small number
> of constant arguments to a computation).
> 
> Or should we have a separate pass that performs such an optimization (noticing
> similar code blocks with just changing constant parameters and merge the blocks
> except for computing the parameters)?

Looks like sth for tail-merging / code hoisting?  Of course we need to
reassoc first (if valid).  If the whole thing were if()s it may be
PRE would already catch it.
Comment 11 Andrew Pinski 2017-10-04 16:29:58 UTC
Thinking about this some more, this is a not a hoisting problem but a sinking problem.

basically we have:
int f(void);
int h(void);

int g(int a)
{
  if (a)
    return f() + 10;
  return h() + 10;
}

Which should be converted into:
int g1(int a)
{
  int t;
  if (a)
    t = f();
  else
    t = h();
  return t + 10;
}

We handle hoisting just fine; just not sinking common.

So we don't need reassoc in the original testcase if we do sinking.
Comment 12 Jakub Jelinek 2019-05-03 09:14:39 UTC
GCC 9.1 has been released.
Comment 13 Jakub Jelinek 2019-08-12 08:54:03 UTC
GCC 9.2 has been released.
Comment 14 Jakub Jelinek 2020-03-12 11:58:53 UTC
GCC 9.3.0 has been released, adjusting target milestone.
Comment 15 Richard Biener 2021-06-01 08:09:23 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.