Following code enum class eShape { eSquare, eCircle, eShpere, eTetraeder }; static inline double eval_square(double r) { return 4 * r*r; } static inline double eval_circle(double r) { return 3.1415 * r * r; } static inline double eval_sphere(double r) { return 4 * 3.1415 * r * r; } static inline double eval_tetraeder(double r) { return 24 * 1.73205 * r*r; } double test_switch_native(eShape shape, double r) { switch(shape) { case eShape::eSquare: return eval_square(r); case eShape::eCircle: return eval_circle(r); case eShape::eShpere: return eval_sphere(r); case eShape::eTetraeder: return eval_tetraeder(r); } } Produces very long suboptimal assembly output with -O2 and -O3. Clang inlines the functions and merges the common `r*r` part: test_switch_native(eShape, double): # @test_switch_native(eShape, double) movsxd rax, edi movsd xmm1, qword ptr [8*rax + .Lswitch.table._Z18test_switch_native6eShaped] # xmm1 = mem[0],zero mulsd xmm1, xmm0 mulsd xmm0, xmm1 ret .Lswitch.table._Z18test_switch_native6eShaped: .quad 4616189618054758400 # double 4 .quad 4614256447914709615 # double 3.1415000000000002 .quad 4623263647169450607 # double 12.566000000000001 .quad 4631047162110439693 # double 41.569200000000002
Confirmed, we also inline the function (-fdump-ipa-inline): <bb 2> [100.00%]: switch (shape_2(D)) <default: <L4> [20.00%], case 0: <L6> [20.00%], case 1: <L7> [20.00%], case 2: <L8> [20.00%], case 3: <L9> [20.00%]> <L6> [20.00%]: _5 = r_4(D) * 4.0e+0; _6 = r_4(D) * _5; goto <bb 8>; [100.00%] <L7> [20.00%]: _7 = r_4(D) * 3.141500000000000181188397618825547397136688232421875e+0; _8 = r_4(D) * _7; goto <bb 8>; [100.00%] <L8> [20.00%]: _9 = r_4(D) * 1.25660000000000007247535904753021895885467529296875e+1; _10 = r_4(D) * _9; goto <bb 8>; [100.00%] <L9> [20.00%]: _11 = r_4(D) * 4.156920000000000214868123293854296207427978515625e+1; _12 = r_4(D) * _11; goto <bb 8>; [100.00%] <L4> [20.00%]: return; <bb 8> [80.00%]: # _1 = PHI <_6(3), _8(4), _10(5), _12(6)> return _1; there are probably 2 issues I see: 1) clang knowing that switch handles all enum values -> thus default branch is removed: $ clang++ -O2 pr82405.c -S -o/dev/stdout _Z18test_switch_native6eShaped: # @_Z18test_switch_native6eShaped .cfi_startproc # BB#0: movslq %edi, %rax movsd .Lswitch.table(,%rax,8), %xmm1 # xmm1 = mem[0],zero ... .Lswitch.table: .quad 4616189618054758400 # double 4 .quad 4614256447914709615 # double 3.1415000000000002 .quad 4623263647169450607 # double 12.566000000000001 .quad 4631047162110439693 # double 41.569200000000002 .size .Lswitch.table, 32 Having a simpler test-case: $ cat pr82405-2.c enum class eShape { eSquare, eCircle, eShpere, eTetraeder }; double test_switch_native(eShape shape, double r) { switch(shape) { case eShape::eSquare: return 2; case eShape::eCircle: return 3; case eShape::eShpere: return 4; case eShape::eTetraeder: return 5; } } we're unable to process switch conversion because: beginning to process the following SWITCH statement (pr82405-2.c:4) : ------- switch (shape_2(D)) <default: <L4> [0.00%], case 0: <L6> [0.00%], case 1: <L1> [0.00%], case 2: <L2> [0.00%], case 3: <L3> [0.00%]> Bailing out - no common successor to all case label target blocks found I know that I discussed that with Jakub a bit, one can use | operator for enum values, but I still believe we should optimize switch statements where all enum values are handled. Thoughts? 2) predictive commoning is not triggered for similar reason (there's still default edge not using r * r). Thanks.
> Having a simpler test-case: > $ cat pr82405-2.c > enum class eShape { eSquare, eCircle, eShpere, eTetraeder }; > > double test_switch_native(eShape shape, double r) { > switch(shape) { > case eShape::eSquare: return 2; > case eShape::eCircle: return 3; > case eShape::eShpere: return 4; > case eShape::eTetraeder: return 5; > } > } It's PR82404.
Isolated test-case where we do not reassociate expressions: $ cat pr82405-reduced2.c static inline double eval_square(double r) { return r * r * 4; } static inline double eval_circle(double r) { return r * r * 3.1415; } static inline double eval_square2(double r) { return 4 * r * r; } static inline double eval_circle2(double r) { return 3.1415 * r * r; } double test_switch_native_slow(int shape, double r) { if (shape == 123) return eval_circle2(r); else return eval_square2(r); } double test_switch_native_fast(int shape, double r) { if (shape == 123) return eval_circle(r); else return eval_square(r); } ================= g++ pr82405-reduced2.c -std=c++11 -O2 -fdump-tree-optimized=/dev/stdout double test_switch_native_slow(int, double) (int shape, double r) { double _1; double _5; double _6; double _7; double _8; <bb 2> [100.00%]: if (shape_2(D) == 123) goto <bb 3>; [30.50%] else goto <bb 4>; [69.50%] <bb 3> [30.50%]: _5 = r_4(D) * 3.141500000000000181188397618825547397136688232421875e+0; _6 = r_4(D) * _5; goto <bb 5>; [100.00%] <bb 4> [69.50%]: _7 = r_4(D) * 4.0e+0; _8 = r_4(D) * _7; <bb 5> [100.00%]: # _1 = PHI <_6(3), _8(4)> return _1; } double test_switch_native_fast(int, double) (int shape, double r) { double _1; double _6; double _8; double _9; <bb 2> [100.00%]: _9 = r_4(D) * r_4(D); if (shape_2(D) == 123) goto <bb 3>; [30.50%] else goto <bb 4>; [69.50%] <bb 3> [30.50%]: _6 = _9 * 3.141500000000000181188397618825547397136688232421875e+0; goto <bb 5>; [100.00%] <bb 4> [69.50%]: _8 = _9 * 4.0e+0; <bb 5> [100.00%]: # _1 = PHI <_6(3), _8(4)> return _1; }
> Isolated test-case where we do not reassociate expressions: And I don't think you can reassociate here validly unless -ffast-math.
(In reply to Andrew Pinski from comment #4) > > Isolated test-case where we do not reassociate expressions: > > And I don't think you can reassociate here validly unless -ffast-math. Yep, you're right. With -ffast-math we do the reassociation. Clang does that w/ -O2: $ ~/bin/llvm/bin/clang++ -v clang version 6.0.0 (http://llvm.org/git/clang.git 80f50978299e20c5b2e444e9a4fc06bfaf0183cc) (http://llvm.org/git/llvm.git a29bdba93eaec8e2cf820532c02261ef93ba82b5) Target: x86_64-unknown-linux-gnu $ ~/bin/llvm/bin/clang++ -B. pr82405-reduced2.c -std=c++11 -O2 -o /dev/stdout -S ... .LCPI0_0: .quad 4616189618054758400 # double 4 .quad 4614256447914709615 # double 3.1415000000000002 .text .globl _Z23test_switch_native_slowid .p2align 4, 0x90 .type _Z23test_switch_native_slowid,@function _Z23test_switch_native_slowid: # @_Z23test_switch_native_slowid .cfi_startproc # BB#0: # %entry xorl %eax, %eax cmpl $123, %edi sete %al movsd .LCPI0_0(,%rax,8), %xmm1 # xmm1 = mem[0],zero mulsd %xmm0, %xmm1 mulsd %xmm1, %xmm0 retq ... Reported says the same.
> And I don't think you can reassociate here validly unless -ffast-math. But you can. In the isolated test case, instead of getting r*r at first, just move the constant into the xmm1 first and after that multiply it twice with r. Clang does that .LCPI0_0: .quad 4616189618054758400 # double 4 .quad 4614256447914709615 # double 3.1415000000000002 test_switch_native_slow(int, double): # @test_switch_native_slow(int, double) xor eax, eax cmp edi, 123 sete al movsd xmm1, qword ptr [8*rax + .LCPI0_0] # xmm1 = mem[0],zero mulsd xmm1, xmm0 mulsd xmm0, xmm1 ret Moreover, the current approach "multiply r twice and then multiply it on the constant" changes the observable behavior, such optimization could be enabled only with -ffast-math
Multiply binds left to right. That is a * b * c is the same as (a * b) * c and not the same as A * (b * c).
Yes, in the isolated test case constants are first: static inline double eval_square2(double r) { return 4 * r * r; } static inline double eval_circle2(double r) { return 3.1415 * r * r; } double test_switch_native_slow(int shape, double r) { if (shape == 123) return eval_circle2(r); else return eval_square2(r); } So multiplying r at the beginning is not good.
So it means maybe llvm performs more advanced switchconv in this case, at least judging from the #c0 assembly snippet. We look solely for PHIs which have arguments SSA_NAMEs initialized in the cases to constants, while in order to optimize this without -ffast-math, it would need to handle at least a couple of stmts where the operands match except for some constant argument that is changing. <L6> [20.00%]: _5 = r_4(D) * 4.0e+0; _6 = r_4(D) * _5; goto <bb 8>; [100.00%] <L7> [20.00%]: _7 = r_4(D) * 3.141500000000000181188397618825547397136688232421875e+0; _8 = r_4(D) * _7; goto <bb 8>; [100.00%] So, in the above case we'd look from the PHI with _6 and _8 arguments, and see that the because the def stmt isn't assignment from constant, we'd notice it is a binary (or unary or ternary) assign where one of the operands is identical (r_4(D), while the other one is another SSA_NAME defined in the case, and we'd loop to that, seeing another assign where one operand is the same and another one is a constant. Thus, we'd build a table with the 4.0e+0 and 3.141500000000000181188397618825547397136688232421875e+0 constants, and after the load from the table did _21 = r_4(D) * value_loaded_from_table_20; _22 = r_4(D) * _21; The question is if we'd require all operands matching except for one which could be a constant eventually, or something different (allow some small number of constant arguments to a computation). Or should we have a separate pass that performs such an optimization (noticing similar code blocks with just changing constant parameters and merge the blocks except for computing the parameters)?
On Tue, 3 Oct 2017, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82405 > > --- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > So it means maybe llvm performs more advanced switchconv in this case, at least > judging from the #c0 assembly snippet. We look solely for PHIs which have > arguments SSA_NAMEs initialized in the cases to constants, while in order to > optimize this without -ffast-math, it would need to handle at least a couple of > stmts where the operands match except for some constant argument that is > changing. > > <L6> [20.00%]: > _5 = r_4(D) * 4.0e+0; > _6 = r_4(D) * _5; > goto <bb 8>; [100.00%] > > <L7> [20.00%]: > _7 = r_4(D) * 3.141500000000000181188397618825547397136688232421875e+0; > _8 = r_4(D) * _7; > goto <bb 8>; [100.00%] > > So, in the above case we'd look from the PHI with _6 and _8 arguments, and see > that the because the def stmt isn't assignment from constant, we'd notice it is > a binary (or unary or ternary) assign where one of the operands is identical > (r_4(D), while the other one is another SSA_NAME defined in the case, and we'd > loop to that, seeing another assign where one operand is the same and another > one is a constant. Thus, we'd build a table with the 4.0e+0 and > 3.141500000000000181188397618825547397136688232421875e+0 constants, and after > the load from the table did _21 = r_4(D) * value_loaded_from_table_20; _22 = > r_4(D) * _21; > The question is if we'd require all operands matching except for one which > could be a constant eventually, or something different (allow some small number > of constant arguments to a computation). > > Or should we have a separate pass that performs such an optimization (noticing > similar code blocks with just changing constant parameters and merge the blocks > except for computing the parameters)? Looks like sth for tail-merging / code hoisting? Of course we need to reassoc first (if valid). If the whole thing were if()s it may be PRE would already catch it.
Thinking about this some more, this is a not a hoisting problem but a sinking problem. basically we have: int f(void); int h(void); int g(int a) { if (a) return f() + 10; return h() + 10; } Which should be converted into: int g1(int a) { int t; if (a) t = f(); else t = h(); return t + 10; } We handle hoisting just fine; just not sinking common. So we don't need reassoc in the original testcase if we do sinking.
GCC 9.1 has been released.
GCC 9.2 has been released.
GCC 9.3.0 has been released, adjusting target milestone.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.