77877 – missed optimization in switch of modulus value

Bug 77877 - missed optimization in switch of modulus value

Summary: missed optimization in switch of modulus value

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	7.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2016-10-06 00:14 UTC by Ulrich Drepper
Modified:	2022-02-21 05:22 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2016-10-06 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ulrich Drepper 2016-10-06 00:14:52 UTC

Compile this code with a recent trunk gcc:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
int zero;
int one;
int two;

extern unsigned compute_mod(unsigned);

void cnt(unsigned n) {
#ifdef HIDE
  n = compute_mod(n);
#else
  n %= 3;
#endif
  switch (n) {
  case 0: ++zero; break;
  case 1: ++one; break;
  case 2: ++two; break;
#ifdef OPT
  default: __builtin_unreachable();
#endif
  }
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On x86-64, without HIDE defined, the code compiles with -O3 to:

   0:	89 f8                	mov    %edi,%eax
   2:	ba ab aa aa aa       	mov    $0xaaaaaaab,%edx
   7:	f7 e2                	mul    %edx
   9:	d1 ea                	shr    %edx
   b:	8d 04 52             	lea    (%rdx,%rdx,2),%eax
   e:	29 c7                	sub    %eax,%edi
  10:	83 ff 01             	cmp    $0x1,%edi
  13:	74 1b                	je     30 <_Z3cntj+0x30>
  15:	83 ff 02             	cmp    $0x2,%edi
  18:	75 0e                	jne    28 <_Z3cntj+0x28>
  1a:	83 05 00 00 00 00 01 	addl   $0x1,0x0(%rip)        # 21 <_Z3cntj+0x21>
  21:	c3                   	retq   
  22:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  28:	83 05 00 00 00 00 01 	addl   $0x1,0x0(%rip)        # 2f <_Z3cntj+0x2f>
  2f:	c3                   	retq   
  30:	83 05 00 00 00 00 01 	addl   $0x1,0x0(%rip)        # 37 <_Z3cntj+0x37>
  37:	c3                   	retq   


This is good, the compiler knows there are only three possible values and does not emit any code for a default case.

If I make sure the compiler doesn't know anything about the arithmetic operation by calling a function but telling the compiler there is no other case by using __builtin_unreachable() then the generated code is even better:

   0:	48 83 ec 08          	sub    $0x8,%rsp
   4:	e8 00 00 00 00       	callq  9 <_Z3cntj+0x9>
   9:	83 f8 01             	cmp    $0x1,%eax
   c:	74 22                	je     30 <_Z3cntj+0x30>
   e:	72 10                	jb     20 <_Z3cntj+0x20>
  10:	83 05 00 00 00 00 01 	addl   $0x1,0x0(%rip)        # 17 <_Z3cntj+0x17>
  17:	48 83 c4 08          	add    $0x8,%rsp
  1b:	c3                   	retq   
  1c:	0f 1f 40 00          	nopl   0x0(%rax)
  20:	83 05 00 00 00 00 01 	addl   $0x1,0x0(%rip)        # 27 <_Z3cntj+0x27>
  27:	48 83 c4 08          	add    $0x8,%rsp
  2b:	c3                   	retq   
  2c:	0f 1f 40 00          	nopl   0x0(%rax)
  30:	83 05 00 00 00 00 01 	addl   $0x1,0x0(%rip)        # 37 <_Z3cntj+0x37>
  37:	48 83 c4 08          	add    $0x8,%rsp
  3b:	c3                   	retq   

There is only one compare instruction.  This is how even the first case should look like.

Even more interesting: just defining the OPT macro does not change anything.  So, there is currently no way to get the optimal behaviour.  We certainly don't want to artificially the function calls.

Comment 1 Richard Biener 2016-10-06 10:50:31 UTC

The point is that with the modulus visible VRP will remove the default case
(or rather take a random one -- here the first one -- as new default):

  <bb 2>:
  n_9 = n_8(D) % 3;
  switch (n_9) <default: <L0>, case 1: <L1>, case 2: <L2>>

<L0>:
  zero.0_1 = zero;
  _2 = zero.0_1 + 1;
  zero = _2;
  goto <bb 6>;

<L1>:
  one.1_3 = one;
  _4 = one.1_3 + 1;
  one = _4;
  goto <bb 6>;

<L2>:
  two.2_5 = two;
  _6 = two.2_5 + 1;
  two = _6;

  <bb 6>:
  return;

if we don't have the modulo then the default case will not be optimized away
(not the one with the unreachable either) before RTL expansion.  So it looks
like the four-cases variant where one case later vanishes happens to be
expanded better.

Which means we should improve expansion for the IL case above (stmt.c:expand_case).  Or work around by choosing another value as
the "default" if that makes expand_case happy.