Bug 86952 - Avoid jump table for switch statement with -mindirect-branch=thunk
Summary: Avoid jump table for switch statement with -mindirect-branch=thunk
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: ---
Assignee: Martin Liška
URL:
Keywords: missed-optimization
Depends on:
Blocks: 84072
  Show dependency treegraph
 
Reported: 2018-08-14 15:24 UTC by H.J. Lu
Modified: 2019-06-15 00:29 UTC (History)
2 users (show)

See Also:
Host:
Target: i386,x86-64
Build:
Known to work: 7.4.1, 8.3.1, 9.0
Known to fail:
Last reconfirmed: 2018-11-23 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2018-08-14 15:24:18 UTC
Why is GCC generating a jump table for a five-entry switch statement if
retpolines are on?  This has got to be a *huge* performance loss.  The
retpoline sequence is very, very slow, and branches aren't that slow.
A five-entry switch is only three branches deep.
Comment 1 Will Schmidt 2018-09-24 15:47:54 UTC
Author: willschm
Date: Mon Sep 24 15:47:22 2018
New Revision: 264538

URL: https://gcc.gnu.org/viewcvs?rev=264538&root=gcc&view=rev
Log:
[testsuite]

2018-09-24  Will Schmidt  <will_schmidt@vnet.ibm.com>

	PR testsuite/86952
	* gcc.target/powerpc/p8-vec-xl-xst-v2.c: Add and
	update expected codegen

Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/powerpc/p8-vec-xl-xst-v2.c
Comment 2 Martin Liška 2018-11-20 08:39:33 UTC
H.J. I can write a patch for it. Do you expect more expensive costs when retpolines are enabled?
Comment 3 H.J. Lu 2018-11-20 13:14:19 UTC
(In reply to Martin Liška from comment #2)
> H.J. I can write a patch for it. Do you expect more expensive costs when
> retpolines are enabled?

retpoline is more expensive than 4 branches.
Comment 4 Martin Liška 2018-11-22 11:55:20 UTC
(In reply to H.J. Lu from comment #3)
> (In reply to Martin Liška from comment #2)
> > H.J. I can write a patch for it. Do you expect more expensive costs when
> > retpolines are enabled?
> 
> retpoline is more expensive than 4 branches.

Can you please make a microbenchmark that will expose how exactly is that expensive? Based on that I can tune current costs.
Comment 5 H.J. Lu 2018-11-23 14:51:39 UTC
(In reply to Martin Liška from comment #4)
> (In reply to H.J. Lu from comment #3)
> > (In reply to Martin Liška from comment #2)
> > > H.J. I can write a patch for it. Do you expect more expensive costs when
> > > retpolines are enabled?
> > 
> > retpoline is more expensive than 4 branches.
> 
> Can you please make a microbenchmark that will expose how exactly is that
> expensive? Based on that I can tune current costs.

Is there a testcase where GCC generates a jump table for a five-entry
switch statement?
Comment 6 Martin Liška 2018-11-23 15:02:01 UTC
(In reply to H.J. Lu from comment #5)
> (In reply to Martin Liška from comment #4)
> > (In reply to H.J. Lu from comment #3)
> > > (In reply to Martin Liška from comment #2)
> > > > H.J. I can write a patch for it. Do you expect more expensive costs when
> > > > retpolines are enabled?
> > > 
> > > retpoline is more expensive than 4 branches.
> > 
> > Can you please make a microbenchmark that will expose how exactly is that
> > expensive? Based on that I can tune current costs.
> 
> Is there a testcase where GCC generates a jump table for a five-entry
> switch statement?

$ cat jt.c
int global;

int foo3 (int x)
{
  switch (x) {
    case 0:
      return 11;
    case 1:
      return 123;
    case 2:
      global += 1;
      return 3;
    case 3:
      return 44;
    case 4:
      return 444;
    default:
      return 0;
  }
}

$ gcc jt.c -O2  -S -o/dev/stdout
	.file	"jt.c"
	.text
	.p2align 4,,15
	.globl	foo3
	.type	foo3, @function
foo3:
.LFB0:
	.cfi_startproc
	cmpl	$4, %edi
	ja	.L2
	movl	%edi, %edi
	jmp	*.L4(,%rdi,8)
	.section	.rodata
	.align 8
	.align 4
.L4:
	.quad	.L9
	.quad	.L7
	.quad	.L6
	.quad	.L5
	.quad	.L3
	.text
	.p2align 4,,10
	.p2align 3
...
Comment 7 H.J. Lu 2018-11-23 15:17:13 UTC
Please try retpoline-table branch at

https://github.com/hjl-tools/microbenchmark

I got

[hjl@gnu-cfl-1 microbenchmark]$ make
gcc -g -I. -O2 -mindirect-branch=thunk   -c -o test.o test.c
gcc -g -I. -O2 -mindirect-branch=thunk -fno-jump-tables   -c -o switch-no-table.o switch-no-table.c
gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
gcc -o test test.o switch-no-table.o switch.o
./test
no jump table: 189484
jump table   : 333016
[hjl@gnu-cfl-1 microbenchmark]$
Comment 8 Martin Liška 2018-11-23 15:26:26 UTC
Thanks for it, will work on that next week.
Comment 9 Martin Liška 2018-11-26 14:18:31 UTC
Ok, I've updated slightly the micro-benchmark and I see following difference:
https://github.com/marxin/microbenchmark/tree/retpoline-table

on my Haswell desktop:

./test
no jump table: 4265908653
jump table   : 5118680921 (119.99%)

which is quite small I would say..
Comment 10 Martin Liška 2018-12-31 09:13:07 UTC
H.J. : Can you please run updated benchmark on a recent machine and provide slow down numbers for that?
Comment 11 H.J. Lu 2018-12-31 16:57:06 UTC
(In reply to Martin Liška from comment #10)
> H.J. : Can you please run updated benchmark on a recent machine and provide
> slow down numbers for that?

The numbers aren't stable:

[hjl@gnu-cfl-1 microbenchmark]$ make
./test
30000 loops:
global: 21, total: 625
no jump table: 178424
global: 21, total: 625
jump table   : 266792 (149.53%)
[hjl@gnu-cfl-1 microbenchmark]$ make
./test
30000 loops:
global: 21, total: 625
no jump table: 185068
global: 21, total: 625
jump table   : 266678 (144.10%)
[hjl@gnu-cfl-1 microbenchmark]$ make
./test
30000 loops:
global: 21, total: 625
no jump table: 292810
global: 21, total: 625
jump table   : 214840 (73.37%)
[hjl@gnu-cfl-1 microbenchmark]$ 

Close it for now.
Comment 12 Daniel Borkmann 2019-03-01 13:07:01 UTC
I've been looking into this issue quite recently and improved the benchmark tool a bit along the way. There need to be multiple considerations wrt to traversing the switch cases, the case is here is doing round robin, but additional distributions / tests could be added. Pushed here just in case: https://github.com/borkmann/microbenchmark

Numbers I'm getting are stable:

* Xeon E3-1240, packet.net c1.small.x86 instance:

 # make prep
 [...]
 # make
 gcc -g -I. -O2   -c -o test.o test.c
 gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20   -c -o switch-no-table.o switch-no-table.c
 gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
 gcc -g -I. -O2   -c -o switch-no-retpol.o switch-no-retpol.c
 gcc -o test test.o switch-no-table.o switch.o switch-no-retpol.o
 taskset 1 ./test
 no retpoline :      6098325270
 no jump table:      6298192058 (no retpoline: 103.28%)
 jump table   :     22081802856 (no retpoline: 362.10%, no jump table: 350.61%)
 # make
 taskset 1 ./test
 no retpoline :      6098439816
 no jump table:      6298242270 (no retpoline: 103.28%)
 jump table   :     22107872854 (no retpoline: 362.52%, no jump table: 351.02%)
 # make
 taskset 1 ./test
 no retpoline :      6098187038
 no jump table:      6298308128 (no retpoline: 103.28%)
 jump table   :     22071053524 (no retpoline: 361.93%, no jump table: 350.43%)

* Xeon Gold 5120, packet.net m2.xlarge.x86 instance:

 # make prep
 [...]
 # make
 gcc -g -I. -O2   -c -o test.o test.c
 gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20   -c -o switch-no-table.o switch-no-table.c
 gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
 gcc -g -I. -O2   -c -o switch-no-retpol.o switch-no-retpol.c
 gcc -o test test.o switch-no-table.o switch.o switch-no-retpol.o
 taskset 1 ./test
 no retpoline :      5450356814
 no jump table:      5620673036 (no retpoline: 103.12%)
 jump table   :     21448285314 (no retpoline: 393.52%, no jump table: 381.60%)
 # make
 taskset 1 ./test
 no retpoline :      5450356100
 no jump table:      5620678302 (no retpoline: 103.12%)
 jump table   :     21448119720 (no retpoline: 393.52%, no jump table: 381.59%)
 # make
 taskset 1 ./test
 no retpoline :      5450331258
 no jump table:      5620839740 (no retpoline: 103.13%)
 jump table   :     21446922902 (no retpoline: 393.50%, no jump table: 381.56%)

I've also looked into clang for their -mretpoline flag, and they generally turn off jump table generation in this case. For gcc, the s390 folks implemented a target override for the default case-values-threshold to raise it to 20. For x86 something similar could be done. Anyway, H.J. Lu asked me to reopen this issue (but seems like I cannot make this change from my account).
Comment 13 H.J. Lu 2019-03-01 13:11:49 UTC
Reopened with new info.
Comment 14 H.J. Lu 2019-03-01 13:12:07 UTC
Reopened.
Comment 15 Martin Liška 2019-03-01 15:21:42 UTC
(In reply to Daniel Borkmann from comment #12)
> I've been looking into this issue quite recently and improved the benchmark
> tool a bit along the way. There need to be multiple considerations wrt to
> traversing the switch cases, the case is here is doing round robin, but
> additional distributions / tests could be added. Pushed here just in case:
> https://github.com/borkmann/microbenchmark

Thanks a lot for the benchmark.

> 
> Numbers I'm getting are stable:
> 
> * Xeon E3-1240, packet.net c1.small.x86 instance:
> 
>  # make prep
>  [...]
>  # make
>  gcc -g -I. -O2   -c -o test.o test.c
>  gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20  
> -c -o switch-no-table.o switch-no-table.c
>  gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
>  gcc -g -I. -O2   -c -o switch-no-retpol.o switch-no-retpol.c
>  gcc -o test test.o switch-no-table.o switch.o switch-no-retpol.o
>  taskset 1 ./test
>  no retpoline :      6098325270
>  no jump table:      6298192058 (no retpoline: 103.28%)
>  jump table   :     22081802856 (no retpoline: 362.10%, no jump table:
> 350.61%)
>  # make
>  taskset 1 ./test
>  no retpoline :      6098439816
>  no jump table:      6298242270 (no retpoline: 103.28%)
>  jump table   :     22107872854 (no retpoline: 362.52%, no jump table:
> 351.02%)
>  # make
>  taskset 1 ./test
>  no retpoline :      6098187038
>  no jump table:      6298308128 (no retpoline: 103.28%)
>  jump table   :     22071053524 (no retpoline: 361.93%, no jump table:
> 350.43%)
> 
> * Xeon Gold 5120, packet.net m2.xlarge.x86 instance:
> 
>  # make prep
>  [...]
>  # make
>  gcc -g -I. -O2   -c -o test.o test.c
>  gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20  
> -c -o switch-no-table.o switch-no-table.c
>  gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
>  gcc -g -I. -O2   -c -o switch-no-retpol.o switch-no-retpol.c
>  gcc -o test test.o switch-no-table.o switch.o switch-no-retpol.o
>  taskset 1 ./test
>  no retpoline :      5450356814
>  no jump table:      5620673036 (no retpoline: 103.12%)
>  jump table   :     21448285314 (no retpoline: 393.52%, no jump table:
> 381.60%)
>  # make
>  taskset 1 ./test
>  no retpoline :      5450356100
>  no jump table:      5620678302 (no retpoline: 103.12%)
>  jump table   :     21448119720 (no retpoline: 393.52%, no jump table:
> 381.59%)
>  # make
>  taskset 1 ./test
>  no retpoline :      5450331258
>  no jump table:      5620839740 (no retpoline: 103.13%)
>  jump table   :     21446922902 (no retpoline: 393.50%, no jump table:
> 381.56%)

I can confirm the numbers. I've got:
model name	: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz

taskset 1 ./test
no retpoline :      4311969467
no jump table:      5146081372 (no retpoline: 119.34%)
jump table   :     18845846887 (no retpoline: 437.06%, no jump table: 366.22%)


> 
> I've also looked into clang for their -mretpoline flag, and they generally
> turn off jump table generation in this case. For gcc, the s390 folks
> implemented a target override for the default case-values-threshold to raise
> it to 20. 

Note that GCC has similar parameter:

--param case-values-threshold
               The smallest number of different values for which it is best to use a jump-table instead of a tree of conditional branches.  If the value is 0, use the default for the machine.  The default is 0.

For 20 branches, I've got even worse numbers:
https://github.com/marxin/microbenchmark-1/tree/retpoline-table

taskset 1 ./test
no retpoline :      5096377521
no jump table:      5169400990 (no retpoline: 101.43%)
jump table   :     28830137876 (no retpoline: 565.70%, no jump table: 557.71%)

So are you suggesting to disable jump tables with retpolines at all?

For x86 something similar could be done. Anyway, H.J. Lu asked me
> to reopen this issue (but seems like I cannot make this change from my
> account).

Yep, I would need an account ending with @gcc.org to change a bug.
Comment 16 Daniel Borkmann 2019-03-01 15:39:33 UTC
(In reply to Martin Liška from comment #15)
> (In reply to Daniel Borkmann from comment #12)
> > I've been looking into this issue quite recently and improved the benchmark
> > tool a bit along the way. There need to be multiple considerations wrt to
> > traversing the switch cases, the case is here is doing round robin, but
> > additional distributions / tests could be added. Pushed here just in case:
> > https://github.com/borkmann/microbenchmark
> 
> Thanks a lot for the benchmark.
> 
> > Numbers I'm getting are stable:
> > 
> > * Xeon E3-1240, packet.net c1.small.x86 instance:
> > 
> >  # make prep
> >  [...]
> >  # make
> >  gcc -g -I. -O2   -c -o test.o test.c
> >  gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20  
> > -c -o switch-no-table.o switch-no-table.c
> >  gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
> >  gcc -g -I. -O2   -c -o switch-no-retpol.o switch-no-retpol.c
> >  gcc -o test test.o switch-no-table.o switch.o switch-no-retpol.o
> >  taskset 1 ./test
> >  no retpoline :      6098325270
> >  no jump table:      6298192058 (no retpoline: 103.28%)
> >  jump table   :     22081802856 (no retpoline: 362.10%, no jump table:
> > 350.61%)
> >  # make
> >  taskset 1 ./test
> >  no retpoline :      6098439816
> >  no jump table:      6298242270 (no retpoline: 103.28%)
> >  jump table   :     22107872854 (no retpoline: 362.52%, no jump table:
> > 351.02%)
> >  # make
> >  taskset 1 ./test
> >  no retpoline :      6098187038
> >  no jump table:      6298308128 (no retpoline: 103.28%)
> >  jump table   :     22071053524 (no retpoline: 361.93%, no jump table:
> > 350.43%)
> > 
> > * Xeon Gold 5120, packet.net m2.xlarge.x86 instance:
> > 
> >  # make prep
> >  [...]
> >  # make
> >  gcc -g -I. -O2   -c -o test.o test.c
> >  gcc -g -I. -O2 -mindirect-branch=thunk --param=case-values-threshold=20  
> > -c -o switch-no-table.o switch-no-table.c
> >  gcc -g -I. -O2 -mindirect-branch=thunk   -c -o switch.o switch.c
> >  gcc -g -I. -O2   -c -o switch-no-retpol.o switch-no-retpol.c
> >  gcc -o test test.o switch-no-table.o switch.o switch-no-retpol.o
> >  taskset 1 ./test
> >  no retpoline :      5450356814
> >  no jump table:      5620673036 (no retpoline: 103.12%)
> >  jump table   :     21448285314 (no retpoline: 393.52%, no jump table:
> > 381.60%)
> >  # make
> >  taskset 1 ./test
> >  no retpoline :      5450356100
> >  no jump table:      5620678302 (no retpoline: 103.12%)
> >  jump table   :     21448119720 (no retpoline: 393.52%, no jump table:
> > 381.59%)
> >  # make
> >  taskset 1 ./test
> >  no retpoline :      5450331258
> >  no jump table:      5620839740 (no retpoline: 103.13%)
> >  jump table   :     21446922902 (no retpoline: 393.50%, no jump table:
> > 381.56%)
> 
> I can confirm the numbers. I've got:
> model name	: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
> 
> taskset 1 ./test
> no retpoline :      4311969467
> no jump table:      5146081372 (no retpoline: 119.34%)
> jump table   :     18845846887 (no retpoline: 437.06%, no jump table:
> 366.22%)

Ok, great, thanks for testing on your side as well!

> > I've also looked into clang for their -mretpoline flag, and they generally
> > turn off jump table generation in this case. For gcc, the s390 folks
> > implemented a target override for the default case-values-threshold to raise
> > it to 20. 
> 
> Note that GCC has similar parameter:
> 
> --param case-values-threshold
>                The smallest number of different values for which it is best
> to use a jump-table instead of a tree of conditional branches.  If the value
> is 0, use the default for the machine.  The default is 0.

Yeah, I know, I've used it above for the test case (see the gcc cmdline parts).

> For 20 branches, I've got even worse numbers:
> https://github.com/marxin/microbenchmark-1/tree/retpoline-table
> 
> taskset 1 ./test
> no retpoline :      5096377521
> no jump table:      5169400990 (no retpoline: 101.43%)
> jump table   :     28830137876 (no retpoline: 565.70%, no jump table:
> 557.71%)
> 
> So are you suggesting to disable jump tables with retpolines at all?

I leave that up to you guys, but I would at min probably implement something like s390 folks did for gcc, commit db7a90aa0de5 ("S/390: Disable prediction of indirect branches"), see s390_case_values_threshold() which does:

+unsigned int
+s390_case_values_threshold (void)
+{
+  /* Disabling branch prediction for indirect jumps makes jump tables
+     much more expensive.  */
+  if (TARGET_INDIRECT_BRANCH_NOBP_JUMP)
+    return 20;
+
+  return default_case_values_threshold ();
+}

> For x86 something similar could be done. Anyway, H.J. Lu asked me
> > to reopen this issue (but seems like I cannot make this change from my
> > account).
> 
> Yep, I would need an account ending with @gcc.org to change a bug.
Comment 17 Martin Liška 2019-03-04 08:52:23 UTC
> I leave that up to you guys, but I would at min probably implement something
> like s390 folks did for gcc, commit db7a90aa0de5 ("S/390: Disable prediction
> of indirect branches"), see s390_case_values_threshold() which does:

Sure, that's probably the right approach. I would appreciate help with the benchmark.
Can you please come up with a --param case-values-threshold value that will show
when a jump table (w/ retpolines) is equally fast as a decision tree (-fno-jump-tables)?

> 
> +unsigned int
> +s390_case_values_threshold (void)
> +{
> +  /* Disabling branch prediction for indirect jumps makes jump tables
> +     much more expensive.  */
> +  if (TARGET_INDIRECT_BRANCH_NOBP_JUMP)
> +    return 20;
> +
> +  return default_case_values_threshold ();
> +}
> 
> > For x86 something similar could be done. Anyway, H.J. Lu asked me
> > > to reopen this issue (but seems like I cannot make this change from my
> > > account).
> > 
> > Yep, I would need an account ending with @gcc.org to change a bug.
Comment 18 Martin Liška 2019-03-05 16:29:27 UTC
I'm working on a more complex test-case generator. I'll post results tomorrow.
Comment 19 Martin Liška 2019-03-06 08:32:07 UTC
Ok, I updated the benchmark and push it here:
https://github.com/marxin/microbenchmark-1

And I see following on my Haswell machine:

$ ./test.py 
             normal       retpoline    retpo+no-JT  retpo+JT=20  retpo+JT=40
cases:    8: 0.34 (100%)  1.80 (529%)  0.39 (114%)  0.39 (115%)  0.39 (115%) 
cases:   16: 0.33 (100%)  1.77 (541%)  0.51 (156%)  0.51 (157%)  0.51 (157%) 
cases:   32: 1.01 (100%)  1.82 (179%)  0.57 ( 56%)  1.82 (179%)  0.54 ( 54%) 
cases:   64: 0.78 (100%)  1.76 (225%)  0.58 ( 74%)  1.76 (225%)  1.75 (224%) 
cases:  128: 0.34 (100%)  1.94 (577%)  0.64 (191%)  1.93 (574%)  1.93 (573%) 
cases:  256: 0.34 (100%)  1.94 (579%)  0.76 (225%)  1.95 (581%)  1.94 (580%) 
cases: 1024: 1.21 (100%)  2.00 (166%)  0.97 ( 80%)  2.00 (165%)  2.00 (166%) 
cases: 2048: 1.48 (100%)  2.03 (137%)  2.06 (139%)  2.01 (136%)  2.00 (135%) 
cases: 4096: 1.67 (100%)  2.09 (125%)  3.78 (226%)  2.10 (126%)  2.20 (132%) 

From the number I see recommend to disable jump tables with -mindirect-branch=*.
Thoughts?
Comment 20 Daniel Borkmann 2019-03-06 09:48:55 UTC
(In reply to Martin Liška from comment #19)
> Ok, I updated the benchmark and push it here:
> https://github.com/marxin/microbenchmark-1
> 
> And I see following on my Haswell machine:

Thanks for working on it! Bit strange why some of your numbers are quite fluctuating e.g. in your 'normal' column. What do you use to tune your setup for testing? I've been running the `make prep` part which I added back then, and the numbers I see are quite stable. I ran a quick test this morning with your repo, and here's what I got for the round-robin walk:

* Xeon E3-1240 (3.7GHz):

# ./test.py 
             normal       retpoline    retpo+no-JT  retpo+JT=20  retpo+JT=40
cases:    8: 0.49 (100%)  2.09 (426%)  0.53 (108%)  0.53 (108%)  0.53 (108%) 
cases:   16: 0.49 (100%)  2.09 (426%)  0.58 (119%)  0.58 (119%)  0.58 (119%) 
cases:   32: 0.49 (100%)  2.09 (426%)  0.61 (125%)  2.09 (426%)  0.61 (125%) 
cases:   64: 0.49 (100%)  2.26 (458%)  0.69 (140%)  2.27 (459%)  2.27 (459%) 
cases:  128: 0.50 (100%)  2.37 (476%)  0.76 (153%)  2.32 (466%)  2.41 (483%) 
cases:  256: 0.52 (100%)  2.33 (451%)  0.91 (175%)  2.33 (450%)  2.36 (456%) 
cases: 1024: 1.05 (100%)  2.54 (242%)  1.08 (103%)  2.59 (246%)  2.54 (242%) 
cases: 2048: 1.63 (100%)  2.56 (157%)  1.94 (119%)  2.61 (160%)  2.59 (159%) 
cases: 4096: 2.19 (100%)  3.12 (143%)  3.22 (147%)  3.09 (142%)  3.13 (143%) 

* Xeon Gold 5120 (2.6GHz):

# ./test.py 
             normal       retpoline    retpo+no-JT  retpo+JT=20  retpo+JT=40
cases:    8: 0.70 (100%)  2.98 (425%)  0.75 (107%)  0.75 (107%)  0.75 (107%) 
cases:   16: 0.70 (100%)  2.98 (425%)  0.82 (117%)  0.82 (117%)  0.82 (117%) 
cases:   32: 0.70 (100%)  3.01 (430%)  0.87 (124%)  2.98 (426%)  0.87 (124%) 
cases:   64: 0.70 (100%)  3.52 (501%)  0.94 (134%)  3.52 (501%)  3.52 (501%) 
cases:  128: 0.71 (100%)  3.51 (495%)  1.07 (151%)  3.50 (495%)  3.50 (494%) 
cases:  256: 0.76 (100%)  3.14 (414%)  1.27 (167%)  3.14 (414%)  3.14 (414%) 
cases: 1024: 1.46 (100%)  3.36 (230%)  1.49 (102%)  3.36 (230%)  3.36 (230%) 
cases: 2048: 2.25 (100%)  3.19 (142%)  2.70 (120%)  3.19 (142%)  3.19 (142%) 
cases: 4096: 2.90 (100%)  3.74 (129%)  4.48 (155%)  3.73 (129%)  3.72 (129%) 

Probably makes sense to also add other walk tests aka input distributions for foo{,_no_table,_no_retpol}(<x>) for further comparison if plan would be to disable jump tables entirely.
Comment 21 Martin Liška 2019-03-06 11:43:51 UTC
(In reply to Daniel Borkmann from comment #20)
> (In reply to Martin Liška from comment #19)
> > Ok, I updated the benchmark and push it here:
> > https://github.com/marxin/microbenchmark-1
> > 
> > And I see following on my Haswell machine:
> 
> Thanks for working on it! Bit strange why some of your numbers are quite
> fluctuating e.g. in your 'normal' column. What do you use to tune your setup
> for testing? I've been running the `make prep` part which I added back then,
> and the numbers I see are quite stable. I ran a quick test this morning with
> your repo, and here's what I got for the round-robin walk:

Yes, it's without taskset and tuned. I don't have any experience with tuned.

> 
> * Xeon E3-1240 (3.7GHz):
> 
> # ./test.py 
>              normal       retpoline    retpo+no-JT  retpo+JT=20  retpo+JT=40
> cases:    8: 0.49 (100%)  2.09 (426%)  0.53 (108%)  0.53 (108%)  0.53 (108%) 
> cases:   16: 0.49 (100%)  2.09 (426%)  0.58 (119%)  0.58 (119%)  0.58 (119%) 
> cases:   32: 0.49 (100%)  2.09 (426%)  0.61 (125%)  2.09 (426%)  0.61 (125%) 
> cases:   64: 0.49 (100%)  2.26 (458%)  0.69 (140%)  2.27 (459%)  2.27 (459%) 
> cases:  128: 0.50 (100%)  2.37 (476%)  0.76 (153%)  2.32 (466%)  2.41 (483%) 
> cases:  256: 0.52 (100%)  2.33 (451%)  0.91 (175%)  2.33 (450%)  2.36 (456%) 
> cases: 1024: 1.05 (100%)  2.54 (242%)  1.08 (103%)  2.59 (246%)  2.54 (242%) 
> cases: 2048: 1.63 (100%)  2.56 (157%)  1.94 (119%)  2.61 (160%)  2.59 (159%) 
> cases: 4096: 2.19 (100%)  3.12 (143%)  3.22 (147%)  3.09 (142%)  3.13 (143%) 
> 
> * Xeon Gold 5120 (2.6GHz):
> 
> # ./test.py 
>              normal       retpoline    retpo+no-JT  retpo+JT=20  retpo+JT=40
> cases:    8: 0.70 (100%)  2.98 (425%)  0.75 (107%)  0.75 (107%)  0.75 (107%) 
> cases:   16: 0.70 (100%)  2.98 (425%)  0.82 (117%)  0.82 (117%)  0.82 (117%) 
> cases:   32: 0.70 (100%)  3.01 (430%)  0.87 (124%)  2.98 (426%)  0.87 (124%) 
> cases:   64: 0.70 (100%)  3.52 (501%)  0.94 (134%)  3.52 (501%)  3.52 (501%) 
> cases:  128: 0.71 (100%)  3.51 (495%)  1.07 (151%)  3.50 (495%)  3.50 (494%) 
> cases:  256: 0.76 (100%)  3.14 (414%)  1.27 (167%)  3.14 (414%)  3.14 (414%) 
> cases: 1024: 1.46 (100%)  3.36 (230%)  1.49 (102%)  3.36 (230%)  3.36 (230%) 
> cases: 2048: 2.25 (100%)  3.19 (142%)  2.70 (120%)  3.19 (142%)  3.19 (142%) 
> cases: 4096: 2.90 (100%)  3.74 (129%)  4.48 (155%)  3.73 (129%)  3.72 (129%) 
> 
> Probably makes sense to also add other walk tests aka input distributions
> for foo{,_no_table,_no_retpol}(<x>) for further comparison if plan would be
> to disable jump tables entirely.

There are number for:
+    int x = i % 57;
+    foo ((3 * x * x + 17 * x) / 100);

distribution:

             normal       retpoline    retpo+no-JT  retpo+JT=20  retpo+JT=40
cases:    8: 1.55 (100%)  2.65 (171%)  0.59 ( 38%)  0.60 ( 39%)  0.60 ( 39%) 
cases:   16: 1.53 (100%)  2.66 (174%)  0.67 ( 44%)  0.66 ( 43%)  0.66 ( 43%) 
cases:   32: 1.76 (100%)  2.68 (152%)  0.70 ( 40%)  2.69 (153%)  0.70 ( 39%) 
cases:   64: 1.31 (100%)  2.71 (206%)  0.75 ( 57%)  2.69 (205%)  2.66 (202%) 
cases:  128: 0.53 (100%)  2.75 (515%)  0.78 (147%)  2.73 (513%)  2.75 (516%) 
cases:  256: 0.55 (100%)  2.76 (504%)  0.85 (154%)  2.76 (504%)  2.76 (503%) 
cases: 1024: 0.54 (100%)  2.73 (506%)  0.96 (178%)  2.76 (511%)  2.74 (507%) 
cases: 2048: 0.54 (100%)  2.74 (507%)  1.23 (228%)  2.73 (505%)  2.71 (501%) 
cases: 4096: 0.54 (100%)  2.73 (503%)  1.44 (266%)  2.73 (502%)  2.73 (503%) 

Conclusion is the same for me, I'm going to prepare a patch that will disable JTs for retpolines.
Thank you for testing.
Comment 22 Martin Liška 2019-03-08 12:56:12 UTC
Author: marxin
Date: Fri Mar  8 12:55:40 2019
New Revision: 269492

URL: https://gcc.gnu.org/viewcvs?rev=269492&root=gcc&view=rev
Log:
x86: Disable jump tables when retpolines are used (PR target/86952).

2019-03-08  Martin Liska  <mliska@suse.cz>

	PR target/86952
	* config/i386/i386.c (ix86_option_override_internal): Disable
	jump tables when retpolines are used.
2019-03-08  Martin Liska  <mliska@suse.cz>

	PR target/86952
	* gcc.target/i386/pr86952.c: New test.
	* gcc.target/i386/indirect-thunk-7.c: Use jump tables to match
	scanned pattern.
	* gcc.target/i386/indirect-thunk-inline-7.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr86952.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/indirect-thunk-7.c
    trunk/gcc/testsuite/gcc.target/i386/indirect-thunk-inline-7.c
Comment 23 Martin Liška 2019-03-08 12:56:54 UTC
Fixed on trunk so far.
Comment 24 Martin Liška 2019-03-11 09:38:37 UTC
Author: marxin
Date: Mon Mar 11 09:38:06 2019
New Revision: 269572

URL: https://gcc.gnu.org/viewcvs?rev=269572&root=gcc&view=rev
Log:
Backport r269492

2019-03-11  Martin Liska  <mliska@suse.cz>

	Backport from mainline
	2019-03-08  Martin Liska  <mliska@suse.cz>

	PR target/86952
	* config/i386/i386.c (ix86_option_override_internal): Disable
	jump tables when retpolines are used.
2019-03-11  Martin Liska  <mliska@suse.cz>

	Backport from mainline
	2019-03-08  Martin Liska  <mliska@suse.cz>

	PR target/86952
	* gcc.target/i386/indirect-thunk-7.c: Use jump tables to match
	scanned pattern.
	* gcc.target/i386/indirect-thunk-inline-7.c: Likewise.

Modified:
    branches/gcc-8-branch/gcc/ChangeLog
    branches/gcc-8-branch/gcc/config/i386/i386.c
    branches/gcc-8-branch/gcc/testsuite/ChangeLog
    branches/gcc-8-branch/gcc/testsuite/gcc.target/i386/indirect-thunk-7.c
    branches/gcc-8-branch/gcc/testsuite/gcc.target/i386/indirect-thunk-inline-7.c
Comment 25 Martin Liška 2019-03-11 09:40:46 UTC
Fixed now.
Comment 26 Martin Liška 2019-04-11 09:00:31 UTC
Author: marxin
Date: Thu Apr 11 08:59:48 2019
New Revision: 270277

URL: https://gcc.gnu.org/viewcvs?rev=270277&root=gcc&view=rev
Log:
Backport r269492

2019-04-11  Martin Liska  <mliska@suse.cz>

	Backport from mainline
	2019-03-08  Martin Liska  <mliska@suse.cz>

	PR target/86952
	* config/i386/i386.c (ix86_option_override_internal): Disable
	jump tables when retpolines are used.
2019-04-11  Martin Liska  <mliska@suse.cz>

	Backport from mainline
	2019-03-08  Martin Liska  <mliska@suse.cz>

	PR target/86952
	* gcc.target/i386/pr86952.c: New test.
	* gcc.target/i386/indirect-thunk-7.c: Use jump tables to match
	scanned pattern.
	* gcc.target/i386/indirect-thunk-inline-7.c: Likewise.

Added:
    branches/gcc-7-branch/gcc/testsuite/gcc.target/i386/pr86952.c
Modified:
    branches/gcc-7-branch/gcc/ChangeLog
    branches/gcc-7-branch/gcc/config/i386/i386.c
    branches/gcc-7-branch/gcc/testsuite/ChangeLog
    branches/gcc-7-branch/gcc/testsuite/gcc.target/i386/indirect-thunk-7.c
    branches/gcc-7-branch/gcc/testsuite/gcc.target/i386/indirect-thunk-extern-7.c
    branches/gcc-7-branch/gcc/testsuite/gcc.target/i386/indirect-thunk-inline-7.c