91776 – `-fsplit-paths` generates slower code on arm

Bug 91776 - `-fsplit-paths` generates slower code on arm

Summary: `-fsplit-paths` generates slower code on arm

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	8.3.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Andrew Pinski

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2019-09-16 11:00 UTC by yhr-_-yhr
Modified:	2023-11-09 20:46 UTC (History)
CC List:	1 user (show)

See Also:	112402
Host:
Target:	arm
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-09-18 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description yhr-_-yhr 2019-09-16 11:00:54 UTC

I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7.
Writing a silly program calculating the cycle length of Fibonacci sequence modulo n.

version: gcc (Raspbian 8.3.0-6+rpi1) 8.3.0

#include <stdio.h>
#include <time.h>
typedef unsigned int uint;
typedef unsigned long long ullong;
int main(){
	uint m;
	ullong cyc=0,lastcyc=0;
	clock_t lastclock=0;
	for(m=2;;m++){
		uint
			a=0,
			b=1,
			n=0;
		do{
			b+=a;
			a=b-a;
			n++;
			if(b>=m)
				b-=m;
		}while(
			a!=0||
			b!=1
		);
		cyc+=n;
		//if(n>=4*m)
		//	printf("%u: %u %.2f\n",m,n,(double)n/m);
		if(cyc-lastcyc>100000000){
			clock_t now=clock();
			printf("~ %.0f loop/s\n",(double)(cyc-lastcyc)/(now-lastclock)*CLOCKS_PER_SEC);
			lastclock=now;
			lastcyc=cyc;
		}
	}
}

(1)
pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2  fibmod.c 
pi@rpi:~/Desktop $ ./fibmod
~ 240755135 loop/s
~ 277965738 loop/s
~ 276675919 loop/s
~ 277244469 loop/s
~ 277207289 loop/s
~ 277303633 loop/s
^C

(2)
pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2 -fsplit-paths fibmod.c 
pi@rpi:~/Desktop $ ./fibmod
~ 137691044 loop/s
~ 144593838 loop/s
~ 144397428 loop/s
~ 144519131 loop/s
~ 144392500 loop/s
^C

Also tested with `-Ofast -nofsplit-paths`, the speed measured is almost same as (1).

On other hardware with x86_64 arch, this option doesn't seem to make observable difference in running time.

btw, clang without `-march=mative -mtune-native` also produces the same speed as (1), but with these two options, the speed is even higher.

(3)
pi@rpi:~/Desktop $ clang -Wall -march=native -mtune=native -o fibmodclang -Ofast fibmod.c 
pi@rpi:~/Desktop $ ./fibmodclang 
~ 291343047 loop/s
~ 347350967 loop/s
~ 349217005 loop/s
~ 349320149 loop/s
~ 349367926 loop/s
~ 349372536 loop/s
^C

Comment 1 Wilco 2019-09-16 16:45:09 UTC

(In reply to yhr-_-yhr from comment #0)
> I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7.

I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi?

> pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2 
> fibmod.c 
> pi@rpi:~/Desktop $ ./fibmod
> ~ 240755135 loop/s
> ~ 277965738 loop/s
> ~ 276675919 loop/s
> ~ 277244469 loop/s
> ~ 277207289 loop/s
> ~ 277303633 loop/s
> ^C
> 
> (2)
> pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2
> -fsplit-paths fibmod.c 
> pi@rpi:~/Desktop $ ./fibmod
> ~ 137691044 loop/s
> ~ 144593838 loop/s
> ~ 144397428 loop/s
> ~ 144519131 loop/s
> ~ 144392500 loop/s
> ^C

Can you list the assembly code for both inner loops please? This doesn't seem like -fsplit-paths, but more likely related to -mstrict-it in Armv8. I can reproduce a 2x slowdown with this loop if the subtract is not conditionally executed. This happens if the register allocator uses a high register:

fast case:
	cmp	r4, r3
	it	ls
	subls	r3, r3, r4

slow case:
	cmp	r10, r3
	bhi	.L2
	sub	r3, r3, r10
.L2:

Can you try using -mno-strict-it on your examples and see whether that helps?

Comment 2 yhr-_-yhr 2019-09-18 04:38:47 UTC

> I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi?
oops you're right, I just got this pointed out when I showed this post to my friend. I just copied it from `cat /proc/cpuinfo`.

> Can you try using -mno-strict-it on your examples and see whether that helps?
Did you mean -mno-restrict-it? I followed gcc's correction info.

(4)
pi@rpi:~/Desktop $ gcc -v -save-temps -Wall -march=native -mtune=native -mno-restrict-it -o fibmod -O2 -fsplit-paths fibmod.c 
[...]
pi@rpi:~/Desktop $ ./fibmod
~ 129358055 loop/s
~ 144338387 loop/s
~ 143361058 loop/s
~ 143191701 loop/s
~ 143414626 loop/s
~ 143312006 loop/s
^C
[fibmod.S]
.L7:
	mov	r1, #0
	mov	r2, #1
	mov	r0, r1
	b	.L5
.L13:
	sub	r3, r3, r10
	cmp	r2, #0
	cmpeq	r3, #1
	beq	.L4
.L3:
	mov	r0, r2
	mov	r2, r3
.L5:
	add	r3, r0, r2
	add	r1, r1, #1
	cmp	r10, r3
	bls	.L13
	cmp	r3, #1
	cmpeq	r2, #0
	bne	.L3
.L4:
	adds	r4, r4, r1
	adc	r5, r5, #0
	subs	r6, r4, ip
	sbc	r7, r5, lr
	cmp	r7, r9
	cmpeq	r6, r8
	bls	.L6
	bl	clock
	mov	r1, r7
	str	r0, [sp]
	mov	r0, r6
	bl	__aeabi_ul2d
	ldr	r3, [sp]
	vmov	d6, r0, r1
	ldr	r0, [sp, #4]
	sub	r2, r3, fp
	vmov	s14, r2	@ int
	mov	fp, r3
	vcvt.f64.s32	d7, s14
	vdiv.f64	d6, d6, d7
	vmul.f64	d7, d6, d8
	vmov	r2, r3, d7
	bl	printf
	mov	ip, r4
	mov	lr, r5
.L6:
	add	r10, r10, #1
	b	.L7

(5)
pi@rpi:~/Desktop $ gcc -v -save-temps -Wall -march=native -mtune=native -mno-restrict-it -o fibmod -O2 fibmod.c
[...]
pi@rpi:~/Desktop $ ./fibmod
~ 277312518 loop/s
~ 279153709 loop/s
~ 278075227 loop/s
~ 277919398 loop/s
~ 277167351 loop/s
~ 278028104 loop/s
~ 278017452 loop/s
^C
[fibmod.S]
.L5:
	mov	r1, #0
	mov	r2, #1
	mov	r0, r1
.L3:
	add	r3, r0, r2
	add	r1, r1, #1
	cmp	r10, r3
	mov	r0, r2
	subls	r3, r3, r10
	cmp	r3, #1
	cmpeq	r2, #0
	mov	r2, r3
	bne	.L3
	adds	r4, r4, r1
	adc	r5, r5, #0
	subs	r6, r4, ip
	sbc	r7, r5, lr
	cmp	r7, r9
	cmpeq	r6, r8
	bls	.L4
	bl	clock
	mov	r1, r7
	str	r0, [sp]
	mov	r0, r6
	bl	__aeabi_ul2d
	ldr	r3, [sp]
	vmov	d6, r0, r1
	ldr	r0, [sp, #4]
	sub	r2, r3, fp
	vmov	s14, r2	@ int
	mov	fp, r3
	vcvt.f64.s32	d7, s14
	vdiv.f64	d6, d6, d7
	vmul.f64	d7, d6, d8
	vmov	r2, r3, d7
	bl	printf
	mov	ip, r4
	mov	lr, r5
.L4:
	add	r10, r10, #1
	b	.L5

I also checked the two fibmod.S without `-mno-restrict-it` but it seems to be no difference.

Oh but I found another that actually makes a little (~7%) difference.. without `-march=native -mtune=native`

(6)
pi@rpi:~/Desktop $ gcc -v -save-temps -Wall -mno-restrict-it -o fibmod -O2 -fsplit-paths fibmod.c
[...]
pi@rpi:~/Desktop $ ./fibmod
~ 140006573 loop/s
~ 153067683 loop/s
~ 153172437 loop/s
~ 152992126 loop/s
~ 153133548 loop/s
^C
[fibmod.S]
.L7:
	mov	r1, #0
	mov	r0, r1		@ here
	mov	r2, #1		@ here
	b	.L5
.L13:
	sub	r3, r3, r10
	cmp	r2, #0
	cmpeq	r3, #1
	beq	.L4
.L3:
	mov	r0, r2
	mov	r2, r3
.L5:
	add	r3, r0, r2
	cmp	r10, r3		@ here
	add	r1, r1, #1	@ here
	bls	.L13
	cmp	r3, #1
	cmpeq	r2, #0
	bne	.L3
.L4:
	adds	r4, r4, r1
	adc	r5, r5, #0
	subs	r6, r4, ip
	sbc	r7, r5, lr
	cmp	r7, r9
	cmpeq	r6, r8
	bls	.L6
	bl	clock
	mov	r1, r7
	str	r0, [sp, #4]
	mov	r0, r6
	bl	__aeabi_ul2d
	ldr	r3, [sp, #4]
	sub	r2, r3, fp
	mov	fp, r3
	vmov	s14, r2	@ int
	vcvt.f64.s32	d7, s14
	vmov	d6, r0, r1
	ldr	r0, .L14+16
	vdiv.f64	d6, d6, d7
	vmul.f64	d7, d6, d8
	vmov	r2, r3, d7
	bl	printf
	mov	ip, r4
	mov	lr, r5
.L6:
	add	r10, r10, #1
	b	.L7

with neither `-fsplit-paths` nor `-march=native -mtune=native` the speed is identical to (5).

Comment 3 Richard Earnshaw 2019-09-18 09:13:16 UTC

(In reply to Wilco from comment #1)
> (In reply to yhr-_-yhr from comment #0)
> > I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7.
> 
> I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi?

BCM2835 is the Linux driver name for the BCM2[78]xx and series.  You get the same on a Pi4 as well, even though it uses a BCM2711.

Comment 4 Wilco 2019-09-18 11:47:15 UTC

(In reply to yhr-_-yhr from comment #2)
> > I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi?
> oops you're right, I just got this pointed out when I showed this post to my
> friend. I just copied it from `cat /proc/cpuinfo`.
> 
> > Can you try using -mno-strict-it on your examples and see whether that helps?
> Did you mean -mno-restrict-it? I followed gcc's correction info.

Yes - but it looks like your compiler defaults to Arm (which is strange), so it has no effect.

With GCC8 I can reproduce this for Arm, but not on newer compilers. On Thumb-2 it still is an issue due to -mrestrict-it (comment 1). Basically it shows how important conditional execution is for performance even on modern CPUs.

Comment 5 Andrew Pinski 2023-11-09 20:46:30 UTC

Mine, I almost have a full patch, just need to improve the cost model for slightly for have_conditional_execution targets (which arm is).