I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7. Writing a silly program calculating the cycle length of Fibonacci sequence modulo n. version: gcc (Raspbian 8.3.0-6+rpi1) 8.3.0 #include <stdio.h> #include <time.h> typedef unsigned int uint; typedef unsigned long long ullong; int main(){ uint m; ullong cyc=0,lastcyc=0; clock_t lastclock=0; for(m=2;;m++){ uint a=0, b=1, n=0; do{ b+=a; a=b-a; n++; if(b>=m) b-=m; }while( a!=0|| b!=1 ); cyc+=n; //if(n>=4*m) // printf("%u: %u %.2f\n",m,n,(double)n/m); if(cyc-lastcyc>100000000){ clock_t now=clock(); printf("~ %.0f loop/s\n",(double)(cyc-lastcyc)/(now-lastclock)*CLOCKS_PER_SEC); lastclock=now; lastcyc=cyc; } } } (1) pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2 fibmod.c pi@rpi:~/Desktop $ ./fibmod ~ 240755135 loop/s ~ 277965738 loop/s ~ 276675919 loop/s ~ 277244469 loop/s ~ 277207289 loop/s ~ 277303633 loop/s ^C (2) pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2 -fsplit-paths fibmod.c pi@rpi:~/Desktop $ ./fibmod ~ 137691044 loop/s ~ 144593838 loop/s ~ 144397428 loop/s ~ 144519131 loop/s ~ 144392500 loop/s ^C Also tested with `-Ofast -nofsplit-paths`, the speed measured is almost same as (1). On other hardware with x86_64 arch, this option doesn't seem to make observable difference in running time. btw, clang without `-march=mative -mtune-native` also produces the same speed as (1), but with these two options, the speed is even higher. (3) pi@rpi:~/Desktop $ clang -Wall -march=native -mtune=native -o fibmodclang -Ofast fibmod.c pi@rpi:~/Desktop $ ./fibmodclang ~ 291343047 loop/s ~ 347350967 loop/s ~ 349217005 loop/s ~ 349320149 loop/s ~ 349367926 loop/s ~ 349372536 loop/s ^C
(In reply to yhr-_-yhr from comment #0) > I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7. I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi? > pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2 > fibmod.c > pi@rpi:~/Desktop $ ./fibmod > ~ 240755135 loop/s > ~ 277965738 loop/s > ~ 276675919 loop/s > ~ 277244469 loop/s > ~ 277207289 loop/s > ~ 277303633 loop/s > ^C > > (2) > pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2 > -fsplit-paths fibmod.c > pi@rpi:~/Desktop $ ./fibmod > ~ 137691044 loop/s > ~ 144593838 loop/s > ~ 144397428 loop/s > ~ 144519131 loop/s > ~ 144392500 loop/s > ^C Can you list the assembly code for both inner loops please? This doesn't seem like -fsplit-paths, but more likely related to -mstrict-it in Armv8. I can reproduce a 2x slowdown with this loop if the subtract is not conditionally executed. This happens if the register allocator uses a high register: fast case: cmp r4, r3 it ls subls r3, r3, r4 slow case: cmp r10, r3 bhi .L2 sub r3, r3, r10 .L2: Can you try using -mno-strict-it on your examples and see whether that helps?
> I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi? oops you're right, I just got this pointed out when I showed this post to my friend. I just copied it from `cat /proc/cpuinfo`. > Can you try using -mno-strict-it on your examples and see whether that helps? Did you mean -mno-restrict-it? I followed gcc's correction info. (4) pi@rpi:~/Desktop $ gcc -v -save-temps -Wall -march=native -mtune=native -mno-restrict-it -o fibmod -O2 -fsplit-paths fibmod.c [...] pi@rpi:~/Desktop $ ./fibmod ~ 129358055 loop/s ~ 144338387 loop/s ~ 143361058 loop/s ~ 143191701 loop/s ~ 143414626 loop/s ~ 143312006 loop/s ^C [fibmod.S] .L7: mov r1, #0 mov r2, #1 mov r0, r1 b .L5 .L13: sub r3, r3, r10 cmp r2, #0 cmpeq r3, #1 beq .L4 .L3: mov r0, r2 mov r2, r3 .L5: add r3, r0, r2 add r1, r1, #1 cmp r10, r3 bls .L13 cmp r3, #1 cmpeq r2, #0 bne .L3 .L4: adds r4, r4, r1 adc r5, r5, #0 subs r6, r4, ip sbc r7, r5, lr cmp r7, r9 cmpeq r6, r8 bls .L6 bl clock mov r1, r7 str r0, [sp] mov r0, r6 bl __aeabi_ul2d ldr r3, [sp] vmov d6, r0, r1 ldr r0, [sp, #4] sub r2, r3, fp vmov s14, r2 @ int mov fp, r3 vcvt.f64.s32 d7, s14 vdiv.f64 d6, d6, d7 vmul.f64 d7, d6, d8 vmov r2, r3, d7 bl printf mov ip, r4 mov lr, r5 .L6: add r10, r10, #1 b .L7 (5) pi@rpi:~/Desktop $ gcc -v -save-temps -Wall -march=native -mtune=native -mno-restrict-it -o fibmod -O2 fibmod.c [...] pi@rpi:~/Desktop $ ./fibmod ~ 277312518 loop/s ~ 279153709 loop/s ~ 278075227 loop/s ~ 277919398 loop/s ~ 277167351 loop/s ~ 278028104 loop/s ~ 278017452 loop/s ^C [fibmod.S] .L5: mov r1, #0 mov r2, #1 mov r0, r1 .L3: add r3, r0, r2 add r1, r1, #1 cmp r10, r3 mov r0, r2 subls r3, r3, r10 cmp r3, #1 cmpeq r2, #0 mov r2, r3 bne .L3 adds r4, r4, r1 adc r5, r5, #0 subs r6, r4, ip sbc r7, r5, lr cmp r7, r9 cmpeq r6, r8 bls .L4 bl clock mov r1, r7 str r0, [sp] mov r0, r6 bl __aeabi_ul2d ldr r3, [sp] vmov d6, r0, r1 ldr r0, [sp, #4] sub r2, r3, fp vmov s14, r2 @ int mov fp, r3 vcvt.f64.s32 d7, s14 vdiv.f64 d6, d6, d7 vmul.f64 d7, d6, d8 vmov r2, r3, d7 bl printf mov ip, r4 mov lr, r5 .L4: add r10, r10, #1 b .L5 I also checked the two fibmod.S without `-mno-restrict-it` but it seems to be no difference. Oh but I found another that actually makes a little (~7%) difference.. without `-march=native -mtune=native` (6) pi@rpi:~/Desktop $ gcc -v -save-temps -Wall -mno-restrict-it -o fibmod -O2 -fsplit-paths fibmod.c [...] pi@rpi:~/Desktop $ ./fibmod ~ 140006573 loop/s ~ 153067683 loop/s ~ 153172437 loop/s ~ 152992126 loop/s ~ 153133548 loop/s ^C [fibmod.S] .L7: mov r1, #0 mov r0, r1 @ here mov r2, #1 @ here b .L5 .L13: sub r3, r3, r10 cmp r2, #0 cmpeq r3, #1 beq .L4 .L3: mov r0, r2 mov r2, r3 .L5: add r3, r0, r2 cmp r10, r3 @ here add r1, r1, #1 @ here bls .L13 cmp r3, #1 cmpeq r2, #0 bne .L3 .L4: adds r4, r4, r1 adc r5, r5, #0 subs r6, r4, ip sbc r7, r5, lr cmp r7, r9 cmpeq r6, r8 bls .L6 bl clock mov r1, r7 str r0, [sp, #4] mov r0, r6 bl __aeabi_ul2d ldr r3, [sp, #4] sub r2, r3, fp mov fp, r3 vmov s14, r2 @ int vcvt.f64.s32 d7, s14 vmov d6, r0, r1 ldr r0, .L14+16 vdiv.f64 d6, d6, d7 vmul.f64 d7, d6, d8 vmov r2, r3, d7 bl printf mov ip, r4 mov lr, r5 .L6: add r10, r10, #1 b .L7 with neither `-fsplit-paths` nor `-march=native -mtune=native` the speed is identical to (5).
(In reply to Wilco from comment #1) > (In reply to yhr-_-yhr from comment #0) > > I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7. > > I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi? BCM2835 is the Linux driver name for the BCM2[78]xx and series. You get the same on a Pi4 as well, even though it uses a BCM2711.
(In reply to yhr-_-yhr from comment #2) > > I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi? > oops you're right, I just got this pointed out when I showed this post to my > friend. I just copied it from `cat /proc/cpuinfo`. > > > Can you try using -mno-strict-it on your examples and see whether that helps? > Did you mean -mno-restrict-it? I followed gcc's correction info. Yes - but it looks like your compiler defaults to Arm (which is strange), so it has no effect. With GCC8 I can reproduce this for Arm, but not on newer compilers. On Thumb-2 it still is an issue due to -mrestrict-it (comment 1). Basically it shows how important conditional execution is for performance even on modern CPUs.
Mine, I almost have a full patch, just need to improve the cost model for slightly for have_conditional_execution targets (which arm is).