gcc version 4.7.0 --with-arch=armv7-a --with-float=hard --with-fpu=neon --with-mode=thumb $ cd /tmp $ wget http://www.phoronix-test-suite.com/benchmark-files/c-ray-1.1.tar.gz $ tar -xzf c-ray-1.1.tar.gz $ cd c-ray-1.1 $ make clean && make gcc -O3 -ffast-math -c -o c-ray-mt.o c-ray-mt.c gcc -o c-ray-mt c-ray-mt.o -lm -lpthread $ ./c-ray-mt -t 32 -s 160x120 -r 8 -i sphfract -o output.ppm c-ray-mt v1.1 Rendering took: 6 seconds (6683 milliseconds) $ sed -i "s,-O3,-O3 -mcpu=cortex-a9,g" Makefile $ make clean && make gcc -O3 -mcpu=cortex-a9 -ffast-math -c -o c-ray-mt.o c-ray-mt.c gcc -o c-ray-mt c-ray-mt.o -lm -lpthread $ ./c-ray-mt -t 32 -s 160x120 -r 8 -i sphfract -o output.ppm c-ray-mt v1.1 Rendering took: 7 seconds (7906 milliseconds) Comparing to the default -march=armv7-a configuration, -mcpu=cortex-a9 caused a ~18% slowdown (7906 milliseconds vs. 6683 milliseconds). The test was run on a dual-core ARM Cortex-A9 @1.2GHz
What platform are you running on (GCC configuration)? Please can you do some profiling and try to identify where the slowdown is coming from. We need more information if we are to progress this.
Even though I've tested this on a Cortex-A5, the 18% difference does reproduce on gcc 6.1.1 (2694 vs 3304 ms): First, the slower profile for A9 codegen: CPU: ARM Cortex-A5, speed 1.728e+06 MHz (estimated) Counted CPU_CYCLES events (CPU cycle) with a unit mask of 0x00 (No unit mask) count 900000 samples % linenr info image name symbol name 14460 65.7422 c-ray-mt.c:377 c-ray-mt shade 3901 17.7358 c-ray-mt.c:336 c-ray-mt trace 3181 14.4624 c-ray-mt.c:308 c-ray-mt render_scanline 186 0.8456 e_pow.c:70 libm-2.19.so __pow_finite 68 0.3092 e_exp.c:240 libm-2.19.so __exp1 55 0.2501 (no location information) no-vmlinux /no-vmlinux 40 0.1819 c-ray-mt.c:454 c-ray-mt get_primary_ray 38 0.1728 c-ray-mt.c:497 c-ray-mt get_sample_pos 17 0.0773 fraiseexcpt.c:27 libm-2.19.so feraiseexcept 13 0.0591 e_pow.c:430 libm-2.19.so checkint 6 0.0273 fesetround.c:31 libm-2.19.so fesetround 5 0.0227 fputc.c:37 libc-2.19.so fputc 5 0.0227 feupdateenv.c:27 libm-2.19.so feupdateenv@@GLIBC_2.4 4 0.0182 feholdexcpt.c:32 libm-2.19.so feholdexcept 4 0.0182 fesetenv.c:31 libm-2.19.so fesetenv@@GLIBC_2.4 3 0.0136 mpa.c:767 libm-2.19.so __sqr 2 0.0091 strtod_l.c:483 libc-2.19.so ____strtod_l_internal 1 0.0045 c-ray-mt.c:170 c-ray-mt main 1 0.0045 dl-tls.c:770 ld-2.19.so __tls_get_addr 1 0.0045 dl-reloc.c:154 ld-2.19.so _dl_relocate_object 1 0.0045 (no location information) libc-2.19.so .udivsi3_skip_div0_test 1 0.0045 malloc.c:3302 libc-2.19.so _int_malloc 1 0.0045 random_r.c:366 libc-2.19.so random_r 1 0.0045 strtod_l.c:201 libc-2.19.so round_and_return compared to the default codegen: samples % linenr info image name symbol name 11657 64.6211 c-ray-mt.c:377 c-ray-mt shade 3396 18.8259 c-ray-mt.c:336 c-ray-mt trace 2586 14.3356 c-ray-mt.c:308 c-ray-mt render_scanline 172 0.9535 e_pow.c:70 libm-2.19.so __pow_finite 49 0.2716 (no location information) no-vmlinux /no-vmlinux 47 0.2605 e_exp.c:240 libm-2.19.so __exp1 41 0.2273 c-ray-mt.c:454 c-ray-mt get_primary_ray 39 0.2162 c-ray-mt.c:497 c-ray-mt get_sample_pos 16 0.0887 e_pow.c:430 libm-2.19.so checkint 12 0.0665 fraiseexcpt.c:27 libm-2.19.so feraiseexcept 7 0.0388 fputc.c:37 libc-2.19.so fputc 2 0.0111 c-ray-mt.c:170 c-ray-mt main 2 0.0111 strtod_l.c:483 libc-2.19.so ____strtod_l_internal 2 0.0111 mpa.c:767 libm-2.19.so __sqr 2 0.0111 feholdexcpt.c:32 libm-2.19.so feholdexcept 2 0.0111 fesetround.c:31 libm-2.19.so fesetround 1 0.0055 cxa_thread_atexit_impl.c:83 libc-2.19.so __call_tls_dtors 1 0.0055 memchr.S:58 libc-2.19.so memchr 1 0.0055 random_r.c:366 libc-2.19.so random_r 1 0.0055 strtok.c:38 libc-2.19.so strtok 1 0.0055 mpa.c:614 libm-2.19.so __mul 1 0.0055 fesetenv.c:31 libm-2.19.so fesetenv@@GLIBC_2.4 1 0.0055 feupdateenv.c:27 libm-2.19.so feupdateenv@@GLIBC_2.4
Curiously, up to gcc 6, targeting Cortex-A5 made virtually no difference, but in gcc 7, generic codegen takes an 8% hit while -mcpu=cortex-a5 produces roughly the same performance as before. (but that's a different issue so FWIW)
I've just done the obvious and run the resulting ARMv7 binaries on a Cortex A53 in aarch32 mode and the difference is there (GCC 6.2.1 and 7.0.0) so I can confirm the issue is present to this day. Cortex-A5 vs Cortex-A9 codegen yields a 0.81x performance ratio.
Created attachment 39649 [details] Annotated ARMv7 assembly
Testing different 32-bit codegen options in aarch32 mode on a Cortex A53, shows A15 is probably also affected. Full comparison below: $ for i in 8 5 7 9 15 ; do gcc -marm -Ofast -o c-ray-a$i c-ray-mt.c -lm -lpthread -mcpu=cortex-a$i; done $ for i in 8 5 7 9 15 ; do echo Cortex-A$i ; ./c-ray-a$i -t 32 -s 160x120 -r 8 -i sphfract -o output.ppm ; done Cortex-A8 c-ray-mt v1.1 Rendering took: 1 seconds (1660 milliseconds) Cortex-A5 c-ray-mt v1.1 Rendering took: 1 seconds (1638 milliseconds) Cortex-A7 c-ray-mt v1.1 Rendering took: 1 seconds (1645 milliseconds) Cortex-A9 c-ray-mt v1.1 Rendering took: 2 seconds (2027 milliseconds) Cortex-A15 c-ray-mt v1.1 Rendering took: 1 seconds (1922 milliseconds)
*** This bug has been marked as a duplicate of bug 68664 ***
Since my report predates bug 68664 by several years, shouldn't bug 68664 be a duplicate? In addition, my report was much more detailed, since it also provided a practical use case, showcasing the importance of this problem. Also if I understand it correctly, you have still not fixed the issue. So closing it seems to be a bit premature. I'll keep a watch on bug 68664 and will be sure to reopen my bugreport in the case if the fix does not help on ARM Cortex A9. Thanks for generating some sort of activity anyway. It's surely better than nothing.
@jgreenhalgh Please have a look at the profiled assembly for both fast and slow codegen. (attached) According to @aldyh's bisection in #68664 this probably isn't the same issue.
(In reply to PeteVine from comment #9) > @jgreenhalgh Please have a look at the profiled assembly for both fast and > slow codegen. (attached) > > According to @aldyh's bisection in #68664 this probably isn't the same issue. In the attached code I once again see the vdiv moved before the branch in the slow case. Looking at the bisection is one way to triage a bug, but it points to a change in scheduling model for Cortex-A53, and the analysis in this report indicates that the same bad scheduling decision is made with the Cortex-A9 and Cortex-A15 scheduling models. If the scheduler is making bad decisions across a range of models, it is (in my opinion) more instructive to look for the pattern shared across those models and fix the scheduler than it is to tweak each scheduling model individually to avoid the abnormal case here.
Super cool, thanks! That makes the OP a true prophet before his time ;)
Nice, PR68664 patch has fixed the issue. FWIW, unlike previously, running on a Cortex-A53, showed perfect alignment with core type (-mfpu=vfpv3) on the first run: Cortex-A8 Rendering took: 1 seconds (1801 milliseconds) Cortex-A5 Rendering took: 1 seconds (1708 milliseconds) Cortex-A7 Rendering took: 1 seconds (1699 milliseconds) Cortex-A9 Rendering took: 1 seconds (1644 milliseconds) Cortex-A15 Rendering took: 1 seconds (1637 milliseconds) whereas using -mfpu=vfpv4 favours Cortex-A5 code's execution: Cortex-A8 Rendering took: 1 seconds (1803 milliseconds) Cortex-A5 Rendering took: 1 seconds (1506 milliseconds) Cortex-A7 Rendering took: 1 seconds (1636 milliseconds) Cortex-A9 Rendering took: 1 seconds (1645 milliseconds) Cortex-A15 Rendering took: 1 seconds (1643 milliseconds) but that's probably expected. Not sure about A8's codegen performance though.