Created attachment 53055 [details] The problematic function, adapted for standalone compilation Hello, I found out that the blake2b implementation in monocypher runs much slower on a SPARC T4 when compiled with `-O3 -mvis3`, as opposed to plain `-O3`: With plain -O3: Blake2b : 184 megabytes per second With -O3 -mvis3: Blake2b : 118 megabytes per second (Results are from monocypher's `make speed` benchmark) Looking at the generated assembly, it seems that when the code is compiled with -mvis3, GCC emits a lot of questionable `movxtod`/`movdtox` instructions? I'm using sparc64-linux-gnu-gcc (GCC) 12.1.0.
You can check -fopt-info-vec for vectorization. Note the sparc backend doesn't implement any of GCCs vectorizer cost modeling hooks.
Created attachment 53066 [details] Vectorization log from -fopt-info-vec-all (In reply to Richard Biener from comment #1) > You can check -fopt-info-vec for vectorization. I tried recompiling it with -fopt-info-vec-all and I got a long message that ends with: > blake2b-monocypher-standalone.c:75:18: note: Cost model analysis: > blake2b-monocypher-standalone.c:75:18: note: Cost model analysis for part in loop 0: > Vector cost: 2282 > Scalar cost: 181 > blake2b-monocypher-standalone.c:75:18: missed: not vectorized: vectorization is not profitable. So I dont think that GCC vectorized that function. Also, I tried recompiling with -fno-tree-optimize and it doesn't improve anything. Seems like the problem isn't in the vectorizer? (it still produces the same slow code with many `movxtod`/`movdtox`s)
I guess that, under high register pressure, the register allocator rather uses floating-point registers than spllling values on the stack.
(In reply to Eric Botcazou from comment #3) > I guess that, under high register pressure, the register allocator rather > uses floating-point registers than spllling values on the stack. I suppose so? However, I found that when compiling the source from the previous comment with -mvis3, it emits over 1400 movXtoY instructions, resulting in 1300-ish extra instructions compared to the version without VIS 3, which seem to be quite weird to me.
Investigating.