Created attachment 53055 [details]
The problematic function, adapted for standalone compilation
Hello, I found out that the blake2b implementation in monocypher runs much slower on a SPARC T4 when compiled with `-O3 -mvis3`, as opposed to plain `-O3`:
With plain -O3: Blake2b : 184 megabytes per second
With -O3 -mvis3: Blake2b : 118 megabytes per second
(Results are from monocypher's `make speed` benchmark)
Looking at the generated assembly, it seems that when the code is compiled with -mvis3, GCC emits a lot of questionable `movxtod`/`movdtox` instructions?
I'm using sparc64-linux-gnu-gcc (GCC) 12.1.0.
You can check -fopt-info-vec for vectorization. Note the sparc backend doesn't implement any of GCCs vectorizer cost modeling hooks.
Created attachment 53066 [details]
Vectorization log from -fopt-info-vec-all
(In reply to Richard Biener from comment #1)
> You can check -fopt-info-vec for vectorization.
I tried recompiling it with -fopt-info-vec-all and I got a long message that
> blake2b-monocypher-standalone.c:75:18: note: Cost model analysis:
> blake2b-monocypher-standalone.c:75:18: note: Cost model analysis for part in loop 0:
> Vector cost: 2282
> Scalar cost: 181
> blake2b-monocypher-standalone.c:75:18: missed: not vectorized: vectorization is not profitable.
So I dont think that GCC vectorized that function.
Also, I tried recompiling with -fno-tree-optimize and it doesn't improve anything.
Seems like the problem isn't in the vectorizer?
(it still produces the same slow code with many `movxtod`/`movdtox`s)
I guess that, under high register pressure, the register allocator rather uses floating-point registers than spllling values on the stack.
(In reply to Eric Botcazou from comment #3)
> I guess that, under high register pressure, the register allocator rather
> uses floating-point registers than spllling values on the stack.
I suppose so?
However, I found that when compiling the source from the previous comment with -mvis3, it emits over 1400 movXtoY instructions, resulting in 1300-ish extra instructions compared to the version without VIS 3, which seem to be quite weird to me.