|Summary:||emission of inefficient movxtod/movdtox with -mvis3|
|Component:||target||Assignee:||Eric Botcazou <ebotcazou>|
|Build:||Known to work:|
|Known to fail:||Last reconfirmed:||2022-06-07 00:00:00|
The problematic function, adapted for standalone compilation
Vectorization log from -fopt-info-vec-all
Description Koakuma 2022-05-30 23:22:51 UTC
Created attachment 53055 [details] The problematic function, adapted for standalone compilation Hello, I found out that the blake2b implementation in monocypher runs much slower on a SPARC T4 when compiled with `-O3 -mvis3`, as opposed to plain `-O3`: With plain -O3: Blake2b : 184 megabytes per second With -O3 -mvis3: Blake2b : 118 megabytes per second (Results are from monocypher's `make speed` benchmark) Looking at the generated assembly, it seems that when the code is compiled with -mvis3, GCC emits a lot of questionable `movxtod`/`movdtox` instructions? I'm using sparc64-linux-gnu-gcc (GCC) 12.1.0.
Comment 1 Richard Biener 2022-06-01 11:53:13 UTC
You can check -fopt-info-vec for vectorization. Note the sparc backend doesn't implement any of GCCs vectorizer cost modeling hooks.
Comment 2 Koakuma 2022-06-01 23:31:13 UTC
Created attachment 53066 [details] Vectorization log from -fopt-info-vec-all (In reply to Richard Biener from comment #1) > You can check -fopt-info-vec for vectorization. I tried recompiling it with -fopt-info-vec-all and I got a long message that ends with: > blake2b-monocypher-standalone.c:75:18: note: Cost model analysis: > blake2b-monocypher-standalone.c:75:18: note: Cost model analysis for part in loop 0: > Vector cost: 2282 > Scalar cost: 181 > blake2b-monocypher-standalone.c:75:18: missed: not vectorized: vectorization is not profitable. So I dont think that GCC vectorized that function. Also, I tried recompiling with -fno-tree-optimize and it doesn't improve anything. Seems like the problem isn't in the vectorizer? (it still produces the same slow code with many `movxtod`/`movdtox`s)
Comment 3 Eric Botcazou 2022-06-07 06:16:13 UTC
I guess that, under high register pressure, the register allocator rather uses floating-point registers than spllling values on the stack.
Comment 4 Koakuma 2022-06-08 14:01:27 UTC
(In reply to Eric Botcazou from comment #3) > I guess that, under high register pressure, the register allocator rather > uses floating-point registers than spllling values on the stack. I suppose so? However, I found that when compiling the source from the previous comment with -mvis3, it emits over 1400 movXtoY instructions, resulting in 1300-ish extra instructions compared to the version without VIS 3, which seem to be quite weird to me.
Comment 5 Eric Botcazou 2022-06-30 10:39:58 UTC