Bug 105782

Summary: emission of inefficient movxtod/movdtox with -mvis3
Product: gcc Reporter: Koakuma <koachan+gccbugs>
Component: targetAssignee: Eric Botcazou <ebotcazou>
Status: ASSIGNED ---    
Severity: normal CC: ebotcazou, sjames
Priority: P3 Keywords: missed-optimization
Version: 12.1.0   
Target Milestone: ---   
Host: Target: sparc64-*-*
Build: Known to work:
Known to fail: Last reconfirmed: 2022-06-07 00:00:00
Attachments: The problematic function, adapted for standalone compilation
Vectorization log from -fopt-info-vec-all

Description Koakuma 2022-05-30 23:22:51 UTC
Created attachment 53055 [details]
The problematic function, adapted for standalone compilation

Hello, I found out that the blake2b implementation in monocypher runs much slower on a SPARC T4 when compiled with `-O3 -mvis3`, as opposed to plain `-O3`:

With plain -O3:  Blake2b : 184 megabytes  per second
With -O3 -mvis3: Blake2b : 118 megabytes  per second

(Results are from monocypher's `make speed` benchmark)

Looking at the generated assembly, it seems that when the code is compiled with -mvis3, GCC emits a lot of questionable `movxtod`/`movdtox` instructions?

I'm using sparc64-linux-gnu-gcc (GCC) 12.1.0.
Comment 1 Richard Biener 2022-06-01 11:53:13 UTC
You can check -fopt-info-vec for vectorization.  Note the sparc backend doesn't implement any of GCCs vectorizer cost modeling hooks.
Comment 2 Koakuma 2022-06-01 23:31:13 UTC
Created attachment 53066 [details]
Vectorization log from -fopt-info-vec-all

(In reply to Richard Biener from comment #1)
> You can check -fopt-info-vec for vectorization.

I tried recompiling it with -fopt-info-vec-all and I got a long message that
ends with:

> blake2b-monocypher-standalone.c:75:18: note: Cost model analysis: 
> blake2b-monocypher-standalone.c:75:18: note: Cost model analysis for part in loop 0:
>   Vector cost: 2282
>   Scalar cost: 181
> blake2b-monocypher-standalone.c:75:18: missed: not vectorized: vectorization is not profitable.

So I dont think that GCC vectorized that function.

Also, I tried recompiling with -fno-tree-optimize and it doesn't improve anything.
Seems like the problem isn't in the vectorizer?
(it still produces the same slow code with many `movxtod`/`movdtox`s)
Comment 3 Eric Botcazou 2022-06-07 06:16:13 UTC
I guess that, under high register pressure, the register allocator rather uses floating-point registers than spllling values on the stack.
Comment 4 Koakuma 2022-06-08 14:01:27 UTC
(In reply to Eric Botcazou from comment #3)
> I guess that, under high register pressure, the register allocator rather
> uses floating-point registers than spllling values on the stack.

I suppose so?
However, I found that when compiling the source from the previous comment with -mvis3, it emits over 1400 movXtoY instructions, resulting in 1300-ish extra instructions compared to the version without VIS 3, which seem to be quite weird to me.
Comment 5 Eric Botcazou 2022-06-30 10:39:58 UTC
Investigating.