Bug 105782

Summary:	emission of inefficient movxtod/movdtox with -mvis3
Product:	gcc	Reporter:	Koakuma <koachan+gccbugs>
Component:	target	Assignee:	Eric Botcazou <ebotcazou>
Status:	ASSIGNED ---
Severity:	normal	CC:	ebotcazou, sjames
Priority:	P3	Keywords:	missed-optimization
Version:	12.1.0
Target Milestone:	---
Host:		Target:	sparc64--
Build:		Known to work:
Known to fail:		Last reconfirmed:	2022-06-07 00:00:00
Attachments:	The problematic function, adapted for standalone compilation Vectorization log from -fopt-info-vec-all

Description Koakuma 2022-05-30 23:22:51 UTC

Created attachment 53055 [details]
The problematic function, adapted for standalone compilation

Hello, I found out that the blake2b implementation in monocypher runs much slower on a SPARC T4 when compiled with `-O3 -mvis3`, as opposed to plain `-O3`:

With plain -O3:  Blake2b : 184 megabytes  per second
With -O3 -mvis3: Blake2b : 118 megabytes  per second

(Results are from monocypher's `make speed` benchmark)

Looking at the generated assembly, it seems that when the code is compiled with -mvis3, GCC emits a lot of questionable `movxtod`/`movdtox` instructions?

I'm using sparc64-linux-gnu-gcc (GCC) 12.1.0.

Comment 1 Richard Biener 2022-06-01 11:53:13 UTC

You can check -fopt-info-vec for vectorization.  Note the sparc backend doesn't implement any of GCCs vectorizer cost modeling hooks.

Comment 2 Koakuma 2022-06-01 23:31:13 UTC

Created attachment 53066 [details]
Vectorization log from -fopt-info-vec-all

(In reply to Richard Biener from comment #1)
> You can check -fopt-info-vec for vectorization.

I tried recompiling it with -fopt-info-vec-all and I got a long message that
ends with:

> blake2b-monocypher-standalone.c:75:18: note: Cost model analysis: 
> blake2b-monocypher-standalone.c:75:18: note: Cost model analysis for part in loop 0:
>   Vector cost: 2282
>   Scalar cost: 181
> blake2b-monocypher-standalone.c:75:18: missed: not vectorized: vectorization is not profitable.

So I dont think that GCC vectorized that function.

Also, I tried recompiling with -fno-tree-optimize and it doesn't improve anything.
Seems like the problem isn't in the vectorizer?
(it still produces the same slow code with many `movxtod`/`movdtox`s)

Comment 3 Eric Botcazou 2022-06-07 06:16:13 UTC

I guess that, under high register pressure, the register allocator rather uses floating-point registers than spllling values on the stack.

Comment 4 Koakuma 2022-06-08 14:01:27 UTC

(In reply to Eric Botcazou from comment #3)
> I guess that, under high register pressure, the register allocator rather
> uses floating-point registers than spllling values on the stack.

I suppose so?
However, I found that when compiling the source from the previous comment with -mvis3, it emits over 1400 movXtoY instructions, resulting in 1300-ish extra instructions compared to the version without VIS 3, which seem to be quite weird to me.

Comment 5 Eric Botcazou 2022-06-30 10:39:58 UTC

Investigating.