This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: RFA: pervasive SSE codegen inefficiency
- From: Andrew Pinski <pinskia at physics dot uc dot edu>
- To: Dale Johannesen <dalej at apple dot com>
- Cc: gcc mailing list <gcc at gcc dot gnu dot org>
- Date: Thu, 15 Sep 2005 00:50:12 -0400
- Subject: Re: RFA: pervasive SSE codegen inefficiency
- References: <842d40888007e4e33ece107d868e027d@apple.com>
On Sep 14, 2005, at 9:21 PM, Dale Johannesen wrote:
Consider the following SSE code
(-march=pentium4 -mtune=prescott -O2 -mfpmath=sse -msse2)
<4256776a.c>
The first inner loop compiles to
paddq %xmm0, %xmm1
Good. The second compiles to
movdqa %xmm2, %xmm0
paddw %xmm1, %xmm0
movdqa %xmm0, %xmm1
when it could be using a single paddw. The basic problem is that
our approach defines __m128i to be V2DI even though all the operations
on the object are V4SI, so there are a lot of subreg's that don't need
to generate code. I'd like to fix this, but am not sure how to go
about it.
From real looks of this looks more like a register allocation issue and
nothing to do with subregs at all, except subregs being there.
Take a look at .greg:
;; 4 regs to allocate: 64 (4) 61 63 (4) 65
;; 61 conflicts: 61 63 64 65 66 7 21
;; 63 conflicts: 61 63 64 65 66 7 21 22
;; 64 conflicts: 61 63 64 65 7
;; 64 preferences: 21 22
;; 65 conflicts: 61 63 64 65 66 7 21
;; 66 conflicts: 61 63 65 66 7 21
;; 66 preferences: 22
;; 67 conflicts: 67 7 21
;; 67 preferences: 22
and then look at allocation:
(reg:V8HI 21 xmm0 [66])
(reg:V8HI 22 xmm1 [orig:64 a ] [64])
(reg/v:V2DI 23 xmm2 [orig:63 z ] [63])
Original instructions:
(insn:HI 23 21 25 2 (set (reg:V8HI 66)
(plus:V8HI (subreg:V8HI (reg/v:V2DI 63 [ z ]) 0)
(subreg:V8HI (reg/v:V2DI 64 [ a ]) 0))) 680 {*addv8hi3}
(nil)
(expr_list:REG_DEAD (reg/v:V2DI 64 [ a ])
(nil)))
(insn:HI 25 23 27 2 (set (reg/v:V2DI 64 [ a ])
(subreg:V2DI (reg:V8HI 66) 0)) 542 {*movv2di_internal}
(insn_list:REG_DEP_TRUE 23 (nil))
(expr_list:REG_DEAD (reg:V8HI 66)
(nil)))
(insn:HI 33 31 38 3 (set (reg:V8HI 67)
(plus:V8HI (subreg:V8HI (reg/v:V2DI 64 [ a ]) 0)
(subreg:V8HI (reg/v:V2DI 64 [ a ]) 0))) 680 {*addv8hi3}
(nil)
(expr_list:REG_DEAD (reg/v:V2DI 64 [ a ])
(nil)))
(note:HI 38 33 41 3 NOTE_INSN_FUNCTION_END)
(insn:HI 41 38 47 3 (set (reg/i:V2DI 21 xmm0 [ <result> ])
(subreg:V2DI (reg:V8HI 67) 0)) 542 {*movv2di_internal}
(insn_list:REG_DEP_TRUE 33 (nil))
(expr_list:REG_DEAD (reg:V8HI 67)
(nil)))
Instructions after allocation:
(insn 60 21 23 2 (set (reg:V8HI 21 xmm0 [66])
(reg:V8HI 23 xmm2)) 540 {*movv8hi_internal} (nil)
(nil))
(insn:HI 23 60 25 2 (set (reg:V8HI 21 xmm0 [66])
(plus:V8HI (reg:V8HI 21 xmm0 [66])
(reg:V8HI 22 xmm1 [orig:64 a ] [64]))) 680 {*addv8hi3} (nil)
(nil))
(insn:HI 25 23 27 2 (set (reg/v:V2DI 22 xmm1 [orig:64 a ] [64])
(reg:V2DI 21 xmm0 [66])) 542 {*movv2di_internal}
(insn_list:REG_DEP_TRUE 23 (nil))
(nil))
...
(insn 61 31 33 3 (set (reg:V8HI 21 xmm0 [67])
(reg:V8HI 22 xmm1)) 540 {*movv8hi_internal} (nil)
(nil))
(insn:HI 33 61 38 3 (set (reg:V8HI 21 xmm0 [67])
(plus:V8HI (reg:V8HI 21 xmm0 [67])
(reg:V8HI 22 xmm1 [orig:64 a ] [64]))) 680 {*addv8hi3} (nil)
(nil))
(note:HI 38 33 41 3 NOTE_INSN_FUNCTION_END)
(insn:HI 41 38 47 3 (set (reg/i:V2DI 21 xmm0 [ <result> ])
(reg:V2DI 21 xmm0 [67])) 542 {*movv2di_internal}
(insn_list:REG_DEP_TRUE 33 (nil))
(nil))
If we allocated 64 and 63 as the same register, it would have worked
correctly.
Yes removing the extra set helps but does not solve the real issue of
the
register allocator being stupid.
Thanks,
Andrew Pinski