RFC: ARM 64-bit shifts in NEON

Mon Dec 12 16:29:00 GMT 2011

On 07/12/11 13:42, Richard Earnshaw wrote:
> So it looks like the code generated for core registers with thumb2 is
> pretty rubbish (no real surprise there -- to get the best code you need
> to make use of the fact that on ARM a shift by a small negative number
> (<  -128) will give zero.  This gives us sequences like:
>
> For ARM state it's something like (untested)
>
> 					@ shft<  32			, shft>= 32
> __ashldi3_v3:
> 	sub	r3, r2, #32		@ -ve            		, shft - 32
> 	lsl	ah, ah, r2		@ ah<<  shft     		, 0
> 	rsb	ip, r2, #32		@ 32 - shft      		, -ve
> 	orr	ah, ah, al, lsl r3	@ ah<<  shft     		, al<<  shft - 32
> 	orr	ah, ah, al, lsr ip	@ ah<<  shft | al>>  32 - shft	, al<<  shft - 32
> 	lsl	al, al, r2		@ al<<  shft     		, 0
>
> For Thumb2 (where there is no orr with register shift)
>
> 	lsls	ah, ah, r2		@ ah<<  shft     		, 0
> 	sub	r3, r2, #32		@ -ve            		, shft - 32
> 	lsl	ip, al, r3		@ 0              		, al<<  shft - 32
> 	negs	r3, r3			@ 32 - shft      		, -ve
> 	orr	ah, ah, ip		@ ah<<  shft     		, al<<  shft - 32
> 	lsr	r3, al, r3		@ al>>  32 - shft		, 0
> 	orrs	ah, ah, r3		@ ah<<  shft | al>>  32 - shft	, al<<  shft - 32
> 	lsls	al, al, r2		@ al<<  shft     		, 0
>
> Neither of which needs the condition flags during execution (and indeed
> is probably better in both cases than the code currently in lib1funcs.asm
> for a modern core).  The flag clobbering behaviour in the thumb2 variant
> is only for code size saving; that would normally be added by a late
> optimization pass.
>
> None of this directly helps with your neon usage, but it does show that we
> really don't need to clobber the condition code register to get an
> efficient sequence.

Unfortunately, both these sequences use two scratch registers, as shown, 
and that's worse than clobbering CC.

Now, I can implement this for non-Neon easily enough, I think, and that 
would be a win, but I'm trying to figure out how best to do it for both 
that case and the case where neon is available but the compiler chooses 
not to do it.

The problem is that when there is no neon available, this can be 
converted at expand or split1 time, but when neon *is* available we have 
to wait until a post-reload split, and then we'd be forced to expand 
this in early-clobber mode, which is far less optimal.

Any suggestions now to do this without pessimizing the code in the case 
that neon is available but not used?

In fact, is the general shift operation sufficiently expensive that I 
should I just abandon the fall back alternatives and *always* use Neon 
when available? In this case, what about A8 vs. A9?

Thanks

Andrew