[PATCH 1/7]: SVE: Add CLOBBER_HIGH expression

Alan Hayward Alan.Hayward@arm.com
Thu Nov 30 11:16:00 GMT 2017


> On 27 Nov 2017, at 17:29, Jeff Law <law@redhat.com> wrote:
> 
> On 11/23/2017 04:11 AM, Alan Hayward wrote:
>> 
>>> On 22 Nov 2017, at 17:33, Jeff Law <law@redhat.com> wrote:
>>> 
>>> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>>>> 
>>>>> On 21 Nov 2017, at 03:13, Jeff Law <law@redhat.com> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  I'd
>>>>>>> totally forgotten about it.  And in fact it seems to come pretty close
>>>>>>> to what you need…
>>>>>> 
>>>>>> Yes, some of the code is similar to the way
>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>>>> CLOBBER expr code served as a starting point for writing the patch. The main difference
>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a specific Instruction,
>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied to an expression (tls_desc).
>>>>>> It meant there wasn’t really any opportunity to resume any existing code.
>>>>> Understood.  Though your first patch mentions that you're trying to
>>>>> describe partial preservation "around TLS calls". Presumably those are
>>>>> represented as normal insns, not call_insn.
>>>>> 
>>>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>>>> to itself using whatever mode is wide enough to cover the neon part of
>>>>> the register.
>>>>> 
>>>>> That should tell the generic parts of the compiler that you're just
>>>>> clobbering the upper part and at least in theory you can implement in
>>>>> the aarch64 backend and the rest of the compiler should "just work"
>>>>> because that's the existing semantics of a subreg store.
>>>>> 
>>>>> The only worry would be if a pass tried to get overly smart and
>>>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>>>> wrong given the semantics of a partial store.
>>>>> 
>>>> 
>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>>>> It’s something we considered, and then dismissed.
>>>> 
>>>> The problem then is you are now using SET semantics on those registers, and it
>>>> would make the register live around the function, which might not be the case.
>>>> Whereas clobber semantics will just make the register dead - which is exactly
>>>> what we want (but only conditionally).
>>> ?!?  A set of the subreg is the *exact* semantics you want.  It says the
>>> low part is preserved while the upper part is clobbered across the TLS
>>> insns.
>>> 
>>> jeff
>> 
>> Consider where the TLS call is inside a loop. The compiler would normally want
>> to hoist that out of the loop. By adding a set(x,x) into the parallel of the tls_desc we
>> are now making x live across the loop, x is dependant on the value from the previous
>> iteration, and the tls_desc can no longer be hoisted.
> Hmm.  I think I see the problem you're trying to point out.  Let me
> restate it and see if you agree.
> 
> The low subreg set does clearly indicate the upper part of the SVE
> register is clobbered.  The problem is from a liveness standpoint the
> compiler is considering the low part live, even though it's a self-set.
> 

Yes.

> In fact, if that is the case, then a single TLS call (independent of a
> loop) would make the low part of the register globally live.  This
> should be testable.  Include one of these low part self sets on the
> existing TLS calls and compile a little test function and let's look at
> the liveness data.
> 


Ok, so I’ve put 
(set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM))
(set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM))
etc up to V31 in the UNSPEC_TLSDESC pattern in aarch64.md, using up all the neon registers.

In an ideal world, this change would do nothing as neon reg size is TI.

First I compiled up  sve_tls_preserve_1.c (Test1 below)

Without the SET changes, gcc does not backup any neon registers before the tlsdesccall (good).
With the SET changes, gcc backs up all the neon registers (not ideal).

Without the SET changes, dump logs give:

cse1:
;;  regs ever live 	 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 66 [cc]

cprop1:
;; lr  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; lr  use 	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; lr  def 	 0 [x0] 30 [x30] 32 [v0] 66 [cc] 79 80 81 82 84 85 86 87 88 89
;; live  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; live  gen 	 0 [x0] 32 [v0] 78 79 80 81 82 83 84 85 86 87 88 89
;; live  kill	 30 [x30] 66 [cc]
;; lr  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
;; live  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]

reload:
;;  regs ever live 	 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 66 [cc]

With the set changes:

cse1:
;;  regs ever live 	 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]

cprop1:
;; lr  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
;; lr  use 	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
;; lr  def 	 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] 79 80 81 82 84 85 86 87 88 89
;; live  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
;; live  gen 	 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 78 79 80 81 82 83 84 85 86 87 88 89
;; live  kill	 30 [x30] 66 [cc]
;; lr  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
;; live  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]

reload:
;;  regs ever live 	 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]


Is there any extra dump data you think would be useful?


For completeness, I also ran through some more tests…..



Second I tried putting the tls call inside a simple for loop (Test2 below)

Both before and after the SET changes, gcc hoisted the tlsdesccall call out of the loop.
This has already happened before the expand pass.
(and as before, with the SET changes, gcc backs up all the neon registers before the tlsdesccall).


Third, I made the tls variable an array, and accessed elements from it in the loop. (Test3 below)

Both before and after the SET changes, gcc fails to hoist the tlsdesccall call.
That’s a missed optimisation chance - I’m not sure how tricky or worthwhile that’d be to fix (at the
tree level the tls load looks just like a MEM base plus index access. I guess you’d need something
specific in an rtl pass to be the hoist).
Anyway, we now have a tlsdec inside a loop, which is what I wanted.
Without the SET changes, nothing is backed up whilst inside the loop.
With the SET changes, gcc tries to back up the live neon registers, and ICEs with
"error: unable to find a register to spill” in lra-assigns.c.


Forth, I took the original test and made two tls accesses in a row with an asm volatile in-between (Test4 below).

With the SET changes, gcc ICEs in the same way.
Looking at the dump files,

Without the SET changes, dump logs for the final test are similar to the very first test:

;;  regs ever live 	 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]

;; lr  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; lr  use 	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; lr  def 	 0 [x0] 30 [x30] 32 [v0] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
;; live  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
;; live  gen 	 0 [x0] 32 [v0] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
;; live  kill	 30 [x30] 66 [cc]
;; lr  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
;; live  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]


With the SET changes:

;;  regs ever live 	 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 66 [cc]

;; lr  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
;; lr  use 	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
;; lr  def 	 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
;; live  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
;; live  gen 	 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
;; live  kill	 30 [x30] 66 [cc]
;; lr  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
;; live  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]




Test1:

__thread v4si tx;

v4si foo (v4si a, v4si b, v4si c)
{
  v4si y;

  y = a + tx + b + c;

  return y + 7;
}


Test2:


__thread v4si tx;

v4si foo (v4si a, v4si *b, v4si *c)
{
  v4si y;

  for(int i=0; i<MAX; i++)
    y += a + tx + b[i] + c[i];

  return y + 7;
}


Test3:

__thread v4si tx[MAX];

v4si foo (v4si a, v4si *b, v4si *c)
{
  v4si y;

  for(int i=0; i<MAX; i++)
    y += a + tx[i] + b[i] + c[i];

  return y + 7;
}


Test4:

__thread v4si tx;
__thread v4si ty;

v4si foo (v4si a, v4si b, v4si c)
{
  v4si y;

  y = a + tx + b + c;

  asm volatile ("" : : : "memory");

  y += a + ty + b + c;

  return y + 7;
}


> 
> Now it could be the case that various local analysis could sub-optimally
> handle things.  You mention LICM.  I know our original LICM did have a
> problem in that if it saw a use of a hard reg in a loop without seeing a
> set of that hard reg it considered the register varying within the loop.
> I have no idea if we carried that forward when the loop code was
> rewritten (when I looked at this it was circa 1992).
> 
> 
>> 
>> Or consider a stream of code containing two tls_desc calls (ok, the compiler might
>> optimise one of the tls calls away, but this approach should be reusable for other exprs).
>> Between the two set(x,x)’s x is considered live so the register allocator can’t use that
>> register.
>> Given that we are applying this to all the neon registers, the register allocator now throws
>> an ICE because it can’t find any free hard neon registers to use.
> Given your statements it sounds like the liveness infrastructure is
> making those neon regs globally live when it sees the low part subreg
> self-set.  Let's confirm that one way or the other and see where it
> takes us.
> 
> Jeff



More information about the Gcc-patches mailing list