This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression


Ping ping.
I think there should be enough information in the first test to show that any "set to self”
registers become live. Let me know if there’s anything I’ve missed.


Thanks,
Alan.

> On 12 Dec 2017, at 11:11, Alan Hayward <Alan.Hayward@arm.com> wrote:
> 
> Ping.
> 
>> On 30 Nov 2017, at 11:03, Alan Hayward <Alan.Hayward@arm.com> wrote:
>> 
>> 
>>> On 27 Nov 2017, at 17:29, Jeff Law <law@redhat.com> wrote:
>>> 
>>> On 11/23/2017 04:11 AM, Alan Hayward wrote:
>>>> 
>>>>> On 22 Nov 2017, at 17:33, Jeff Law <law@redhat.com> wrote:
>>>>> 
>>>>> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>>>>>> 
>>>>>>> On 21 Nov 2017, at 03:13, Jeff Law <law@redhat.com> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED.  I'd
>>>>>>>>> totally forgotten about it.  And in fact it seems to come pretty close
>>>>>>>>> to what you need…
>>>>>>>> 
>>>>>>>> Yes, some of the code is similar to the way
>>>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>>>>>> CLOBBER expr code served as a starting point for writing the patch. The main difference
>>>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a specific Instruction,
>>>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied to an expression (tls_desc).
>>>>>>>> It meant there wasn’t really any opportunity to resume any existing code.
>>>>>>> Understood.  Though your first patch mentions that you're trying to
>>>>>>> describe partial preservation "around TLS calls". Presumably those are
>>>>>>> represented as normal insns, not call_insn.
>>>>>>> 
>>>>>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>>>>>> to itself using whatever mode is wide enough to cover the neon part of
>>>>>>> the register.
>>>>>>> 
>>>>>>> That should tell the generic parts of the compiler that you're just
>>>>>>> clobbering the upper part and at least in theory you can implement in
>>>>>>> the aarch64 backend and the rest of the compiler should "just work"
>>>>>>> because that's the existing semantics of a subreg store.
>>>>>>> 
>>>>>>> The only worry would be if a pass tried to get overly smart and
>>>>>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>>>>>> wrong given the semantics of a partial store.
>>>>>>> 
>>>>>> 
>>>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>>>>>> It’s something we considered, and then dismissed.
>>>>>> 
>>>>>> The problem then is you are now using SET semantics on those registers, and it
>>>>>> would make the register live around the function, which might not be the case.
>>>>>> Whereas clobber semantics will just make the register dead - which is exactly
>>>>>> what we want (but only conditionally).
>>>>> ?!?  A set of the subreg is the *exact* semantics you want.  It says the
>>>>> low part is preserved while the upper part is clobbered across the TLS
>>>>> insns.
>>>>> 
>>>>> jeff
>>>> 
>>>> Consider where the TLS call is inside a loop. The compiler would normally want
>>>> to hoist that out of the loop. By adding a set(x,x) into the parallel of the tls_desc we
>>>> are now making x live across the loop, x is dependant on the value from the previous
>>>> iteration, and the tls_desc can no longer be hoisted.
>>> Hmm.  I think I see the problem you're trying to point out.  Let me
>>> restate it and see if you agree.
>>> 
>>> The low subreg set does clearly indicate the upper part of the SVE
>>> register is clobbered.  The problem is from a liveness standpoint the
>>> compiler is considering the low part live, even though it's a self-set.
>>> 
>> 
>> Yes.
>> 
>>> In fact, if that is the case, then a single TLS call (independent of a
>>> loop) would make the low part of the register globally live.  This
>>> should be testable.  Include one of these low part self sets on the
>>> existing TLS calls and compile a little test function and let's look at
>>> the liveness data.
>>> 
>> 
>> 
>> Ok, so I’ve put 
>> (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM))
>> (set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM))
>> etc up to V31 in the UNSPEC_TLSDESC pattern in aarch64.md, using up all the neon registers.
>> 
>> In an ideal world, this change would do nothing as neon reg size is TI.
>> 
>> First I compiled up  sve_tls_preserve_1.c (Test1 below)
>> 
>> Without the SET changes, gcc does not backup any neon registers before the tlsdesccall (good).
>> With the SET changes, gcc backs up all the neon registers (not ideal).
>> 
>> Without the SET changes, dump logs give:
>> 
>> cse1:
>> ;;  regs ever live 	 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 66 [cc]
>> 
>> cprop1:
>> ;; lr  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr  use 	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr  def 	 0 [x0] 30 [x30] 32 [v0] 66 [cc] 79 80 81 82 84 85 86 87 88 89
>> ;; live  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; live  gen 	 0 [x0] 32 [v0] 78 79 80 81 82 83 84 85 86 87 88 89
>> ;; live  kill	 30 [x30] 66 [cc]
>> ;; lr  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> 
>> reload:
>> ;;  regs ever live 	 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 66 [cc]
>> 
>> With the set changes:
>> 
>> cse1:
>> ;;  regs ever live 	 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>> 
>> cprop1:
>> ;; lr  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr  use 	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr  def 	 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] 79 80 81 82 84 85 86 87 88 89
>> ;; live  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
>> ;; live  gen 	 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 78 79 80 81 82 83 84 85 86 87 88 89
>> ;; live  kill	 30 [x30] 66 [cc]
>> ;; lr  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> 
>> reload:
>> ;;  regs ever live 	 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>> 
>> 
>> Is there any extra dump data you think would be useful?
>> 
>> 
>> For completeness, I also ran through some more tests…..
>> 
>> 
>> 
>> Second I tried putting the tls call inside a simple for loop (Test2 below)
>> 
>> Both before and after the SET changes, gcc hoisted the tlsdesccall call out of the loop.
>> This has already happened before the expand pass.
>> (and as before, with the SET changes, gcc backs up all the neon registers before the tlsdesccall).
>> 
>> 
>> Third, I made the tls variable an array, and accessed elements from it in the loop. (Test3 below)
>> 
>> Both before and after the SET changes, gcc fails to hoist the tlsdesccall call.
>> That’s a missed optimisation chance - I’m not sure how tricky or worthwhile that’d be to fix (at the
>> tree level the tls load looks just like a MEM base plus index access. I guess you’d need something
>> specific in an rtl pass to be the hoist).
>> Anyway, we now have a tlsdec inside a loop, which is what I wanted.
>> Without the SET changes, nothing is backed up whilst inside the loop.
>> With the SET changes, gcc tries to back up the live neon registers, and ICEs with
>> "error: unable to find a register to spill” in lra-assigns.c.
>> 
>> 
>> Forth, I took the original test and made two tls accesses in a row with an asm volatile in-between (Test4 below).
>> 
>> With the SET changes, gcc ICEs in the same way.
>> Looking at the dump files,
>> 
>> Without the SET changes, dump logs for the final test are similar to the very first test:
>> 
>> ;;  regs ever live 	 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>> 
>> ;; lr  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr  use 	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr  def 	 0 [x0] 30 [x30] 32 [v0] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
>> ;; live  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; live  gen 	 0 [x0] 32 [v0] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
>> ;; live  kill	 30 [x30] 66 [cc]
>> ;; lr  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> 
>> 
>> With the SET changes:
>> 
>> ;;  regs ever live 	 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 66 [cc]
>> 
>> ;; lr  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr  use 	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr  def 	 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
>> ;; live  in  	 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
>> ;; live  gen 	 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
>> ;; live  kill	 30 [x30] 66 [cc]
>> ;; lr  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live  out 	 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> 
>> 
>> 
>> 
>> Test1:
>> 
>> __thread v4si tx;
>> 
>> v4si foo (v4si a, v4si b, v4si c)
>> {
>> v4si y;
>> 
>> y = a + tx + b + c;
>> 
>> return y + 7;
>> }
>> 
>> 
>> Test2:
>> 
>> 
>> __thread v4si tx;
>> 
>> v4si foo (v4si a, v4si *b, v4si *c)
>> {
>> v4si y;
>> 
>> for(int i=0; i<MAX; i++)
>>   y += a + tx + b[i] + c[i];
>> 
>> return y + 7;
>> }
>> 
>> 
>> Test3:
>> 
>> __thread v4si tx[MAX];
>> 
>> v4si foo (v4si a, v4si *b, v4si *c)
>> {
>> v4si y;
>> 
>> for(int i=0; i<MAX; i++)
>>   y += a + tx[i] + b[i] + c[i];
>> 
>> return y + 7;
>> }
>> 
>> 
>> Test4:
>> 
>> __thread v4si tx;
>> __thread v4si ty;
>> 
>> v4si foo (v4si a, v4si b, v4si c)
>> {
>> v4si y;
>> 
>> y = a + tx + b + c;
>> 
>> asm volatile ("" : : : "memory");
>> 
>> y += a + ty + b + c;
>> 
>> return y + 7;
>> }
>> 
>> 
>>> 
>>> Now it could be the case that various local analysis could sub-optimally
>>> handle things.  You mention LICM.  I know our original LICM did have a
>>> problem in that if it saw a use of a hard reg in a loop without seeing a
>>> set of that hard reg it considered the register varying within the loop.
>>> I have no idea if we carried that forward when the loop code was
>>> rewritten (when I looked at this it was circa 1992).
>>> 
>>> 
>>>> 
>>>> Or consider a stream of code containing two tls_desc calls (ok, the compiler might
>>>> optimise one of the tls calls away, but this approach should be reusable for other exprs).
>>>> Between the two set(x,x)’s x is considered live so the register allocator can’t use that
>>>> register.
>>>> Given that we are applying this to all the neon registers, the register allocator now throws
>>>> an ICE because it can’t find any free hard neon registers to use.
>>> Given your statements it sounds like the liveness infrastructure is
>>> making those neon regs globally live when it sees the low part subreg
>>> self-set.  Let's confirm that one way or the other and see where it
>>> takes us.
>>> 
>>> Jeff
>> 
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]