This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
- From: Alan Hayward <Alan dot Hayward at arm dot com>
- To: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, Jeff Law <law at redhat dot com>, Richard Biener <richard dot guenther at gmail dot com>
- Cc: nd <nd at arm dot com>
- Date: Tue, 19 Dec 2017 10:12:47 +0000
- Subject: Re: [PATCH 1/7]: SVE: Add CLOBBER_HIGH expression
- Authentication-results: sourceware.org; auth=none
- Authentication-results: spf=none (sender IP is ) smtp.mailfrom=Alan dot Hayward at arm dot com;
- Nodisclaimer: True
- References: <FE15054D-7A89-40CD-93EA-136C9C6F69D0@arm.com> <4370c76a-6015-cdc5-2c0b-3fc896aa29f4@redhat.com> <6E109029-7E3F-406E-BA74-3B760227F4F6@gmail.com> <0BF479E8-BA52-42C5-AFDB-2960B26FFA40@arm.com> <df6ec942-22c3-0755-8d2a-91bb5edd81d4@redhat.com> <DC7BCB26-845A-491A-9D4D-99B64235225C@arm.com> <5424cd2a-bf06-1bf8-69e5-a5883d6b638d@redhat.com> <5B0C22EA-BB4A-4B26-AA18-3A1037EA2FC4@arm.com> <5d45c171-ba28-c0b9-f4ae-c1f43b21f24f@redhat.com> <770C9BDC-0279-4C34-92C2-3F8B5EE6A6C5@arm.com> <599c2582-6d8d-a909-a878-28e8135d5100@redhat.com> <90F53098-5814-4513-BB43-C6221E4125D2@arm.com> <E7A82AB6-4CE5-442D-96E4-A3812FDE630D@arm.com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
Ping ping.
I think there should be enough information in the first test to show that any "set to self”
registers become live. Let me know if there’s anything I’ve missed.
Thanks,
Alan.
> On 12 Dec 2017, at 11:11, Alan Hayward <Alan.Hayward@arm.com> wrote:
>
> Ping.
>
>> On 30 Nov 2017, at 11:03, Alan Hayward <Alan.Hayward@arm.com> wrote:
>>
>>
>>> On 27 Nov 2017, at 17:29, Jeff Law <law@redhat.com> wrote:
>>>
>>> On 11/23/2017 04:11 AM, Alan Hayward wrote:
>>>>
>>>>> On 22 Nov 2017, at 17:33, Jeff Law <law@redhat.com> wrote:
>>>>>
>>>>> On 11/22/2017 04:31 AM, Alan Hayward wrote:
>>>>>>
>>>>>>> On 21 Nov 2017, at 03:13, Jeff Law <law@redhat.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> You might also look at TARGET_HARD_REGNO_CALL_PART_CLOBBERED. I'd
>>>>>>>>> totally forgotten about it. And in fact it seems to come pretty close
>>>>>>>>> to what you need…
>>>>>>>>
>>>>>>>> Yes, some of the code is similar to the way
>>>>>>>> TARGET_HARD_REGNO_CALL_PART_CLOBBERED works. Both that code and the
>>>>>>>> CLOBBER expr code served as a starting point for writing the patch. The main difference
>>>>>>>> here, is that _PART_CLOBBERED is around all calls and is not tied to a specific Instruction,
>>>>>>>> it’s part of the calling abi. Whereas clobber_high is explicitly tied to an expression (tls_desc).
>>>>>>>> It meant there wasn’t really any opportunity to resume any existing code.
>>>>>>> Understood. Though your first patch mentions that you're trying to
>>>>>>> describe partial preservation "around TLS calls". Presumably those are
>>>>>>> represented as normal insns, not call_insn.
>>>>>>>
>>>>>>> That brings me back to Richi's idea of exposing a set of the low subreg
>>>>>>> to itself using whatever mode is wide enough to cover the neon part of
>>>>>>> the register.
>>>>>>>
>>>>>>> That should tell the generic parts of the compiler that you're just
>>>>>>> clobbering the upper part and at least in theory you can implement in
>>>>>>> the aarch64 backend and the rest of the compiler should "just work"
>>>>>>> because that's the existing semantics of a subreg store.
>>>>>>>
>>>>>>> The only worry would be if a pass tried to get overly smart and
>>>>>>> considered that kind of set a nop -- but I think I'd argue that's simply
>>>>>>> wrong given the semantics of a partial store.
>>>>>>>
>>>>>>
>>>>>> So, the instead of using clobber_high(reg X), to use set(reg X, reg X).
>>>>>> It’s something we considered, and then dismissed.
>>>>>>
>>>>>> The problem then is you are now using SET semantics on those registers, and it
>>>>>> would make the register live around the function, which might not be the case.
>>>>>> Whereas clobber semantics will just make the register dead - which is exactly
>>>>>> what we want (but only conditionally).
>>>>> ?!? A set of the subreg is the *exact* semantics you want. It says the
>>>>> low part is preserved while the upper part is clobbered across the TLS
>>>>> insns.
>>>>>
>>>>> jeff
>>>>
>>>> Consider where the TLS call is inside a loop. The compiler would normally want
>>>> to hoist that out of the loop. By adding a set(x,x) into the parallel of the tls_desc we
>>>> are now making x live across the loop, x is dependant on the value from the previous
>>>> iteration, and the tls_desc can no longer be hoisted.
>>> Hmm. I think I see the problem you're trying to point out. Let me
>>> restate it and see if you agree.
>>>
>>> The low subreg set does clearly indicate the upper part of the SVE
>>> register is clobbered. The problem is from a liveness standpoint the
>>> compiler is considering the low part live, even though it's a self-set.
>>>
>>
>> Yes.
>>
>>> In fact, if that is the case, then a single TLS call (independent of a
>>> loop) would make the low part of the register globally live. This
>>> should be testable. Include one of these low part self sets on the
>>> existing TLS calls and compile a little test function and let's look at
>>> the liveness data.
>>>
>>
>>
>> Ok, so I’ve put
>> (set (reg:TI V0_REGNUM) (reg:TI V0_REGNUM))
>> (set (reg:TI V1_REGNUM) (reg:TI V1_REGNUM))
>> etc up to V31 in the UNSPEC_TLSDESC pattern in aarch64.md, using up all the neon registers.
>>
>> In an ideal world, this change would do nothing as neon reg size is TI.
>>
>> First I compiled up sve_tls_preserve_1.c (Test1 below)
>>
>> Without the SET changes, gcc does not backup any neon registers before the tlsdesccall (good).
>> With the SET changes, gcc backs up all the neon registers (not ideal).
>>
>> Without the SET changes, dump logs give:
>>
>> cse1:
>> ;; regs ever live 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 66 [cc]
>>
>> cprop1:
>> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr def 0 [x0] 30 [x30] 32 [v0] 66 [cc] 79 80 81 82 84 85 86 87 88 89
>> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; live gen 0 [x0] 32 [v0] 78 79 80 81 82 83 84 85 86 87 88 89
>> ;; live kill 30 [x30] 66 [cc]
>> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>>
>> reload:
>> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 66 [cc]
>>
>> With the set changes:
>>
>> cse1:
>> ;; regs ever live 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>>
>> cprop1:
>> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr def 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] 79 80 81 82 84 85 86 87 88 89
>> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
>> ;; live gen 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 78 79 80 81 82 83 84 85 86 87 88 89
>> ;; live kill 30 [x30] 66 [cc]
>> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>>
>> reload:
>> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>>
>>
>> Is there any extra dump data you think would be useful?
>>
>>
>> For completeness, I also ran through some more tests…..
>>
>>
>>
>> Second I tried putting the tls call inside a simple for loop (Test2 below)
>>
>> Both before and after the SET changes, gcc hoisted the tlsdesccall call out of the loop.
>> This has already happened before the expand pass.
>> (and as before, with the SET changes, gcc backs up all the neon registers before the tlsdesccall).
>>
>>
>> Third, I made the tls variable an array, and accessed elements from it in the loop. (Test3 below)
>>
>> Both before and after the SET changes, gcc fails to hoist the tlsdesccall call.
>> That’s a missed optimisation chance - I’m not sure how tricky or worthwhile that’d be to fix (at the
>> tree level the tls load looks just like a MEM base plus index access. I guess you’d need something
>> specific in an rtl pass to be the hoist).
>> Anyway, we now have a tlsdec inside a loop, which is what I wanted.
>> Without the SET changes, nothing is backed up whilst inside the loop.
>> With the SET changes, gcc tries to back up the live neon registers, and ICEs with
>> "error: unable to find a register to spill” in lra-assigns.c.
>>
>>
>> Forth, I took the original test and made two tls accesses in a row with an asm volatile in-between (Test4 below).
>>
>> With the SET changes, gcc ICEs in the same way.
>> Looking at the dump files,
>>
>> Without the SET changes, dump logs for the final test are similar to the very first test:
>>
>> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc]
>>
>> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; lr def 0 [x0] 30 [x30] 32 [v0] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
>> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 64 [sfp] 65 [ap]
>> ;; live gen 0 [x0] 32 [v0] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
>> ;; live kill 30 [x30] 66 [cc]
>> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>>
>>
>> With the SET changes:
>>
>> ;; regs ever live 0 [x0] 1 [x1] 2 [x2] 29 [x29] 30 [x30] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 66 [cc]
>>
>> ;; lr in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr use 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 64 [sfp] 65 [ap]
>> ;; lr def 0 [x0] 30 [x30] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 66 [cc] 81 84 85 86 88 89 90 91 95 96 97 98 99 100
>> ;; live in 29 [x29] 31 [sp] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 64 [sfp] 65 [ap]
>> ;; live gen 0 [x0] 32 [v0] 33 [v1] 34 [v2] 35 [v3] 36 [v4] 37 [v5] 38 [v6] 39 [v7] 40 [v8] 41 [v9] 42 [v10] 43 [v11] 44 [v12] 45 [v13] 46 [v14] 47 [v15] 48 [v16] 49 [v17] 50 [v18] 51 [v19] 52 [v20] 53 [v21] 54 [v22] 55 [v23] 56 [v24] 57 [v25] 58 [v26] 59 [v27] 60 [v28] 61 [v29] 62 [v30] 63 [v31] 81 84 85 86 88 89 90 91 92 95 96 97 98 99 100
>> ;; live kill 30 [x30] 66 [cc]
>> ;; lr out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>> ;; live out 29 [x29] 31 [sp] 32 [v0] 64 [sfp] 65 [ap]
>>
>>
>>
>>
>> Test1:
>>
>> __thread v4si tx;
>>
>> v4si foo (v4si a, v4si b, v4si c)
>> {
>> v4si y;
>>
>> y = a + tx + b + c;
>>
>> return y + 7;
>> }
>>
>>
>> Test2:
>>
>>
>> __thread v4si tx;
>>
>> v4si foo (v4si a, v4si *b, v4si *c)
>> {
>> v4si y;
>>
>> for(int i=0; i<MAX; i++)
>> y += a + tx + b[i] + c[i];
>>
>> return y + 7;
>> }
>>
>>
>> Test3:
>>
>> __thread v4si tx[MAX];
>>
>> v4si foo (v4si a, v4si *b, v4si *c)
>> {
>> v4si y;
>>
>> for(int i=0; i<MAX; i++)
>> y += a + tx[i] + b[i] + c[i];
>>
>> return y + 7;
>> }
>>
>>
>> Test4:
>>
>> __thread v4si tx;
>> __thread v4si ty;
>>
>> v4si foo (v4si a, v4si b, v4si c)
>> {
>> v4si y;
>>
>> y = a + tx + b + c;
>>
>> asm volatile ("" : : : "memory");
>>
>> y += a + ty + b + c;
>>
>> return y + 7;
>> }
>>
>>
>>>
>>> Now it could be the case that various local analysis could sub-optimally
>>> handle things. You mention LICM. I know our original LICM did have a
>>> problem in that if it saw a use of a hard reg in a loop without seeing a
>>> set of that hard reg it considered the register varying within the loop.
>>> I have no idea if we carried that forward when the loop code was
>>> rewritten (when I looked at this it was circa 1992).
>>>
>>>
>>>>
>>>> Or consider a stream of code containing two tls_desc calls (ok, the compiler might
>>>> optimise one of the tls calls away, but this approach should be reusable for other exprs).
>>>> Between the two set(x,x)’s x is considered live so the register allocator can’t use that
>>>> register.
>>>> Given that we are applying this to all the neon registers, the register allocator now throws
>>>> an ICE because it can’t find any free hard neon registers to use.
>>> Given your statements it sounds like the liveness infrastructure is
>>> making those neon regs globally live when it sees the low part subreg
>>> self-set. Let's confirm that one way or the other and see where it
>>> takes us.
>>>
>>> Jeff
>>
>