This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
- From: Bernd Edlinger <bernd dot edlinger at hotmail dot de>
- To: Christophe Lyon <christophe dot lyon at linaro dot org>, Kyrill Tkachov <kyrylo dot tkachov at foss dot arm dot com>
- Cc: Ramana Radhakrishnan <ramana dot gcc at googlemail dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>, Richard Earnshaw <richard dot earnshaw at arm dot com>, "Wilco Dijkstra" <wilco dot dijkstra at arm dot com>
- Date: Tue, 5 Sep 2017 14:25:27 +0000
- Subject: Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
- Authentication-results: sourceware.org; auth=none
- Authentication-results: linaro.org; dkim=none (message not signed) header.d=none;linaro.org; dmarc=none action=none header.from=hotmail.de;
- References: <HE1PR0701MB2169CD4BF5110F84B68E4AF9E4A40@HE1PR0701MB2169.eurprd07.prod.outlook.com> <CAJA7tRZGYttnYYCsbqFuc88jt8DySFvLY9J+1+88sfofY8Gweg@mail.gmail.com> <AM4PR0701MB2162C11BF479CD62542E2E8AE48A0@AM4PR0701MB2162.eurprd07.prod.outlook.com> <AM4PR0701MB21629F5BB692295C62E09DFCE4120@AM4PR0701MB2162.eurprd07.prod.outlook.com> <59AD6894.7010605@foss.arm.com> <CAKdteObRYyL-T6+E4eJM22o+48QCoto6ndXTCLuL4BAqUFZj5A@mail.gmail.com>
- Spamdiagnosticmetadata: NSPM
- Spamdiagnosticoutput: 1:99
Hi Christophe,
On 09/05/17 10:45, Christophe Lyon wrote:
> Hi Bernd,
>
>
> On 4 September 2017 at 16:52, Kyrill Tkachov
> <kyrylo.tkachov@foss.arm.com> wrote:
>>
>> On 29/04/17 18:45, Bernd Edlinger wrote:
>>>
>>> Ping...
>>>
>>> I attached a rebased version since there was a merge conflict in
>>> the xordi3 pattern, otherwise the patch is still identical.
>>> It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2
>>> early when the target has no neon or iwmmxt.
>>>
>>>
>>> Thanks
>>> Bernd.
>>>
>>>
>>>
>>> On 11/28/16 20:42, Bernd Edlinger wrote:
>>>>
>>>> On 11/25/16 12:30, Ramana Radhakrishnan wrote:
>>>>>
>>>>> On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger
>>>>> <bernd.edlinger@hotmail.de> wrote:
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> This improves the stack usage on the sha512 test case for the case
>>>>>> without hardware fpu and without iwmmxt by splitting all di-mode
>>>>>> patterns right while expanding which is similar to what the
>>>>>> shift-pattern
>>>>>> does. It does nothing in the case iwmmxt and fpu=neon or vfp as well
>>>>>> as
>>>>>> thumb1.
>>>>>>
>>>>> I would go further and do this in the absence of Neon, the VFP unit
>>>>> being there doesn't help with DImode operations i.e. we do not have 64
>>>>> bit integer arithmetic instructions without Neon. The main reason why
>>>>> we have the DImode patterns split so late is to give a chance for
>>>>> folks who want to do 64 bit arithmetic in Neon a chance to make this
>>>>> work as well as support some of the 64 bit Neon intrinsics which IIRC
>>>>> map down to these instructions. Doing this just for soft-float doesn't
>>>>> improve the default case only. I don't usually test iwmmxt and I'm not
>>>>> sure who has the ability to do so, thus keeping this restriction for
>>>>> iwMMX is fine.
>>>>>
>>>>>
>>>> Yes I understand, thanks for pointing that out.
>>>>
>>>> I was not aware what iwmmxt exists at all, but I noticed that most
>>>> 64bit expansions work completely different, and would break if we split
>>>> the pattern early.
>>>>
>>>> I can however only look at the assembler outout for iwmmxt, and make
>>>> sure that the stack usage does not get worse.
>>>>
>>>> Thus the new version of the patch keeps only thumb1, neon and iwmmxt as
>>>> it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack
>>>> for the test cases, and vfp and soft-float at around 270 bytes stack
>>>> usage.
>>>>
>>>>>> It reduces the stack usage from 2300 to near optimal 272 bytes (!).
>>>>>>
>>>>>> Note this also splits many ldrd/strd instructions and therefore I will
>>>>>> post a followup-patch that mitigates this effect by enabling the
>>>>>> ldrd/strd
>>>>>> peephole optimization after the necessary reg-testing.
>>>>>>
>>>>>>
>>>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf.
>>>>>
>>>>> What do you mean by arm-linux-gnueabihf - when folks say that I
>>>>> interpret it as --with-arch=armv7-a --with-float=hard
>>>>> --with-fpu=vfpv3-d16 or (--with-fpu=neon).
>>>>>
>>>>> If you've really bootstrapped and regtested it on armhf, doesn't this
>>>>> patch as it stand have no effect there i.e. no change ?
>>>>> arm-linux-gnueabihf usually means to me someone has configured with
>>>>> --with-float=hard, so there are no regressions in the hard float ABI
>>>>> case,
>>>>>
>>>> I know it proves little. When I say arm-linux-gnueabihf
>>>> I do in fact mean --enable-languages=all,ada,go,obj-c++
>>>> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
>>>> --with-float=hard.
>>>>
>>>> My main interest in the stack usage is of course not because of linux,
>>>> but because of eCos where we have very small task stacks and in fact
>>>> no fpu support by the O/S at all, so that patch is exactly what we need.
>>>>
>>>>
>>>> Bootstrapped and reg-tested on arm-linux-gnueabihf
>>>> Is it OK for trunk?
>>
>>
>> The code is ok.
>> AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as
>> the splits
>> will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and
>> TARGET_IWMMXT2
>> is irrelevant here).
>>
>> So this is ok for trunk.
>> Thanks, and sorry again for the delay.
>> Kyrill
>>
>
> This patch (r251663) causes a regression on armeb-none-linux-gnueabihf
> --with-mode arm
> --with-cpu cortex-a9
> --with-fpu vfpv3-d16-fp16
> FAIL: gcc.dg/vect/vect-singleton_1.c (internal compiler error)
> FAIL: gcc.dg/vect/vect-singleton_1.c -flto -ffat-lto-objects
> (internal compiler error)
>
> the test passes if gcc is configured --with-fpu neon-fp16
>
Thank you very much for what you do!
I am able to reproduce this.
Combine creates an invalid insn out of these two insns:
(insn 12 8 18 2 (parallel [
(set (reg:DI 122)
(plus:DI (reg:DI 116 [ aD.5331 ])
(reg:DI 119 [ bD.5332 ])))
(clobber (reg:CC 100 cc))
]) "vect-singleton_1.c":28 1 {*arm_adddi3}
(expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ])
(expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ])
(expr_list:REG_UNUSED (reg:CC 100 cc)
(nil)))))
(insn 18 12 19 2 (set (reg/i:DI 16 s0)
(reg:DI 122)) "vect-singleton_1.c":28 650 {*movdi_vfp}
(expr_list:REG_DEAD (reg:DI 122)
(nil)))
=>
(insn 18 12 19 2 (parallel [
(set (reg/i:DI 16 s0)
(plus:DI (reg:DI 116 [ aD.5331 ])
(reg:DI 119 [ bD.5332 ])))
(clobber (reg:CC 100 cc))
]) "vect-singleton_1.c":28 1 {*arm_adddi3}
(expr_list:REG_UNUSED (reg:CC 100 cc)
(expr_list:REG_DEAD (reg:DI 116 [ aD.5331 ])
(expr_list:REG_DEAD (reg:DI 119 [ bD.5332 ])
(nil)))))
But after reload this is fixed again, so that is no issue for a
splitter that runs always after reload.
But now the splitter runs into an assertion in gen_highpart(op0).
The same problem should exist with the subdi3 as well, but
not with the other patterns in this patch.
I think the right way to fix this is use a non-vfp register
predicate to prevent combine to create the un-splittable
instruction in the first place.
See attached my proposed fix for this defect.
Christophe, You could do me a favor, and test it in your environment
as I have nothing comparable.
Kyrill, is this patch OK after bootstrap / reg-testing?
Thanks
Bernd.
2017-09-05 Bernd Edlinger <bernd.edlinger@hotmail.de>
PR target/77308
* config/arm/predicates.md (s_register_operand_nv,
arm_adddi_operand_nv): Create new non-vfp predicates.
* config/arm/arm.md (*arm_adddi3, *arm_subdi3): Use new predicates.
Index: gcc/config/arm/arm.md
===================================================================
--- gcc/config/arm/arm.md (revision 251663)
+++ gcc/config/arm/arm.md (working copy)
@@ -457,14 +457,13 @@
)
(define_insn_and_split "*arm_adddi3"
- [(set (match_operand:DI 0 "s_register_operand" "=&r,&r,&r,&r,&r")
- (plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r")
- (match_operand:DI 2 "arm_adddi_operand" "r, 0, r, Dd, Dd")))
+ [(set (match_operand:DI 0 "s_register_operand_nv" "=&r,&r,&r,&r,&r")
+ (plus:DI (match_operand:DI 1 "s_register_operand_nv" "%0, 0, r, 0, r")
+ (match_operand:DI 2 "arm_adddi_operand_nv" "r, 0, r, Dd, Dd")))
(clobber (reg:CC CC_REGNUM))]
"TARGET_32BIT && !TARGET_NEON"
"#"
- "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)
- && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
+ "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
[(parallel [(set (reg:CC_C CC_REGNUM)
(compare:CC_C (plus:SI (match_dup 1) (match_dup 2))
(match_dup 1)))
@@ -1263,9 +1262,9 @@
)
(define_insn_and_split "*arm_subdi3"
- [(set (match_operand:DI 0 "s_register_operand" "=&r,&r,&r")
- (minus:DI (match_operand:DI 1 "s_register_operand" "0,r,0")
- (match_operand:DI 2 "s_register_operand" "r,0,0")))
+ [(set (match_operand:DI 0 "s_register_operand_nv" "=&r,&r,&r")
+ (minus:DI (match_operand:DI 1 "s_register_operand_nv" "0,r,0")
+ (match_operand:DI 2 "s_register_operand_nv" "r,0,0")))
(clobber (reg:CC CC_REGNUM))]
"TARGET_32BIT && !TARGET_NEON"
"#" ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2"
Index: gcc/config/arm/predicates.md
===================================================================
--- gcc/config/arm/predicates.md (revision 251663)
+++ gcc/config/arm/predicates.md (working copy)
@@ -653,3 +653,11 @@
(ior (and (match_code "symbol_ref")
(match_test "!arm_is_long_call_p (SYMBOL_REF_DECL (op))"))
(match_operand 0 "s_register_operand")))
+
+(define_predicate "s_register_operand_nv"
+ (and (match_operand 0 "s_register_operand")
+ (not (match_operand 0 "vfp_hard_register_operand"))))
+
+(define_predicate "arm_adddi_operand_nv"
+ (and (match_operand 0 "arm_adddi_operand")
+ (not (match_operand 0 "vfp_hard_register_operand"))))