This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PING**2] [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)



On 29/04/17 18:45, Bernd Edlinger wrote:
Ping...

I attached a rebased version since there was a merge conflict in
the xordi3 pattern, otherwise the patch is still identical.
It splits adddi3, subdi3, anddi3, iordi3, xordi3 and one_cmpldi2
early when the target has no neon or iwmmxt.


Thanks
Bernd.



On 11/28/16 20:42, Bernd Edlinger wrote:
On 11/25/16 12:30, Ramana Radhakrishnan wrote:
On Sun, Nov 6, 2016 at 2:18 PM, Bernd Edlinger
<bernd.edlinger@hotmail.de> wrote:
Hi!

This improves the stack usage on the sha512 test case for the case
without hardware fpu and without iwmmxt by splitting all di-mode
patterns right while expanding which is similar to what the
shift-pattern
does.  It does nothing in the case iwmmxt and fpu=neon or vfp as well as
thumb1.

I would go further and do this in the absence of Neon, the VFP unit
being there doesn't help with DImode operations i.e. we do not have 64
bit integer arithmetic instructions without Neon. The main reason why
we have the DImode patterns split so late is to give a chance for
folks who want to do 64 bit arithmetic in Neon a chance to make this
work as well as support some of the 64 bit Neon intrinsics which IIRC
map down to these instructions. Doing this just for soft-float doesn't
improve the default case only. I don't usually test iwmmxt and I'm not
sure who has the ability to do so, thus keeping this restriction for
iwMMX is fine.


Yes I understand, thanks for pointing that out.

I was not aware what iwmmxt exists at all, but I noticed that most
64bit expansions work completely different, and would break if we split
the pattern early.

I can however only look at the assembler outout for iwmmxt, and make
sure that the stack usage does not get worse.

Thus the new version of the patch keeps only thumb1, neon and iwmmxt as
it is: around 1570 (thumb1), 2300 (neon) and 2200 (wimmxt) bytes stack
for the test cases, and vfp and soft-float at around 270 bytes stack
usage.

It reduces the stack usage from 2300 to near optimal 272 bytes (!).

Note this also splits many ldrd/strd instructions and therefore I will
post a followup-patch that mitigates this effect by enabling the
ldrd/strd
peephole optimization after the necessary reg-testing.


Bootstrapped and reg-tested on arm-linux-gnueabihf.
What do you mean by arm-linux-gnueabihf - when folks say that I
interpret it as --with-arch=armv7-a --with-float=hard
--with-fpu=vfpv3-d16 or (--with-fpu=neon).

If you've really bootstrapped and regtested it on armhf, doesn't this
patch as it stand have no effect there i.e. no change ?
arm-linux-gnueabihf usually means to me someone has configured with
--with-float=hard, so there are no regressions in the hard float ABI
case,

I know it proves little.  When I say arm-linux-gnueabihf
I do in fact mean --enable-languages=all,ada,go,obj-c++
--with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
--with-float=hard.

My main interest in the stack usage is of course not because of linux,
but because of eCos where we have very small task stacks and in fact
no fpu support by the O/S at all, so that patch is exactly what we need.


Bootstrapped and reg-tested on arm-linux-gnueabihf
Is it OK for trunk?

The code is ok.
AFAICS testing this with --with-fpu=vfpv3-d16 does exercise the new code as the splits will happen for !TARGET_NEON (it is of course !TARGET_IWMMXT and TARGET_IWMMXT2
is irrelevant here).

So this is ok for trunk.
Thanks, and sorry again for the delay.
Kyrill


Thanks
Bernd.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]