77308 – surprisingly large stack usage for sha512 on arm

Bug 77308 - surprisingly large stack usage for sha512 on arm

Summary: surprisingly large stack usage for sha512 on arm

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	7.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:

Reported:	2016-08-21 17:42 UTC by Bernd Edlinger
Modified:	2019-08-23 16:44 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	arm*
Build:
Known to work:	8.1.0
Known to fail:
Last reconfirmed:	2016-08-23 00:00:00

Attachments
test case (2.02 KB, text/plain) 2016-08-21 17:42 UTC, Bernd Edlinger	Details
proposed patch (511 bytes, patch) 2016-10-26 15:47 UTC, Bernd Edlinger	Details \| Diff
proposed patch, v2 (376 bytes, patch) 2016-11-01 14:31 UTC, Bernd Edlinger	Details \| Diff
proposed patch, v2 (1.85 KB, patch) 2016-11-01 14:35 UTC, Bernd Edlinger	Details \| Diff
patch for di-mode early splitting (3.34 KB, patch) 2016-11-03 16:58 UTC, Bernd Edlinger	Details \| Diff
patch for enabling ldrdstrd peephole (1.31 KB, patch) 2016-11-03 17:02 UTC, Bernd Edlinger	Details \| Diff
Show Obsolete (3) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bernd Edlinger 2016-08-21 17:42:28 UTC

Created attachment 39479 [details]
test case

I've noticed that openssl with no-asm but without -DOPENSSL_SMALL_FOOTPRINT
uses > 3600 byte of stack for a simple sha512 computation with -O3 on ARM.

Which is surprising, because on i386 the same function uses only 1016 byte,
and x86_64 uses only 256 bytes stack.

See the attached source code witch is stripped down from sha512.c

Comment 1 Andrew Pinski 2016-08-21 19:12:41 UTC

Does -fno-schedule-insns help?  Sometimes the scheduler before the register allocator causes register pressure and forces more register spills.

Comment 2 Andrew Pinski 2016-08-21 19:16:18 UTC

For aarch64, the stack size is just 208 bytes.

Comment 3 Bernd Edlinger 2016-08-21 19:23:05 UTC

(In reply to Andrew Pinski from comment #1)
> Does -fno-schedule-insns help?  Sometimes the scheduler before the register
> allocator causes register pressure and forces more register spills.

Yes, with -fno-schedule-insns it is a little bit smaller:

sha512_block_data_order:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 3472
        @ frame_needed = 0, uses_anonymous_args = 0
        push    {r4, r5, r6, r7, r8, r9, r10, fp, lr}
        sub     sp, sp, #3472
        sub     sp, sp, #4


but it could be worse :-)

with -fno-ira-share-spill-slots:

sha512_block_data_order:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 6640
        @ frame_needed = 0, uses_anonymous_args = 0
        push    {r4, r5, r6, r7, r8, r9, r10, fp, lr}
        sub     sp, sp, #6592
        sub     sp, sp, #52

Comment 4 Bernd Edlinger 2016-08-21 21:32:47 UTC

hmm, when I compare aarch64 vs. arm sha512.c.260r.reload
with -O3 -fno-schedule-insns

I see a big difference:

aarch64 has only few spill regs

subreg regs:
  Slot 0 regnos (width = 8):     856
  Slot 1 regnos (width = 8):     857
  Slot 2 regnos (width = 8):     858
  Slot 3 regnos (width = 8):     859
  Slot 4 regnos (width = 8):     860
  Slot 5 regnos (width = 8):     861
  Slot 6 regnos (width = 8):     862
  Slot 7 regnos (width = 8):     2117
  Slot 8 regnos (width = 8):     1164
  Slot 9 regnos (width = 8):     1052


but arm has 415 (8 bytes each)
and the line "subreg regs:" before the Spill Slots is contains ~1500 regs.

and while aarch64 does not have a single subreg in any pass,
the arm has lots of subregs before lra eliminates all of them.

like this, in sha512.c.217r.expand:

(insn 85 84 86 5 (set (subreg:SI (reg:DI 1670) 4)
        (ashift:SI (subreg:SI (reg:DI 1669) 0)
            (const_int 24 [0x18]))) sha512.c:98 -1
     (nil))
(insn 86 85 87 5 (set (subreg:SI (reg:DI 1670) 0)
        (const_int 0 [0])) sha512.c:98 -1
     (nil))

This funny instruction is generated in arm_emit_coreregs_64bit_shift:

          /* Shifts by a constant greater than 31.  */
          rtx adj_amount = GEN_INT (INTVAL (amount) - 32);

          emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
          if (code == ASHIFTRT)
            emit_insn (gen_ashrsi3 (out_up, in_up,
                                    GEN_INT (31)));
          else
            emit_insn (SET (out_up, const0_rtx));

From my past experience, I assume that using a subreg to write
an half of the out register makes more problems than it solves.

So I tried this:

Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 239624)
+++ gcc/config/arm/arm.c	(working copy)
@@ -29170,12 +29170,11 @@
 	  /* Shifts by a constant greater than 31.  */
 	  rtx adj_amount = GEN_INT (INTVAL (amount) - 32);
 
+	  emit_insn (SET (out, const0_rtx));
 	  emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
 	  if (code == ASHIFTRT)
 	    emit_insn (gen_ashrsi3 (out_up, in_up,
 				    GEN_INT (31)));
-	  else
-	    emit_insn (SET (out_up, const0_rtx));
 	}
     }
   else

and it reduced the stack from 3472->2960

Comment 5 Bernd Edlinger 2016-08-21 21:41:27 UTC

Now I try to clear the out register when the shift < 32


Index: gcc/config/arm/arm.c
===================================================================
--- gcc/config/arm/arm.c	(revision 239624)
+++ gcc/config/arm/arm.c	(working copy)
@@ -29159,6 +29159,7 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code,
 	  /* Shifts by a constant less than 32.  */
 	  rtx reverse_amount = GEN_INT (32 - INTVAL (amount));
 
+	  emit_insn (SET (out, const0_rtx));
 	  emit_insn (SET (out_down, LSHIFT (code, in_down, amount)));
 	  emit_insn (SET (out_down,
 			  ORR (REV_LSHIFT (code, in_up, reverse_amount),
@@ -29170,12 +29171,11 @@ arm_emit_coreregs_64bit_shift (enum rtx_code code,
 	  /* Shifts by a constant greater than 31.  */
 	  rtx adj_amount = GEN_INT (INTVAL (amount) - 32);
 
+	  emit_insn (SET (out, const0_rtx));
 	  emit_insn (SET (out_down, SHIFT (code, in_up, adj_amount)));
 	  if (code == ASHIFTRT)
 	    emit_insn (gen_ashrsi3 (out_up, in_up,
 				    GEN_INT (31)));
-	  else
-	    emit_insn (SET (out_up, const0_rtx));
 	}
     }
   else


result: the stack use goes down from 2960 to 2320
note: this SET is simply redundant, but it makes the out
well-defined from the beginning.

Comment 6 Bernd Edlinger 2016-08-21 21:46:36 UTC

Vladimir,

is this an lra problem?
or should this be fixed at the target?

Comment 7 Bernd Edlinger 2016-08-22 06:09:19 UTC

even more surprisingly is that:

While thumb2 code (-march=armv6t2 -mthumb) has about the same stack size
as arm code (-marm), thumb1 code has only 1588 bytes stack, and it does
not change with -fno-schedule-insns nor with the patch above:

arm-unknown-eabi-gcc -O3 -march=armv6 -msoft-float -mthumb -S sha512.c

=>

sha512_block_data_order:
        push    {r4, r5, r6, r7, lr}
        mov     r6, r10
        mov     r4, r8
        mov     r7, fp
        mov     r5, r9
        ldr     r3, .L11
        push    {r4, r5, r6, r7}
        ldr     r4, .L11+4
        add     sp, sp, r4

...

.L11:
        .word   1580
        .word   -1588

Comment 8 Bernd Edlinger 2016-08-22 08:11:53 UTC

analyzing the different thumb1/2 reload dumps,
I see t2 often uses code like that to access spill slots:

(insn 11576 8090 9941 5 (set (reg:SI 3 r3 [11890])
        (plus:SI (reg/f:SI 13 sp)
            (const_int 480 [0x1e0]))) sha512.c:147 4 {*arm_addsi3}
     (nil))
(insn 9941 11576 2978 5 (set (reg:DI 2 r2 [4210])
        (mem/c:DI (reg:SI 3 r3 [11890]) [5 %sfpD.4158+-3112 S8 A64])) sha512.c:147 170 {*arm_movdi}
     (nil))

while t1 often does it this way:

(insn 8221 8219 4591 6 (set (reg/v:DI 4 r4 [orig:1450 fD.4102 ] [1450])
        (mem/c:DI (plus:SI (reg/f:SI 13 sp)
                (const_int 152 [0x98])) [5 %sfpD.4164+-1432 S8 A64])) sha512.c:155 748 {*thumb1_movdi_insn}
     (nil))


grep "plus.*reg.*sp" t1/sha512.c.260r.reload |grep -v mem |wc -l
110

grep "plus.*reg.*sp" t2/sha512.c.260r.reload |grep -v mem |wc -l
602


I think in thumb1 the 110 memory accesses are all for frame-objects,
like X, a-h, etc, while thumb2 also uses pointers to access spill slots.

It must be pretty expensive for thumb1 to access something using
a pointer register.  But for thumb2 it this strategy creates rather
significant register pressure.

Comment 9 ktkachov 2016-08-23 12:51:42 UTC

Note that the fpu option plays a role here as well

When I compile with -O3 -S -mfloat-abi=hard -march=armv7-a -mthumb -mtune=cortex-a8 -mfpu=neon

I get:
sha512_block_data_order:
        @ args = 0, pretend = 0, frame = 2384
        @ frame_needed = 0, uses_anonymous_args = 0
        push    {r4, r5, r6, r7, r8, r9, r10, fp, lr}
        subw    sp, sp, #2388
        subs    r4, r2, #1


whereas if you leave out the -mfpu you get the default which is probably 'vfp' if you didn't configure gcc with an explicit --with-fpu. This is usually not a good fit for recent targets.
With -mfpu=vfp I get the terrible:
sha512_block_data_order:
        @ args = 0, pretend = 0, frame = 3568
        @ frame_needed = 0, uses_anonymous_args = 0
        push    {r4, r5, r6, r7, r8, r9, r10, fp, lr}
        subw    sp, sp, #3572
        subs    r4, r2, #1

That said, I bet there's still room for improvement

Comment 10 Wilco 2016-08-23 16:01:54 UTC

The register allocator doesn't correctly track liferanges for SETs with a subreg and this can cause terrible spilling indeed.

Also why aren't DI mode values split into native SI registers on 32-bit targets? Using DI mode registers means the register allocator has a much harder problem to solve as it can only use even/odd register pairs (and must allocate both registers even if one is dead).

Comment 11 Bernd Edlinger 2016-10-17 17:47:31 UTC

Author: edlinger
Date: Mon Oct 17 17:46:59 2016
New Revision: 241273

URL: https://gcc.gnu.org/viewcvs?rev=241273&root=gcc&view=rev
Log:
2016-10-17  Bernd Edlinger  <bernd.edlinger@hotmail.de>

        PR target/77308
        * config/arm/arm.c (arm_emit_coreregs_64bit_shift): Clear the result
        register explicitly.
        * config/arm/arm.md (ashldi3, ashrdi3, lshrdi3): Don't FAIL if
        optimizing for size.

testsuite:
2016-10-17  Bernd Edlinger  <bernd.edlinger@hotmail.de>

        PR target/77308
        * gcc.target/arm/pr77308.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/arm/pr77308.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.c
    trunk/gcc/config/arm/arm.md
    trunk/gcc/testsuite/ChangeLog

Comment 12 Wilco 2016-10-20 22:40:09 UTC

It looks like we need a different approach, I've seen the extra SETs use up more registers in some cases, and in other cases being optimized away early on... 

Doing shift expansion at the same time as all other DI mode operations should result in the same stack size as -fpu=neon. However that's still well behind Thumb-1, and I would expect ARM/Thumb-2 to beat Thumb-1 easily with 6 extra registers.

The spill code for Thumb-2 seems incorrect:

(insn 11576 8090 9941 5 (set (reg:SI 3 r3 [11890])
        (plus:SI (reg/f:SI 13 sp)
            (const_int 480 [0x1e0]))) sha512.c:147 4 {*arm_addsi3}
     (nil))
(insn 9941 11576 2978 5 (set (reg:DI 2 r2 [4210])
        (mem/c:DI (reg:SI 3 r3 [11890]) [5 %sfpD.4158+-3112 S8 A64])) sha512.c:147 170 {*arm_movdi}
     (nil))

LDRD has a range of 1020 on Thumb-2 so I would expect this to be a single instruction.

Comment 13 Bernd Edlinger 2016-10-25 18:19:12 UTC

I am still trying to understand why thumb1 seems to outperform thumb2.

Obviously thumb1 does not have the shiftdi3 pattern,
but even if I remove these from thumb2, the result is still
not par with thumb2.  Apparently other patterns still produce di
values that are not enabled with thumb1, they are 
xordi3 and anddi3, these are often used.  Then there is
adddi3 that is enabled in thumb1 and thumb2, I also disabled
this one, and now the sha512 gets down to inclredible 1152
bytes frame (-Os -march=armv7 -mthumb -float-abi=soft):

I know this is a hack, but 1K stack is what we should expect...

--- arm.md      2016-10-25 19:54:16.425736721 +0200
+++ arm.md.orig 2016-10-17 19:46:59.000000000 +0200
@@ -448,7 +448,7 @@
          (plus:DI (match_operand:DI 1 "s_register_operand" "")
                   (match_operand:DI 2 "arm_adddi_operand"  "")))
     (clobber (reg:CC CC_REGNUM))])]
-  "TARGET_EITHER && !TARGET_THUMB2"
+  "TARGET_EITHER"
   "
   if (TARGET_THUMB1)
     {
@@ -465,7 +465,7 @@
        (plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r")
                 (match_operand:DI 2 "arm_adddi_operand"  "r,  0, r, Dd, Dd")))
    (clobber (reg:CC CC_REGNUM))]
-  "TARGET_32BIT && !TARGET_NEON && !TARGET_THUMB2"
+  "TARGET_32BIT && !TARGET_NEON"
   "#"
   "TARGET_32BIT && reload_completed
    && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
@@ -2256,7 +2256,7 @@
   [(set (match_operand:DI         0 "s_register_operand" "")
        (and:DI (match_operand:DI 1 "s_register_operand" "")
                (match_operand:DI 2 "neon_inv_logic_op2" "")))]
-  "TARGET_32BIT && !TARGET_THUMB2"
+  "TARGET_32BIT"
   ""
 )

@@ -2264,7 +2264,7 @@
   [(set (match_operand:DI         0 "s_register_operand"     "=w,w ,&r,&r,&r,&r,?w,?w")
         (and:DI (match_operand:DI 1 "s_register_operand"     "%w,0 ,0 ,r ,0 ,r ,w ,0")
                 (match_operand:DI 2 "arm_anddi_operand_neon" "w ,DL,r ,r ,De,De,w ,DL")))]
-  "TARGET_32BIT && !TARGET_IWMMXT && !TARGET_THUMB2"
+  "TARGET_32BIT && !TARGET_IWMMXT"
 {
   switch (which_alternative)
     {
@@ -3310,7 +3310,7 @@
   [(set (match_operand:DI         0 "s_register_operand" "")
        (xor:DI (match_operand:DI 1 "s_register_operand" "")
                (match_operand:DI 2 "arm_xordi_operand" "")))]
-  "TARGET_32BIT && !TARGET_THUMB2"
+  "TARGET_32BIT"
   ""
 )

@@ -3318,7 +3318,7 @@
   [(set (match_operand:DI         0 "s_register_operand" "=w,&r,&r,&r,&r,?w")
        (xor:DI (match_operand:DI 1 "s_register_operand" "%w ,0,r ,0 ,r ,w")
                (match_operand:DI 2 "arm_xordi_operand"  "w ,r ,r ,Dg,Dg,w")))]
-  "TARGET_32BIT && !TARGET_IWMMXT && !TARGET_THUMB2"
+  "TARGET_32BIT && !TARGET_IWMMXT"
 {
   switch (which_alternative)
     {
@@ -3983,7 +3983,7 @@
   [(set (match_operand:DI            0 "s_register_operand" "")
         (ashift:DI (match_operand:DI 1 "s_register_operand" "")
                    (match_operand:SI 2 "general_operand" "")))]
-  "TARGET_32BIT && !TARGET_THUMB2"
+  "TARGET_32BIT"
   "
   if (TARGET_NEON)
     {
@@ -4058,7 +4058,7 @@
   [(set (match_operand:DI              0 "s_register_operand" "")
         (ashiftrt:DI (match_operand:DI 1 "s_register_operand" "")
                      (match_operand:SI 2 "reg_or_int_operand" "")))]
-  "TARGET_32BIT && !TARGET_THUMB2"
+  "TARGET_32BIT"
   "
   if (TARGET_NEON)
     {


What do you think?

Comment 14 Wilco 2016-10-25 18:41:19 UTC

(In reply to Bernd Edlinger from comment #13)
> I am still trying to understand why thumb1 seems to outperform thumb2.
> 
> Obviously thumb1 does not have the shiftdi3 pattern,
> but even if I remove these from thumb2, the result is still
> not par with thumb2.  Apparently other patterns still produce di
> values that are not enabled with thumb1, they are 
> xordi3 and anddi3, these are often used.  Then there is
> adddi3 that is enabled in thumb1 and thumb2, I also disabled
> this one, and now the sha512 gets down to inclredible 1152
> bytes frame (-Os -march=armv7 -mthumb -float-abi=soft):
> 
> I know this is a hack, but 1K stack is what we should expect...
> 
> --- arm.md      2016-10-25 19:54:16.425736721 +0200
> +++ arm.md.orig 2016-10-17 19:46:59.000000000 +0200
> @@ -448,7 +448,7 @@
>           (plus:DI (match_operand:DI 1 "s_register_operand" "")
>                    (match_operand:DI 2 "arm_adddi_operand"  "")))
>      (clobber (reg:CC CC_REGNUM))])]
> -  "TARGET_EITHER && !TARGET_THUMB2"
> +  "TARGET_EITHER"

So you're actually turning the these instructions off for Thumb-2? What does it do instead then? Do the number of instructions go down?

I noticed that with or without -mfpu=neon, using -marm is significantly smaller than -mthumb. Most of the extra instructions appear to be moves, which means something is wrong (I would expect Thumb-2 to do better as it supports LDRD with larger offsets than ARM).

Comment 15 Bernd Edlinger 2016-10-25 20:47:53 UTC

(In reply to Wilco from comment #14)
> (In reply to Bernd Edlinger from comment #13)
> > I am still trying to understand why thumb1 seems to outperform thumb2.
> > 
> > Obviously thumb1 does not have the shiftdi3 pattern,
> > but even if I remove these from thumb2, the result is still
> > not par with thumb2.  Apparently other patterns still produce di
> > values that are not enabled with thumb1, they are 
> > xordi3 and anddi3, these are often used.  Then there is
> > adddi3 that is enabled in thumb1 and thumb2, I also disabled
> > this one, and now the sha512 gets down to inclredible 1152
> > bytes frame (-Os -march=armv7 -mthumb -float-abi=soft):
> > 
> > I know this is a hack, but 1K stack is what we should expect...
> > 
> > --- arm.md      2016-10-25 19:54:16.425736721 +0200
> > +++ arm.md.orig 2016-10-17 19:46:59.000000000 +0200
> > @@ -448,7 +448,7 @@
> >           (plus:DI (match_operand:DI 1 "s_register_operand" "")
> >                    (match_operand:DI 2 "arm_adddi_operand"  "")))
> >      (clobber (reg:CC CC_REGNUM))])]
> > -  "TARGET_EITHER && !TARGET_THUMB2"
> > +  "TARGET_EITHER"
> 
> So you're actually turning the these instructions off for Thumb-2? What does
> it do instead then? Do the number of instructions go down?
> 
> I noticed that with or without -mfpu=neon, using -marm is significantly
> smaller than -mthumb. Most of the extra instructions appear to be moves,
> which means something is wrong (I would expect Thumb-2 to do better as it
> supports LDRD with larger offsets than ARM).

The LDRD may be another detail, that contributes to this mess.
Maybe, just a guess, the LDRD does simply not work with DI registers, but
only with two SI, at least the pattern looks like targeting two SI moves?

I would expect the n DI mode registers to fall apart into 2n SI mode registers,
that should happen when the expansion finds no DI pattern, it falls back
to use SI pattern, and each SI mode register can be spilled independently and
can be dead independently of the other half word.

And frankly I am still puzzled, what my brutal patch did to the stack size,
and it reduced the code size:

frame       2328 ->   1152
code size 0x4188 -> 0x3ab8


I have not tested if the code works, but I assume that it should,
or fail in an ICE, which is apparently not the case.

Comment 16 Bernd Edlinger 2016-10-26 00:32:16 UTC

Wow.

look at this:

Index: arm.md
===================================================================
--- arm.md	(revision 241539)
+++ arm.md	(working copy)
@@ -448,7 +448,7 @@
 	  (plus:DI (match_operand:DI 1 "s_register_operand" "")
 	           (match_operand:DI 2 "arm_adddi_operand"  "")))
     (clobber (reg:CC CC_REGNUM))])]
-  "TARGET_EITHER"
+  "TARGET_EITHER && !TARGET_THUMB2"
   "
   if (TARGET_THUMB1)
     {
@@ -2256,7 +2256,7 @@
   [(set (match_operand:DI         0 "s_register_operand" "")
 	(and:DI (match_operand:DI 1 "s_register_operand" "")
 		(match_operand:DI 2 "neon_inv_logic_op2" "")))]
-  "TARGET_32BIT"
+  "TARGET_32BIT && !TARGET_THUMB2"
   ""
 )
 
@@ -3310,7 +3310,7 @@
   [(set (match_operand:DI         0 "s_register_operand" "")
 	(xor:DI (match_operand:DI 1 "s_register_operand" "")
 		(match_operand:DI 2 "arm_xordi_operand" "")))]
-  "TARGET_32BIT"
+  "TARGET_32BIT && !TARGET_THUMB2"
   ""
 )
 

Thus only disabling adddi3, anddi3, and xordi3,
but not ashldi3 and ashrdi3, because they are not part of the problem.

reduces
arm-linux-gnueabihf-gcc -mthumb -march=armv7 -mfloat-abi=soft -Os -c pr77308.c

frame = 272 (about the same as aarch64 had!)
code  = 0x374C

sha512_block_data_order:
        @ args = 0, pretend = 0, frame = 272
        @ frame_needed = 0, uses_anonymous_args = 0


Maybe I am dreaming, or something is completely wrong now...

Comment 17 Wilco 2016-10-26 12:05:29 UTC

(In reply to Bernd Edlinger from comment #16)
> Wow.

> 
> Maybe I am dreaming, or something is completely wrong now...

Well I can reproduce it so it is real - thanks for the insight! Basically you're forcing Thumb-2 to split early like Thumb-1. This allows the top and bottom halves to be independently optimized (unused halves or zeroes are very common in DI mode operations), resulting in much lower register pressure.

Interestingly the subreg removal phase works fine if you split *all* DI mode operations early. This explains the bad code for shifts as they are split early but can't have their subregs removed due to other DI mode operations being split after reload.

This means the correct approach is to split all operations before register allocation. Looking at the phase list, Expand appears best as there isn't any constant propagation or dead code elimination done after Split1. Interestingly it will be a major simplification as we can get rid of a huge number of complex patterns and do exactly the same for ARM, Thumb-1, Thumb-2, VFP and NEON.

Comment 18 Bernd Edlinger 2016-10-26 15:47:53 UTC

Created attachment 39898 [details]
proposed patch

This disables problematic di patterns when no fpu is used, and
there is absolutely no chance that the di patterns can improve something.

This is what I am going to test now, but any help with additional testing
would be highly welcome.

Comment 19 Bernd Edlinger 2016-10-26 16:01:18 UTC

I think the problem with anddi iordi and xordi instructions is that
they obscure the data flow between low and high half words.
When they are not enabled, we have the low and high parts
expanded independently, but in the case of the di mode instructions
it is not clear which of the half words propagate from input to output.

With my new patch, we have 2328 bytes stack for hard float point,
and only 272 bytes for arm-none-eabi which is a target I care about.


This is still not perfect, but certainly a big improvement.

Wilco, where have you seen the additional registers used with my
previous patch, maybe we can try to fix that somehow?

Comment 20 Wilco 2016-10-26 18:19:26 UTC

(In reply to Bernd Edlinger from comment #19)
> I think the problem with anddi iordi and xordi instructions is that
> they obscure the data flow between low and high half words.
> When they are not enabled, we have the low and high parts
> expanded independently, but in the case of the di mode instructions
> it is not clear which of the half words propagate from input to output.
> 
> With my new patch, we have 2328 bytes stack for hard float point,
> and only 272 bytes for arm-none-eabi which is a target I care about.
> 
> 
> This is still not perfect, but certainly a big improvement.
> 
> Wilco, where have you seen the additional registers used with my
> previous patch, maybe we can try to fix that somehow?

What happens is that the move of zero causes us to use extra registers in shifts as both source and destination are now always live at the same time. We generate worse code for simple examples like x | (y << 3):

-mfpu=vfp:
	push	{r4, r5}
	lsls	r5, r1, #3
	orr	r5, r5, r0, lsr #29
	lsls	r4, r0, #3
	orr	r0, r4, r2
	orr	r1, r5, r3
	pop	{r4, r5}
	bx	lr
-mfpu=neon:
	lsls	r1, r1, #3
	orr	r1, r1, r0, lsr #29
	lsls	r0, r0, #3
	orrs	r0, r0, r2
	orrs	r1, r1, r3
	bx	lr

So that means this is not a solution.

Note init_regs already does insert moves of zero before expanded shifts (I get the same code with -mfpu=vfp with or without your previous patch), so it shouldn't be necessary. Why does it still make a difference? Presumably init_regs doesn't find all cases or inserts the moves at the right place, so we should fix that rather than do it in the shift expansion.

However the underlying issue is that DI mode operations are not all split at exactly the same time, and that is what needs to be fixed.

Comment 21 Bernd Edlinger 2016-10-27 14:02:44 UTC

(In reply to wilco from comment #20)
> > Wilco, where have you seen the additional registers used with my
> > previous patch, maybe we can try to fix that somehow?
> 
> What happens is that the move of zero causes us to use extra registers in
> shifts as both source and destination are now always live at the same time.
> We generate worse code for simple examples like x | (y << 3):
> 
> -mfpu=vfp:
> 	push	{r4, r5}
> 	lsls	r5, r1, #3
> 	orr	r5, r5, r0, lsr #29
> 	lsls	r4, r0, #3
> 	orr	r0, r4, r2
> 	orr	r1, r5, r3
> 	pop	{r4, r5}
> 	bx	lr
> -mfpu=neon:
> 	lsls	r1, r1, #3
> 	orr	r1, r1, r0, lsr #29
> 	lsls	r0, r0, #3
> 	orrs	r0, r0, r2
> 	orrs	r1, r1, r3
> 	bx	lr
> 

hmm. I think with my patch reverted the code is the same.

I tried -O2 -marm -mfpu=vfp -mhard-float get the first variant
with and without patch.

For -O2 -marm -mfpu=vfp -msoft-float I get the second variant
with and witout patch.

For -O2 -marm -mfpu=neon -mhard-float I get the second variant

With -O2 -marm -mfpu=neon -msoft-float I get a third variant
again with and without patch:

        lsl     r1, r1, #3
        mov     ip, r0
        orr     r0, r2, r0, lsl #3
        orr     r1, r1, ip, lsr #29
        orr     r1, r1, r3
        bx      lr



Am I missing something?

Comment 22 Wilco 2016-10-27 14:27:12 UTC

(In reply to Bernd Edlinger from comment #21)
> (In reply to wilco from comment #20)
> > > Wilco, where have you seen the additional registers used with my
> > > previous patch, maybe we can try to fix that somehow?
> > 
> > What happens is that the move of zero causes us to use extra registers in
> > shifts as both source and destination are now always live at the same time.
> > We generate worse code for simple examples like x | (y << 3):
> > 
> > -mfpu=vfp:
> > 	push	{r4, r5}
> > 	lsls	r5, r1, #3
> > 	orr	r5, r5, r0, lsr #29
> > 	lsls	r4, r0, #3
> > 	orr	r0, r4, r2
> > 	orr	r1, r5, r3
> > 	pop	{r4, r5}
> > 	bx	lr
> > -mfpu=neon:
> > 	lsls	r1, r1, #3
> > 	orr	r1, r1, r0, lsr #29
> > 	lsls	r0, r0, #3
> > 	orrs	r0, r0, r2
> > 	orrs	r1, r1, r3
> > 	bx	lr
> > 
> 
> hmm. I think with my patch reverted the code is the same.
> 
> I tried -O2 -marm -mfpu=vfp -mhard-float get the first variant
> with and without patch.

Yes that's what I get.

> For -O2 -marm -mfpu=vfp -msoft-float I get the second variant
> with and witout patch.

This still gives the first variant for me.

> For -O2 -marm -mfpu=neon -mhard-float I get the second variant

Right.

> With -O2 -marm -mfpu=neon -msoft-float I get a third variant
> again with and without patch:
> 
>         lsl     r1, r1, #3
>         mov     ip, r0
>         orr     r0, r2, r0, lsl #3
>         orr     r1, r1, ip, lsr #29
>         orr     r1, r1, r3
>         bx      lr

I don't see this...

> Am I missing something?

What I meant is that your patch still makes a large difference on the original test case despite making no difference in simple cases like the above.

Anyway, there is another bug: on AArch64 we correctly recognize there are 8 1-byte loads, shifts and orrs which can be replaced by a single 8-byte load and a byte reverse. Although it is recognized on ARM and works correctly if it is a little endian load, it doesn't perform the optimization if a byte reverse is needed. As a result there are lots of 64-bit shifts and orrs which create huge register pressure if not expanded early.

This testcase is turning out to be a goldmine of bugs...

Comment 23 Bernd Edlinger 2016-10-27 16:51:34 UTC

(In reply to wilco from comment #22)
> 
> What I meant is that your patch still makes a large difference on the
> original test case despite making no difference in simple cases like the
> above.

For sure it is papering over something, as the complete init-regs pass,
it is even documented to do that:

/* Check all of the uses of pseudo variables.  If any use that is MUST
   uninitialized, add a store of 0 immediately before it.  For
   subregs, this makes combine happy.  For full word regs, this makes
   other optimizations, like the register allocator and the reg-stack
   happy as well as papers over some problems on the arm and other
   processors where certain isa constraints cannot be handled by gcc.

I have seen the DI = 0 decay to two SI = 0 and finally removed
well before the init-regs pass runs, and the init-regs pass finds
nothing to do in this test case.  Nevertheless they have a very
positive influence in the lra pass.  In the moment I do not see,
what could replace this.  Magic.
 
> Anyway, there is another bug: on AArch64 we correctly recognize there are 8
> 1-byte loads, shifts and orrs which can be replaced by a single 8-byte load
> and a byte reverse. Although it is recognized on ARM and works correctly if
> it is a little endian load, it doesn't perform the optimization if a byte
> reverse is needed. As a result there are lots of 64-bit shifts and orrs
> which create huge register pressure if not expanded early.
> 
> This testcase is turning out to be a goldmine of bugs...


Yes, and the test case can be modified to exercise other insns too.

For instance I just added di-mode ~ to the sigma blocks:

#define Sigma0(x)       ~(ROTR((x),28) ^ ROTR((x),34) ^ ROTR((x),39))
#define Sigma1(x)       ~(ROTR((x),14) ^ ROTR((x),18) ^ ROTR((x),41))
#define sigma0(x)       ~(ROTR((x),1)  ^ ROTR((x),8)  ^ ((x)>>7))
#define sigma1(x)       ~(ROTR((x),19) ^ ROTR((x),61) ^ ((x)>>6))

and saw the stack use double with -marm -mfpu=vfp -msoft-float -Os
to 528, and when I disable the one_cmpldi2 pattern it goes
back to 278 again:

thus I will add this to the second patch:

@@ -5020,7 +5020,7 @@
 (define_insn_and_split "one_cmpldi2"
   [(set (match_operand:DI 0 "s_register_operand"	 "=w,&r,&r,?w")
 	(not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))]
-  "TARGET_32BIT"
+  "TARGET_32BIT && TARGET_HARD_FLOAT"
   "@
    vmvn\t%P0, %P1
    #


Not every di2 pattern is hamful, for instance unary minus does nothing.
Mostly all patterns that mix =w and =r alternatives.

Comment 24 Bernd Edlinger 2016-10-27 17:20:19 UTC

(In reply to Bernd Edlinger from comment #23)
> @@ -5020,7 +5020,7 @@
>  (define_insn_and_split "one_cmpldi2"
>    [(set (match_operand:DI 0 "s_register_operand"	 "=w,&r,&r,?w")
>  	(not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))]

BTW: who knows what it is good for to have
=w
 w

on the first alternative, and on the 4th alternative
?w
 w

"?" does only make the alternative less attractive, but it is otherwise
exactly the same as the first one, right?

Comment 25 Wilco 2016-10-27 17:26:59 UTC

(In reply to Bernd Edlinger from comment #24)
> (In reply to Bernd Edlinger from comment #23)
> > @@ -5020,7 +5020,7 @@
> >  (define_insn_and_split "one_cmpldi2"
> >    [(set (match_operand:DI 0 "s_register_operand"	 "=w,&r,&r,?w")
> >  	(not:DI (match_operand:DI 1 "s_register_operand" " w, 0, r, w")))]
> 
> BTW: who knows what it is good for to have
> =w
>  w
> 
> on the first alternative, and on the 4th alternative
> ?w
>  w
> 
> "?" does only make the alternative less attractive, but it is otherwise
> exactly the same as the first one, right?

Alternatives can be disabled, there are flags, eg:

(set_attr "arch" "neon_for_64bits,*,*,avoid_neon_for_64bits")

I think the NEON alternatives not only do not help at all, they actually generate cause CQ to be far worse even when no NEON instructions are ever generated.

Comment 26 Bernd Edlinger 2016-10-27 18:00:24 UTC

(In reply to wilco from comment #25)
> 
> Alternatives can be disabled, there are flags, eg:
> 
> (set_attr "arch" "neon_for_64bits,*,*,avoid_neon_for_64bits")
> 

Ok I see, thanks.

Still lots of insns could be simplified into less alternatives,
when the attribs are identical of course.  Doesn't that create
smaller tables and use less memory to compile?

> I think the NEON alternatives not only do not help at all, they actually
> generate cause CQ to be far worse even when no NEON instructions are ever
> generated.

Yes.  At least when not fpu is available, there can't possibly be
any benefit from these insns.  Sure there will be cases when
the fpu instructions can improve something, just not here.

Comment 27 Wilco 2016-10-28 12:08:22 UTC

(In reply to Bernd Edlinger from comment #26)
> (In reply to wilco from comment #25)
> > 
> > Alternatives can be disabled, there are flags, eg:
> > 
> > (set_attr "arch" "neon_for_64bits,*,*,avoid_neon_for_64bits")
> > 
> 
> Ok I see, thanks.
> 
> Still lots of insns could be simplified into less alternatives,
> when the attribs are identical of course.  Doesn't that create
> smaller tables and use less memory to compile?

Quite likely yes, but more importantly it will generate better code by not accidentally allocating to FP registers. There are many integer variants that have a '?' which seems incorrect.

> > I think the NEON alternatives not only do not help at all, they actually
> > generate cause CQ to be far worse even when no NEON instructions are ever
> > generated.
> 
> Yes.  At least when not fpu is available, there can't possibly be
> any benefit from these insns.  Sure there will be cases when
> the fpu instructions can improve something, just not here.

It's hard to imagine cases where using NEON for 64-bit operations is beneficial - the extra moves to and from NEON registers are expensive on many cores, and NEON operations have higher latencies than equivalent integer instructions.

Note I tried setting SLOW_UNALIGNED_ACCESS to false in addition to your patch. This emits the rev instruction as expected, and size goes down to ~4400 instructions with -mthumb -march=armv7 -mfloat-abi=soft -Os. With -O2 -mthumb -mfpu=vfp the reduction is even more dramatic from around 7300 instructions to just 5000...

Comment 28 Bernd Edlinger 2016-10-31 13:02:40 UTC

With my latest patch I bootstrapped a configuration with
--with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16 --with-float=hard

I noticed a single regression in gcc.target/arm/pr53447-*.c

That is caused by disabling the adddi3 expansion.

void t0p(long long * p)
{
  *p += 0x100000001;
}

used to get compiled to this at -O2:

	ldrd	r2, [r0]
	adds	r2, r2, #1
	adc	r3, r3, #1
	strd	r2, [r0]
	bx	lr

but without the adddi3 pattern I have at -O2:

	ldr	r3, [r0]
	ldr	r1, [r0, #4]
	cmn	r3, #1
	add	r3, r3, #1
	movcc	r2, #0
	movcs	r2, #1
	add	r1, r1, #1
	str	r3, [r0]
	add	r3, r2, r1
	str	r3, [r0, #4]
	bx	lr

Note that also the ldrd instructions are not there.

Unfortunaltely also the other di3 pattern make the ldrd go away:

void t0p(long long * p)
{
  *p |= 0x100000001;
}

was with iordi3 like this:

	ldrd	r2, [r0]
	orr	r2, r2, #1
	orr	r3, r3, #1
	strd	r2, [r0]
	bx	lr

and without iordi3:

	ldm	r0, {r2, r3}
	orr	r2, r2, #1
	orr	r3, r3, #1
	stm	r0, {r2, r3}
	bx	lr

but 

void t0p(long long * p)
{
  p[1] |= 0x100000001;
}


gets two loads instead:

	ldr	r2, [r0, #8]
	ldr	r3, [r0, #12]
	orr	r2, r2, #1
	orr	r3, r3, #1
	str	r2, [r0, #8]
	str	r3, [r0, #12]
	bx	lr

however:

void t0p(long long * p)
{
  p[1] <<= 11;
}

gets compiled into this:

	ldr	r3, [r0, #12]
	ldr	r2, [r0, #8]
	lsl	r3, r3, #11
	lsl	r1, r2, #11
	orr	r3, r3, r2, lsr #21
	str	r1, [r0, #8]
	str	r3, [r0, #12]
	bx	lr


already before my patch.

I think this is the effect on the ldrd that you already mentioned,
and it gets worse when the expansion breaks the di registers up
into two si registers.

Comment 29 Wilco 2016-10-31 14:00:42 UTC

(In reply to Bernd Edlinger from comment #28)
> With my latest patch I bootstrapped a configuration with
> --with-arch=armv7-a --with-tune=cortex-a9 --with-fpu=vfpv3-d16
> --with-float=hard
> 
> I noticed a single regression in gcc.target/arm/pr53447-*.c
> 
> That is caused by disabling the adddi3 expansion.
> 
> void t0p(long long * p)
> {
>   *p += 0x100000001;
> }
> 
> used to get compiled to this at -O2:
> 
> 	ldrd	r2, [r0]
> 	adds	r2, r2, #1
> 	adc	r3, r3, #1
> 	strd	r2, [r0]
> 	bx	lr
> 
> but without the adddi3 pattern I have at -O2:
> 
> 	ldr	r3, [r0]
> 	ldr	r1, [r0, #4]
> 	cmn	r3, #1
> 	add	r3, r3, #1
> 	movcc	r2, #0
> 	movcs	r2, #1
> 	add	r1, r1, #1
> 	str	r3, [r0]
> 	add	r3, r2, r1
> 	str	r3, [r0, #4]
> 	bx	lr

That's because your patch disables adddi3 completely, which is not correct. We want to use the existing integer sequence, just expanded earlier. Instead of your change, removing the "&& reload_completed" from the arm_adddi3 instruction means we expand before register allocation:

	ldr	r3, [r0]
	ldr	r2, [r0, #4]
	adds	r3, r3, #1
	str	r3, [r0]
	adc	r2, r2, #16
	str	r2, [r0, #4]
	bx	lr

> Note that also the ldrd instructions are not there.

Yes that's yet another bug...

> I think this is the effect on the ldrd that you already mentioned,
> and it gets worse when the expansion breaks the di registers up
> into two si registers.

Indeed, splitting early means we end up with 2 loads. However in most cases we should be able to gather the loads and emit LDRD/STRD on Thumb-2 (ARM's LDRD/STRD is far more limited so not as useful). Combine could help with merging 2 loads/stores into a single instruction.

Comment 30 Richard Earnshaw 2016-10-31 17:41:57 UTC

(In reply to wilco from comment #29)
>  Combine could help with
> merging 2 loads/stores into a single instruction.

No, combine works strictly on dataflow dependencies.  Two stores cannot be dataflow related so won't be combined.  Loads would only be dataflow related if both loads fed into *exactly* one data-processing instruction after the split.  That's unlikely to happen so I very much dobut it would happen there either.

Comment 31 Bernd Edlinger 2016-10-31 19:48:01 UTC

Sure, combine cant help, especially because it runs before split1.

But I wondered why this peephole2 is not enabled:

(define_peephole2 ; ldrd
  [(set (match_operand:SI 0 "arm_general_register_operand" "")
        (match_operand:SI 2 "memory_operand" ""))
   (set (match_operand:SI 1 "arm_general_register_operand" "")
        (match_operand:SI 3 "memory_operand" ""))]
  "TARGET_LDRD
     && current_tune->prefer_ldrd_strd
     && !optimize_function_for_size_p (cfun)"
  [(const_int 0)]


I have -march=armv7-a / -mcpu=cortex-a9 and thus for me
current_tune-> prefer_ldrd_strd is FALSE.

Furthermore, if I want to do -Os the third condition is FALSE too.
But one ldrd must be shorter than two ldr ?

That seems wrong...

Comment 32 Wilco 2016-10-31 20:55:47 UTC

(In reply to Bernd Edlinger from comment #31)
> Sure, combine cant help, especially because it runs before split1.
> 
> But I wondered why this peephole2 is not enabled:
> 
> (define_peephole2 ; ldrd
>   [(set (match_operand:SI 0 "arm_general_register_operand" "")
>         (match_operand:SI 2 "memory_operand" ""))
>    (set (match_operand:SI 1 "arm_general_register_operand" "")
>         (match_operand:SI 3 "memory_operand" ""))]
>   "TARGET_LDRD
>      && current_tune->prefer_ldrd_strd
>      && !optimize_function_for_size_p (cfun)"
>   [(const_int 0)]
> 
> 
> I have -march=armv7-a / -mcpu=cortex-a9 and thus for me
> current_tune-> prefer_ldrd_strd is FALSE.
> 
> Furthermore, if I want to do -Os the third condition is FALSE too.
> But one ldrd must be shorter than two ldr ?
> 
> That seems wrong...

Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM should only be tried on Thumb-1. Emitting LDRD from a peephole when the offset is in range will never increase code size so should always be enabled.

Comment 33 Richard Earnshaw 2016-11-01 09:49:12 UTC

(In reply to Wilco from comment #32)
> (In reply to Bernd Edlinger from comment #31)
> > Furthermore, if I want to do -Os the third condition is FALSE too.
> > But one ldrd must be shorter than two ldr ?
> > 
> > That seems wrong...
> 
> Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
> should only be tried on Thumb-1. Emitting LDRD from a peephole when the
> offset is in range will never increase code size so should always be enabled.

The logic is certainly strange.  Some cores run LDRD less quickly than they can do LDM, or even two independent loads.  I suspect the logic is meant to be: use LDRD if available and not (optimizing for speed on a slow LDRD-device).

Comment 34 Bernd Edlinger 2016-11-01 11:22:23 UTC

(In reply to Richard Earnshaw from comment #33)
> (In reply to Wilco from comment #32)
> > (In reply to Bernd Edlinger from comment #31)
> > > Furthermore, if I want to do -Os the third condition is FALSE too.
> > > But one ldrd must be shorter than two ldr ?
> > > 
> > > That seems wrong...
> > 
> > Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
> > should only be tried on Thumb-1. Emitting LDRD from a peephole when the
> > offset is in range will never increase code size so should always be enabled.
> 
> The logic is certainly strange.  Some cores run LDRD less quickly than they
> can do LDM, or even two independent loads.  I suspect the logic is meant to
> be: use LDRD if available and not (optimizing for speed on a slow
> LDRD-device).

Ok, so instead of removing this completely I should change it to:
   TARGET_LDRD
   && (current_tune->prefer_ldrd_strd
       || optimize_function_for_size_p (cfun))

Comment 35 Wilco 2016-11-01 11:33:09 UTC

(In reply to Richard Earnshaw from comment #30)
> (In reply to wilco from comment #29)
> >  Combine could help with
> > merging 2 loads/stores into a single instruction.
> 
> No, combine works strictly on dataflow dependencies.  Two stores cannot be
> dataflow related so won't be combined.  Loads would only be dataflow related
> if both loads fed into *exactly* one data-processing instruction after the
> split.  That's unlikely to happen so I very much dobut it would happen there
> either.

Right, so then either we need to look further when creating ldm/ldrd or when splitting use a parallel of 2 SI mode loads.(In reply to Richard Earnshaw from comment #33)
> (In reply to Wilco from comment #32)
> > (In reply to Bernd Edlinger from comment #31)
> > > Furthermore, if I want to do -Os the third condition is FALSE too.
> > > But one ldrd must be shorter than two ldr ?
> > > 
> > > That seems wrong...
> > 
> > Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
> > should only be tried on Thumb-1. Emitting LDRD from a peephole when the
> > offset is in range will never increase code size so should always be enabled.
> 
> The logic is certainly strange.  Some cores run LDRD less quickly than they
> can do LDM, or even two independent loads.  I suspect the logic is meant to
> be: use LDRD if available and not (optimizing for speed on a slow
> LDRD-device).

The issue is that the behaviour is not consistent. If DI mode accesses are split early, LDRD is not used, but if not split, LDRD is used even on cores where LDRD is not preferred or slow.

Selecting -mcpu=cortex-a57 while splitting early gives:

t0p:
	ldrd	r3, r2, [r0]
	adds	r3, r3, #1
	adc	r2, r2, #0
	strd	r3, r2, [r0]
	bx	lr

But with -mcpu=cortex-a53 (with -O2 or -Os):

t0p:
	ldr	r3, [r0]
	ldr	r2, [r0, #4]
	adds	r3, r3, #1
	str	r3, [r0]
	adc	r2, r2, #0
	str	r2, [r0, #4]
	bx	lr

GCC currently emits LDRD for both cases - so clearly LDRD was preferred...

Comment 36 Wilco 2016-11-01 11:43:29 UTC

(In reply to Bernd Edlinger from comment #34)
> (In reply to Richard Earnshaw from comment #33)
> > (In reply to Wilco from comment #32)
> > > (In reply to Bernd Edlinger from comment #31)
> > > > Furthermore, if I want to do -Os the third condition is FALSE too.
> > > > But one ldrd must be shorter than two ldr ?
> > > > 
> > > > That seems wrong...
> > > 
> > > Indeed, on a target that supports LDRD you want to use LDRD if legal. LDM
> > > should only be tried on Thumb-1. Emitting LDRD from a peephole when the
> > > offset is in range will never increase code size so should always be enabled.
> > 
> > The logic is certainly strange.  Some cores run LDRD less quickly than they
> > can do LDM, or even two independent loads.  I suspect the logic is meant to
> > be: use LDRD if available and not (optimizing for speed on a slow
> > LDRD-device).
> 
> Ok, so instead of removing this completely I should change it to:
>    TARGET_LDRD
>    && (current_tune->prefer_ldrd_strd
>        || optimize_function_for_size_p (cfun))

That's better but still won't emit LDRD as it seems most cores have prefer_ldrd_strd disabled... Given that we currently always emit LDRD/STRD for DI mode accesses, this should just check TARGET_LDRD.

Comment 37 Richard Earnshaw 2016-11-01 11:44:28 UTC

(In reply to Bernd Edlinger from comment #34)
> (In reply to Richard Earnshaw from comment #33)

> > The logic is certainly strange.  Some cores run LDRD less quickly than they
> > can do LDM, or even two independent loads.  I suspect the logic is meant to
> > be: use LDRD if available and not (optimizing for speed on a slow
> > LDRD-device).
> 
> Ok, so instead of removing this completely I should change it to:
>    TARGET_LDRD
>    && (current_tune->prefer_ldrd_strd
>        || optimize_function_for_size_p (cfun))

That sounds about right.  Note that the original patch, back in 2013, said:

"* does not attempt to generate LDRD/STRD when optimizing for size and non of
the LDM/STM patterns match (but it would be easy to add),"
(https://gcc.gnu.org/ml/gcc-patches/2013-02/msg00604.html)

So it appears that this case was not attempted at the time.

I think when LDRD is not preferred we'd want to try to use LDM by preference if the address offsets support it, even when optimizing for size.  Otherwise, use LDRD if it supports the operation.

Comment 38 Bernd Edlinger 2016-11-01 14:31:02 UTC

Created attachment 39939 [details]
proposed patch, v2

Hi,

this is a new version that tries to fix the fall out of
the previous attempt.

I will attempt a bootstrap and reg-test later this week.

It splits the logical di3 pattern right at the expansion.
When !TARGET_HARD_FLOAT or !TARGET_IWMMXT, in order to not
break the neon/iwmmxt patterns that seem to depend on it.

Simply disabling the logical di3 pattern made it impossible
to merge the ldrd/strd later because the ldr/str got expanded
too far away from each other.

It splits the adddi3/subdi3 in the split1 pass but only when
!TARGET_HARD_FLOAT, because other hard float pattern seem
to depend on it.

Note that the setting of the out register in the shift
expansion is only necessary in the case -mfpu=vfp -mhard-float
in all other configurations this is now unnecessary.

So far I have only benchmarked with the sha512 test case
and a modified sha512 with the Sigma blocks decorated with bit-not (~).

Checked that the pr53447-*.c test cases work again.

Checked that this test case emits all ldrd/strd where expected:

cat test.c
void foo(long long* p)
{
  p[1] |= 0x100000001;
  p[2] &= 0x100000001;
  p[3] ^= 0x100000001;
  p[4] += 0x100000001;
  p[5] -= 0x100000001;
  p[6] = ~p[6];
  p[7] <<= 5;
  p[8] >>= 5;
  p[9] -= p[10];
}

At -Os -mthumb -march=armv7-a -msoft-float / -mhard-float
improves number of ldrd/strd with this patch to 100%.

I wonder if it is OK to emit ldrd at all when optimizing
for speed, given they are considered slower than ldm / 2x ldr ?

With -Os -mfpu=neon / -mfpu=vfp / -march=iwmmxt: checked that the stack usage
is still the same, around 2328 bytes.

With -Os -marm / thumb2: made sure that the stack usage is still 272 bytes.

Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
because thumb1 cannot split the adddi3 pattern, once it is emitted.

Comment 39 Bernd Edlinger 2016-11-01 14:35:54 UTC

Created attachment 39940 [details]
proposed patch, v2

last upload was accidentally truncated.
uploaded the right patch.

Sorry.

Comment 40 Bernd Edlinger 2016-11-01 14:52:10 UTC

BTW: I found something strange in this pattern in neon.md:

(define_insn_and_split "orndi3_neon"
  [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?&r,?&r")
        (ior:DI (not:DI (match_operand:DI 2 "s_register_operand" "w,0,0,r"))
                (match_operand:DI 1 "s_register_operand" "w,r,r,0")))]
  "TARGET_NEON"
  "@
   vorn\t%P0, %P1, %P2
   #
   #
   #"
  "reload_completed && 
   (TARGET_NEON && !(IS_VFP_REGNUM (REGNO (operands[0]))))"
  [(set (match_dup 0) (ior:SI (not:SI (match_dup 2)) (match_dup 1)))
   (set (match_dup 3) (ior:SI (not:SI (match_dup 4)) (match_dup 5)))]
  "
  {
    if (TARGET_THUMB2)
      {
        operands[3] = gen_highpart (SImode, operands[0]);
        operands[0] = gen_lowpart (SImode, operands[0]);
        operands[4] = gen_highpart (SImode, operands[2]);
        operands[2] = gen_lowpart (SImode, operands[2]);
        operands[5] = gen_highpart (SImode, operands[1]);
        operands[1] = gen_lowpart (SImode, operands[1]);
      }
    else
      {
        emit_insn (gen_one_cmpldi2 (operands[0], operands[2]));
        emit_insn (gen_iordi3 (operands[0], operands[1], operands[0]));
        DONE;
      }
  }"
  [(set_attr "type" "neon_logic,multiple,multiple,multiple")
   (set_attr "length" "*,16,8,8")
   (set_attr "arch" "any,a,t2,t2")]
)


I think in alternative#4 we have operands[0] == operands[1]
and operands[2] != operands[0]

and then gen_one_cmpldi2 (operands[0], operands[2])
will overwrite the operand[1] before it is used in
gen_iordi3 (operands[0], operands[1], operands[0]) ??

Comment 41 Wilco 2016-11-01 15:47:29 UTC

(In reply to Bernd Edlinger from comment #40)
> BTW: I found something strange in this pattern in neon.md:
> 
> (define_insn_and_split "orndi3_neon"
>   [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?&r,?&r")
>         (ior:DI (not:DI (match_operand:DI 2 "s_register_operand" "w,0,0,r"))
>                 (match_operand:DI 1 "s_register_operand" "w,r,r,0")))]
>   "TARGET_NEON"
>   "@
>    vorn\t%P0, %P1, %P2
>    #
>    #
>    #"
>   "reload_completed && 
>    (TARGET_NEON && !(IS_VFP_REGNUM (REGNO (operands[0]))))"
>   [(set (match_dup 0) (ior:SI (not:SI (match_dup 2)) (match_dup 1)))
>    (set (match_dup 3) (ior:SI (not:SI (match_dup 4)) (match_dup 5)))]
>   "
>   {
>     if (TARGET_THUMB2)
>       {
>         operands[3] = gen_highpart (SImode, operands[0]);
>         operands[0] = gen_lowpart (SImode, operands[0]);
>         operands[4] = gen_highpart (SImode, operands[2]);
>         operands[2] = gen_lowpart (SImode, operands[2]);
>         operands[5] = gen_highpart (SImode, operands[1]);
>         operands[1] = gen_lowpart (SImode, operands[1]);
>       }
>     else
>       {
>         emit_insn (gen_one_cmpldi2 (operands[0], operands[2]));
>         emit_insn (gen_iordi3 (operands[0], operands[1], operands[0]));
>         DONE;
>       }
>   }"
>   [(set_attr "type" "neon_logic,multiple,multiple,multiple")
>    (set_attr "length" "*,16,8,8")
>    (set_attr "arch" "any,a,t2,t2")]
> )
> 
> 
> I think in alternative#4 we have operands[0] == operands[1]
> and operands[2] != operands[0]
> 
> and then gen_one_cmpldi2 (operands[0], operands[2])
> will overwrite the operand[1] before it is used in
> gen_iordi3 (operands[0], operands[1], operands[0]) ??

ARM only uses the 2nd alternative (set_attr "arch" "any,a,t2,t2"), so this is correct. There is no need to support this pattern for ARM as ARM doesn't have ORN, and we expand early the whole pattern becomes redundant.

Comment 42 Wilco 2016-11-01 15:56:00 UTC

(In reply to Bernd Edlinger from comment #40)
> BTW: I found something strange in this pattern in neon.md:
> 
> (define_insn_and_split "orndi3_neon"
>   [(set (match_operand:DI 0 "s_register_operand" "=w,?&r,?&r,?&r")
>         (ior:DI (not:DI (match_operand:DI 2 "s_register_operand" "w,0,0,r"))
>                 (match_operand:DI 1 "s_register_operand" "w,r,r,0")))]

Also it would be easy to support "&r,r,0" by doing op0 = ~(op0 = op2 & ~op0) so there was no need to have ARM and Thumb-2 specific alternatives...

Comment 43 Bernd Edlinger 2016-11-01 16:11:05 UTC

(In reply to wilco from comment #41)
> 
> ARM only uses the 2nd alternative (set_attr "arch" "any,a,t2,t2"), so this
> is correct. There is no need to support this pattern for ARM as ARM doesn't
> have ORN, and we expand early the whole pattern becomes redundant.

Oh I see.  Thanks for clarifying that.

Comment 44 Wilco 2016-11-01 16:26:46 UTC

(In reply to Bernd Edlinger from comment #38)
> Created attachment 39939 [details]
> proposed patch, v2
> 

> Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
> because thumb1 cannot split the adddi3 pattern, once it is emitted.

We can split into a new pattern that contains adds/adc together. Splitting should help Thumb-1 the most as it has just 3 allocatable DI mode registers...

Comment 45 Bernd Edlinger 2016-11-01 16:49:32 UTC

(In reply to wilco from comment #44)
> (In reply to Bernd Edlinger from comment #38)
> > Created attachment 39939 [details]
> > proposed patch, v2
> > 
> 
> > Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
> > because thumb1 cannot split the adddi3 pattern, once it is emitted.
> 
> We can split into a new pattern that contains adds/adc together. Splitting
> should help Thumb-1 the most as it has just 3 allocatable DI mode
> registers...

But we need to split the adds and the adc into two separate
pattern, then it can happen that the adc instruction's result
is unused, and that propagates to the inputs.

But since I read this comment in thumb1.md I have doubts:

;; Beware of splitting Thumb1 patterns that output multiple
;; assembly instructions, in particular instruction such as SBC and
;; ADC which consume flags.  For example, in the pattern thumb_subdi3
;; below, the output SUB implicitly sets the flags (assembled to SUBS)
;; and then the Carry flag is used by SBC to compute the correct
;; result.  If we split thumb_subdi3 pattern into two separate RTL
;; insns (using define_insn_and_split), the scheduler might place
;; other RTL insns between SUB and SBC, possibly modifying the Carry
;; flag used by SBC.  This might happen because most Thumb1 patterns
;; for flag-setting instructions do not have explicit RTL for setting
;; or clobbering the flags.  Instead, they have the attribute "conds"
;; with value "set" or "clob".  However, this attribute is not used to
;; identify dependencies and therefore the scheduler might reorder
;; these instruction.  Currenly, this problem cannot happen because
;; there are no separate Thumb1 patterns for individual instruction
;; that consume flags (except conditional execution, which is treated
;; differently).  In particular there is no Thumb1 armv6-m pattern for
;; sbc or adc.


Disabling the adddi3 pattern worked with control flow instead of
passing the Carry flag from thee adds to the adc pattern.

In the sha512 test case that was still profitable, but I think
that will not be the case in general.

I can live with the state of thumb1 in the moment.

I am more interested in early expansion of di patterns
in vfp / avoid_neon_for_64bits and so on.

Maybe if the user explicitly wants neon_for_64bits, so be it.

Comment 46 Richard Earnshaw 2016-11-01 16:57:11 UTC

(In reply to wilco from comment #44)
> (In reply to Bernd Edlinger from comment #38)
> > Created attachment 39939 [details]
> > proposed patch, v2
> > 
> 
> > Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
> > because thumb1 cannot split the adddi3 pattern, once it is emitted.
> 
> We can split into a new pattern that contains adds/adc together. Splitting
> should help Thumb-1 the most as it has just 3 allocatable DI mode
> registers...

Not on Thumb-1 we can't.  Because of register allocation limitations, we cannot expose the flags until after register allocation has completed.  (Since the register allocator needs to be able to insert loads, adds and copy instructions between any two insns.  The add and copy instructions clobber the flags, making early splitting impossible.

Comment 47 Wilco 2016-11-01 17:17:10 UTC

(In reply to Richard Earnshaw from comment #46)
> (In reply to wilco from comment #44)
> > (In reply to Bernd Edlinger from comment #38)
> > > Created attachment 39939 [details]
> > > proposed patch, v2
> > > 
> > 
> > > Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
> > > because thumb1 cannot split the adddi3 pattern, once it is emitted.
> > 
> > We can split into a new pattern that contains adds/adc together. Splitting
> > should help Thumb-1 the most as it has just 3 allocatable DI mode
> > registers...
> 
> Not on Thumb-1 we can't.  Because of register allocation limitations, we
> cannot expose the flags until after register allocation has completed. 
> (Since the register allocator needs to be able to insert loads, adds and
> copy instructions between any two insns.  The add and copy instructions
> clobber the flags, making early splitting impossible.

What I meant is splitting into a single new instruction using SI mode registers rather than DI mode registers so that register allocation is more efficient.

Comment 48 Bernd Edlinger 2016-11-02 09:24:51 UTC

(In reply to wilco from comment #22)
> 
> Anyway, there is another bug: on AArch64 we correctly recognize there are 8
> 1-byte loads, shifts and orrs which can be replaced by a single 8-byte load
> and a byte reverse. Although it is recognized on ARM and works correctly if
> it is a little endian load, it doesn't perform the optimization if a byte
> reverse is needed. As a result there are lots of 64-bit shifts and orrs
> which create huge register pressure if not expanded early.

Hmm...

I think the test case does something invalid here:

const SHA_LONG64 *W = in;

T1 = X[0] = PULL64(W[0]);


in is not aligned, but it is cast to a 8-byte aligned type.

If the bswap pass assumes with your proposed patch
it is OK to merge 4 byte accesses into an aligned word access,
it may likely break openssl on -mno-unaligned targets.
Even on our cortex-a9 the O/S will trap on unaligned accesses.
I have checked that openssl still works on arm-none-eabi 
with my patch, but I am not sure about your patch.

Comment 49 Bernd Edlinger 2016-11-02 10:10:15 UTC

(In reply to Bernd Edlinger from comment #48)
> (In reply to wilco from comment #22)
> > 
> > Anyway, there is another bug: on AArch64 we correctly recognize there are 8
> > 1-byte loads, shifts and orrs which can be replaced by a single 8-byte load
> > and a byte reverse. Although it is recognized on ARM and works correctly if
> > it is a little endian load, it doesn't perform the optimization if a byte
> > reverse is needed. As a result there are lots of 64-bit shifts and orrs
> > which create huge register pressure if not expanded early.
> 
> Hmm...
> 
> I think the test case does something invalid here:
> 
> const SHA_LONG64 *W = in;
> 
> T1 = X[0] = PULL64(W[0]);
> 
> 
> in is not aligned, but it is cast to a 8-byte aligned type.
> 
> If the bswap pass assumes with your proposed patch
> it is OK to merge 4 byte accesses into an aligned word access,
> it may likely break openssl on -mno-unaligned targets.
> Even on our cortex-a9 the O/S will trap on unaligned accesses.
> I have checked that openssl still works on arm-none-eabi 
> with my patch, but I am not sure about your patch.

I tried it out.  Although the code is bogus the code generation
does not use the wrong alignment.

With -mno-unaligned-access the ldr is split out into 4 ldb and
the result is fed into the rev.

At least in this configuration that is not profitable though.

Comment 50 Richard Earnshaw 2016-11-02 10:21:38 UTC

(In reply to wilco from comment #47)
> (In reply to Richard Earnshaw from comment #46)
> > (In reply to wilco from comment #44)
> > > (In reply to Bernd Edlinger from comment #38)
> > > > Created attachment 39939 [details]
> > > > proposed patch, v2
> > > > 
> > > 
> > > > Unlike the previous patch, thumb1 stack usage stays at 1588 bytes,
> > > > because thumb1 cannot split the adddi3 pattern, once it is emitted.
> > > 
> > > We can split into a new pattern that contains adds/adc together. Splitting
> > > should help Thumb-1 the most as it has just 3 allocatable DI mode
> > > registers...
> > 
> > Not on Thumb-1 we can't.  Because of register allocation limitations, we
> > cannot expose the flags until after register allocation has completed. 
> > (Since the register allocator needs to be able to insert loads, adds and
> > copy instructions between any two insns.  The add and copy instructions
> > clobber the flags, making early splitting impossible.
> 
> What I meant is splitting into a single new instruction using SI mode
> registers rather than DI mode registers so that register allocation is more
> efficient.

You couldn't do that before combine, since the pattern would have to describe setting both 'result' registers independently.  That would create a pattern that combine couldn't handle (more than one non-flag result) and so that in turn would stop the compiler being able to optimize such a pattern properly.  Note the pattern would probably end up looking like a parallel that set high and low parts to the result of the 64-bit operation.

It might help to rewrite the pattern that way after combine, but before register allocation.

Comment 51 Wilco 2016-11-02 11:12:06 UTC

(In reply to Bernd Edlinger from comment #49)
> (In reply to Bernd Edlinger from comment #48)
> > (In reply to wilco from comment #22)
> > > 
> > > Anyway, there is another bug: on AArch64 we correctly recognize there are 8
> > > 1-byte loads, shifts and orrs which can be replaced by a single 8-byte load
> > > and a byte reverse. Although it is recognized on ARM and works correctly if
> > > it is a little endian load, it doesn't perform the optimization if a byte
> > > reverse is needed. As a result there are lots of 64-bit shifts and orrs
> > > which create huge register pressure if not expanded early.
> > 
> > Hmm...
> > 
> > I think the test case does something invalid here:
> > 
> > const SHA_LONG64 *W = in;
> > 
> > T1 = X[0] = PULL64(W[0]);
> > 
> > 
> > in is not aligned, but it is cast to a 8-byte aligned type.
> > 
> > If the bswap pass assumes with your proposed patch
> > it is OK to merge 4 byte accesses into an aligned word access,
> > it may likely break openssl on -mno-unaligned targets.
> > Even on our cortex-a9 the O/S will trap on unaligned accesses.
> > I have checked that openssl still works on arm-none-eabi 
> > with my patch, but I am not sure about your patch.
> 
> I tried it out.  Although the code is bogus the code generation
> does not use the wrong alignment.
> 
> With -mno-unaligned-access the ldr is split out into 4 ldb and
> the result is fed into the rev.
>
> At least in this configuration that is not profitable though.

Indeed, that's the reason behind the existing check. However it disables all profitable bswap cases while still generating unaligned accesses if no bswap is needed. So I am looking for a callback that gives the correct answer. It would need to check -mno-unaligned-access and the target capabilities (eg. if unaligned accesses are supported in hardware but really expensive we want to avoid them).

Comment 52 Bernd Edlinger 2016-11-02 11:57:04 UTC

(In reply to wilco from comment #51)
> 
> Indeed, that's the reason behind the existing check. However it disables all
> profitable bswap cases while still generating unaligned accesses if no bswap
> is needed. So I am looking for a callback that gives the correct answer. It
> would need to check -mno-unaligned-access and the target capabilities (eg.
> if unaligned accesses are supported in hardware but really expensive we want
> to avoid them).

Yes.  I think ARM is becoming a non-strict-alignment platform.
While x86_64 is moving in the opposite direction.

Would it be possible to handle the STRICT_ALIGNMENT switchable
like int the rs6000, in that case you have also more flexibility
in the handling of SLOW_UNALIGNED_ACCESS macro ?

Comment 53 richard.earnshaw 2016-11-02 14:23:30 UTC

On 02/11/16 11:57, bernd.edlinger at hotmail dot de wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308
> 
> --- Comment #52 from Bernd Edlinger <bernd.edlinger at hotmail dot de> ---
> (In reply to wilco from comment #51)
>>
>> Indeed, that's the reason behind the existing check. However it disables all
>> profitable bswap cases while still generating unaligned accesses if no bswap
>> is needed. So I am looking for a callback that gives the correct answer. It
>> would need to check -mno-unaligned-access and the target capabilities (eg.
>> if unaligned accesses are supported in hardware but really expensive we want
>> to avoid them).
> 
> Yes.  I think ARM is becoming a non-strict-alignment platform.
> While x86_64 is moving in the opposite direction.

It can never be a non-strict alignment platform while some memory access
instructions do not support unaligned accesses.

However, it is progressively becoming a less slow unaligned access platform.

R.

Comment 54 Bernd Edlinger 2016-11-02 15:04:53 UTC

(In reply to richard.earnshaw from comment #53)
> On 02/11/16 11:57, bernd.edlinger at hotmail dot de wrote:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77308
> > 
> > --- Comment #52 from Bernd Edlinger <bernd.edlinger at hotmail dot de> ---
> > (In reply to wilco from comment #51)
> >>
> >> Indeed, that's the reason behind the existing check. However it disables all
> >> profitable bswap cases while still generating unaligned accesses if no bswap
> >> is needed. So I am looking for a callback that gives the correct answer. It
> >> would need to check -mno-unaligned-access and the target capabilities (eg.
> >> if unaligned accesses are supported in hardware but really expensive we want
> >> to avoid them).
> > 
> > Yes.  I think ARM is becoming a non-strict-alignment platform.
> > While x86_64 is moving in the opposite direction.
> 
> It can never be a non-strict alignment platform while some memory access
> instructions do not support unaligned accesses.
> 
> However, it is progressively becoming a less slow unaligned access platform.
> 


But isn't that exactly the same situation for x86_64:
Most instructions support unaligned memory accesses,
and a few data types need a movmisalign_optab ?

Comment 55 Wilco 2016-11-03 14:26:20 UTC

(In reply to Bernd Edlinger from comment #39)
> Created attachment 39940 [details]
> proposed patch, v2
> 
> last upload was accidentally truncated.
> uploaded the right patch.

Right so looking at your patch, I think we should make the LDRD peephole change in a separate patch. I tried your foo example on all combinations of ARM, Thumb-2, VFP, NEON on various CPUs with both settings of prefer_ldrd_strd.

In all cases the current GCC generates LDRD/STRD, even for zero offsets. CPUs where prefer_ldrd_strd=false emit LDR/STR for the shifts with -msoft-float or -mfpu=vfp (but not -mfpu=neon). This is clearly incorrect given that LDRD/STRD is used in all other cases, and prefer_ldrd_strd seems to imply whether to prefer using LDRD/STRD in prolog/epilog and inlined memcpy.

So that means we should remove the odd checks for codesize and current_tune->prefer_ldrd_strd from all the peepholes.

Comment 56 Bernd Edlinger 2016-11-03 14:50:43 UTC

(In reply to wilco from comment #55)
> (In reply to Bernd Edlinger from comment #39)
> > Created attachment 39940 [details]
> > proposed patch, v2
> > 
> > last upload was accidentally truncated.
> > uploaded the right patch.
> 
> Right so looking at your patch, I think we should make the LDRD peephole
> change in a separate patch. I tried your foo example on all combinations of
> ARM, Thumb-2, VFP, NEON on various CPUs with both settings of
> prefer_ldrd_strd.
> 
> In all cases the current GCC generates LDRD/STRD, even for zero offsets.
> CPUs where prefer_ldrd_strd=false emit LDR/STR for the shifts with
> -msoft-float or -mfpu=vfp (but not -mfpu=neon). This is clearly incorrect
> given that LDRD/STRD is used in all other cases, and prefer_ldrd_strd seems
> to imply whether to prefer using LDRD/STRD in prolog/epilog and inlined
> memcpy.
> 
> So that means we should remove the odd checks for codesize and
> current_tune->prefer_ldrd_strd from all the peepholes.

Agreed, I can split the patch.

From what I understand, we should never emit ldrd/strd out of
the memmovdi2 pattern when optimizing for speed and disable
the peephole in the way I proposed it in the patch.

Comment 57 Wilco 2016-11-03 15:21:03 UTC

(In reply to Bernd Edlinger from comment #56)
> (In reply to wilco from comment #55)
> > (In reply to Bernd Edlinger from comment #39)
> > > Created attachment 39940 [details]
> > > proposed patch, v2
> > > 
> > > last upload was accidentally truncated.
> > > uploaded the right patch.
> > 
> > Right so looking at your patch, I think we should make the LDRD peephole
> > change in a separate patch. I tried your foo example on all combinations of
> > ARM, Thumb-2, VFP, NEON on various CPUs with both settings of
> > prefer_ldrd_strd.
> > 
> > In all cases the current GCC generates LDRD/STRD, even for zero offsets.
> > CPUs where prefer_ldrd_strd=false emit LDR/STR for the shifts with
> > -msoft-float or -mfpu=vfp (but not -mfpu=neon). This is clearly incorrect
> > given that LDRD/STRD is used in all other cases, and prefer_ldrd_strd seems
> > to imply whether to prefer using LDRD/STRD in prolog/epilog and inlined
> > memcpy.
> > 
> > So that means we should remove the odd checks for codesize and
> > current_tune->prefer_ldrd_strd from all the peepholes.
> 
> Agreed, I can split the patch.
> 
> From what I understand, we should never emit ldrd/strd out of
> the memmovdi2 pattern when optimizing for speed and disable
> the peephole in the way I proposed it in the patch.

No that's incorrect. Not generating LDRD when optimizing for speed means a slowdown on most cores, so it is essential we keep generating LDRD whenever possible.

Comment 58 Bernd Edlinger 2016-11-03 15:26:28 UTC

(In reply to wilco from comment #57)
> (In reply to Bernd Edlinger from comment #56)
> > Agreed, I can split the patch.
> > 
> > From what I understand, we should never emit ldrd/strd out of
> > the memmovdi2 pattern when optimizing for speed and disable
> > the peephole in the way I proposed it in the patch.
> 
> No that's incorrect. Not generating LDRD when optimizing for speed means a
> slowdown on most cores, so it is essential we keep generating LDRD whenever
> possible.

But if that is true, the current setting of prefer_lrdr_strd is wrong
in most cores, and should be fixed.

Comment 59 Wilco 2016-11-03 15:42:06 UTC

(In reply to Bernd Edlinger from comment #58)
> (In reply to wilco from comment #57)
> > (In reply to Bernd Edlinger from comment #56)
> > > Agreed, I can split the patch.
> > > 
> > > From what I understand, we should never emit ldrd/strd out of
> > > the memmovdi2 pattern when optimizing for speed and disable
> > > the peephole in the way I proposed it in the patch.
> > 
> > No that's incorrect. Not generating LDRD when optimizing for speed means a
> > slowdown on most cores, so it is essential we keep generating LDRD whenever
> > possible.
> 
> But if that is true, the current setting of prefer_lrdr_strd is wrong
> in most cores, and should be fixed.

The meaning is really: "prefer using ldrd/strd over ldm/stm in function prolog/epilog and inlined memcpy". So it says something about performance of large LDMs vs multiple LDRDs, rather than about performance of a single LDRD vs 2x LDR (basically LDRD doubles available memory bandwidth so is pretty much always a good idea). And yes I wouldn't be surprised if the setting is non-optimal for some cores.

Comment 60 Bernd Edlinger 2016-11-03 16:58:00 UTC

Created attachment 39957 [details]
patch for di-mode early splitting

This is the di-mode early splitting patch which is same as the
previous one with the ldrdstrd part removed.

So I am currently bootstrapping with this patch.

It tries to reduce the stack usage with -msoft-float
to 272 bytes, and does nothing in thumb1 and vfp/neon modes.

Comment 61 Bernd Edlinger 2016-11-03 17:02:03 UTC

Created attachment 39958 [details]
patch for enabling ldrdstrd peephole

And this is what I will bootstrap in the next cycle.

It will enable all ldrd/strd peephole rules, as suggested
by Wilco.

Comment 62 Bernd Edlinger 2016-11-10 07:20:04 UTC

Both parts of the patch are now posted for review:

https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00523.html

https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00830.html

Comment 63 Bernd Edlinger 2016-11-17 13:47:57 UTC

Author: edlinger
Date: Thu Nov 17 13:47:24 2016
New Revision: 242549

URL: https://gcc.gnu.org/viewcvs?rev=242549&root=gcc&view=rev
Log:
2016-11-17  Bernd Edlinger  <bernd.edlinger@hotmail.de>

        PR target/77308
        * config/arm/arm.md (*thumb2_ldrd, *thumb2_ldrd_base,
        *thumb2_ldrd_base_neg, *thumb2_strd, *thumb2_strd_base,
        *thumb2_strd_base_neg): Recognize insn regardless of
        current_tune->prefer_ldrd_strd.
        * config/arm/ldrdstrd.md: Enable all ldrd/strd peephole rules
        whenever possible.

testsuite:
2016-11-17  Bernd Edlinger  <bernd.edlinger@hotmail.de>

        PR target/77308
        * gcc.target/arm/pr53447-5.c: New test.
        * lib/target-supports.exp
        (check_effective_target_arm_prefer_ldrd_strd): Adjust.

Added:
    trunk/gcc/testsuite/gcc.target/arm/pr53447-5.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.md
    trunk/gcc/config/arm/ldrdstrd.md
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/lib/target-supports.exp

Comment 64 Bernd Edlinger 2017-09-04 15:26:31 UTC

Author: edlinger
Date: Mon Sep  4 15:25:59 2017
New Revision: 251663

URL: https://gcc.gnu.org/viewcvs?rev=251663&root=gcc&view=rev
Log:
2017-09-04  Bernd Edlinger  <bernd.edlinger@hotmail.de>

        PR target/77308
        * config/arm/arm.md (*arm_adddi3, *arm_subdi3): Split early except for
        TARGET_NEON and TARGET_IWMMXT.
        (anddi3, iordi3, xordi3, one_cmpldi2): Split while expanding except for
        TARGET_NEON and TARGET_IWMMXT.
        (*one_cmpldi2_insn): Moved the body of one_cmpldi2 here.

testsuite:
2017-09-04  Bernd Edlinger  <bernd.edlinger@hotmail.de>

        PR target/77308
        * gcc.target/arm/pr77308-1.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/arm/pr77308-1.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.md
    trunk/gcc/testsuite/ChangeLog

Comment 65 Bernd Edlinger 2017-09-06 07:48:27 UTC

Author: edlinger
Date: Wed Sep  6 07:47:52 2017
New Revision: 251752

URL: https://gcc.gnu.org/viewcvs?rev=251752&root=gcc&view=rev
Log:
2017-09-06  Bernd Edlinger  <bernd.edlinger@hotmail.de>

        PR target/77308
        * config/arm/predicates.md (arm_general_adddi_operand): Create new
        non-vfp predicate.
        * config/arm/arm.md (*arm_adddi3, *arm_subdi3): Use new predicates.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.md
    trunk/gcc/config/arm/predicates.md

Comment 66 Martin Liška 2018-11-19 14:32:24 UTC

Bernd: Can you please update Known to work?

Comment 67 Bernd Edlinger 2018-11-19 14:53:59 UTC

Okay, with the patches I installed for gcc-8
the stack usage is at least a bit less "surprising" than before.

Comment 68 Wilco 2019-08-23 16:44:24 UTC

Now also fixed when Neon is enabled (r274823, r274824, r274825). Softfp, vfp and neon all generate similar instruction counts and stack size, all below 300 bytes with -O3.