This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)

From: Bernd Edlinger <bernd dot edlinger at hotmail dot de>
To: Wilco Dijkstra <Wilco dot Dijkstra at arm dot com>, Ramana Radhakrishnan <ramana dot gcc at googlemail dot com>
Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>, Kyrill Tkachov <kyrylo dot tkachov at foss dot arm dot com>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>, nd <nd at arm dot com>
Date: Wed, 30 Nov 2016 17:01:24 +0000
Subject: Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
Authentication-results: sourceware.org; auth=none
Authentication-results: arm.com; dkim=none (message not signed) header.d=none;arm.com; dmarc=none action=none header.from=hotmail.de;
References: <HE1PR0701MB2169CD4BF5110F84B68E4AF9E4A40@HE1PR0701MB2169.eurprd07.prod.outlook.com> <CAJA7tRZGYttnYYCsbqFuc88jt8DySFvLY9J+1+88sfofY8Gweg@mail.gmail.com> <AM4PR0701MB2162C11BF479CD62542E2E8AE48A0@AM4PR0701MB2162.eurprd07.prod.outlook.com> <VI1PR0802MB2621FFBFA3252B40E5978C9F838D0@VI1PR0802MB2621.eurprd08.prod.outlook.com> <AM4PR0701MB21628562505A31E0C0630660E48D0@AM4PR0701MB2162.eurprd07.prod.outlook.com> <AM5PR0802MB261038521472515DDE3E58DA838D0@AM5PR0802MB2610.eurprd08.prod.outlook.com> <AM5PR0802MB26103E4F51572F3BAB036ACC838C0@AM5PR0802MB2610.eurprd08.prod.outlook.com>
Spamdiagnosticmetadata: NSPM
Spamdiagnosticoutput: 1:99

On 11/30/16 13:01, Wilco Dijkstra wrote:
> Bernd Edlinger wrote:
>> On 11/29/16 16:06, Wilco Dijkstra wrote:
>>> Bernd Edlinger wrote:
>>>
>>> -  "TARGET_32BIT && reload_completed
>>> +  "TARGET_32BIT && ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)
>>>      && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"
>>>
>>> This is equivalent to "&& (!TARGET_IWMMXT || reload_completed)" since we're
>>> already excluding NEON.
>>
>> Aehm, no.  This would split the addi_neon insn before it is clear
>> if the reload pass will assign a VFP register.
>
> Hmm that's strange... This instruction shouldn't be used to also split some random
> Neon pattern - for example arm_subdi3 doesn't do the same. To understand and
> reason about any of these complex patterns they should all work in the same way...
>
I was a bit surprised as well, when I saw that happen.
But subdi3 is different:
   "TARGET_32BIT && !TARGET_NEON"
   "#"  ; "subs\\t%Q0, %Q1, %Q2\;sbc\\t%R0, %R1, %R2"
   "&& reload_completed"

so this never splits anything if TARGET_NEON.
but adddi3 can not expand if TARGET_NEON but it's pattern simply
looks exactly like the addi3_neon:

(define_insn_and_split "*arm_adddi3"
   [(set (match_operand:DI          0 "s_register_operand" 
"=&r,&r,&r,&r,&r")
         (plus:DI (match_operand:DI 1 "s_register_operand" "%0, 0, r, 0, r")
                  (match_operand:DI 2 "arm_adddi_operand"  "r,  0, r, 
Dd, Dd")))
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_32BIT && !TARGET_NEON"
   "#"
   "TARGET_32BIT && reload_completed
    && ! (TARGET_NEON && IS_VFP_REGNUM (REGNO (operands[0])))"

(define_insn "adddi3_neon"
   [(set (match_operand:DI 0 "s_register_operand" 
"=w,?&r,?&r,?w,?&r,?&r,?&r")
         (plus:DI (match_operand:DI 1 "s_register_operand" "%w,0,0,w,r,0,r")
                  (match_operand:DI 2 "arm_adddi_operand" 
"w,r,0,w,r,Dd,Dd")))
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_NEON"
{
   switch (which_alternative)
     {
     case 0: /* fall through */
     case 3: return "vadd.i64\t%P0, %P1, %P2";
     case 1: return "#";
     case 2: return "#";
     case 4: return "#";
     case 5: return "#";
     case 6: return "#";
     default: gcc_unreachable ();
     }

Even the return "#" explicitly invokes the former pattern.
So I think the author knew that, and did it on purpose.


>> But when I make *arm_cmpdi_insn split early, it ICEs:
>
> (insn 4870 4869 1636 87 (set (scratch:SI)
>          (minus:SI (minus:SI (subreg:SI (reg:DI 2261) 4)
>                  (subreg:SI (reg:DI 473 [ X$14 ]) 4))
>              (ltu:SI (reg:CC_C 100 cc)
>                  (const_int 0 [0])))) "pr77308-2.c":140 -1
>       (nil))
>
> That's easy, we don't have a sbcs <scratch>, r1, r2 pattern. A quick workaround is
> to create a temporary for operand[2] (if before reload) so it will match the standard
> sbcs pattern, and then the split works fine.
>
>> So it is certainly possible, but not really simple to improve the
>> stack size even further.  But I would prefer to do that in a
>> separate patch.
>
> Yes separate patches would be fine. However there is a lot of scope to improve this
> further. For example after your patch shifts and logical operations are expanded in
> expand, add/sub are in split1 after combine runs and everything else is split after
> reload. It doesn't make sense to split different operations at different times - it means
> you're still going to get the bad DImode subregs and miss lots of optimization
> opportunities due to the mix of partly split and partly not-yet-split operations.
>

Yes.  I did the add/sub differently because it was more easy this way,
and it was simply sufficient to make the existing test cases happy.

Also, the biggest benefit was IIRC from the very early splitting
of the anddi/iordi/xordi patterns, because they have completely
separate data flow in low and high parts.  And that is not
the case for the arihmetic patterns, but nevertheless they
can still be optimized, preferably, when a new test case
is found, that can demonstrate an improvement.

I am not sure why the cmpdi pattern have an influence at all,
because from the data flow you need all 64 bits of both sides.
Nevertheless it is a fact: With the modified test case I
get 264 bytes frame size, and that was 1920 before.

I attached the completely untested follow-up patch now, but I would
like to post that one again for review, after I applied my current
patch, which is still waiting for final review (please feel pinged!).


This is really exciting...


Thanks
Bernd.

--- gcc/config/arm/arm.md.orig	2016-11-27 09:22:41.794790123 +0100
+++ gcc/config/arm/arm.md	2016-11-30 16:40:30.140532737 +0100
@@ -4738,7 +4738,7 @@
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_ARM"
   "#"   ; "rsbs\\t%Q0, %Q1, #0\;rsc\\t%R0, %R1, #0"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (const_int 0) (match_dup 1)))
 	      (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))])
@@ -7432,7 +7432,7 @@
    (clobber (match_scratch:SI 2 "=r"))]
   "TARGET_32BIT"
   "#"   ; "cmp\\t%Q0, %Q1\;sbcs\\t%2, %R0, %R1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(set (reg:CC CC_REGNUM)
         (compare:CC (match_dup 0) (match_dup 1)))
    (parallel [(set (reg:CC CC_REGNUM)
@@ -7456,7 +7456,10 @@
         operands[5] = gen_rtx_MINUS (SImode, operands[3], operands[4]);
       }
     operands[1] = gen_lowpart (SImode, operands[1]);
-    operands[2] = gen_lowpart (SImode, operands[2]);
+    if (can_create_pseudo_p ())
+      operands[2] = gen_reg_rtx (SImode);
+    else
+      operands[2] = gen_lowpart (SImode, operands[2]);
   }
   [(set_attr "conds" "set")
    (set_attr "length" "8")
@@ -7470,7 +7473,7 @@
 
   "TARGET_32BIT"
   "#"   ; "cmp\\t%R0, %R1\;it eq\;cmpeq\\t%Q0, %Q1"
-  "&& reload_completed"
+  "&& ((!TARGET_NEON && !TARGET_IWMMXT) || reload_completed)"
   [(set (reg:CC CC_REGNUM)
         (compare:CC (match_dup 2) (match_dup 3)))
    (cond_exec (eq:SI (reg:CC CC_REGNUM) (const_int 0))
--- gcc/config/arm/thumb2.md.orig	2016-11-30 16:57:44.760589624 +0100
+++ gcc/config/arm/thumb2.md	2016-11-30 16:58:05.310590754 +0100
@@ -132,7 +132,7 @@
    (clobber (reg:CC CC_REGNUM))]
   "TARGET_THUMB2"
   "#" ; negs\\t%Q0, %Q1\;sbc\\t%R0, %R1, %R1, lsl #1
-  "&& reload_completed"
+  "&& (!TARGET_NEON || reload_completed)"
   [(parallel [(set (reg:CC CC_REGNUM)
 		   (compare:CC (const_int 0) (match_dup 1)))
 	      (set (match_dup 0) (minus:SI (const_int 0) (match_dup 1)))])
--- /dev/null	2016-11-30 15:23:46.779473644 +0100
+++ gcc/testsuite/gcc.target/arm/pr77308-2.c	2016-11-30 17:05:21.021614711 +0100
@@ -0,0 +1,169 @@
+/* { dg-do compile } */
+/* { dg-options "-Os -Wstack-usage=2500" } */
+
+/* This is a modified algorithm with 64bit cmp and neg at the Sigma-blocks.
+   It improves the test coverage of cmpdi and negdi2 patterns.
+   Unlike the original test case these insns can reach the reload pass,
+   which may result in large stack usage.  */
+
+#define SHA_LONG64 unsigned long long
+#define U64(C)     C##ULL
+
+#define SHA_LBLOCK      16
+#define SHA512_CBLOCK   (SHA_LBLOCK*8)
+
+typedef struct SHA512state_st {
+    SHA_LONG64 h[8];
+    SHA_LONG64 Nl, Nh;
+    union {
+        SHA_LONG64 d[SHA_LBLOCK];
+        unsigned char p[SHA512_CBLOCK];
+    } u;
+    unsigned int num, md_len;
+} SHA512_CTX;
+
+static const SHA_LONG64 K512[80] = {
+    U64(0x428a2f98d728ae22), U64(0x7137449123ef65cd),
+    U64(0xb5c0fbcfec4d3b2f), U64(0xe9b5dba58189dbbc),
+    U64(0x3956c25bf348b538), U64(0x59f111f1b605d019),
+    U64(0x923f82a4af194f9b), U64(0xab1c5ed5da6d8118),
+    U64(0xd807aa98a3030242), U64(0x12835b0145706fbe),
+    U64(0x243185be4ee4b28c), U64(0x550c7dc3d5ffb4e2),
+    U64(0x72be5d74f27b896f), U64(0x80deb1fe3b1696b1),
+    U64(0x9bdc06a725c71235), U64(0xc19bf174cf692694),
+    U64(0xe49b69c19ef14ad2), U64(0xefbe4786384f25e3),
+    U64(0x0fc19dc68b8cd5b5), U64(0x240ca1cc77ac9c65),
+    U64(0x2de92c6f592b0275), U64(0x4a7484aa6ea6e483),
+    U64(0x5cb0a9dcbd41fbd4), U64(0x76f988da831153b5),
+    U64(0x983e5152ee66dfab), U64(0xa831c66d2db43210),
+    U64(0xb00327c898fb213f), U64(0xbf597fc7beef0ee4),
+    U64(0xc6e00bf33da88fc2), U64(0xd5a79147930aa725),
+    U64(0x06ca6351e003826f), U64(0x142929670a0e6e70),
+    U64(0x27b70a8546d22ffc), U64(0x2e1b21385c26c926),
+    U64(0x4d2c6dfc5ac42aed), U64(0x53380d139d95b3df),
+    U64(0x650a73548baf63de), U64(0x766a0abb3c77b2a8),
+    U64(0x81c2c92e47edaee6), U64(0x92722c851482353b),
+    U64(0xa2bfe8a14cf10364), U64(0xa81a664bbc423001),
+    U64(0xc24b8b70d0f89791), U64(0xc76c51a30654be30),
+    U64(0xd192e819d6ef5218), U64(0xd69906245565a910),
+    U64(0xf40e35855771202a), U64(0x106aa07032bbd1b8),
+    U64(0x19a4c116b8d2d0c8), U64(0x1e376c085141ab53),
+    U64(0x2748774cdf8eeb99), U64(0x34b0bcb5e19b48a8),
+    U64(0x391c0cb3c5c95a63), U64(0x4ed8aa4ae3418acb),
+    U64(0x5b9cca4f7763e373), U64(0x682e6ff3d6b2b8a3),
+    U64(0x748f82ee5defb2fc), U64(0x78a5636f43172f60),
+    U64(0x84c87814a1f0ab72), U64(0x8cc702081a6439ec),
+    U64(0x90befffa23631e28), U64(0xa4506cebde82bde9),
+    U64(0xbef9a3f7b2c67915), U64(0xc67178f2e372532b),
+    U64(0xca273eceea26619c), U64(0xd186b8c721c0c207),
+    U64(0xeada7dd6cde0eb1e), U64(0xf57d4f7fee6ed178),
+    U64(0x06f067aa72176fba), U64(0x0a637dc5a2c898a6),
+    U64(0x113f9804bef90dae), U64(0x1b710b35131c471b),
+    U64(0x28db77f523047d84), U64(0x32caab7b40c72493),
+    U64(0x3c9ebe0a15c9bebc), U64(0x431d67c49c100d4c),
+    U64(0x4cc5d4becb3e42b6), U64(0x597f299cfc657e2a),
+    U64(0x5fcb6fab3ad6faec), U64(0x6c44198c4a475817)
+};
+
+#define B(x,j)    (((SHA_LONG64)(*(((const unsigned char *)(&x))+j)))<<((7-j)*8))
+#define PULL64(x) (B(x,0)|B(x,1)|B(x,2)|B(x,3)|B(x,4)|B(x,5)|B(x,6)|B(x,7))
+#define ROTR(x,s)       (((x)>>s) | (x)<<(64-s))
+#define Sigma0(x)       (ROTR((x),28) ^ ROTR((x),34) ^ (ROTR((x),39) == (x)) ? -(x) : (x))
+#define Sigma1(x)       (ROTR((x),14) ^ ROTR(-(x),18) ^ ((long long)ROTR((x),41) < (long long)(x)) ? -(x) : (x))
+#define sigma0(x)       (ROTR((x),1)  ^ ROTR((x),8)  ^ (((x)>>7) > (x)) ? -(x) : (x))
+#define sigma1(x)       (ROTR((x),19) ^ ROTR((x),61) ^ ((long long)((x)>>6) < (long long)(x)) ? -(x) : (x))
+#define Ch(x,y,z)       (((x) & (y)) ^ ((~(x)) & (z)))
+#define Maj(x,y,z)      (((x) & (y)) ^ ((x) & (z)) ^ ((y) & (z)))
+
+#define ROUND_00_15(i,a,b,c,d,e,f,g,h)          do {    \
+        T1 += h + Sigma1(e) + Ch(e,f,g) + K512[i];      \
+        h = Sigma0(a) + Maj(a,b,c);                     \
+        d += T1;        h += T1;                } while (0)
+#define ROUND_16_80(i,j,a,b,c,d,e,f,g,h,X)      do {    \
+        s0 = X[(j+1)&0x0f];     s0 = sigma0(s0);        \
+        s1 = X[(j+14)&0x0f];    s1 = sigma1(s1);        \
+        T1 = X[(j)&0x0f] += s0 + s1 + X[(j+9)&0x0f];    \
+        ROUND_00_15(i+j,a,b,c,d,e,f,g,h);               } while (0)
+void sha512_block_data_order(SHA512_CTX *ctx, const void *in,
+                                    unsigned int num)
+{
+    const SHA_LONG64 *W = in;
+    SHA_LONG64 a, b, c, d, e, f, g, h, s0, s1, T1;
+    SHA_LONG64 X[16];
+    int i;
+
+    while (num--) {
+
+        a = ctx->h[0];
+        b = ctx->h[1];
+        c = ctx->h[2];
+        d = ctx->h[3];
+        e = ctx->h[4];
+        f = ctx->h[5];
+        g = ctx->h[6];
+        h = ctx->h[7];
+
+        T1 = X[0] = PULL64(W[0]);
+        ROUND_00_15(0, a, b, c, d, e, f, g, h);
+        T1 = X[1] = PULL64(W[1]);
+        ROUND_00_15(1, h, a, b, c, d, e, f, g);
+        T1 = X[2] = PULL64(W[2]);
+        ROUND_00_15(2, g, h, a, b, c, d, e, f);
+        T1 = X[3] = PULL64(W[3]);
+        ROUND_00_15(3, f, g, h, a, b, c, d, e);
+        T1 = X[4] = PULL64(W[4]);
+        ROUND_00_15(4, e, f, g, h, a, b, c, d);
+        T1 = X[5] = PULL64(W[5]);
+        ROUND_00_15(5, d, e, f, g, h, a, b, c);
+        T1 = X[6] = PULL64(W[6]);
+        ROUND_00_15(6, c, d, e, f, g, h, a, b);
+        T1 = X[7] = PULL64(W[7]);
+        ROUND_00_15(7, b, c, d, e, f, g, h, a);
+        T1 = X[8] = PULL64(W[8]);
+        ROUND_00_15(8, a, b, c, d, e, f, g, h);
+        T1 = X[9] = PULL64(W[9]);
+        ROUND_00_15(9, h, a, b, c, d, e, f, g);
+        T1 = X[10] = PULL64(W[10]);
+        ROUND_00_15(10, g, h, a, b, c, d, e, f);
+        T1 = X[11] = PULL64(W[11]);
+        ROUND_00_15(11, f, g, h, a, b, c, d, e);
+        T1 = X[12] = PULL64(W[12]);
+        ROUND_00_15(12, e, f, g, h, a, b, c, d);
+        T1 = X[13] = PULL64(W[13]);
+        ROUND_00_15(13, d, e, f, g, h, a, b, c);
+        T1 = X[14] = PULL64(W[14]);
+        ROUND_00_15(14, c, d, e, f, g, h, a, b);
+        T1 = X[15] = PULL64(W[15]);
+        ROUND_00_15(15, b, c, d, e, f, g, h, a);
+
+        for (i = 16; i < 80; i += 16) {
+            ROUND_16_80(i, 0, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 1, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 2, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 3, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 4, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 5, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 6, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 7, b, c, d, e, f, g, h, a, X);
+            ROUND_16_80(i, 8, a, b, c, d, e, f, g, h, X);
+            ROUND_16_80(i, 9, h, a, b, c, d, e, f, g, X);
+            ROUND_16_80(i, 10, g, h, a, b, c, d, e, f, X);
+            ROUND_16_80(i, 11, f, g, h, a, b, c, d, e, X);
+            ROUND_16_80(i, 12, e, f, g, h, a, b, c, d, X);
+            ROUND_16_80(i, 13, d, e, f, g, h, a, b, c, X);
+            ROUND_16_80(i, 14, c, d, e, f, g, h, a, b, X);
+            ROUND_16_80(i, 15, b, c, d, e, f, g, h, a, X);
+        }
+
+        ctx->h[0] += a;
+        ctx->h[1] += b;
+        ctx->h[2] += c;
+        ctx->h[3] += d;
+        ctx->h[4] += e;
+        ctx->h[5] += f;
+        ctx->h[6] += g;
+        ctx->h[7] += h;
+
+        W += SHA_LBLOCK;
+    }
+}

References:
- [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
  - From: Bernd Edlinger
- Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
  - From: Ramana Radhakrishnan
- Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
  - From: Bernd Edlinger
- Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
  - From: Bernd Edlinger
- Re: [PATCH, ARM] Further improve stack usage on sha512 (PR 77308)
  - From: Wilco Dijkstra

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]