s390: Avoid CAS boolean output inefficiency

Mon Aug 6 22:40:00 GMT 2012

On 08/06/2012 11:34 AM, Ulrich Weigand wrote:
> There is one particular inefficiency I have noticed.  This code:
> 
>   if (!__atomic_compare_exchange_n (&v, &expected, max, 0 , 0, 0))
>     abort ();
> 
> from atomic-compare-exchange-3.c gets compiled into:
> 
>         l       %r3,0(%r2)
>         larl    %r1,v
>         cs      %r3,%r4,0(%r1)
>         ipm     %r1
>         sra     %r1,28
>         st      %r3,0(%r2)
>         ltr     %r1,%r1
>         jne     .L3
> 
> which is extremely inefficient; it converts the condition code into
> an integer using the slow ipm, sra sequence, just so that it can
> convert the integer back into a condition code via ltr and branch
> on it ...

This was caused (or perhaps abetted by) the representation of EQ
as NE ^ 1.  With the subsequent truncation and zero-extend, I
think combine reached its insn limit of 3 before seeing everything
it needed to see.

I'm able to fix this problem by representing EQ as EQ before reload.
For extimm targets this results in identical code; for older targets
it requires avoidance of the constant pool, i.e. LHI+XR instead of X.

        l       %r2,0(%r3)
        larl    %r1,v
        cs      %r2,%r5,0(%r1)
        st      %r2,0(%r3)
        jne     .L3

That fixed, we see the second CAS in that file:

        .loc 1 27 0
        cs      %r2,%r2,0(%r1)
        ipm     %r5
        sll     %r5,28
        lhi     %r0,1
        xr      %r5,%r0
        st      %r2,0(%r3)
        ltr     %r5,%r5
        je      .L20

This happens because CSE notices the cbranch vs 0, and sets r116
to zero along the path to

     32   if (!__atomic_compare_exchange_n (&v, &expected, 0, STRONG,
                                            __ATOMIC_RELEASE, __ATOMIC_ACQUIRE))

at which point CSE decides that it would be cheaper to "re-use"
the zero already in r116 instead of load another constant 0 here.
After that, combine is ham-strung because r116 is not dead.

I'm not quite sure the best way to fix this, since rtx_costs already
has all constants cost 0.  CSE ought not believe that r116 is better
than a plain constant.  CSE also shouldn't be extending the life of
pseudos this way.

A short-term possibility is to have the CAS insns accept general_operand,
so that the 0 gets merged.  With reload inheritance and post-reload cse,
that might produce code that is "good enough".  Certainly it's effective
for the atomic-compare-exchange-3.c testcase.  I'm less than happy with
that, since the non-optimization of CAS depends on following code that
is totally unrelated.

This patch ought to be independent of any other patch so far.


r~
-------------- next part --------------

diff --git a/gcc/config/s390/s390.md b/gcc/config/s390/s390.md
index 0e43e51..bed6b79 100644
--- a/gcc/config/s390/s390.md
+++ b/gcc/config/s390/s390.md
@@ -5325,12 +5325,15 @@
 	    (match_operand 3 "const0_operand")]))
      (clobber (reg:CC CC_REGNUM))])]
   ""
-  "emit_insn (gen_sne (operands[0], operands[2]));
-   if (GET_CODE (operands[1]) == EQ)
-     emit_insn (gen_xorsi3 (operands[0], operands[0], const1_rtx));
-   DONE;")
+{
+  if (!TARGET_EXTIMM && GET_CODE (operands[1]) == EQ)
+    {
+      emit_insn (gen_seq_neimm (operands[0], operands[2]));
+      DONE;
+    }
+})
 
-(define_insn_and_split "sne"
+(define_insn_and_split "*sne"
   [(set (match_operand:SI 0 "register_operand" "=d")
 	(ne:SI (match_operand:CCZ1 1 "register_operand" "0")
 	       (const_int 0)))
@@ -5342,6 +5345,48 @@
     [(set (match_dup 0) (ashiftrt:SI (match_dup 0) (const_int 28)))
      (clobber (reg:CC CC_REGNUM))])])
 
+(define_insn_and_split "*seq"
+  [(set (match_operand:SI 0 "register_operand" "=d")
+	(eq:SI (match_operand:CCZ1 1 "register_operand" "0")
+	       (const_int 0)))
+   (clobber (reg:CC CC_REGNUM))]
+  "TARGET_EXTIMM"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rtx op0 = operands[0];
+  emit_insn (gen_lshrsi3 (op0, op0, GEN_INT (28)));
+  emit_insn (gen_xorsi3 (op0, op0, const1_rtx));
+  DONE;
+})
+
+;; ??? Ideally we'd USE a const1_rtx, properly reloaded, but that makes
+;; things more difficult for combine (which can only insert clobbers).
+;; But perhaps it would be better still to have simply used a branch around
+;; constant load instead of beginning with the IPM?
+;;
+;; What about LOCR for Z196?  That's a more general question about cstore
+;; being decomposed into movcc...
+
+(define_insn_and_split "seq_neimm"
+  [(set (match_operand:SI 0 "register_operand" "=d")
+	(eq:SI (match_operand:CCZ1 1 "register_operand" "0")
+	       (const_int 0)))
+   (clobber (match_scratch:SI 2 "=&d"))
+   (clobber (reg:CC CC_REGNUM))]
+  "!TARGET_EXTIMM"
+  "#"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rtx op0 = operands[0];
+  rtx op2 = operands[2];
+  emit_insn (gen_ashlsi3 (op0, op0, GEN_INT (28)));
+  emit_move_insn (op2, const1_rtx);
+  emit_insn (gen_xorsi3 (op0, op0, op2));
+  DONE;
+})
 
 ;;
 ;; - Conditional move instructions (introduced with z196)