This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH, SPU] generated better code for loads and stores


Trevor Smigiel wrote:

> This patch generates better code for loads and stores on SPU.

Unfortunately, in the current version this patch introduces a number
of regressions for me:
- wrong code generation due to exposed REGNO_POINTER_ALIGN bug
- ICE with -fnon-call-exceptions 
- code size regression with -O0
- -march=celledp code quality regression

I've seen those both on mainline and with a 4.3 backport of your
patch.


1. Wrong code generation due to exposed REGNO_POINTER_ALIGN bug

On mainline (only) the following test case now fails:
FAIL: gcc.c-torture/execute/960521-1.c execution,  -O1

This is because when compiling this code:

int *b;

foo ()
{
  int i;
  for (i = 0; i < BLOCK_SIZE - 1; i++)
    b[i] = -1;
}

GCC assumes (incorrectly) that the *value* of b must be 16-byte aligned,
because it has (correctly) infered that the *address* of b is 16-byte
aligned!

This happens because of an independent bug in computing REGNO_POINTER_ALIGN
which is present both in force_reg and set_reg_attrs_from_value:

      if (MEM_POINTER (x))
        mark_reg_pointer (reg, MEM_ALIGN (x));

This is broken, because the *value* of the MEM x was just copied into reg.
MEM_ALIGN is the alignment of the memory address, not the alignment of
the pointer that is stored there.

(Note that in GCC 4.3, REGNO_POINTER_ALIGN is incorrect just the same, but
the problem still does not show because the broken register is placed as
the second operand of the PLUS used for address generation -- therefore
the optimization in spu_split_load does not trigger.)


2. ICE with -fnon-call-exceptions

A couple of test cases now fail on both mainline and 4.3:

FAIL: g++.dg/eh/subreg-1.C (internal compiler error)
FAIL: g++.dg/opt/cfg5.C (internal compiler error)
FAIL: g++.dg/opt/pr34036.C (internal compiler error)
FAIL: g++.dg/opt/reg-stack2.C (internal compiler error)
FAIL: g++.dg/other/profile1.C (internal compiler error)

all with an ICE along the following lines:

/home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:41: error: in basic block 5:
/home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:41: error: flow control insn inside a basic block
(insn 141 21 142 5 /home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:35 (set (reg:TI 185)
        (mem/s:TI (reg/v/f:SI 138 [ sp.19 ]) [0 S16 A128])) -1 (expr_list:REG_EH_REGION (const_int 1 [0x1])
        (nil)))
/home/uweigand/fsf/gcc-4_3/gcc/testsuite/g++.dg/eh/subreg-1.C:41: internal compiler error: in rtl_verify_flow_info_1, at cfgrtl.c:1923

The reason for this is that with -fnon-call-exceptions, the memory store
gets tagged with a REG_EH_REGION note.  The split0 pass generates a memory
load and a memory store insn from this, and the splitter logic copies that
note to *both* instructions.  This is a problem as the load is now considered
as a "flow control" insn as well (as it supposedly may throw an exception);
and such insns are not allowed within a basic block.

I'm not sure how to fix this.  However, in actual fact, memory accesses on
the SPU *never* trap anyway -- it's just that the may_trap_p logic is not
aware of that fact.  In the SDK compiler there is a target macro
ADDRESSES_NEVER_TRAP that is set on the SPU to cause rtx_addr_can_trap_p_1
to always return 0.  If we port that feature to mainline, this will fix
this ICE as well.


3. Code size regression with -O0

I'm seeing one more test suite regression on 4.3 only:
FAIL: g++.dg/opt/longbranch1.C (test for excess errors)

This is caused by the resulting code just exceeding local store size,
while it just fit into local store before this patch.  The reason for
this code size regression with -O0 is that the new code will always
generate a rotate instruction to perform the following load:

(insn 17 16 18 4 longbranch1.ii:6 (set (reg:SI 145)
        (const_int 29936 [0x74f0])) 6 {_movsi} (nil))

(insn 18 17 19 4 longbranch1.ii:6 (set (reg:SI 146)
        (mem/c/i:SI (plus:SI (reg/f:SI 128 $vfp)
                (reg:SI 145)) [0 i+0 S4 A128])) 6 {_movsi} (nil))

even though the memory access is clearly 16-byte aligned.

The code in spu_split_load doesn't recognize this, however, because 
REGNO_POINTER_ALIGN of register 145 returns false -- this register is
in fact not a pointer, but holds a plain integral value.

However, even so, the memory access is itself marked as 16-byte
aligned via the MEM_ALIGN flag.  Unfortunately, spu_split_load
never actually looks at that flag.  In the old code, spu_valid_move
used to check aligned_mem_p, which did look at MEM_ALIGN.

It would appear that simply checking MEM_ALIGN and omitting the
rotate in spu_split_load if possible should be sufficient to fix
this regression.

Unfortunately, the resulting code is still bigger, because of the
additional _spu_convert insn that is being generated.  With -O0,
local-alloc does not attempt to use the same register for both
operands of the _spu_convert, and thus it becomes an explicit
copy in the final output.  This still causes the test case to fail ...

Now, we don't actually *need* to split the insn into a TImode load
plus a _spu_convert.  However, as your comment before address_needs_split
says, you do it deliberately to enable more merging of accesses.  Without
optimization, that merging won't happen anyway ... so I think we should
not perform the split in this case.

With this change in place as well, the code size regression is fixed.


4. -march=celledp code quality regression

The PowerXCell 8i processor has double-precision magnitude compare
instructions.  These used to be matched by combine against patterns like:

(define_insn "cmeq_<mode>_celledp"
  [(set (match_operand:<DF2I> 0 "spu_reg_operand" "=r")
        (eq:<DF2I> (abs:VDF (match_operand:VDF 1 "spu_reg_operand" "r"))
                   (abs:VDF (match_operand:VDF 2 "spu_reg_operand" "r"))))]
  "spu_arch == PROCESSOR_CELLEDP"
  "dfcmeq\t%0,%1,%2"
  [(set_attr "type" "fpd")])

However, after your patch these instructions never match, because there
is no ABS RTX any more.  Those have been split into AND by:

(define_insn_and_split "_abs<mode>2"
  [(set (match_operand:VSDF 0 "spu_reg_operand" "=r")
        (abs:VSDF (match_operand:VSDF 1 "spu_reg_operand" "r")))
   (use (match_operand:<F2I> 2 "spu_reg_operand" "r"))]
  ""
  "#"
  ""
  [(set (match_dup:<F2I> 3)
        (and:<F2I> (match_dup:<F2I> 4)
                   (match_dup:<F2I> 2)))]

which now runs already in the split0 pass (before combine).

I think the proper fix is to simply add "split0_completed" to the 
condition of this splitter -- there doesn't seem to be any gain from
running this splitter early.  (I suspect there are some other splitters
that probably should be treated likewise, but I don't have a specific
regression except the ABS case.)


The following patch contains my suggested fixes (as discussed above),
except for the REGNO_POINTER_ALIGN breakage.  Tested on mainline and
4.3 with no regressions, fixes all regressions (except the mentioned
REGNO_POINTER_ALIGN breakage) of the original patch.  What do you think?

Bye,
Ulrich


ChangeLog:

	* config/spu/spu.h (ADDRESSES_NEVER_TRAP): Define.
	* rtlanal.c (rtx_addr_can_trap_p_1): Respect ADDRESSES_NEVER_TRAP macro.
	* doc/tm.texi (ADDRESSES_NEVER_TRAP): Document.

	* config/spu/spu.c (spu_split_load): Trust MEM_ALIGN.  When not
	optimizing, do not split load unless necessary.

	* config/spu/spu.md ("_abs<mode>2"): Do not split in split0 pass.


diff -crNp -x .svn gcc-4_3-orig/gcc/config/spu/spu.c gcc-4_3/gcc/config/spu/spu.c
*** gcc-4_3-orig/gcc/config/spu/spu.c	2008-09-10 22:09:24.000000000 +0200
--- gcc-4_3/gcc/config/spu/spu.c	2008-09-11 00:40:35.000000000 +0200
*************** spu_split_load (rtx * ops)
*** 3596,3602 ****
  
    rot = 0;
    rot_amt = 0;
!   if (GET_CODE (addr) == PLUS)
      {
        /* 8 cases:
           aligned reg   + aligned reg     => lqx
--- 3596,3605 ----
  
    rot = 0;
    rot_amt = 0;
! 
!   if (MEM_ALIGN (ops[1]) >= 128)
!     /* Address is already aligned; simply perform a TImode load.  */;
!   else if (GET_CODE (addr) == PLUS)
      {
        /* 8 cases:
           aligned reg   + aligned reg     => lqx
*************** spu_split_load (rtx * ops)
*** 3707,3712 ****
--- 3710,3723 ----
        rot_amt = 0;
      }
  
+   /* If the source is properly aligned, we don't need to split this insn into
+      a TImode load plus a _spu_convert.  However, we want to perform the split
+      anyway when optimizing to make the MEMs look the same as those used for
+      stores so they are more easily merged.  When *not* optimizing, that will
+      not happen anyway, so we prefer to avoid generating the _spu_convert.  */
+   if (!rot && !rot_amt && !optimize)
+     return 0;
+ 
    load = gen_reg_rtx (TImode);
  
    mem = change_address (ops[1], TImode, copy_rtx (addr));
diff -crNp -x .svn gcc-4_3-orig/gcc/config/spu/spu.h gcc-4_3/gcc/config/spu/spu.h
*** gcc-4_3-orig/gcc/config/spu/spu.h	2008-09-10 22:09:24.000000000 +0200
--- gcc-4_3/gcc/config/spu/spu.h	2008-09-10 21:19:30.000000000 +0200
*************** extern GTY(()) rtx spu_compare_op1;
*** 640,642 ****
--- 640,644 ----
  
  #define SPLIT_BEFORE_CSE2 1
  
+ #define ADDRESSES_NEVER_TRAP 1
+ 
diff -crNp -x .svn gcc-4_3-orig/gcc/config/spu/spu.md gcc-4_3/gcc/config/spu/spu.md
*** gcc-4_3-orig/gcc/config/spu/spu.md	2008-09-10 22:09:32.000000000 +0200
--- gcc-4_3/gcc/config/spu/spu.md	2008-09-10 20:09:59.000000000 +0200
***************
*** 1246,1252 ****
     (use (match_operand:<F2I> 2 "spu_reg_operand" "r"))]
    ""
    "#"
!   ""
    [(set (match_dup:<F2I> 3)
  	(and:<F2I> (match_dup:<F2I> 4)
  		   (match_dup:<F2I> 2)))]
--- 1246,1252 ----
     (use (match_operand:<F2I> 2 "spu_reg_operand" "r"))]
    ""
    "#"
!   "split0_completed"
    [(set (match_dup:<F2I> 3)
  	(and:<F2I> (match_dup:<F2I> 4)
  		   (match_dup:<F2I> 2)))]
diff -crNp -x .svn gcc-4_3-orig/gcc/doc/tm.texi gcc-4_3/gcc/doc/tm.texi
*** gcc-4_3-orig/gcc/doc/tm.texi	2008-09-10 22:09:25.000000000 +0200
--- gcc-4_3/gcc/doc/tm.texi	2008-09-10 21:43:46.000000000 +0200
*************** optimizations before this pass work bett
*** 10384,10386 ****
--- 10384,10392 ----
  instructions, and the optimizations right after this pass (e.g., CSE and
  combine) are be able to optimize the split instructions.
  @end defmac
+ 
+ @defmac ADDRESSES_NEVER_TRAP
+ Define this macro if memory accesses will never cause a trap.
+ This is the case for example on the Cell SPU processor.
+ @end defmac
+ 
diff -crNp -x .svn gcc-4_3-orig/gcc/rtlanal.c gcc-4_3/gcc/rtlanal.c
*** gcc-4_3-orig/gcc/rtlanal.c	2008-03-05 19:44:55.000000000 +0100
--- gcc-4_3/gcc/rtlanal.c	2008-09-10 21:18:53.000000000 +0200
*************** rtx_varies_p (const_rtx x, bool for_alia
*** 265,270 ****
--- 265,274 ----
  static int
  rtx_addr_can_trap_p_1 (const_rtx x, enum machine_mode mode, bool unaligned_mems)
  {
+ #ifdef ADDRESSES_NEVER_TRAP
+   /* On some processors, like the SPU, memory accesses never trap.  */
+   return 0;
+ #else
    enum rtx_code code = GET_CODE (x);
  
    switch (code)
*************** rtx_addr_can_trap_p_1 (const_rtx x, enum
*** 344,349 ****
--- 348,354 ----
  
    /* If it isn't one of the case above, it can cause a trap.  */
    return 1;
+ #endif
  }
  
  /* Return nonzero if the use of X as an address in a MEM can cause a trap.  */

-- 
  Dr. Ulrich Weigand
  GNU Toolchain for Linux on System z and Cell BE
  Ulrich.Weigand@de.ibm.com


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]