[PATCH] Allow fwprop to undo vectorization harm (PR68961)

Tue Jul 5 20:50:00 GMT 2016

Richard Biener <rguenther@suse.de> writes:
> On Sun, 3 Jul 2016, Richard Sandiford wrote:
>
>> Richard Biener <rguenther@suse.de> writes:
>> > On Wed, 15 Jun 2016, Richard Sandiford wrote:
>> >
>> >> Richard Biener <rguenther@suse.de> writes:
>> >> > With the proposed cost change for vector construction we will end up
>> >> > vectorizing the testcase in PR68961 again (on x86_64 and likely
>> >> > on ppc64le as well after that target gets adjustments).  Currently
>> >> > we can't optimize that away again noticing the direct overlap of
>> >> > argument and return registers.  The obstackle is
>> >> >
>> >> > (insn 7 4 8 2 (set (reg:V2DF 93)
>> >> >         (vec_concat:V2DF (reg/v:DF 91 [ a ])
>> >> >             (reg/v:DF 92 [ aa ]))) 
>> >> > ...
>> >> > (insn 21 8 24 2 (set (reg:DI 97 [ D.1756 ])
>> >> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 0))
>> >> > (insn 24 21 11 2 (set (reg:DI 100 [+8 ])
>> >> >         (subreg:DI (reg:TI 88 [ D.1756 ]) 8))
>> >> >
>> >> > which we eventually optimize to DFmode subregs of (reg:V2DF 93).
>> >> >
>> >> > First of all simplify_subreg doesn't handle the subregs of a vec_concat
>> >> > (easy fix below).
>> >> >
>> >> > Then combine doesn't like to simplify the multi-use (it tries some
>> >> > parallel it seems).  So I went to forwprop which eventually manages
>> >> > to do this but throws away the result (reg:DF 91) or (reg:DF 92)
>> >> > because it is not a constant.  Thus I allow arbitrary simplification
>> >> > results for SUBREGs of [VEC_]CONCAT operations.  There doesn't seem
>> >> > to be a magic flag to tell it to restrict to the case where all
>> >> > uses can be simplified or so, nor to restrict simplifications to a REG.
>> >> > But I don't see any undesirable simplifications of (subreg 
>> >> > ([vec_]concat)).
>> >> 
>> >> Adding that as a special case to propgate_rtx feels like a hack though :-)
>> >> I think:
>> >> 
>> >> > Index: gcc/fwprop.c
>> >> > ===================================================================
>> >> > *** gcc/fwprop.c	(revision 237286)
>> >> > --- gcc/fwprop.c	(working copy)
>> >> > *************** propagate_rtx (rtx x, machine_mode mode,
>> >> > *** 664,670 ****
>> >> >         || (GET_CODE (new_rtx) == SUBREG
>> >> >   	  && REG_P (SUBREG_REG (new_rtx))
>> >> >   	  && (GET_MODE_SIZE (mode)
>> >> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
>> >> >       flags |= PR_CAN_APPEAR;
>> >> >     if (!varying_mem_p (new_rtx))
>> >> >       flags |= PR_HANDLE_MEM;
>> >> > --- 664,673 ----
>> >> >         || (GET_CODE (new_rtx) == SUBREG
>> >> >   	  && REG_P (SUBREG_REG (new_rtx))
>> >> >   	  && (GET_MODE_SIZE (mode)
>> >> > ! 	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx)))))
>> >> > !       || ((GET_CODE (new_rtx) == VEC_CONCAT
>> >> > ! 	   || GET_CODE (new_rtx) == CONCAT)
>> >> > ! 	  && GET_CODE (x) == SUBREG))
>> >> >       flags |= PR_CAN_APPEAR;
>> >> >     if (!varying_mem_p (new_rtx))
>> >> >       flags |= PR_HANDLE_MEM;
>> >> 
>> >> ...this if statement should fundamentally only test new_rtx.
>> >> E.g. we'd want the same thing for any SUBREG inside X.
>> >> 
>> >> How about changing:
>> >> 
>> >>   /* The replacement we made so far is valid, if all of the recursive
>> >>      replacements were valid, or we could simplify everything to
>> >>      a constant.  */
>> >>   return valid_ops || can_appear || CONSTANT_P (tem);
>> >> 
>> >> so that (REG_P (tem) && !HARD_REGISTER_P (tem)) is also valid?
>> >> I suppose that's likely to increase register pressure though,
>> >> if only some uses of new_rtx simplify.  (There again, requiring all
>> >> uses to be replacable could make hot code the hostage of cold code.)
>> >
>> > Yes, my fear was about register presure increase for the case not all
>> > uses can be replaced (fwprop doesn't seem to have code to verify or
>> > require that).
>> >
>> > I can avoid checking for GET_CODE (x) == SUBREG and add a PR_REG
>> > case to restrict REG_P (tem) && !HARD_REGISTER_P (tem) to the
>> > new_rtx == [VEC_]CONCAT case for example.
>> 
>> I don't think that helps though.  There might be other uses of a
>> VEC_CONCAT that aren't SUBREGs, in which case we'd have the same
>> problem of keeping both values live at once.
>> 
>> How about restricting the REG_P (tem) && !HARD_REGISTER_P (tem)
>> to cases where new_rtx has more words than tem?
>
> So would you really make a simple mode-size check here?

I thought a mode check would be worth trying since (for better or worse)
words are special for subregs.  But...

> I wonder which cases are there other than the subreg of [vec_concat]
> that would end up with this case.  That is,
>
>   if (REG_P (tem) && !HARD_REGISTER_P (tem)
>       && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx))
>       && (VECTOR_MODE_P (GET_MODE (new_rtx))
>           || COMPLEX_MODE_P (GET_MODE (new_rtx))))
>     return true;

this looks good to me too FWIW.  I think it reads more naturally if
you do the GET_MODE_INNER check last.

> works as would constraining new_rtx to [VEC_]CONCAT instead of
> vector/complex modes.  I'm worried about relaxing it further
> as partial int modes also have a GET_MODE_INNER - relaxing to
>
>       && GET_MODE_INNER (GET_MODE (new_rtx)) != GET_MODE (new_rtx)
>
> I mean.  If we restrict it to [VEC_]CONCAT we can even allow
> any REG_P && !HARD_REGISTER_P as I can't see any outer (non-noop)
> operation that will simplify into a REG_P besides a subreg.

I think the simplification could also trigger for VEC_DUPLICATE and
VEC_MERGE, so the patch seems better than a check for specific codes.

> But the above form would capture the intent to allow simplifying
> operations on vector / complex to a scalar component.
>
> Which means I can test the following if that looks better
> than what I had originally.  There's Seghers 2->2 combine
> patch but that has wider impact as well.
>
> Thanks,
> Richard.
>
> 2016-07-04  Richard Biener  <rguenther@suse.de>
>
> 	PR rtl-optimization/68961
> 	* fwprop.c (propagate_rtx): Allow SUBREGs of VEC_CONCAT and CONCAT
> 	to simplify to a non-constant.
>
> 	* gcc.target/i386/pr68961.c: New testcase.

Thanks, LGTM.

Richard

>
> Index: gcc/testsuite/gcc.target/i386/pr68961.c
> ===================================================================
> *** gcc/testsuite/gcc.target/i386/pr68961.c	(revision 0)
> --- gcc/testsuite/gcc.target/i386/pr68961.c	(working copy)
> ***************
> *** 0 ****
> --- 1,19 ----
> + /* { dg-do compile { target lp64 } } */
> + /* { dg-options "-O3 -fno-vect-cost-model -fdump-tree-slp2-details" } */
> + 
> + struct x { double d[2]; };
> + 
> + struct x
> + pack (double a, double aa)
> + {
> +   struct x u;
> +   u.d[0] = a;
> +   u.d[1] = aa;
> +   return u;
> + }
> + 
> + /* The function should be optimized to just return as arguments and
> +    result exactly overlap even when previously vectorized.  */
> + 
> + /* { dg-final { scan-tree-dump "basic block vectorized" "slp2" } } */
> + /* { dg-final { scan-assembler-not "mov" } } */
> Index: gcc/fwprop.c
> ===================================================================
> *** gcc/fwprop.c	(revision 237955)
> --- gcc/fwprop.c	(working copy)
> *************** propagate_rtx_1 (rtx *px, rtx old_rtx, r
> *** 619,624 ****
> --- 619,633 ----
>   
>     *px = tem;
>   
> +   /* Allow replacements that simplify operations on a vector or complex
> +      value to a component.  The most prominent case is
> +      (subreg ([vec_]concat ...)).   */
> +   if (REG_P (tem) && !HARD_REGISTER_P (tem)
> +       && GET_MODE (tem) == GET_MODE_INNER (GET_MODE (new_rtx))
> +       && (VECTOR_MODE_P (GET_MODE (new_rtx))
> + 	  || COMPLEX_MODE_P (GET_MODE (new_rtx))))
> +     return true;
> + 
>     /* The replacement we made so far is valid, if all of the recursive
>        replacements were valid, or we could simplify everything to
>        a constant.  */