[patch] Improve detection of widening multiplication in the vectorizer

Thu Jun 2 15:35:00 GMT 2011

On Thu, Jun 2, 2011 at 1:08 PM, Ira Rosen <ira.rosen@linaro.org> wrote:
> On 2 June 2011 12:59, Richard Guenther <richard.guenther@gmail.com> wrote:
>> On Thu, Jun 2, 2011 at 10:46 AM, Ira Rosen <ira.rosen@linaro.org> wrote:
>>> On 1 June 2011 15:14, Richard Guenther <richard.guenther@gmail.com> wrote:
>>>> On Wed, Jun 1, 2011 at 1:37 PM, Ira Rosen <ira.rosen@linaro.org> wrote:
>>>>> On 1 June 2011 12:42, Richard Guenther <richard.guenther@gmail.com> wrote:
>>>>>
>>>>>> Did you think about moving pass_optimize_widening_mul before
>>>>>> loop optimizations?  Does that pass catch the cases you are
>>>>>> teaching the pattern recognizer?  I think we should try to expose
>>>>>> these more complicated instructions to loop optimizers.
>>>>>>
>>>>>
>>>>> pass_optimize_widening_mul doesn't catch these cases, but I can try to
>>>>> teach it instead of the vectorizer.
>>>>> I am now testing
>>>>>
>>>>> Index: passes.c
>>>>> ===================================================================
>>>>> --- passes.c    (revision 174391)
>>>>> +++ passes.c    (working copy)
>>>>> @@ -870,6 +870,7 @@
>>>>>       NEXT_PASS (pass_split_crit_edges);
>>>>>       NEXT_PASS (pass_pre);
>>>>>       NEXT_PASS (pass_sink_code);
>>>>> +      NEXT_PASS (pass_optimize_widening_mul);
>>>>>       NEXT_PASS (pass_tree_loop);
>>>>>        {
>>>>>          struct opt_pass **p = &pass_tree_loop.pass.sub;
>>>>> @@ -934,7 +935,6 @@
>>>>>       NEXT_PASS (pass_forwprop);
>>>>>       NEXT_PASS (pass_phiopt);
>>>>>       NEXT_PASS (pass_fold_builtins);
>>>>> -      NEXT_PASS (pass_optimize_widening_mul);
>>>>>       NEXT_PASS (pass_tail_calls);
>>>>>       NEXT_PASS (pass_rename_ssa_copies);
>>>>>       NEXT_PASS (pass_uncprop);
>>>>>
>>>>> to see how it affects other loop optimizations (vectorizer pattern
>>>>> tests obviously fail).
>>>
>>> Looks like it needs copy_prop and dce as well:
>>>
>>> Index: passes.c
>>> ===================================================================
>>> --- passes.c    (revision 174391)
>>> +++ passes.c    (working copy)
>>> @@ -870,6 +870,9 @@
>>>       NEXT_PASS (pass_split_crit_edges);
>>>       NEXT_PASS (pass_pre);
>>>       NEXT_PASS (pass_sink_code);
>>> +      NEXT_PASS (pass_copy_prop);
>>> +      NEXT_PASS (pass_dce);
>>> +      NEXT_PASS (pass_optimize_widening_mul);
>>>       NEXT_PASS (pass_tree_loop);
>>>        {
>>>          struct opt_pass **p = &pass_tree_loop.pass.sub;
>>> @@ -934,7 +937,6 @@
>>>       NEXT_PASS (pass_forwprop);
>>>       NEXT_PASS (pass_phiopt);
>>>       NEXT_PASS (pass_fold_builtins);
>>> -      NEXT_PASS (pass_optimize_widening_mul);
>>>       NEXT_PASS (pass_tail_calls);
>>>       NEXT_PASS (pass_rename_ssa_copies);
>>>       NEXT_PASS (pass_uncprop);
>>>
>>> otherwise I get (on x86_64-suse-linux)
>>>
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddss
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddsd
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubss
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubsd
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddss
>>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddsd
>>
>> Hmm.  I would have put the pass next to the sincos pass, but yes,
>> in principle a copyprop & dce pass after PRE makes sense
>> (the loop passes likely don't run because there are no loops in
>> those testcases - both copyprop and dce should be scheduled
>> more like TODOs, or even automatically by the pass manager
>> via PROPs ...).  Dead code can indeed confuse those matching
>> passes that look for single-use vars.
>>
>> I'll think about a more elegant solution for this problem.
>>
>> Would you mind checking if the next-to-sincos position makes
>> any difference?
>
> Before sincos we have
>
>  D.2747_2 = __builtin_powf (a_1(D), 2.0e+0);
>  D.2746_4 = D.2747_2 + c_3(D);
>
> which is transformed by sincos to
>
>  powmult.8_7 = a_1(D) * a_1(D);
>  D.2747_2 = powmult.8_7;
>  D.2746_4 = D.2747_2 + c_3(D);
>
> but widening_mul  is confused by D.2747_2 = powmult.8_7; and it needs
> both copy_prop and dce to remove it:
>
>  powmult.8_7 = a_1(D) * a_1(D);
>  D.2746_4 = c_3(D) + powmult.8_7;
>
> So moving widening_mul next to sincos doesn't help.
> Maybe gimple_expand_builtin_pow() can be changed to generate the last
> version by itself?

Yeah, I guess so.  I'll have a look.

Richard.

> Ira
>
>>
>> Thanks,
>> Richard.
>>
>>> Ira
>>>
>>>>
>>>> Thanks.  I would hope that we eventually can get rid of the
>>>> pattern recognizer ... at least for SSE there is also always
>>>> a scalar variant instruction for each vectorized one.
>>>>
>>>> Richard.
>>>>
>>>
>>
>