[patch] Improve detection of widening multiplication in the vectorizer

Ira Rosen ira.rosen@linaro.org
Thu Jun 2 11:08:00 GMT 2011


On 2 June 2011 12:59, Richard Guenther <richard.guenther@gmail.com> wrote:
> On Thu, Jun 2, 2011 at 10:46 AM, Ira Rosen <ira.rosen@linaro.org> wrote:
>> On 1 June 2011 15:14, Richard Guenther <richard.guenther@gmail.com> wrote:
>>> On Wed, Jun 1, 2011 at 1:37 PM, Ira Rosen <ira.rosen@linaro.org> wrote:
>>>> On 1 June 2011 12:42, Richard Guenther <richard.guenther@gmail.com> wrote:
>>>>
>>>>> Did you think about moving pass_optimize_widening_mul before
>>>>> loop optimizations?  Does that pass catch the cases you are
>>>>> teaching the pattern recognizer?  I think we should try to expose
>>>>> these more complicated instructions to loop optimizers.
>>>>>
>>>>
>>>> pass_optimize_widening_mul doesn't catch these cases, but I can try to
>>>> teach it instead of the vectorizer.
>>>> I am now testing
>>>>
>>>> Index: passes.c
>>>> ===================================================================
>>>> --- passes.c    (revision 174391)
>>>> +++ passes.c    (working copy)
>>>> @@ -870,6 +870,7 @@
>>>>       NEXT_PASS (pass_split_crit_edges);
>>>>       NEXT_PASS (pass_pre);
>>>>       NEXT_PASS (pass_sink_code);
>>>> +      NEXT_PASS (pass_optimize_widening_mul);
>>>>       NEXT_PASS (pass_tree_loop);
>>>>        {
>>>>          struct opt_pass **p = &pass_tree_loop.pass.sub;
>>>> @@ -934,7 +935,6 @@
>>>>       NEXT_PASS (pass_forwprop);
>>>>       NEXT_PASS (pass_phiopt);
>>>>       NEXT_PASS (pass_fold_builtins);
>>>> -      NEXT_PASS (pass_optimize_widening_mul);
>>>>       NEXT_PASS (pass_tail_calls);
>>>>       NEXT_PASS (pass_rename_ssa_copies);
>>>>       NEXT_PASS (pass_uncprop);
>>>>
>>>> to see how it affects other loop optimizations (vectorizer pattern
>>>> tests obviously fail).
>>
>> Looks like it needs copy_prop and dce as well:
>>
>> Index: passes.c
>> ===================================================================
>> --- passes.c    (revision 174391)
>> +++ passes.c    (working copy)
>> @@ -870,6 +870,9 @@
>>       NEXT_PASS (pass_split_crit_edges);
>>       NEXT_PASS (pass_pre);
>>       NEXT_PASS (pass_sink_code);
>> +      NEXT_PASS (pass_copy_prop);
>> +      NEXT_PASS (pass_dce);
>> +      NEXT_PASS (pass_optimize_widening_mul);
>>       NEXT_PASS (pass_tree_loop);
>>        {
>>          struct opt_pass **p = &pass_tree_loop.pass.sub;
>> @@ -934,7 +937,6 @@
>>       NEXT_PASS (pass_forwprop);
>>       NEXT_PASS (pass_phiopt);
>>       NEXT_PASS (pass_fold_builtins);
>> -      NEXT_PASS (pass_optimize_widening_mul);
>>       NEXT_PASS (pass_tail_calls);
>>       NEXT_PASS (pass_rename_ssa_copies);
>>       NEXT_PASS (pass_uncprop);
>>
>> otherwise I get (on x86_64-suse-linux)
>>
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddss
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmaddsd
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubss
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfmsubsd
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddss
>> FAIL: gcc.target/i386/fma4-fma-2.c scan-assembler vfnmaddsd
>
> Hmm.  I would have put the pass next to the sincos pass, but yes,
> in principle a copyprop & dce pass after PRE makes sense
> (the loop passes likely don't run because there are no loops in
> those testcases - both copyprop and dce should be scheduled
> more like TODOs, or even automatically by the pass manager
> via PROPs ...).  Dead code can indeed confuse those matching
> passes that look for single-use vars.
>
> I'll think about a more elegant solution for this problem.
>
> Would you mind checking if the next-to-sincos position makes
> any difference?

Before sincos we have

  D.2747_2 = __builtin_powf (a_1(D), 2.0e+0);
  D.2746_4 = D.2747_2 + c_3(D);

which is transformed by sincos to

  powmult.8_7 = a_1(D) * a_1(D);
  D.2747_2 = powmult.8_7;
  D.2746_4 = D.2747_2 + c_3(D);

but widening_mul  is confused by D.2747_2 = powmult.8_7; and it needs
both copy_prop and dce to remove it:

  powmult.8_7 = a_1(D) * a_1(D);
  D.2746_4 = c_3(D) + powmult.8_7;

So moving widening_mul next to sincos doesn't help.
Maybe gimple_expand_builtin_pow() can be changed to generate the last
version by itself?

Ira

>
> Thanks,
> Richard.
>
>> Ira
>>
>>>
>>> Thanks.  I would hope that we eventually can get rid of the
>>> pattern recognizer ... at least for SSE there is also always
>>> a scalar variant instruction for each vectorized one.
>>>
>>> Richard.
>>>
>>
>



More information about the Gcc-patches mailing list