This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: LEA-splitting improvement patch.


Hi Uros,

Thanks  a lot forr your comments.

I prepared new patch and ChangeLog. Testing of x32 is in progress.

It it OK for trunk?

2012-08-14  Yuri Rumyantsev  <ysrumyan@gmail.com>

	* config/i386/i386-protos.h (ix86_split_lea_for_addr) : Add
	additional argument.
	* config/i386/i386.md (ix86_split_lea_for_addr) : Add
	additional argument curr_insn.
	* config/i386/i386.c (ix86_split_lea_for_addr): Do instructions
	reodering to get opportunities for better scheduling.
	(ix86_lea_outperforms): Do more aggressive lea splitting.
	(find_nearest_reg-def): New function. Find nearest register
	definition used in address.

2012/8/14 Uros Bizjak <ubizjak@gmail.com>:
> Hello!
>
>> It is known that LEA splitting is one of the most critical problems
>> for Atom processors and changes try to improve it through:
>> 1.       More aggressive Lea splitting – do not perform splitting if
>> only split cost exceeds AGU stall .
>> 2.       Reordering splitting instructions to get better scheduling –
>> use the farthest defined register for SET instruction, then add
>> constant offset if any and finally generate add instruction.This gives
>> +0.5% speedup in geomean for eembc2.0 suite on Atom.
>> All required testing was done – bootstraps for Atom & Core2, make check.
>> Note that this fix affects only on Atom processors.
>
> IMO, you should test LEA handling changes on x32 atom, too. With
> recent changes, you will get lots of zero_extended addresses through
> these functions, so I think it is worth to benchmark on x32 target.
>
>> ChangeLog:
>> 2012-08-08  Yuri Rumyantsev  Yuri.S.Rumyantsev@intel.com
>>
>> * config/i386/i386-protos.h (ix86_split_lea_for_addr) : Add additional argument.
>> * config/i386/i386.md (ix86_splitt_lea_for_addr) : Add additional
>> argument curr_insn.
>> * config/i386/i386.c (find_nearest_reg-def): New function. Find
>> nearest register definition used in address.
>
> (find_nearest_reg_def)
>
>> (ix86_split_lea_for_addr) : Do more aggressive lea splitting and
>> instructions reodering to get opportunities for better scheduling.
>
> Please merge entries for ix86_split_lea_for_addr.
>
> @@ -16954,7 +16954,7 @@ ix86_lea_outperforms (rtx insn, unsigned int
> regno0, unsigned int regno1,
>    /* If there is no use in memory addess then we just check
>       that split cost exceeds AGU stall.  */
>    if (dist_use < 0)
> -    return dist_define >= LEA_MAX_STALL;
> +    return dist_define > LEA_MAX_STALL;
>
> You didn't described this change in ChangeLog. Does this affect also
> affect benchmark speedup?
>
> +/* Return 0 if regno1 def is nearest to insn and 1 otherwise. */
>
> Watch comment formatting and vertical spaces!
>
> +static int
> +find_nearest_reg_def (rtx insn, int regno1, int regno2)
>
> IMO, you can return 0, 1 and 2; with 0 when no definitions are found
> in the BB, 1 and 2 when either of two regnos are found. Otherwise,
> please use bool for function type.
>
> +       if (insn_defines_reg (regno1, regno1, prev))
> +         return 0;
> +       else if (insn_defines_reg (regno2, regno2, prev))
>
> Please use INVALID_REGNUM as the second argument in the call to
> insn_defines_reg when looking for only one regno definition.
>
>             {
> -             emit_insn (gen_rtx_SET (VOIDmode, target, parts.base));
> -             tmp = parts.index;
> +              rtx tmp1;
> +              /* Try to give more opportunities to scheduler -
> +                 choose operand for move instruction with longer
> +                 distance from its definition to insn. */
>
> (Hm, I don't think you mean gcc insn scheduler here.)
>
> +              if (find_nearest_reg_def (insn, regno1, regno2) == 0)
> +                {
> +                  tmp = parts.index;  /* choose index for move. */
> +                  tmp1 = parts.base;
> +                }
> +              else
> +               {
> +                 tmp = parts.base;
> +                 tmp1 = parts.index;
> +               }
> +             emit_insn (gen_rtx_SET (VOIDmode, target, tmp));
> +              if (parts.disp && parts.disp != const0_rtx)
> +                ix86_emit_binop (PLUS, mode, target, parts.disp);
> +             ix86_emit_binop (PLUS, mode, target, tmp1);
> +              return;
>             }
>
> (Please use tabs instead of spaces in the added code.)
>
> However, this whole new part can be written simply as following (untested):
>
>             {
>               rtx tmp1;
>
>               if (find_nearest_reg_def (insn, regno1, regno2) == 0)
>                 tmp1 = parts.base, tmp = parts.index;
>               else
>                 tmp1 = parts.index, tmp = parts.base;
>
>               emit_insn (gen_rtx_SET (VOIDmode, target, tmp1);
>             }
>
> Please see how tmp is handled further down in the function.
>
> -         ix86_emit_binop (PLUS, mode, target, tmp);
> +         ix86_emit_binop (PLUS, mode, target, tmp);
>
> Please watch accidental whitespace changes.
>
> Uros.

Attachment: lea_split_improve.diff
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]