This is the mail archive of the
mailing list for the GCC project.
Re: LEA-splitting improvement patch.
- From: Uros Bizjak <ubizjak at gmail dot com>
- To: gcc-patches at gcc dot gnu dot org
- Cc: Yuri Rumyantsev <ysrumyan at gmail dot com>, Igor Zamyatin <izamyatin at gmail dot com>
- Date: Mon, 13 Aug 2012 23:26:25 +0200
- Subject: Re: LEA-splitting improvement patch.
> It is known that LEA splitting is one of the most critical problems
> for Atom processors and changes try to improve it through:
> 1. More aggressive Lea splitting – do not perform splitting if
> only split cost exceeds AGU stall .
> 2. Reordering splitting instructions to get better scheduling –
> use the farthest defined register for SET instruction, then add
> constant offset if any and finally generate add instruction.This gives
> +0.5% speedup in geomean for eembc2.0 suite on Atom.
> All required testing was done – bootstraps for Atom & Core2, make check.
> Note that this fix affects only on Atom processors.
IMO, you should test LEA handling changes on x32 atom, too. With
recent changes, you will get lots of zero_extended addresses through
these functions, so I think it is worth to benchmark on x32 target.
> 2012-08-08 Yuri Rumyantsev Yuri.S.Rumyantsev@intel.com
> * config/i386/i386-protos.h (ix86_split_lea_for_addr) : Add additional argument.
> * config/i386/i386.md (ix86_splitt_lea_for_addr) : Add additional
> argument curr_insn.
> * config/i386/i386.c (find_nearest_reg-def): New function. Find
> nearest register definition used in address.
> (ix86_split_lea_for_addr) : Do more aggressive lea splitting and
> instructions reodering to get opportunities for better scheduling.
Please merge entries for ix86_split_lea_for_addr.
@@ -16954,7 +16954,7 @@ ix86_lea_outperforms (rtx insn, unsigned int
regno0, unsigned int regno1,
/* If there is no use in memory addess then we just check
that split cost exceeds AGU stall. */
if (dist_use < 0)
- return dist_define >= LEA_MAX_STALL;
+ return dist_define > LEA_MAX_STALL;
You didn't described this change in ChangeLog. Does this affect also
affect benchmark speedup?
+/* Return 0 if regno1 def is nearest to insn and 1 otherwise. */
Watch comment formatting and vertical spaces!
+find_nearest_reg_def (rtx insn, int regno1, int regno2)
IMO, you can return 0, 1 and 2; with 0 when no definitions are found
in the BB, 1 and 2 when either of two regnos are found. Otherwise,
please use bool for function type.
+ if (insn_defines_reg (regno1, regno1, prev))
+ return 0;
+ else if (insn_defines_reg (regno2, regno2, prev))
Please use INVALID_REGNUM as the second argument in the call to
insn_defines_reg when looking for only one regno definition.
- emit_insn (gen_rtx_SET (VOIDmode, target, parts.base));
- tmp = parts.index;
+ rtx tmp1;
+ /* Try to give more opportunities to scheduler -
+ choose operand for move instruction with longer
+ distance from its definition to insn. */
(Hm, I don't think you mean gcc insn scheduler here.)
+ if (find_nearest_reg_def (insn, regno1, regno2) == 0)
+ tmp = parts.index; /* choose index for move. */
+ tmp1 = parts.base;
+ tmp = parts.base;
+ tmp1 = parts.index;
+ emit_insn (gen_rtx_SET (VOIDmode, target, tmp));
+ if (parts.disp && parts.disp != const0_rtx)
+ ix86_emit_binop (PLUS, mode, target, parts.disp);
+ ix86_emit_binop (PLUS, mode, target, tmp1);
(Please use tabs instead of spaces in the added code.)
However, this whole new part can be written simply as following (untested):
if (find_nearest_reg_def (insn, regno1, regno2) == 0)
tmp1 = parts.base, tmp = parts.index;
tmp1 = parts.index, tmp = parts.base;
emit_insn (gen_rtx_SET (VOIDmode, target, tmp1);
Please see how tmp is handled further down in the function.
- ix86_emit_binop (PLUS, mode, target, tmp);
+ ix86_emit_binop (PLUS, mode, target, tmp);
Please watch accidental whitespace changes.