This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [4.0 and mainline] Fix multiplication by constant expansion


Hi Jan,

On Sun, 1 Jan 2006, Roger Sayle wrote:
> I'm currently investigating a better solution, but it looks like
> the underlying cause is the high cost of "lea" in i386.c:athlon_cost.

I've managed to come up with two small non-intrusive patches to fix
the athlon code generation issue that could be suitable for 4.0,
4.1 and mainline.  For mainline, I still think we want to move to
using COSTS_N_INSNS in the declarations of the processor_costs tables,
but even this might be a bit too invasive for release branches.

In the meantime, I tested that both:

patch #1:
Index: i386.c
===================================================================
*** i386.c      (revision 109198)
--- i386.c      (working copy)
*************** struct processor_costs k6_cost = {
*** 331,337 ****
  static const
  struct processor_costs athlon_cost = {
    1,                                  /* cost of an add instruction */
!   2,                                  /* cost of a lea instruction */
    1,                                  /* variable shift costs */
    1,                                  /* constant shift costs */
    {5, 5, 5, 5, 5},                    /* cost of starting a multiply */
--- 331,337 ----
  static const
  struct processor_costs athlon_cost = {
    1,                                  /* cost of an add instruction */
!   1,                                  /* cost of a lea instruction */
    1,                                  /* variable shift costs */
    1,                                  /* constant shift costs */
    {5, 5, 5, 5, 5},                    /* cost of starting a multiply */


and

patch #2:
Index: i386.c
===================================================================
*** i386.c      (revision 109198)
--- i386.c      (working copy)
*************** static const
*** 332,339 ****
  struct processor_costs athlon_cost = {
    1,                                  /* cost of an add instruction */
    2,                                  /* cost of a lea instruction */
!   1,                                  /* variable shift costs */
!   1,                                  /* constant shift costs */
    {5, 5, 5, 5, 5},                    /* cost of starting a multiply */
    0,                                  /* cost of multiply per each bit
set */
    {18, 26, 42, 74, 74},                       /* cost of a divide/mod */
--- 332,339 ----
  struct processor_costs athlon_cost = {
    1,                                  /* cost of an add instruction */
    2,                                  /* cost of a lea instruction */
!   2,                                  /* variable shift costs */
!   2,                                  /* constant shift costs */
    {5, 5, 5, 5, 5},                    /* cost of starting a multiply */
    0,                                  /* cost of multiply per each bit
set */
    {18, 26, 42, 74, 74},                       /* cost of a divide/mod */


Both resolve the multiplication by 11 issue with -march=athlon.  The
first reduces the cost of lea, to make it cheaper than a shift and
addition, and the alternate second patch increases the cost of a shift,
so that a shift and an addition are more expensive than an lea.  At
some point in the future, the COSTS_N_INSNS change should give us
slightly better fine tuning.

Could I ask you and/or Andreas and/or H.J. and/or Evandro to benchmark
the above patches against SPEC with -march=athlon and report whether
they are net wins or losses?  In addition to making the AMD athlon
tuned code run faster, the fact that these tweaks are closer in line
with the Intel timings, should mean that code compiled with -march=athlon
is also less pessimized on non-athlon processor families.


I hope this compromise is acceptable.  With luck, one of the above
refinements will outperform mainline providing a "quick-fix".  The
problem is that on the athlon, you want "lea" to be cheaper than a
shift-and-add or a shift-and-sub sequence.  The obvious way to do
this is to reflect it in the target's rtx_costs.

In an ideal world, retaining the latency optimization will even
benefit athlon, for choosing sequences in which shifts and/or
additions may be issued concurrently over identical cost sequences
without parallelism.

Many thanks in advance,

Roger
--



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]