This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: inadequate multiply-by-const expansion for pentium4


Luchezar Belev wrote:
I played a bit to see how gcc expands multiplication by various constants in
leas/adds/subs/shifts and found that it does not do the job in the best possible
way it could. I tracked for the the reason and found 3 different problems:

We don't track messages sent to the gcc mailing list. Maybe someone will look at this, maybe not. You might want to submit a bug report for this, so the ideas don't get lost. Or you can try fixing them yourself.


We may not be able to solve all such problems. We have to use heuristics (guesses) here so that the compiler runs in a reasonable amount of time, and sometimes it isn't possible to get optimal results in all cases.

 1) the cost of an add insn for p4 (in config/i386/i386.c) is 1 while
    it should be 0.5 (or probably 1 with all other costs doubled)

The costs are relative to other costs. Doubling the costs may not work. And since the cost has to be an integer here, using 0.5 won't work either.


 2) that x86_decompose_lea flag is causing invalid cost evaluation for
    the lea insn (in gcc/expmed.c, init_expmed())

I don't have a P4 optimization guide, so I can't answer this. There may be some non-obvious reason why lea is bad even though it has a shorter latency. For instance, maybe it uses more function units, or maybe it stalls other pipelines, etc.


Checking the mailing lists, the answer seems to be that lea expands to multiple uops on Pentium4, though this isn't clearly stated in the patch.

 3) look at the followind code in gcc/expmed.c, expand_mult():
      mult_cost = rtx_cost (gen_rtx_MULT (mode, op0, op1), SET);
      mult_cost = MIN (12 * add_cost, mult_cost);

This looks like a pragmatic code expansion limit. If given a choice between emitting 13 adds or a single multiply, we choose to emit a single multiply even if the latency is longer, to avoid unreasonable code size expansion. 13 adds may be slower due to cache effects for instance, even if the latency figures show that they may be faster. It might be reasonable to change this limit depending on optimization options. Or maybe compute the limit differently, for instance 6 * add_cost + 6 * shift_cost instead of just 12 * add_cost.


In general, performance on a testsuite (like SPEC) will be more convincing than simple multiply sequences in isoation. In real programs, a multiply usually does not occur by itself, so you have to consider how the sequence performs when scheduled with other instructions. Sequences that look optimal in isolation may cause real programs to run slower because of scheduling, cache, or other effects.
--
Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]