Yet another attempt to solve multiplication by 11 and many (most?) other constants problem

Jan Hubicka hubicka@ucw.cz
Thu Jan 26 12:09:00 GMT 2006


> 
> >+ 
> >+   /* alg_add_t_m2 is useful for eliminating bits set in t.
> >+      It also allow shit to happen in parallel on superscalar CPU so it is
>                        ^^^^
> 
> shifts :-)
> 
> The other corrections to the doc are more cosmetic.  Feel free to reject 
> them of course.
> 
> >+      essential to get reasonable mult sequences on those.  Unfortunately 
> >it is
>                    ^^^^^^
> 
> ... so it gives much better multiplication sequences on those machines.
> 
> >+      too expensive to try each possible bit set and it seems difficult 
> >to prune
> >+      the possibilities somehow because number of special cases we deal 
> >with.
> 
> to prune the search because the right choice at one step is often not 
> evident
> 
> >+      Thus we try the highest bit set that is often useful to make t 
> >small enough
> >+      for LEA style operations and also few lowest bits, skipping the 
> >others.  */
> 
> We try using alg_add_t_m2 for the highest bit, hoping to make T small 
> enough for LEA-style operations.  We also try on the low four bits, 
> hoping that a small perturbation of T makes it possible to use alg_*_factor.
> 
> >+ 	     FIXME: This should be probably controlled by target hook as well
> >+ 	     as the latency code should be disabled on non-superscalar CPUs. 
> >*/
> 
> Do you think it would be ok to parameterize this by flag_schedule_insns2?

We probably can assume that CPUs without flag_scedule_insns2 set are not
superscalar, but that would rule out Pentium4 as well.  The problem I am
solving here is however bit different.  I need to know that:

reg <<= 2 is faster than
reg = reg2<<2

ie, the rtx_cost will give both sequences same cost (since we hardly can
guess whether the instruction will end up with different source than
destination) but on 2 address architecture the parallel shift is the
second case.  Since we are really counting cycles here to get the
sequences right, it sort of matters more than in rest of compiler.

i386 is probably worst example of what we can get in this respect - it
is mostly 2 address architecture, but some targets, slike P4, behaves
essentially as 3 address when the shift operand is small enought to fit
in lea instruction...

Thank you for the documentation fixes, I will incorporate them to my
copy of patch!

Honza
> 
> Paolo



More information about the Gcc-patches mailing list