This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 14851] Final fix for suboptimal fp division with -ffast-math


Roger Sayle wrote:

PR rtl-optimization/14851
* combine.c (combine_instructions): Move 'three linked insns'
pass before 'two linked insns' pass.


However, your benchmarking clearly shows a significant benefit (for x87
mathematics at least) for doing three instruction combinations before
two instruction combinations.  My worry now is how this change affects
GCC compile-times and what impact it may have on other platforms.  The
three linked passes are slower than the two-linked pass, so there's a
strong possibility that this may slow down combine, which is already a
significant fraction of the time in the RTL passes when optimizing.

Roger, I think that compile time won't increase, as there is nothing added. It is just rearranged, and number of two-linked and three-linked combinations stay the same. Perhaps even some two-linked combination less, because three-linked pass has more probability of combining instructions what would be candidates for two-linked combinations. A build benchmark can proove this:
building povray-3.50c with time 'gmake > /dev/null` results in:


three-linked pass first:
real  2m4.493s
user  1m32.100s
sys   0m3.178s

two-linked pass first:
real  2m9.391s
user  1m32.076s
sys   0m3.396s

Is there any chance, I could ask you (or some kind volunteer) to measure
the impact of this patch on compile-time and generated code on some
other targets/benchmarks? I'm also uneasy about how deceptively simple
but significant this change is, potentially affecting lots of code
unrelated to the original PR, so this solution might not be suitable for
stage3 :<


It is very hard to measure performance impact of this change. As there can clearly be demonstrated, that number of combined instruction increases, the impact of these combinations depend on number of other factors. As probability that simplified instruction sits in a tight loop increases with number of combinations, we have higher chance to hit this simplification. I can present a statistics of povray-3.50c compilation:

three-first two-first gain

(reg)/(reg):    886        852        +34
(reg)/(mem):    47        29        +18
(mem)/(reg):    38        41        -3
(reg)/(float):    44        35        +9

(const)/(reg):    3        6        -3    #invalid, but
                            could be fixed
(mem)/(mem):    0        12        -12    #invalid
(float)/(reg):    6        12        -6    #invalid
other:        41        48        -7    #invalid

total: 1065 1035 +30

With three-linked pass first, 30 more mult/div combinations were discovered and number of valid divisions generated increased by 58. In this compilation we hit 3 cases, where (CONST_DOUBLE)/(reg) was returned, and these cases could be fixed by patch (b) in my partial fix. It is interesting, that three-linked pass didn't genretate any invalid (mem)/(mem) divisions, so your proposed patch would have no job here ;) . There are still 41 invalid cases where we try to generate (float_extend)/(float), etc. These would be greatly reduced, if a/b -> a * (1.0/b) would be moved to tree-ssa.

It should be noted, that 'three-linked first' pass could do other transforms, too. As three-linked pass can have greater impact on produced code, three-linked pass should operate first, and after that, two-linked pass converts what remains. As it is now, two-linked pass gets all low-hanging fruit, blocking three-pass to do more complex transforms.

Uros.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]