This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug tree-optimization/78348] [7 REGRESSION] 15% performance drop for coremark-pro/nnet-test after r242038


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78348

--- Comment #2 from Jim Wilson <wilson at gcc dot gnu.org> ---
The testcase doesn't produce runable code, and I'm not sure if I have access to
any haswell parts, but I can make a few comments.

The testcase requires -O3 -ftree-loop-distribution to reproduce.

Without my patch, the loop distribution pass thinks that there is a
bi-directional (backward and forward) dependence between the first and second
lines of the loop, and that prevents optimization.  With my patch, the loop
distribution pass correctly computes that there is only the forward
anti-dependence, which allows optimization.

Without optimization, the inner loop is fully unrolled and vectorized to use
128-bit pair of double vectors.  With optimization, we get a call to the
memmove and memset builtins.

There is the problem already mentioned by Richard Biener in my patch review
that the unoptimized code has 2 memory streams, but the optimized code has 3
memory streams.  This might be causing some of the performance loss.

There is another problem here with load/store sizes.  The memmove builtin does
not get expanded inline, and we end up in libc which appears to be 128-bit
loads and stores.  However, the memset is expanded inline, and we only get
64-bit stores.  The extra stores necessary here may be causing some of the
performance loss.

For short term solutions, we could look at adding a heuristic that tries to
determine whether code is stream limited, and prevent optimizations that would
increase the number of streams in that case.  Maybe something like
PARAM_PREFETCH_MIN_INSN_TO_MEM_RATIO used in the prefetch code.

Another short term solution is to get the memset expander to use 128-bit
stores.

Long term, loop distribution should only be performed when this enables some
other optimization, like vectorization, which suggests that loop distribution
should be a library called by other passes, instead of its own pass.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]