This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Vectorization regression on s390x GCC6 vs GCC5


On Thu, Jan 26, 2017 at 10:18 AM, Robin Dapp <rdapp@linux.vnet.ibm.com> wrote:
> Hi,
>
> while analyzing a test case with a lot of nested loops (>7) and double
> floating point operations I noticed a performance regression of GCC 6/7
> vs GCC 5 on s390x. It seems due to GCC 6 vectorizing something GCC 5
> couldn't.
>  Basically, each loop iterates over three dimensions, we fully unroll
> some of the inner loops until we have straight-line code of roughly 2000
> insns that are being executed three times in GCC 5. GCC 6 vectorizes two
> iterations and adds a scalar epilogue for the third iteration. The
> epilogue code is so bad that it slows down the execution by at least
> 50%, using only two hard registers and lots of spill slots.
> Although my analysis is not completed, I believe this is because
> register pressure is high in the epilogue and the live ranges span the
> vectorized code as well as the epilogue.
>
> Even reduced, the test case is huge, therefore I didn't include it. Some
> high-level questions instead:
>
> - Has anybody else observed similar problems and got around them?
Yes, I think so.  Also we have case that GCC vectorizes with larger
vect_factor, which causes regression too.

>
> - Is there some way around the register pressure/long live ranges?
I am doing some experiments calculating coarse-grained register
pressure for GIMPLE loop, but the motivation is not from vectorizer,
but predcom/pre, like PR77498.

> Perhaps something we could/should fix in the s390 backend? (Probably
> hard to tell without source)
>
> - Would it make sense to allow a backend to specify the minimal number
> of loop iterations considered for vectorization? Is this
> perhaps already possible somehow? I added a check to disable
> vectorization for loops with <= 3 iterations that shows no regressions
> and improves two SPEC benchmarks noticeably. I'm even considering <=5,
> since a vectorization factor of 4 should exhibit the same problematic
> pattern.
Is the niter number known at compilation time?  if yes, I am surprised
GCC's behavior here on such small iteration loops.  Cost-model?

Thanks,
bin
>
> Regards
>  Robin
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]