[Bug target/90204] [8/9/10 Regression] C code is optimized worse than C++
rguenther at suse dot de
gcc-bugzilla@gcc.gnu.org
Fri Apr 26 07:13:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204
--- Comment #14 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 26 Apr 2019, crazylht at gmail dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204
>
> --- Comment #13 from Hongtao.liu <crazylht at gmail dot com> ---
> (In reply to rguenther@suse.de from comment #10)
> > On Thu, 25 Apr 2019, crazylht at gmail dot com wrote:
> >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90204
> > >
> > > --- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
> > > Also what's better between aligned load/store of smaller size VS unaligned
> > > load/store of bigger size?
> > >
> > > aligned load/store of smaller size:
> > >
> > > movq %rdx, (%rdi)
> > > movq -56(%rsp), %rdx
> > > movq %rdx, 8(%rdi)
> > > movq -48(%rsp), %rdx
> > > movq %rdx, 16(%rdi)
> > > movq -40(%rsp), %rdx
> > > movq %rdx, 24(%rdi)
> > > vmovq %xmm0, 32(%rax)
> > > movq -24(%rsp), %rdx
> > > movq %rdx, 40(%rdi)
> > > movq -16(%rsp), %rdx
> > > movq %rdx, 48(%rdi)
> > > movq -8(%rsp), %rdx
> > > movq %rdx, 56(%rdi)
> > >
> > > unaligned load/store of bigger size:
> > >
> > > vmovups %xmm2, (%rdi)
> > > vmovups %xmm3, 16(%rdi)
> > > vmovups %xmm4, 32(%rdi)
> > > vmovups %xmm5, 48(%rdi)
> >
> > bigger stores are almost always a win while bigger loads have
> > the possibility to run into store-to-load forwarding issues
> > (and bigger stores eventually mitigate them). Based on
> > CPU tuning we'd also eventually end up with mov[lh]ps splitting
> > unaligned loads/stores.
>
> From
> https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-optimization-reference-manual
>
> 14.6.3 Prefer Aligned Stores Over Aligned Loads
>
> Unaligned stores are likely to cause greater performance degradation than
> unaligned loads, since there
> is a very high penalty on stores to a split cache-line that crosses pages. This
> penalty is estimated at 150
> cycles. Loads that cross a page boundary are executed at retirement.
That's a thing to keep in mind when peeling for alignment, but as
a general rule for straight-line code the possibility of hitting
a page boundary with an unaligned store is small while hitting
STLF failure is more likely.
More information about the Gcc-bugs
mailing list