This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Sun, Nov 17, 2013 at 04:42:18PM +0100, Richard Biener wrote: > "OndÅej BÃlka" <neleai@seznam.cz> wrote: > >On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote: > >> "OndÅej BÃlka" <neleai@seznam.cz> wrote: > >> >On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote: > >> > >> IIRC what can still be seen is store-buffer related slowdowns when > >you have a big unaligned store load in your loop. Thus aligning stores > >still pays back last time I measured this. > > > >Then send you benchmark. What I did is a loop that stores 512 bytes. > >Unaligned stores there are faster than aligned ones, so tell me when > >aligning stores pays itself. Note that in filling store buffer you must > >take into account extra stores to make loop aligned. > > The issue is that the effective write bandwidth can be limited by the store buffer if you have multiple write streams. IIRC at least some amd CPUs have to use two entries for stores crossing a cache line boundary. > So can be performance limited by branch misprediction. You need to show that likely bottleneck is too much writes and not other factor. > Anyway, a look into the optimization manuals will tell you what to do and the cost model should follow these recommendations. > These tend to be quite out of data, you typically need to recheck everything. Take Intel 64 and IA-32 Architectures Optimization Reference Manual from April 2012 A sugestion on store load forwarding there is to align loads and stores to make it working (with P4 and core2 suggestions). However this is false since nehalem, when I test a variant of memcpy that is unaligned by one byte, code is following (full benchmark attached.): set: .LFB0: .cfi_startproc xor %rdx, %rdx addq $1, %rsi lea 144(%rsi), %rdi .L: movdqu 0(%rsi,%rdx), %xmm0 movdqu 16(%rsi,%rdx), %xmm1 ... movdqu 112(%rsi,%rdx), %xmm7 movdqu %xmm0, 0(%rdi,%rdx) ... movdqu %xmm7, 112(%rdi,%rdx) addq $128, %rdx cmp $5120, %rdx jle .L ret Then there is around 10% slowdown vs nonforwarding one. real 0m2.098s user 0m2.083s sys 0m0.003s However when I set 'in lea 144(%rsi), %rdi' a 143 or other nonmultiple of 16 then performance degrades. real 0m3.495s user 0m3.480s sys 0m0.000s And other suggestions are similarly flimsy unless your target is pentium 4. > >Also what do you do with loops that contain no store? If I modify test > >to > > > >int set(int *p, int *q){ > > int i; > > int sum = 0; > > for (i=0; i < 128; i++) > > sum += 42 * p[i]; > > return sum; > >} > > > >then it still does aligning. > > Because the cost model simply does not exist for the decision whether to peel or not. Patches welcome. > > >There may be a threshold after which aligning buffer makes sense then > >you > >need to show that loop spend most of time on sizes after that treshold. > > > >Also do you have data how common store-buffer slowdowns are? Without > >knowing that you risk that you make few loops faster at expense of > >majority which could likely slow whole application down. It would not > >supprise me as these loops can be ran mostly on L1 cache data (which is > >around same level as assuming that increased code size fits into > >instruction cache.) > > > > > >Actually these questions could be answered by a test, first compile > >SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to > >use unaligned loads. Then results will tell if peeling is also good in > >practice or not. > > It should not be a on or off decision but rather a decision based on a cost model. > You cannot decide that on cost model alone as performance is decided by runtime usage pattern. If you do profiling then you could do that. Alternatively you can add a branch to enable peeling only after preset treshold.
Attachment:
test.c
Description: Text document
Attachment:
set2.s
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |