I notice a nasty load hit store in the following test case built with -O2 on ppc64le. The load is larger than the store, which means on many CPUs the store cannot be forwarded and it has to wait until it completes in the cache. struct sk_buff a; struct sk_buff { long blah; char csum_level:2; }; void __skb_decr_checksum_unnecessary(struct sk_buff *p1) { p1->csum_level--; } void func(void); void tcp_v4_rcv(void) { struct sk_buff *b = &a; func(); __skb_decr_checksum_unnecessary(b); func(); __skb_decr_checksum_unnecessary(b); } We end up with this odd situation where we do both a byte and a double word load: # objdump -d testcase.o | grep r31 58: 08 00 5f e9 ld r10,8(r31) 5c: 08 00 3f 89 lbz r9,8(r31) 70: 08 00 3f 99 stb r9,8(r31) 7c: 08 00 5f e9 ld r10,8(r31) 84: 08 00 3f 89 lbz r9,8(r31) a0: 08 00 3f 99 stb r9,8(r31)
Bitfield extraction on ppc64le goes the /* Try loading part of OP0 into a register and extracting the bitfield from that. */ unsigned HOST_WIDE_INT bitpos; rtx xop0 = adjust_bit_field_mem_for_reg (pattern, op0, bitsize, bitnum, 0, 0, tmode, &bitpos); way which ends up generating the DImode load using the fact that the struct alignment adds padding after csum_level. The store path OTOH ends up honoring the C++ mem model which says to access the bitfield in ints declared type (IIRC?) and the bit region via DECL_BIT_FIELD_REPRESENTATIVE is of size 8 (because of C++ inheritance that tail padding can be re-used). It looks like we didn't adjust the bitfield read paths for the mem model because in practice it doesn't matter and it may generate larger/slower code not to do loads in larger types on some archs. This leads to the observed load-store / store-load issues. Note that we conservatively compute the extent for DECL_BIT_FIELD_REPRESENTATIVE by prefering smaller modes. There's some ??? in finish_bitfield_representative and the above remark about tail padding re-use is only implemented via prefering smaller modes. Thus when adding a 'long foo' after csum_level the representative doesn't change to 64bit width but stays at 8bits (both are valid from the C++ memory model). Note that the proposed simple lowering of bitfield accesses on GIMPLE would do accesses of DECL_BIT_FIELD_REPRESENTATIVE and thus in this case use byte accesses. I suppose we want to be less conservative about DECL_BIT_FIELD_REPRESENTATIVE and leave it up to the target how to do the actual accesses. Widening the representative generates __skb_decr_checksum_unnecessary: ld 9,8(3) addi 10,9,3 rldicr 9,9,0,61 rldicl 10,10,0,62 or 9,9,10 std 9,8(3) blr
(In reply to Richard Biener from comment #1) > It looks like we didn't adjust the bitfield read paths for the mem model > because in practice it doesn't matter and it may generate larger/slower code > not to do loads in larger types on some archs. It still generates correct (or at least working) code, yeah. > Note that we conservatively compute the extent for > DECL_BIT_FIELD_REPRESENTATIVE > by prefering smaller modes. There's some ??? in > finish_bitfield_representative > and the above remark about tail padding re-use is only implemented via > prefering > smaller modes. Thus when adding a 'long foo' after csum_level the > representative doesn't change to 64bit width but stays at 8bits (both are > valid from the C++ memory model). Smaller modes are slightly better here, they allow more efficient insn sequences to manipulate the data, esp. on big endian. For example on PowerPC, on BE, our 16-bit immediates (in e.g. ior) can handle QImode and HImode, but not DImode; and e.g. the rlwinm insn can handle SImode but not DImode. In general, with smaller modes you do not need to worry about preserving the other bits. > Note that the proposed simple lowering of bitfield accesses on GIMPLE would > do accesses of DECL_BIT_FIELD_REPRESENTATIVE and thus in this case use byte > accesses. That would be lovely :-) > I suppose we want to be less conservative about DECL_BIT_FIELD_REPRESENTATIVE > and leave it up to the target how to do the actual accesses. Maybe. Always smallest is a good choice for PowerPC at least. > Widening the representative generates > > __skb_decr_checksum_unnecessary: > ld 9,8(3) > addi 10,9,3 > rldicr 9,9,0,61 > rldicl 10,10,0,62 > or 9,9,10 > std 9,8(3) > blr Making all accesses use QImode yields __skb_decr_checksum_unnecessary: lbz 9,8(3) addi 10,9,3 rlwinm 9,9,0,0,29 rlwinm 10,10,0,30,31 or 9,9,10 stb 9,8(3) blr (the rlwinm's and the or could be a rlwimi together, not sure why they aren't merged) and on BE it gives .L.__skb_decr_checksum_unnecessary: lbz 9,8(3) srwi 10,9,6 rlwinm 9,9,0,26,31 addi 10,10,3 rlwinm 10,10,6,24,25 or 9,9,10 stb 9,8(3) blr which could be just lbz 9,8(3) addi 9,9,192 stb 9,8(3) blr
On Tue, 14 Jun 2016, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509 > > --- Comment #2 from Segher Boessenkool <segher at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #1) > > It looks like we didn't adjust the bitfield read paths for the mem model > > because in practice it doesn't matter and it may generate larger/slower code > > not to do loads in larger types on some archs. > > It still generates correct (or at least working) code, yeah. > > > Note that we conservatively compute the extent for > > DECL_BIT_FIELD_REPRESENTATIVE > > by prefering smaller modes. There's some ??? in > > finish_bitfield_representative > > and the above remark about tail padding re-use is only implemented via > > prefering > > smaller modes. Thus when adding a 'long foo' after csum_level the > > representative doesn't change to 64bit width but stays at 8bits (both are > > valid from the C++ memory model). > > Smaller modes are slightly better here, they allow more efficient insn > sequences to manipulate the data, esp. on big endian. For example on > PowerPC, on BE, our 16-bit immediates (in e.g. ior) can handle QImode > and HImode, but not DImode; and e.g. the rlwinm insn can handle SImode > but not DImode. In general, with smaller modes you do not need to > worry about preserving the other bits. > > > Note that the proposed simple lowering of bitfield accesses on GIMPLE would > > do accesses of DECL_BIT_FIELD_REPRESENTATIVE and thus in this case use byte > > accesses. > > That would be lovely :-) On my list for GCC 7 ;-) > > I suppose we want to be less conservative about DECL_BIT_FIELD_REPRESENTATIVE > > and leave it up to the target how to do the actual accesses. > > Maybe. Always smallest is a good choice for PowerPC at least. Note the middle-end in extract_bit_field says using a larger mode is always better because it is better for CSE (not sure about that argument if you only look at a single stmt).
Created attachment 39683 [details] Another bitop LHS test case Here's another issue found in the Linux kernel. Seems like this should be a single lwz/stw since the union of counter and the bitops completely overlap. The half word store followed by word load is going to prevent it from store forwarding. 0000000000000000 <set_page_slub_counters>: 0: 00 00 03 81 lwz r8,0(r3) 4: 20 00 89 78 clrldi r9,r4,32 8: c2 0f 2a 79 rldicl r10,r9,33,31 c: 00 f8 48 51 rlwimi r8,r10,31,0,0 10: 5e 00 2a 55 rlwinm r10,r9,0,1,15 14: 00 00 03 91 stw r8,0(r3) 18: 00 00 83 b0 sth r4,0(r3) 1c: 00 00 42 60 ori r2,r2,0 20: 00 00 23 81 lwz r9,0(r3) 24: 00 04 29 55 rlwinm r9,r9,0,16,0 28: 78 53 29 7d or r9,r9,r10 2c: 00 00 23 91 stw r9,0(r3) 30: 20 00 80 4e blr
This test case seems like it may be related. It does the right thing and uses all 4 byte ops when the 8 byte alignment is removed. I post it here because it may not always be the case that smallest op is best struct s { unsigned long align1; union { unsigned int blah; unsigned int a:1; }; }; void test2(struct s *s) { s->blah = 100; if (s->a) asm volatile("#blah"); } Generates (gcc 7.0.1) test2: li 9,100 stw 9,8(3) ld 9,8(3) andi. 9,9,0x1 beqlr 0 #APP # 29 "a.c" 1 #blah # 0 "" 2 #NO_APP blr There is a more general case of mismatched load and store sizes in unions with different size types, but in this case the sizes could be matched I think.
Doing an 8 byte load of something that was stored as 4 byte immediately before will cause flushes and stalls... Yeah it could use a 4-byte load here afaics.
Hi Richard, trying to figure out the issue recently, but get some questions need your help. How is the status of the "proposed simple lowering of bitfield accesses on GIMPLE", please? for "less conservative about DECL_BIT_FIELD_REPRESENTATIVE", do you mean we choose large mode in GIMPLE stage, and make the decision when in target? Thanks. PS: As a newbie, can you tell how did you do to "Widening the representative" :), I am a bit confused about the best mode and where to process it, sometimes big mode is better and sometimes smaller mode is better(from Segher's comments).
On Fri, 15 Mar 2019, luoxhu at cn dot ibm.com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509 > > Xiong Hu XS Luo <luoxhu at cn dot ibm.com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |guojiufu at gcc dot gnu.org, > | |rguenth at gcc dot gnu.org > > --- Comment #7 from Xiong Hu XS Luo <luoxhu at cn dot ibm.com> --- > Hi Richard, trying to figure out the issue recently, but get some questions > need your help. How is the status of the "proposed simple lowering of bitfield > accesses on GIMPLE", please? There are finished patches in a few variants but all of them show issues in the testsuite, mostly around missed optimizations IIRC. It has been quite some time since I last looked at this but IIRC Andrew Pinski said he's got sth for GCC 10 so you might want to ask him. > for "less conservative about DECL_BIT_FIELD_REPRESENTATIVE", do you mean we > choose large mode in GIMPLE stage, and make the decision when in target? > Thanks. DECL_BIT_FIELD_REPRESENTATIVE was mainly added for strict volatile bitfield accesses as well as to avoid data races as specified by the C++ memory model. So in some cases we might be able to widen the accesses (and we can shorten them in any case if we compute this is a good idea). > PS: As a newbie, can you tell how did you do to "Widening the representative" > :), I am a bit confused about the best mode and where to process it, sometimes > big mode is better and sometimes smaller mode is better(from Segher's > comments). Yeah, get_best_mode is a big pile of *** and this issue is very hard to compute locally. I suppose for the best decision you'd need to look at the whole program (or at least the whole function) in a flow-sensitive manner. One of the lowering tries I did was to integrate the lowering with the SRA pass which is said to eventually get flow-sensitive analysis at some point (but at least does whole-function analysis already).
(In reply to rguenther@suse.de from comment #8) > On Fri, 15 Mar 2019, luoxhu at cn dot ibm.com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509 > > > > Xiong Hu XS Luo <luoxhu at cn dot ibm.com> changed: > > > > What |Removed |Added > > ---------------------------------------------------------------------------- > > CC| |guojiufu at gcc dot gnu.org, > > | |rguenth at gcc dot gnu.org > > > > --- Comment #7 from Xiong Hu XS Luo <luoxhu at cn dot ibm.com> --- > > Hi Richard, trying to figure out the issue recently, but get some questions > > need your help. How is the status of the "proposed simple lowering of bitfield > > accesses on GIMPLE", please? > > There are finished patches in a few variants but all of them show issues > in the testsuite, mostly around missed optimizations IIRC. It has been > quite some time since I last looked at this but IIRC Andrew Pinski said > he's got sth for GCC 10 so you might want to ask him. Yes I do, I am planing on submitting the new pass once GCC 10 has opened up.
(In reply to Andrew Pinski from comment #9) > (In reply to rguenther@suse.de from comment #8) > > On Fri, 15 Mar 2019, luoxhu at cn dot ibm.com wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509 > > > > > > Xiong Hu XS Luo <luoxhu at cn dot ibm.com> changed: > > > > > > What |Removed |Added > > > ---------------------------------------------------------------------------- > > > CC| |guojiufu at gcc dot gnu.org, > > > | |rguenth at gcc dot gnu.org > > > > > > --- Comment #7 from Xiong Hu XS Luo <luoxhu at cn dot ibm.com> --- > > > Hi Richard, trying to figure out the issue recently, but get some questions > > > need your help. How is the status of the "proposed simple lowering of bitfield > > > accesses on GIMPLE", please? > > > > There are finished patches in a few variants but all of them show issues > > in the testsuite, mostly around missed optimizations IIRC. It has been > > quite some time since I last looked at this but IIRC Andrew Pinski said > > he's got sth for GCC 10 so you might want to ask him. > > Yes I do, I am planing on submitting the new pass once GCC 10 has opened up. Hi Pinski, is your patch upstreamed to GCC 10 please? Thanks.
(In reply to Anton Blanchard from comment #4) > Created attachment 39683 [details] > Another bitop LHS test case > > Here's another issue found in the Linux kernel. Seems like this should be a > single lwz/stw since the union of counter and the bitops completely overlap. > > The half word store followed by word load is going to prevent it from store > forwarding. > > 0000000000000000 <set_page_slub_counters>: > 0: 00 00 03 81 lwz r8,0(r3) > 4: 20 00 89 78 clrldi r9,r4,32 > 8: c2 0f 2a 79 rldicl r10,r9,33,31 > c: 00 f8 48 51 rlwimi r8,r10,31,0,0 > 10: 5e 00 2a 55 rlwinm r10,r9,0,1,15 > 14: 00 00 03 91 stw r8,0(r3) > 18: 00 00 83 b0 sth r4,0(r3) > 1c: 00 00 42 60 ori r2,r2,0 > 20: 00 00 23 81 lwz r9,0(r3) > 24: 00 04 29 55 rlwinm r9,r9,0,16,0 > 28: 78 53 29 7d or r9,r9,r10 > 2c: 00 00 23 91 stw r9,0(r3) > 30: 20 00 80 4e blr This case only is fixed on latest gcc 10 already (issues in case __skb_decr_checksum_unnecessary from Anton Blanchard and test2 from Nicholas Piggin still exist). gcc version 10.0.1 20200210 objdump -d set_page_slub_counters.o set_page_slub_counters.o: file format elf64-powerpcle Disassembly of section .text: 0000000000000000 <set_page_slub_counters>: 0: 22 84 89 78 rldicl r9,r4,48,48 4: 00 00 83 b0 sth r4,0(r3) 8: 02 00 23 b1 sth r9,2(r3) c: 20 00 80 4e blr
But it could do just stw r4,0(r3) (on LE; and with a rotate first, on BE).
(In reply to Segher Boessenkool from comment #12) > But it could do just > > stw r4,0(r3) > > (on LE; and with a rotate first, on BE). Thanks for the catching, this optimization is not related to load hit store. I will investigate why store-merging pass failed to merge the 2 half store.
Taking this for GCC 13. I am going to finally submit the bit-field lowering next year.