gcc (tried 4.5.1 and 4.6.0) generates absolutely horrid code for some common bitfield accesses due to minimizing the access size. Trivial test case: struct bad_gcc_code_generation { unsigned type:6, pos:16, stream:10; }; int show_bug(struct bad_gcc_code_generation *a) { a->type = 0; return a->pos; } will generate code like this on x86-64 with -O2: andb $-64, (%rdi) movl (%rdi), %eax shrl $6, %eax movzwl %ax, %eax ret where the problem is that byte access write, followed by a word access read. Most (all?) modern x86 CPU's will come to a screeching halt when they see a read that hits a store buffer entry, but cannot be fully forwarded from it. The penalty can be quite severe, and this is _very_ noticeable in profiles. This code would be _much_ faster either using an "andl" (making the store size match the next load, and thus forwarding through the store buffer), or by having the load be done first. (The above code snippet is not the real code I noticed it on, obviously, but real code definitely sees this, and profiling shows very clearly how the 32-bit load from memory basically stops cold due to the partial store buffer hit) Using non-native accesses to memory is fine for loads (so narrowing the access for a pure load is fine), but for store or read-modify-write instructions it's almost always a serious performance problem to try to "optimize" the memory operand size to something smaller. Yes, the constants often shrink, but the code becomes *much* slower unless you can guarantee that there are no loads of the original access size that follow the write.
Side note, this is talked about in the Intel optimization manual 3.6.4 Store Forwarding Under "General Optimization Guidelines" -> "Optimizing Memory Accesses" -> "Store Forwarding". I'm sure people are aware of it, but I'm not certain people realize how horrible the performance hit can be. I have to admit to having been a bit surprised myself just _how_ clearly it shows up in profiles.
First of all, confirmed. We already expand to a byte read-modify-write and later do not realize that with a word one we could combine the later read. Thus, we optimize the bitfield store on its own, not looking at surroundings. I'm not sure where to best address this, rather than throwing in again the idea of lowering bitfield accesses early on trees. Eventually simply always doing whole-bitfield read-modify-write at expansion time will be beneficial, at least when not optimizing for size.
So something changed between 4.0.3 and 4.0.4? Or maybe a typo?
(In reply to comment #3) > So something changed between 4.0.3 and 4.0.4? Or maybe a typo? I only have 32bit compilers for both and see, for 4.0.3: show_bug: pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx movl (%edx), %eax andl $-64, %eax movl %eax, (%edx) shrl $6, %eax popl %ebp movzwl %ax, %eax ret and for 4.0.4: show_bug: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax andb $-64, (%eax) movl (%eax), %eax leave shrl $6, %eax movzwl %ax, %eax ret
(In reply to comment #4) > (In reply to comment #3) > > So something changed between 4.0.3 and 4.0.4? Or maybe a typo? > > I only have 32bit compilers for both and see, for 4.0.3: > > show_bug: > pushl %ebp > movl %esp, %ebp > movl 8(%ebp), %edx > movl (%edx), %eax > andl $-64, %eax > movl %eax, (%edx) > shrl $6, %eax > popl %ebp > movzwl %ax, %eax > ret > > and for 4.0.4: > > show_bug: > pushl %ebp > movl %esp, %ebp > movl 8(%ebp), %eax > andb $-64, (%eax) > movl (%eax), %eax > leave > shrl $6, %eax > movzwl %ax, %eax > ret Actually the 4.0.4 compiler is x86_64, the code with -m32. The 4.0.3 compiler is i586. /space/rguenther/install/gcc-4.0.3/libexec/gcc/i686-pc-linux-gnu/4.0.3/cc1 -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=pentiumpro -auxbase t -O2 -version -o t.s /space/rguenther/install/gcc-4.0.4/libexec/gcc/x86_64-unknown-linux-gnu/4.0.4/cc1 -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=k8 -auxbase t -O2 -version -o t.s but no -march/tune combination makes the bug vanish for the 4.0.4 compiler (maybe a HWI dependent "optimization")
Very likely PR rtl-optimization/22563 for the regression in 4.0.x and 4.1.x.
(In reply to comment #2) > > I'm not sure where to best address this, rather than throwing in again > the idea of lowering bitfield accesses early on trees. So my gut feel is that getting rid of the bitfield as early as possible, and turning all bitfield accesses into regular load/shift/mask/store operations is always the right thing to do. I also think that doing it with the size that the user specified is generally a good idea, ie I sincerely hope that gcc hasn't thrown away the "unsigned int" part of the type when it does the lowering of the bitfield op. If gcc has forgotten the underlying type, and only looks at the bitfield size and offset, gcc will likely never do a good job at it unless gcc gets _really_ smart and looks at all the accesses around it and decides "I need to do these as 'int' just because (ie in the example, the "unsigned" base type is as important as is the "bits 0..5" range information). So I suspect it's better to just do a totally mindless expansion of bitfield accesses early, and then use all the regular optimizations on them. Rather than keep them around as bitfields and try to optimize at some higher level. In an ironic twist, the real program that shows this optimization problem is "sparse" (the kernel source code checker), which can actually do a "linearize and optimize" the test-case itself, and in this case does this all better than gcc (using its "dump the linearized IR" test-program): [torvalds@i5 ~]$ ./src/sparse/test-linearize test.c test.c:7:5: warning: symbol 'show_bug' was not declared. Should it be static? show_bug: .L0x7f4cf7b93010: <entry-point> load.32 %r2 <- 0[%arg1] and.32 %r3 <- %r2, $-64 store.32 %r3 -> 0[%arg1] lsr.32 %r7 <- %r3, $6 cast.32 %r8 <- (16) %r7 ret.32 %r8 Heh. Sparse may get a lot of other things wrong, but it got this particular case right.
(In reply to comment #7) > (In reply to comment #2) > > > > I'm not sure where to best address this, rather than throwing in again > > the idea of lowering bitfield accesses early on trees. > > So my gut feel is that getting rid of the bitfield as early as possible, and > turning all bitfield accesses into regular load/shift/mask/store operations is > always the right thing to do. > > I also think that doing it with the size that the user specified is generally a > good idea, ie I sincerely hope that gcc hasn't thrown away the "unsigned int" > part of the type when it does the lowering of the bitfield op. Yeah, I was working on this some time ago. > If gcc has forgotten the underlying type, and only looks at the bitfield size > and offset, gcc will likely never do a good job at it unless gcc gets _really_ > smart and looks at all the accesses around it and decides "I need to do these > as 'int' just because (ie in the example, the "unsigned" base type is as > important as is the "bits 0..5" range information). Unfortunately the underlying type isn't easily available (at least I didn't yet find it ...). But I suppose we have to guess anyway considering targets that don't handle unaligned accesses well or packed bitfields. Thus, an idea was to use aligned word-size loads/stores and only at the start/end of a structure fall back to smaller accesses (for strict align targets). I still hope to eventually find that underlying type info somewhere ... > So I suspect it's better to just do a totally mindless expansion of bitfield > accesses early, and then use all the regular optimizations on them. Rather than > keep them around as bitfields and try to optimize at some higher level. Yep. Same for how we currently deal with unaligned loads on targets that do not support them - the generated code is very similar.
Btw, the branch from the work "some time ago" created show_bug: .LFB2: movl (%rdi), %eax andl $-64, %eax movl %eax, (%rdi) shrl $6, %eax movzwl %ax, %eax ret for your testcase.
> Actually the 4.0.4 compiler is x86_64, the code with -m32. The 4.0.3 > compiler is i586. > > /space/rguenther/install/gcc-4.0.3/libexec/gcc/i686-pc-linux-gnu/4.0.3/cc1 > -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=pentiumpro -auxbase t -O2 > -version -o t.s > > > /space/rguenther/install/gcc-4.0.4/libexec/gcc/x86_64-unknown-linux-gnu/4.0.4/cc1 > -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=k8 -auxbase t -O2 -version -o > t.s > > but no -march/tune combination makes the bug vanish for the 4.0.4 compiler > (maybe a HWI dependent "optimization") We do have i386 flag to disable instruction choice that change memory access size, it is X86_TUNE_MEMORY_MISMATCH_STALL and I remember myself adding code to check that we don't do instruction selection that changes memory access size (i.e. andl->andb promotion) unless when optimizing for size. This code seems to've evaporated from backend, so I guess it needs re-adding ;) It did not however work quite reliably since combiner actually does this kind of transform himself from time to time and I did not get approval to get target hook for that back then (in 2000 or so). Honza
(In reply to comment #8) > > Unfortunately the underlying type isn't easily available (at least I didn't > yet find it ...). But I suppose we have to guess anyway considering > targets that don't handle unaligned accesses well or packed bitfields. > Thus, an idea was to use aligned word-size loads/stores and only at the > start/end of a structure fall back to smaller accesses (for strict align > targets). That sounds fine. The only reason to bother with the "underlying type" is that I suspect it could be possible for educated programmers to use it as a code generation hint. IOW, if all the individual fields end up fitting nicely in "char", using that as a base type (even if the _total_ fields don't fit in a single byte) might be a good hint for the compiler that it can/should use byte accesses and small constants. But using the biggest aligned word-size is probably equally good in practice. And if you end up narrowing the types on _reads_, I think that's fine on x86. I forget the exact store buffer forwarding rules (and they probably vary a bit between different microarchitectures anyway), but I think almost all of them support forwarding a larger store into a smaller (aligned) load. It's just the writes that should normally not be narrowed. (Of course, sometimes you may really want to narrow it. Replacing a andl $0xffffff00,(%rax) with a simple movb $0,(%rax) is certainly a very tempting optimization, but it really only works if there are no subsequent word-sized loads that would get fouled by the write buffer entry.
Well, there is also the expander that can and often does increase the size of the accesses, see e.g. PR48124 for more details. And e.g. for C++0x memory model as well as -fopenmp or, I guess, kernel SMP as well, the access size should never grow into following non-bitfield fields and the bitfield access lowering has to take it into account.
On Wed, 20 Apr 2011, rguenth at gcc dot gnu.org wrote: > > If gcc has forgotten the underlying type, and only looks at the bitfield size > > and offset, gcc will likely never do a good job at it unless gcc gets _really_ > > smart and looks at all the accesses around it and decides "I need to do these > > as 'int' just because (ie in the example, the "unsigned" base type is as > > important as is the "bits 0..5" range information). > > Unfortunately the underlying type isn't easily available (at least I didn't > yet find it ...). But I suppose we have to guess anyway considering > targets that don't handle unaligned accesses well or packed bitfields. > Thus, an idea was to use aligned word-size loads/stores and only at the > start/end of a structure fall back to smaller accesses (for strict align > targets). I still hope to eventually find that underlying type info > somewhere ... The underlying type is used for -fstrict-volatile-bitfields, so it must be available in the cases needed for that option.
On Thu, 21 Apr 2011, joseph at codesourcery dot com wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696 > > --- Comment #13 from joseph at codesourcery dot com <joseph at codesourcery dot com> 2011-04-21 15:43:27 UTC --- > On Wed, 20 Apr 2011, rguenth at gcc dot gnu.org wrote: > > > > If gcc has forgotten the underlying type, and only looks at the bitfield size > > > and offset, gcc will likely never do a good job at it unless gcc gets _really_ > > > smart and looks at all the accesses around it and decides "I need to do these > > > as 'int' just because (ie in the example, the "unsigned" base type is as > > > important as is the "bits 0..5" range information). > > > > Unfortunately the underlying type isn't easily available (at least I didn't > > yet find it ...). But I suppose we have to guess anyway considering > > targets that don't handle unaligned accesses well or packed bitfields. > > Thus, an idea was to use aligned word-size loads/stores and only at the > > start/end of a structure fall back to smaller accesses (for strict align > > targets). I still hope to eventually find that underlying type info > > somewhere ... > > The underlying type is used for -fstrict-volatile-bitfields, so it must be > available in the cases needed for that option. It relies on the mode, thus the start of the bitfield is at bitpos % mode-size and its size is mode-size if it is aligned. I'm not sure -fstrict-volatile-bitfields works for the unaligned/packed case.
Looking into bitfield lowering again.
Re-confirmed for GCC 6. Related to PR71509.
I think some of this is due to SLOW_BYTE_ACCESS being set to 0.
And re-confirmed a few years later. Sorry :/ Sofar we've improved the middle-end representation for bitfields, we have an underlying field covering what memory models require to see as a single entity and we've gained simpler IL for bitfield inserts. But the actual lowering hasn't materialized yet and we'd still need to deal with optimizations turning larger to smaller stores if we know we can do it (which is difficult - we've seen STLF fails across TUs in benchmarks in the context of vectorization). IIRC Andrew was most recently working on sth like bitfield lowering.