Summary: | Horrible bitfield code generation on x86 | ||
---|---|---|---|
Product: | gcc | Reporter: | Linus Torvalds <torvalds> |
Component: | rtl-optimization | Assignee: | Richard Biener <rguenth> |
Status: | ASSIGNED --- | ||
Severity: | normal | CC: | dimhen, ebotcazou, pinskia, rguenth, zsojka |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 4.5.2 | ||
Target Milestone: | --- | ||
See Also: |
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107601 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114289 |
||
Host: | Target: | x86_64-*-*, i?86-*-* | |
Build: | Known to work: | 3.3.3, 3.4.6, 4.0.3, 4.1.0 | |
Known to fail: | 4.0.4, 4.1.2, 4.3.5, 4.5.2, 4.6.0, 4.7.0 | Last reconfirmed: | 2021-02-11 00:00:00 |
Bug Depends on: | |||
Bug Blocks: | 19466, 71509 |
Description
Linus Torvalds
2011-04-20 04:25:05 UTC
Side note, this is talked about in the Intel optimization manual 3.6.4 Store Forwarding Under "General Optimization Guidelines" -> "Optimizing Memory Accesses" -> "Store Forwarding". I'm sure people are aware of it, but I'm not certain people realize how horrible the performance hit can be. I have to admit to having been a bit surprised myself just _how_ clearly it shows up in profiles. First of all, confirmed. We already expand to a byte read-modify-write and later do not realize that with a word one we could combine the later read. Thus, we optimize the bitfield store on its own, not looking at surroundings. I'm not sure where to best address this, rather than throwing in again the idea of lowering bitfield accesses early on trees. Eventually simply always doing whole-bitfield read-modify-write at expansion time will be beneficial, at least when not optimizing for size. So something changed between 4.0.3 and 4.0.4? Or maybe a typo? (In reply to comment #3) > So something changed between 4.0.3 and 4.0.4? Or maybe a typo? I only have 32bit compilers for both and see, for 4.0.3: show_bug: pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx movl (%edx), %eax andl $-64, %eax movl %eax, (%edx) shrl $6, %eax popl %ebp movzwl %ax, %eax ret and for 4.0.4: show_bug: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax andb $-64, (%eax) movl (%eax), %eax leave shrl $6, %eax movzwl %ax, %eax ret (In reply to comment #4) > (In reply to comment #3) > > So something changed between 4.0.3 and 4.0.4? Or maybe a typo? > > I only have 32bit compilers for both and see, for 4.0.3: > > show_bug: > pushl %ebp > movl %esp, %ebp > movl 8(%ebp), %edx > movl (%edx), %eax > andl $-64, %eax > movl %eax, (%edx) > shrl $6, %eax > popl %ebp > movzwl %ax, %eax > ret > > and for 4.0.4: > > show_bug: > pushl %ebp > movl %esp, %ebp > movl 8(%ebp), %eax > andb $-64, (%eax) > movl (%eax), %eax > leave > shrl $6, %eax > movzwl %ax, %eax > ret Actually the 4.0.4 compiler is x86_64, the code with -m32. The 4.0.3 compiler is i586. /space/rguenther/install/gcc-4.0.3/libexec/gcc/i686-pc-linux-gnu/4.0.3/cc1 -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=pentiumpro -auxbase t -O2 -version -o t.s /space/rguenther/install/gcc-4.0.4/libexec/gcc/x86_64-unknown-linux-gnu/4.0.4/cc1 -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=k8 -auxbase t -O2 -version -o t.s but no -march/tune combination makes the bug vanish for the 4.0.4 compiler (maybe a HWI dependent "optimization") Very likely PR rtl-optimization/22563 for the regression in 4.0.x and 4.1.x. (In reply to comment #2) > > I'm not sure where to best address this, rather than throwing in again > the idea of lowering bitfield accesses early on trees. So my gut feel is that getting rid of the bitfield as early as possible, and turning all bitfield accesses into regular load/shift/mask/store operations is always the right thing to do. I also think that doing it with the size that the user specified is generally a good idea, ie I sincerely hope that gcc hasn't thrown away the "unsigned int" part of the type when it does the lowering of the bitfield op. If gcc has forgotten the underlying type, and only looks at the bitfield size and offset, gcc will likely never do a good job at it unless gcc gets _really_ smart and looks at all the accesses around it and decides "I need to do these as 'int' just because (ie in the example, the "unsigned" base type is as important as is the "bits 0..5" range information). So I suspect it's better to just do a totally mindless expansion of bitfield accesses early, and then use all the regular optimizations on them. Rather than keep them around as bitfields and try to optimize at some higher level. In an ironic twist, the real program that shows this optimization problem is "sparse" (the kernel source code checker), which can actually do a "linearize and optimize" the test-case itself, and in this case does this all better than gcc (using its "dump the linearized IR" test-program): [torvalds@i5 ~]$ ./src/sparse/test-linearize test.c test.c:7:5: warning: symbol 'show_bug' was not declared. Should it be static? show_bug: .L0x7f4cf7b93010: <entry-point> load.32 %r2 <- 0[%arg1] and.32 %r3 <- %r2, $-64 store.32 %r3 -> 0[%arg1] lsr.32 %r7 <- %r3, $6 cast.32 %r8 <- (16) %r7 ret.32 %r8 Heh. Sparse may get a lot of other things wrong, but it got this particular case right. (In reply to comment #7) > (In reply to comment #2) > > > > I'm not sure where to best address this, rather than throwing in again > > the idea of lowering bitfield accesses early on trees. > > So my gut feel is that getting rid of the bitfield as early as possible, and > turning all bitfield accesses into regular load/shift/mask/store operations is > always the right thing to do. > > I also think that doing it with the size that the user specified is generally a > good idea, ie I sincerely hope that gcc hasn't thrown away the "unsigned int" > part of the type when it does the lowering of the bitfield op. Yeah, I was working on this some time ago. > If gcc has forgotten the underlying type, and only looks at the bitfield size > and offset, gcc will likely never do a good job at it unless gcc gets _really_ > smart and looks at all the accesses around it and decides "I need to do these > as 'int' just because (ie in the example, the "unsigned" base type is as > important as is the "bits 0..5" range information). Unfortunately the underlying type isn't easily available (at least I didn't yet find it ...). But I suppose we have to guess anyway considering targets that don't handle unaligned accesses well or packed bitfields. Thus, an idea was to use aligned word-size loads/stores and only at the start/end of a structure fall back to smaller accesses (for strict align targets). I still hope to eventually find that underlying type info somewhere ... > So I suspect it's better to just do a totally mindless expansion of bitfield > accesses early, and then use all the regular optimizations on them. Rather than > keep them around as bitfields and try to optimize at some higher level. Yep. Same for how we currently deal with unaligned loads on targets that do not support them - the generated code is very similar. Btw, the branch from the work "some time ago" created show_bug: .LFB2: movl (%rdi), %eax andl $-64, %eax movl %eax, (%rdi) shrl $6, %eax movzwl %ax, %eax ret for your testcase. > Actually the 4.0.4 compiler is x86_64, the code with -m32. The 4.0.3
> compiler is i586.
>
> /space/rguenther/install/gcc-4.0.3/libexec/gcc/i686-pc-linux-gnu/4.0.3/cc1
> -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=pentiumpro -auxbase t -O2
> -version -o t.s
>
>
> /space/rguenther/install/gcc-4.0.4/libexec/gcc/x86_64-unknown-linux-gnu/4.0.4/cc1
> -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=k8 -auxbase t -O2 -version -o
> t.s
>
> but no -march/tune combination makes the bug vanish for the 4.0.4 compiler
> (maybe a HWI dependent "optimization")
We do have i386 flag to disable instruction choice that change memory access
size, it is X86_TUNE_MEMORY_MISMATCH_STALL and I remember myself adding code to
check that we don't do instruction selection that changes memory access size (i.e.
andl->andb promotion) unless when optimizing for size.
This code seems to've evaporated from backend, so I guess it needs re-adding ;)
It did not however work quite reliably since combiner actually does this kind of
transform himself from time to time and I did not get approval to get target
hook for that back then (in 2000 or so).
Honza
(In reply to comment #8) > > Unfortunately the underlying type isn't easily available (at least I didn't > yet find it ...). But I suppose we have to guess anyway considering > targets that don't handle unaligned accesses well or packed bitfields. > Thus, an idea was to use aligned word-size loads/stores and only at the > start/end of a structure fall back to smaller accesses (for strict align > targets). That sounds fine. The only reason to bother with the "underlying type" is that I suspect it could be possible for educated programmers to use it as a code generation hint. IOW, if all the individual fields end up fitting nicely in "char", using that as a base type (even if the _total_ fields don't fit in a single byte) might be a good hint for the compiler that it can/should use byte accesses and small constants. But using the biggest aligned word-size is probably equally good in practice. And if you end up narrowing the types on _reads_, I think that's fine on x86. I forget the exact store buffer forwarding rules (and they probably vary a bit between different microarchitectures anyway), but I think almost all of them support forwarding a larger store into a smaller (aligned) load. It's just the writes that should normally not be narrowed. (Of course, sometimes you may really want to narrow it. Replacing a andl $0xffffff00,(%rax) with a simple movb $0,(%rax) is certainly a very tempting optimization, but it really only works if there are no subsequent word-sized loads that would get fouled by the write buffer entry. Well, there is also the expander that can and often does increase the size of the accesses, see e.g. PR48124 for more details. And e.g. for C++0x memory model as well as -fopenmp or, I guess, kernel SMP as well, the access size should never grow into following non-bitfield fields and the bitfield access lowering has to take it into account. On Wed, 20 Apr 2011, rguenth at gcc dot gnu.org wrote:
> > If gcc has forgotten the underlying type, and only looks at the bitfield size
> > and offset, gcc will likely never do a good job at it unless gcc gets _really_
> > smart and looks at all the accesses around it and decides "I need to do these
> > as 'int' just because (ie in the example, the "unsigned" base type is as
> > important as is the "bits 0..5" range information).
>
> Unfortunately the underlying type isn't easily available (at least I didn't
> yet find it ...). But I suppose we have to guess anyway considering
> targets that don't handle unaligned accesses well or packed bitfields.
> Thus, an idea was to use aligned word-size loads/stores and only at the
> start/end of a structure fall back to smaller accesses (for strict align
> targets). I still hope to eventually find that underlying type info
> somewhere ...
The underlying type is used for -fstrict-volatile-bitfields, so it must be
available in the cases needed for that option.
On Thu, 21 Apr 2011, joseph at codesourcery dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696
>
> --- Comment #13 from joseph at codesourcery dot com <joseph at codesourcery dot com> 2011-04-21 15:43:27 UTC ---
> On Wed, 20 Apr 2011, rguenth at gcc dot gnu.org wrote:
>
> > > If gcc has forgotten the underlying type, and only looks at the bitfield size
> > > and offset, gcc will likely never do a good job at it unless gcc gets _really_
> > > smart and looks at all the accesses around it and decides "I need to do these
> > > as 'int' just because (ie in the example, the "unsigned" base type is as
> > > important as is the "bits 0..5" range information).
> >
> > Unfortunately the underlying type isn't easily available (at least I didn't
> > yet find it ...). But I suppose we have to guess anyway considering
> > targets that don't handle unaligned accesses well or packed bitfields.
> > Thus, an idea was to use aligned word-size loads/stores and only at the
> > start/end of a structure fall back to smaller accesses (for strict align
> > targets). I still hope to eventually find that underlying type info
> > somewhere ...
>
> The underlying type is used for -fstrict-volatile-bitfields, so it must be
> available in the cases needed for that option.
It relies on the mode, thus the start of the bitfield is at
bitpos % mode-size and its size is mode-size if it is aligned.
I'm not sure -fstrict-volatile-bitfields works for the unaligned/packed
case.
Looking into bitfield lowering again. Re-confirmed for GCC 6. Related to PR71509. I think some of this is due to SLOW_BYTE_ACCESS being set to 0. And re-confirmed a few years later. Sorry :/ Sofar we've improved the middle-end representation for bitfields, we have an underlying field covering what memory models require to see as a single entity and we've gained simpler IL for bitfield inserts. But the actual lowering hasn't materialized yet and we'd still need to deal with optimizations turning larger to smaller stores if we know we can do it (which is difficult - we've seen STLF fails across TUs in benchmarks in the context of vectorization). IIRC Andrew was most recently working on sth like bitfield lowering. |