Bug 48696

Summary:	Horrible bitfield code generation on x86
Product:	gcc	Reporter:	Linus Torvalds <torvalds>
Component:	rtl-optimization	Assignee:	Richard Biener <rguenth>
Status:	ASSIGNED ---
Severity:	normal	CC:	dimhen, ebotcazou, pinskia, rguenth, zsojka
Priority:	P3	Keywords:	missed-optimization
Version:	4.5.2
Target Milestone:	---
See Also:	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107601 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114289
Host:		Target:	x86_64--, i?86--
Build:		Known to work:	3.3.3, 3.4.6, 4.0.3, 4.1.0
Known to fail:	4.0.4, 4.1.2, 4.3.5, 4.5.2, 4.6.0, 4.7.0	Last reconfirmed:	2021-02-11 00:00:00
Bug Depends on:
Bug Blocks:	19466, 71509

Description Linus Torvalds 2011-04-20 04:25:05 UTC

gcc (tried 4.5.1 and 4.6.0) generates absolutely horrid code for some common bitfield accesses due to minimizing the access size.

Trivial test case:

  struct bad_gcc_code_generation {
	unsigned type:6,
		 pos:16,
		 stream:10;
  };

  int show_bug(struct bad_gcc_code_generation *a)
  {
	a->type = 0;
	return a->pos;
  }

will generate code like this on x86-64 with -O2:

	andb	$-64, (%rdi)
	movl	(%rdi), %eax
	shrl	$6, %eax
	movzwl	%ax, %eax
	ret

where the problem is that byte access write, followed by a word access read.

Most (all?) modern x86 CPU's will come to a screeching halt when they see a read that hits a store buffer entry, but cannot be fully forwarded from it. The penalty can be quite severe, and this is _very_ noticeable in profiles.

This code would be _much_ faster either using an "andl" (making the store size match the next load, and thus forwarding through the store buffer), or by having the load be done first. 

(The above code snippet is not the real code I noticed it on, obviously, but real code definitely sees this, and profiling shows very clearly how the 32-bit load from memory basically stops cold due to the partial store buffer hit)

Using non-native accesses to memory is fine for loads (so narrowing the access for a pure load is fine), but for store or read-modify-write instructions it's almost always a serious performance problem to try to "optimize" the memory operand size to something smaller.

Yes, the constants often shrink, but the code becomes *much* slower unless you can guarantee that there are no loads of the original access size that follow the write.

Comment 1 Linus Torvalds 2011-04-20 04:28:18 UTC

Side note, this is talked about in the Intel optimization manual

  3.6.4 Store Forwarding

Under "General Optimization Guidelines" -> "Optimizing Memory Accesses" -> "Store Forwarding".

I'm sure people are aware of it, but I'm not certain people realize how horrible the performance hit can be. I have to admit to having been a bit surprised myself just _how_ clearly it shows up in profiles.

Comment 2 Richard Biener 2011-04-20 09:43:28 UTC

First of all, confirmed.

We already expand to a byte read-modify-write and later do not realize
that with a word one we could combine the later read.  Thus, we optimize
the bitfield store on its own, not looking at surroundings.

I'm not sure where to best address this, rather than throwing in again
the idea of lowering bitfield accesses early on trees.

Eventually simply always doing whole-bitfield read-modify-write at expansion
time will be beneficial, at least when not optimizing for size.

Comment 3 Eric Botcazou 2011-04-20 11:47:00 UTC

So something changed between 4.0.3 and 4.0.4?  Or maybe a typo?

Comment 4 Richard Biener 2011-04-20 12:11:56 UTC

(In reply to comment #3)
> So something changed between 4.0.3 and 4.0.4?  Or maybe a typo?

I only have 32bit compilers for both and see, for 4.0.3:

show_bug:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %edx
        movl    (%edx), %eax
        andl    $-64, %eax
        movl    %eax, (%edx)
        shrl    $6, %eax
        popl    %ebp
        movzwl  %ax, %eax
        ret

and for 4.0.4:

show_bug:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %eax
        andb    $-64, (%eax)
        movl    (%eax), %eax
        leave
        shrl    $6, %eax
        movzwl  %ax, %eax
        ret

Comment 5 Richard Biener 2011-04-20 12:15:14 UTC

(In reply to comment #4)
> (In reply to comment #3)
> > So something changed between 4.0.3 and 4.0.4?  Or maybe a typo?
> 
> I only have 32bit compilers for both and see, for 4.0.3:
> 
> show_bug:
>         pushl   %ebp
>         movl    %esp, %ebp
>         movl    8(%ebp), %edx
>         movl    (%edx), %eax
>         andl    $-64, %eax
>         movl    %eax, (%edx)
>         shrl    $6, %eax
>         popl    %ebp
>         movzwl  %ax, %eax
>         ret
> 
> and for 4.0.4:
> 
> show_bug:
>         pushl   %ebp
>         movl    %esp, %ebp
>         movl    8(%ebp), %eax
>         andb    $-64, (%eax)
>         movl    (%eax), %eax
>         leave
>         shrl    $6, %eax
>         movzwl  %ax, %eax
>         ret

Actually the 4.0.4 compiler is x86_64, the code with -m32.  The 4.0.3
compiler is i586.

 /space/rguenther/install/gcc-4.0.3/libexec/gcc/i686-pc-linux-gnu/4.0.3/cc1 -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=pentiumpro -auxbase t -O2 -version -o t.s

 /space/rguenther/install/gcc-4.0.4/libexec/gcc/x86_64-unknown-linux-gnu/4.0.4/cc1 -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=k8 -auxbase t -O2 -version -o t.s

but no -march/tune combination makes the bug vanish for the 4.0.4 compiler
(maybe a HWI dependent "optimization")

Comment 6 Eric Botcazou 2011-04-20 12:59:40 UTC

Very likely PR rtl-optimization/22563 for the regression in 4.0.x and 4.1.x.

Comment 7 Linus Torvalds 2011-04-20 15:30:17 UTC

(In reply to comment #2)
> 
> I'm not sure where to best address this, rather than throwing in again
> the idea of lowering bitfield accesses early on trees.

So my gut feel is that getting rid of the bitfield as early as possible, and turning all bitfield accesses into regular load/shift/mask/store operations is always the right thing to do.

I also think that doing it with the size that the user specified is generally a good idea, ie I sincerely hope that gcc hasn't thrown away the "unsigned int" part of the type when it does the lowering of the bitfield op.

If gcc has forgotten the underlying type, and only looks at the bitfield size and offset, gcc will likely never do a good job at it unless gcc gets _really_ smart and looks at all the accesses around it and decides "I need to do these as 'int' just because (ie in the example, the "unsigned" base type is as important as is the "bits 0..5" range information).

So I suspect it's better to just do a totally mindless expansion of bitfield accesses early, and then use all the regular optimizations on them. Rather than keep them around as bitfields and try to optimize at some higher level.

In an ironic twist, the real program that shows this optimization problem is "sparse" (the kernel source code checker), which can actually do a "linearize and optimize" the test-case itself, and in this case does this all better than gcc (using its "dump the linearized IR" test-program):

  [torvalds@i5 ~]$ ./src/sparse/test-linearize test.c 
  test.c:7:5: warning: symbol 'show_bug' was not declared. Should it be static?
  show_bug:
  .L0x7f4cf7b93010:
	<entry-point>
	load.32     %r2 <- 0[%arg1]
	and.32      %r3 <- %r2, $-64
	store.32    %r3 -> 0[%arg1]
	lsr.32      %r7 <- %r3, $6
	cast.32     %r8 <- (16) %r7
	ret.32      %r8

Heh. Sparse may get a lot of other things wrong, but it got this particular case right.

Comment 8 Richard Biener 2011-04-20 15:39:38 UTC

(In reply to comment #7)
> (In reply to comment #2)
> > 
> > I'm not sure where to best address this, rather than throwing in again
> > the idea of lowering bitfield accesses early on trees.
> 
> So my gut feel is that getting rid of the bitfield as early as possible, and
> turning all bitfield accesses into regular load/shift/mask/store operations is
> always the right thing to do.
> 
> I also think that doing it with the size that the user specified is generally a
> good idea, ie I sincerely hope that gcc hasn't thrown away the "unsigned int"
> part of the type when it does the lowering of the bitfield op.

Yeah, I was working on this some time ago.

> If gcc has forgotten the underlying type, and only looks at the bitfield size
> and offset, gcc will likely never do a good job at it unless gcc gets _really_
> smart and looks at all the accesses around it and decides "I need to do these
> as 'int' just because (ie in the example, the "unsigned" base type is as
> important as is the "bits 0..5" range information).

Unfortunately the underlying type isn't easily available (at least I didn't
yet find it ...).  But I suppose we have to guess anyway considering
targets that don't handle unaligned accesses well or packed bitfields.
Thus, an idea was to use aligned word-size loads/stores and only at the
start/end of a structure fall back to smaller accesses (for strict align
targets).  I still hope to eventually find that underlying type info
somewhere ...

> So I suspect it's better to just do a totally mindless expansion of bitfield
> accesses early, and then use all the regular optimizations on them. Rather than
> keep them around as bitfields and try to optimize at some higher level.

Yep.  Same for how we currently deal with unaligned loads on targets
that do not support them - the generated code is very similar.

Comment 9 Richard Biener 2011-04-20 15:41:09 UTC

Btw, the branch from the work "some time ago" created

show_bug:
.LFB2:
        movl    (%rdi), %eax
        andl    $-64, %eax
        movl    %eax, (%rdi)
        shrl    $6, %eax
        movzwl  %ax, %eax
        ret

for your testcase.

Comment 10 Jan Hubicka 2011-04-20 15:54:17 UTC

> Actually the 4.0.4 compiler is x86_64, the code with -m32.  The 4.0.3
> compiler is i586.
> 
>  /space/rguenther/install/gcc-4.0.3/libexec/gcc/i686-pc-linux-gnu/4.0.3/cc1
> -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=pentiumpro -auxbase t -O2
> -version -o t.s
> 
> 
> /space/rguenther/install/gcc-4.0.4/libexec/gcc/x86_64-unknown-linux-gnu/4.0.4/cc1
> -quiet -v t.c -quiet -dumpbase t.c -m32 -mtune=k8 -auxbase t -O2 -version -o
> t.s
> 
> but no -march/tune combination makes the bug vanish for the 4.0.4 compiler
> (maybe a HWI dependent "optimization")

We do have i386 flag to disable instruction choice that change memory access
size, it is X86_TUNE_MEMORY_MISMATCH_STALL and I remember myself adding code to
check that we don't do instruction selection that changes memory access size (i.e.
andl->andb promotion) unless when optimizing for size.

This code seems to've evaporated from backend, so I guess it needs re-adding ;)
It did not however work quite reliably since combiner actually does this kind of
transform himself from time to time and I did not get approval to get target
hook for that back then (in 2000 or so).

Honza

Comment 11 Linus Torvalds 2011-04-20 16:16:52 UTC

(In reply to comment #8)
> 
> Unfortunately the underlying type isn't easily available (at least I didn't
> yet find it ...).  But I suppose we have to guess anyway considering
> targets that don't handle unaligned accesses well or packed bitfields.
> Thus, an idea was to use aligned word-size loads/stores and only at the
> start/end of a structure fall back to smaller accesses (for strict align
> targets).  

That sounds fine.

The only reason to bother with the "underlying type" is that I suspect it could be possible for educated programmers to use it as a code generation hint. IOW, if all the individual fields end up fitting nicely in "char", using that as a base type (even if the _total_ fields don't fit in a single byte) might be a good hint for the compiler that it can/should use byte accesses and small constants.

But using the biggest aligned word-size is probably equally good in practice.

And if you end up narrowing the types on _reads_, I think that's fine on x86. I forget the exact store buffer forwarding rules (and they probably vary a bit between different microarchitectures anyway), but I think almost all of them support forwarding a larger store into a smaller (aligned) load.

It's just the writes that should normally not be narrowed.

(Of course, sometimes you may really want to narrow it. Replacing a

   andl $0xffffff00,(%rax)

with a simple

   movb $0,(%rax)

is certainly a very tempting optimization, but it really only works if there are no subsequent word-sized loads that would get fouled by the write buffer entry.

Comment 12 Jakub Jelinek 2011-04-20 16:19:05 UTC

Well, there is also the expander that can and often does increase the size of the accesses, see e.g. PR48124 for more details.  And e.g. for C++0x memory model as well as -fopenmp or, I guess, kernel SMP as well, the access size should never grow into following non-bitfield fields and the bitfield access lowering has to take it into account.

Comment 13 jsm-csl@polyomino.org.uk 2011-04-21 15:43:27 UTC

On Wed, 20 Apr 2011, rguenth at gcc dot gnu.org wrote:

> > If gcc has forgotten the underlying type, and only looks at the bitfield size
> > and offset, gcc will likely never do a good job at it unless gcc gets _really_
> > smart and looks at all the accesses around it and decides "I need to do these
> > as 'int' just because (ie in the example, the "unsigned" base type is as
> > important as is the "bits 0..5" range information).
> 
> Unfortunately the underlying type isn't easily available (at least I didn't
> yet find it ...).  But I suppose we have to guess anyway considering
> targets that don't handle unaligned accesses well or packed bitfields.
> Thus, an idea was to use aligned word-size loads/stores and only at the
> start/end of a structure fall back to smaller accesses (for strict align
> targets).  I still hope to eventually find that underlying type info
> somewhere ...

The underlying type is used for -fstrict-volatile-bitfields, so it must be 
available in the cases needed for that option.

Comment 14 rguenther@suse.de 2011-04-21 15:48:10 UTC

On Thu, 21 Apr 2011, joseph at codesourcery dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696
> 
> --- Comment #13 from joseph at codesourcery dot com <joseph at codesourcery dot com> 2011-04-21 15:43:27 UTC ---
> On Wed, 20 Apr 2011, rguenth at gcc dot gnu.org wrote:
> 
> > > If gcc has forgotten the underlying type, and only looks at the bitfield size
> > > and offset, gcc will likely never do a good job at it unless gcc gets _really_
> > > smart and looks at all the accesses around it and decides "I need to do these
> > > as 'int' just because (ie in the example, the "unsigned" base type is as
> > > important as is the "bits 0..5" range information).
> > 
> > Unfortunately the underlying type isn't easily available (at least I didn't
> > yet find it ...).  But I suppose we have to guess anyway considering
> > targets that don't handle unaligned accesses well or packed bitfields.
> > Thus, an idea was to use aligned word-size loads/stores and only at the
> > start/end of a structure fall back to smaller accesses (for strict align
> > targets).  I still hope to eventually find that underlying type info
> > somewhere ...
> 
> The underlying type is used for -fstrict-volatile-bitfields, so it must be 
> available in the cases needed for that option.

It relies on the mode, thus the start of the bitfield is at
bitpos % mode-size and its size is mode-size if it is aligned.
I'm not sure -fstrict-volatile-bitfields works for the unaligned/packed
case.

Comment 15 Richard Biener 2011-05-03 11:20:34 UTC

Looking into bitfield lowering again.

Comment 16 Richard Biener 2016-06-29 13:11:26 UTC

Re-confirmed for GCC 6.  Related to PR71509.

Comment 17 Andrew Pinski 2018-04-21 21:38:31 UTC

I think some of this is due to SLOW_BYTE_ACCESS being set to 0.

Comment 18 Richard Biener 2021-02-11 11:22:01 UTC

And re-confirmed a few years later.  Sorry :/

Sofar we've improved the middle-end representation for bitfields, we have an underlying field covering what memory models require to see as a single entity
and we've gained simpler IL for bitfield inserts.

But the actual lowering hasn't materialized yet and we'd still need to deal
with optimizations turning larger to smaller stores if we know we can do it
(which is difficult - we've seen STLF fails across TUs in benchmarks in the
context of vectorization).

IIRC Andrew was most recently working on sth like bitfield lowering.