71509 – Bitfield causes load hit store with larger store than load

Bug 71509 - Bitfield causes load hit store with larger store than load

Summary: Bitfield causes load hit store with larger store than load

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	7.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Andrew Pinski

URL:
Keywords:	missed-optimization

Depends on:	48696 71310
Blocks:	bitfield
	Show dependency tree / graph

Reported:	2016-06-12 23:12 UTC by Anton Blanchard
Modified:	2024-04-24 06:40 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:	powerpc64le-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:	2016-06-13 00:00:00

Attachments
Another bitop LHS test case (822 bytes, text/plain) 2016-09-26 00:58 UTC, Anton Blanchard	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Anton Blanchard 2016-06-12 23:12:14 UTC

I notice a nasty load hit store in the following test case built with -O2 on ppc64le. The load is larger than the store, which means on many CPUs the store cannot be forwarded and it has to wait until it completes in the cache.

struct sk_buff a;

struct sk_buff {
        long blah;
        char csum_level:2;
};

void __skb_decr_checksum_unnecessary(struct sk_buff *p1)
{
        p1->csum_level--;
}

void func(void);

void tcp_v4_rcv(void)
{
        struct sk_buff *b = &a;

        func();
        __skb_decr_checksum_unnecessary(b);
        func();
        __skb_decr_checksum_unnecessary(b);
}

We end up with this odd situation where we do both a byte and a double
word load:

# objdump  -d testcase.o | grep r31

  58:	08 00 5f e9 	ld      r10,8(r31)
  5c:	08 00 3f 89 	lbz     r9,8(r31)
  70:	08 00 3f 99 	stb     r9,8(r31)

  7c:	08 00 5f e9 	ld      r10,8(r31)
  84:	08 00 3f 89 	lbz     r9,8(r31)
  a0:	08 00 3f 99 	stb     r9,8(r31)

Comment 1 Richard Biener 2016-06-14 08:53:48 UTC

Bitfield extraction on ppc64le goes the

/* Try loading part of OP0 into a register and extracting the
bitfield from that. */
unsigned HOST_WIDE_INT bitpos;
rtx xop0 = adjust_bit_field_mem_for_reg (pattern, op0, bitsize, bitnum,
0, 0, tmode, &bitpos);

way which ends up generating the DImode load using the fact that the struct
alignment adds padding after csum_level. The store path OTOH ends up
honoring the C++ mem model which says to access the bitfield in ints declared
type (IIRC?) and the bit region via DECL_BIT_FIELD_REPRESENTATIVE is of size 8
(because of C++ inheritance that tail padding can be re-used).

It looks like we didn't adjust the bitfield read paths for the mem model
because in practice it doesn't matter and it may generate larger/slower code
not to do loads in larger types on some archs.

This leads to the observed load-store / store-load issues.

Note that we conservatively compute the extent for DECL_BIT_FIELD_REPRESENTATIVE
by prefering smaller modes. There's some ??? in finish_bitfield_representative
and the above remark about tail padding re-use is only implemented via prefering
smaller modes. Thus when adding a 'long foo' after csum_level the representative doesn't change to 64bit width but stays at 8bits (both are valid from the C++ memory model).

Note that the proposed simple lowering of bitfield accesses on GIMPLE would
do accesses of DECL_BIT_FIELD_REPRESENTATIVE and thus in this case use byte
accesses.

I suppose we want to be less conservative about DECL_BIT_FIELD_REPRESENTATIVE
and leave it up to the target how to do the actual accesses.

Widening the representative generates

__skb_decr_checksum_unnecessary:
ld 9,8(3)
addi 10,9,3
rldicr 9,9,0,61
rldicl 10,10,0,62
or 9,9,10
std 9,8(3)
blr

Comment 2 Segher Boessenkool 2016-06-14 20:47:49 UTC

(In reply to Richard Biener from comment #1)
> It looks like we didn't adjust the bitfield read paths for the mem model
> because in practice it doesn't matter and it may generate larger/slower code
> not to do loads in larger types on some archs.

It still generates correct (or at least working) code, yeah.

> Note that we conservatively compute the extent for
> DECL_BIT_FIELD_REPRESENTATIVE
> by prefering smaller modes.  There's some ??? in
> finish_bitfield_representative
> and the above remark about tail padding re-use is only implemented via
> prefering
> smaller modes.  Thus when adding a 'long foo' after csum_level the
> representative doesn't change to 64bit width but stays at 8bits (both are
> valid from the C++ memory model).

Smaller modes are slightly better here, they allow more efficient insn
sequences to manipulate the data, esp. on big endian.  For example on
PowerPC, on BE, our 16-bit immediates (in e.g. ior) can handle QImode
and HImode, but not DImode; and e.g. the rlwinm insn can handle SImode
but not DImode.  In general, with smaller modes you do not need to
worry about preserving the other bits.

> Note that the proposed simple lowering of bitfield accesses on GIMPLE would
> do accesses of DECL_BIT_FIELD_REPRESENTATIVE and thus in this case use byte
> accesses.

That would be lovely :-)

> I suppose we want to be less conservative about DECL_BIT_FIELD_REPRESENTATIVE
> and leave it up to the target how to do the actual accesses.

Maybe.  Always smallest is a good choice for PowerPC at least.

> Widening the representative generates
> 
> __skb_decr_checksum_unnecessary:
>         ld 9,8(3)
>         addi 10,9,3
>         rldicr 9,9,0,61
>         rldicl 10,10,0,62
>         or 9,9,10
>         std 9,8(3)
>         blr

Making all accesses use QImode yields

__skb_decr_checksum_unnecessary:
        lbz 9,8(3)
        addi 10,9,3
        rlwinm 9,9,0,0,29
        rlwinm 10,10,0,30,31
        or 9,9,10
        stb 9,8(3)
        blr

(the rlwinm's and the or could be a rlwimi together, not sure why they
aren't merged)

and on BE it gives

.L.__skb_decr_checksum_unnecessary:
        lbz 9,8(3)
        srwi 10,9,6
        rlwinm 9,9,0,26,31
        addi 10,10,3
        rlwinm 10,10,6,24,25
        or 9,9,10
        stb 9,8(3)
        blr

which could be just

        lbz 9,8(3)
        addi 9,9,192
        stb 9,8(3)
        blr

Comment 3 rguenther@suse.de 2016-06-15 09:04:18 UTC

On Tue, 14 Jun 2016, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509
> 
> --- Comment #2 from Segher Boessenkool <segher at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #1)
> > It looks like we didn't adjust the bitfield read paths for the mem model
> > because in practice it doesn't matter and it may generate larger/slower code
> > not to do loads in larger types on some archs.
> 
> It still generates correct (or at least working) code, yeah.
> 
> > Note that we conservatively compute the extent for
> > DECL_BIT_FIELD_REPRESENTATIVE
> > by prefering smaller modes.  There's some ??? in
> > finish_bitfield_representative
> > and the above remark about tail padding re-use is only implemented via
> > prefering
> > smaller modes.  Thus when adding a 'long foo' after csum_level the
> > representative doesn't change to 64bit width but stays at 8bits (both are
> > valid from the C++ memory model).
> 
> Smaller modes are slightly better here, they allow more efficient insn
> sequences to manipulate the data, esp. on big endian.  For example on
> PowerPC, on BE, our 16-bit immediates (in e.g. ior) can handle QImode
> and HImode, but not DImode; and e.g. the rlwinm insn can handle SImode
> but not DImode.  In general, with smaller modes you do not need to
> worry about preserving the other bits.
> 
> > Note that the proposed simple lowering of bitfield accesses on GIMPLE would
> > do accesses of DECL_BIT_FIELD_REPRESENTATIVE and thus in this case use byte
> > accesses.
> 
> That would be lovely :-)

On my list for GCC 7 ;-)

> > I suppose we want to be less conservative about DECL_BIT_FIELD_REPRESENTATIVE
> > and leave it up to the target how to do the actual accesses.
> 
> Maybe.  Always smallest is a good choice for PowerPC at least.

Note the middle-end in extract_bit_field says using a larger mode is
always better because it is better for CSE (not sure about that argument
if you only look at a single stmt).

Comment 4 Anton Blanchard 2016-09-26 00:58:59 UTC

Created attachment 39683 [details]
Another bitop LHS test case

Here's another issue found in the Linux kernel. Seems like this should be a single lwz/stw since the union of counter and the bitops completely overlap.

The half word store followed by word load is going to prevent it from store forwarding.

0000000000000000 <set_page_slub_counters>:
   0:	00 00 03 81 	lwz     r8,0(r3)
   4:	20 00 89 78 	clrldi  r9,r4,32
   8:	c2 0f 2a 79 	rldicl  r10,r9,33,31
   c:	00 f8 48 51 	rlwimi  r8,r10,31,0,0
  10:	5e 00 2a 55 	rlwinm  r10,r9,0,1,15
  14:	00 00 03 91 	stw     r8,0(r3)
  18:	00 00 83 b0 	sth     r4,0(r3)
  1c:	00 00 42 60 	ori     r2,r2,0
  20:	00 00 23 81 	lwz     r9,0(r3)
  24:	00 04 29 55 	rlwinm  r9,r9,0,16,0
  28:	78 53 29 7d 	or      r9,r9,r10
  2c:	00 00 23 91 	stw     r9,0(r3)
  30:	20 00 80 4e 	blr

Comment 5 Nicholas Piggin 2017-05-11 06:46:56 UTC

This test case seems like it may be related. It does the right thing and uses all 4 byte ops when the 8 byte alignment is removed. I post it here because it may not always be the case that smallest op is best

struct s {
        unsigned long align1;
        union {
                unsigned int blah;
                unsigned int a:1;
        };
};

void test2(struct s *s)
{
        s->blah = 100;
        if (s->a)
                asm volatile("#blah");
}

Generates (gcc 7.0.1)

test2:
        li 9,100
        stw 9,8(3)
        ld 9,8(3)
        andi. 9,9,0x1
        beqlr 0
#APP
 # 29 "a.c" 1
        #blah
 # 0 "" 2
#NO_APP
        blr

There is a more general case of mismatched load and store sizes in unions with different size types, but in this case the sizes could be matched I think.

Comment 6 Segher Boessenkool 2017-05-11 14:03:44 UTC

Doing an 8 byte load of something that was stored as 4 byte immediately
before will cause flushes and stalls...  Yeah it could use a 4-byte load
here afaics.

Comment 7 Xiong Hu XS Luo 2019-03-15 06:51:49 UTC

Hi Richard, trying to figure out the issue recently, but get some questions need your help. How is the status of the "proposed simple lowering of bitfield accesses on GIMPLE", please? 
for "less conservative about DECL_BIT_FIELD_REPRESENTATIVE", do you mean we choose large mode in GIMPLE stage, and make the decision when in target? Thanks.

PS: As a newbie, can you tell how did you do to "Widening the representative" :),  I am a bit confused about the best mode and where to process it, sometimes big mode is better and sometimes smaller mode is better(from Segher's comments).

Comment 8 rguenther@suse.de 2019-03-15 09:09:55 UTC

On Fri, 15 Mar 2019, luoxhu at cn dot ibm.com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509
> 
> Xiong Hu XS Luo <luoxhu at cn dot ibm.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |guojiufu at gcc dot gnu.org,
>                    |                            |rguenth at gcc dot gnu.org
> 
> --- Comment #7 from Xiong Hu XS Luo <luoxhu at cn dot ibm.com> ---
> Hi Richard, trying to figure out the issue recently, but get some questions
> need your help. How is the status of the "proposed simple lowering of bitfield
> accesses on GIMPLE", please? 

There are finished patches in a few variants but all of them show issues
in the testsuite, mostly around missed optimizations IIRC.  It has been
quite some time since I last looked at this but IIRC Andrew Pinski said
he's got sth for GCC 10 so you might want to ask him.

> for "less conservative about DECL_BIT_FIELD_REPRESENTATIVE", do you mean we
> choose large mode in GIMPLE stage, and make the decision when in target?
> Thanks.

DECL_BIT_FIELD_REPRESENTATIVE was mainly added for strict volatile
bitfield accesses as well as to avoid data races as specified by the
C++ memory model.  So in some cases we might be able to widen the
accesses (and we can shorten them in any case if we compute this is a good
idea).

> PS: As a newbie, can you tell how did you do to "Widening the representative"
> :),  I am a bit confused about the best mode and where to process it, sometimes
> big mode is better and sometimes smaller mode is better(from Segher's
> comments).

Yeah, get_best_mode is a big pile of *** and this issue is very hard to
compute locally.  I suppose for the best decision you'd need to
look at the whole program (or at least the whole function) in a
flow-sensitive manner.  One of the lowering tries I did was to integrate
the lowering with the SRA pass which is said to eventually get
flow-sensitive analysis at some point (but at least does whole-function
analysis already).

Comment 9 Andrew Pinski 2019-03-15 14:10:37 UTC

(In reply to rguenther@suse.de from comment #8)
> On Fri, 15 Mar 2019, luoxhu at cn dot ibm.com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509
> > 
> > Xiong Hu XS Luo <luoxhu at cn dot ibm.com> changed:
> > 
> >            What    |Removed                     |Added
> > ----------------------------------------------------------------------------
> >                  CC|                            |guojiufu at gcc dot gnu.org,
> >                    |                            |rguenth at gcc dot gnu.org
> > 
> > --- Comment #7 from Xiong Hu XS Luo <luoxhu at cn dot ibm.com> ---
> > Hi Richard, trying to figure out the issue recently, but get some questions
> > need your help. How is the status of the "proposed simple lowering of bitfield
> > accesses on GIMPLE", please? 
> 
> There are finished patches in a few variants but all of them show issues
> in the testsuite, mostly around missed optimizations IIRC.  It has been
> quite some time since I last looked at this but IIRC Andrew Pinski said
> he's got sth for GCC 10 so you might want to ask him.

Yes I do, I am planing on submitting the new pass once GCC 10 has opened up.

Comment 10 Xiong Hu XS Luo 2019-11-18 09:05:55 UTC

(In reply to Andrew Pinski from comment #9)
> (In reply to rguenther@suse.de from comment #8)
> > On Fri, 15 Mar 2019, luoxhu at cn dot ibm.com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509
> > > 
> > > Xiong Hu XS Luo <luoxhu at cn dot ibm.com> changed:
> > > 
> > >            What    |Removed                     |Added
> > > ----------------------------------------------------------------------------
> > >                  CC|                            |guojiufu at gcc dot gnu.org,
> > >                    |                            |rguenth at gcc dot gnu.org
> > > 
> > > --- Comment #7 from Xiong Hu XS Luo <luoxhu at cn dot ibm.com> ---
> > > Hi Richard, trying to figure out the issue recently, but get some questions
> > > need your help. How is the status of the "proposed simple lowering of bitfield
> > > accesses on GIMPLE", please? 
> > 
> > There are finished patches in a few variants but all of them show issues
> > in the testsuite, mostly around missed optimizations IIRC.  It has been
> > quite some time since I last looked at this but IIRC Andrew Pinski said
> > he's got sth for GCC 10 so you might want to ask him.
> 
> Yes I do, I am planing on submitting the new pass once GCC 10 has opened up.

Hi Pinski, is your patch upstreamed to GCC 10 please?  Thanks.

Comment 11 Xionghu Luo (luoxhu@gcc.gnu.org) 2020-02-10 06:56:49 UTC

(In reply to Anton Blanchard from comment #4)
> Created attachment 39683 [details]
> Another bitop LHS test case
> 
> Here's another issue found in the Linux kernel. Seems like this should be a
> single lwz/stw since the union of counter and the bitops completely overlap.
> 
> The half word store followed by word load is going to prevent it from store
> forwarding.
> 
> 0000000000000000 <set_page_slub_counters>:
>    0:	00 00 03 81 	lwz     r8,0(r3)
>    4:	20 00 89 78 	clrldi  r9,r4,32
>    8:	c2 0f 2a 79 	rldicl  r10,r9,33,31
>    c:	00 f8 48 51 	rlwimi  r8,r10,31,0,0
>   10:	5e 00 2a 55 	rlwinm  r10,r9,0,1,15
>   14:	00 00 03 91 	stw     r8,0(r3)
>   18:	00 00 83 b0 	sth     r4,0(r3)
>   1c:	00 00 42 60 	ori     r2,r2,0
>   20:	00 00 23 81 	lwz     r9,0(r3)
>   24:	00 04 29 55 	rlwinm  r9,r9,0,16,0
>   28:	78 53 29 7d 	or      r9,r9,r10
>   2c:	00 00 23 91 	stw     r9,0(r3)
>   30:	20 00 80 4e 	blr

This case only is fixed on latest gcc 10 already (issues in case __skb_decr_checksum_unnecessary from Anton Blanchard and test2 from Nicholas Piggin  still exist).

gcc version 10.0.1 20200210 

objdump -d set_page_slub_counters.o

set_page_slub_counters.o:     file format elf64-powerpcle


Disassembly of section .text:

0000000000000000 <set_page_slub_counters>:
   0:   22 84 89 78     rldicl  r9,r4,48,48
   4:   00 00 83 b0     sth     r4,0(r3)
   8:   02 00 23 b1     sth     r9,2(r3)
   c:   20 00 80 4e     blr

Comment 12 Segher Boessenkool 2020-02-11 00:12:37 UTC

But it could do just

  stw r4,0(r3)

(on LE; and with a rotate first, on BE).

Comment 13 Xionghu Luo (luoxhu@gcc.gnu.org) 2020-02-11 07:24:23 UTC

(In reply to Segher Boessenkool from comment #12)
> But it could do just
> 
>   stw r4,0(r3)
> 
> (on LE; and with a rotate first, on BE).

Thanks for the catching, this optimization is not related to load hit store.  I will investigate why store-merging pass failed to merge the 2 half store.

Comment 14 Andrew Pinski 2021-08-10 23:29:19 UTC

Taking this for GCC 13.  I am going to finally submit the bit-field lowering next year.