Bug 82940 - Suboptimal code for (a & 0x7f) | (b & 0x80) on powerpc
Summary: Suboptimal code for (a & 0x7f) | (b & 0x80) on powerpc
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 5.4.0
: P3 enhancement
Target Milestone: ---
Assignee: Ajit Kumar Agarwal
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2017-11-10 15:39 UTC by Christophe Leroy
Modified: 2023-04-11 17:05 UTC (History)
4 users (show)

See Also:
Host:
Target: powerpc*-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-11-17 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christophe Leroy 2017-11-10 15:39:41 UTC
unsigned char g(unsigned char t[], unsigned char v)
{
	return (t[v & 0x7f] & 0x7f) | (v & 0x80);
}

00000008 <g>:
   8:	54 89 06 7e 	clrlwi  r9,r4,25
   c:	7c 63 48 ae 	lbzx    r3,r3,r9
  10:	54 84 00 30 	rlwinm  r4,r4,0,0,24
  14:	54 63 06 7e 	clrlwi  r3,r3,25
  18:	7c 63 23 78 	or      r3,r3,r4
  1c:	4e 80 00 20 	blr


I would expect

00000008 <g>:
   8:	54 89 06 7e 	clrlwi  r9,r4,25
   c:	7c 63 48 ae 	lbzx    r3,r3,r9
  10:	54 84 00 30 	rlwimi  r3,r4,0,24,24
  14:	4e 80 00 20 	blr
Comment 1 Richard Biener 2017-11-13 09:04:25 UTC
Note GCC 5 is no longer supported, you might want to try GCC 7.
Comment 2 Segher Boessenkool 2017-11-17 08:56:45 UTC
On trunk it does (with -m32 -O2):

g:
        rlwinm 9,4,0,25,31
        rlwinm 4,4,0,0,24
        lbzx 3,3,9
        rlwinm 3,3,0,25,31
        or 3,3,4
        blr
Comment 3 Segher Boessenkool 2017-11-17 09:14:20 UTC
In combine, we start with


insn_cost 4 for    10: r137:SI=r136:QI#0&0x7f
      REG_DEAD r136:QI
insn_cost 4 for    13: r140:SI=r132:SI&0xffffffffffffff80
      REG_DEAD r132:SI
insn_cost 4 for    16: r143:SI=r137:SI|r140:SI
      REG_DEAD r140:SI
      REG_DEAD r137:SI
insn_cost 4 for    17: r144:SI=zero_extend(r143:SI#3)
      REG_DEAD r143:SI


It then tries:


Trying 13, 10 -> 16:
   13: r140:SI=r132:SI&0xffffffffffffff80
      REG_DEAD r132:SI
   10: r137:SI=r136:QI#0&0x7f
      REG_DEAD r136:QI
   16: r143:SI=r137:SI|r140:SI
      REG_DEAD r140:SI
      REG_DEAD r137:SI
Failed to match this instruction:
(set (reg:SI 143)
    (ior:SI (and:SI (reg/v:SI 132 [ v+-3 ])
            (const_int 128 [0x80]))
        (and:SI (subreg:SI (reg:QI 136 [ *_2 ]) 0)
            (const_int 127 [0x7f]))))


which fails because that isn't an existing instruction (the -0x80 was
optimised to 0x80).

Then 16->17 is done (throwing away the useless extend), and of course
10,13->17 won't work after that.


So it's a case of the general problem that combine using known values
of registers often hurts instead of helps.
Comment 4 Andrew Pinski 2021-08-01 07:16:01 UTC
I have a set of patches that I am going to clean up next year for GCC 13 where on the gimple level GCC produces:
  _13 = v_9(D) & 127;
  _1 = (sizetype) _13;
  _2 = t_10(D) + _1;
  _3 = *_2;
  _14 = (<unnamed-unsigned:7>) _3;
  _12 = BIT_INSERT_EXPR <v_9(D), _14, 0 (7 bits)>;
  return _12;

As long as nothing on the rtl level (combine) does not mess this up, it should produce the best code.
Comment 5 Segher Boessenkool 2021-08-01 09:57:57 UTC
(In reply to Andrew Pinski from comment #4)
> As long as nothing on the rtl level (combine) does not mess this up, it
> should produce the best code.

combine cannot ever create worse code than it had as input :-)
Comment 6 Peter Cordes 2021-08-22 22:16:32 UTC
For a simpler test case, GCC 4.8.5 did redundantly mask before using bitfield-insert, but GCC 9.2.1 doesn't.


unsigned merge2(unsigned a, unsigned b){
    return (a&0xFFFFFF00u) | (b&0xFFu);
}

https://godbolt.org/z/froExaPxe
# PowerPC (32-bit) GCC 4.8.5
        rlwinm 4,4,0,0xff     # b &= 0xFF is totally redundant
        rlwimi 3,4,0,24,31
        blr

# power64 GCC 9.2.1 (ATI13.0)
        rlwimi 3,4,0,255    # bit-blend according to mask, rotate count=0
        rldicl 3,3,0,32     # Is this zero-extension to 64-bit redundant?
        blr

But ppc64 GCC does zero-extension of the result from 32 to 64-bit, which is probably not needed unless the calling convention has different requirements for return values than for incoming args.  (I don't know PPC well enough.)

So for at least some cases, modern GCC does ok.

Also, when the blend isn't split at a byte boundary, even GCC4.8.5 manages to avoid redundant masking before the bitfield-insert.

unsigned merge2(unsigned a, unsigned b){
    return (a & 0xFFFFFF80u) | (b & 0x7Fu);
}

        rlwimi 3,4,0,25,31   # GCC4.8.5, 32-bit so no zero-extension
        blr
Comment 7 Segher Boessenkool 2021-08-23 15:40:43 UTC
(In reply to Peter Cordes from comment #6)
> # power64 GCC 9.2.1 (ATI13.0)
>         rlwimi 3,4,0,255    # bit-blend according to mask, rotate count=0
>         rldicl 3,3,0,32     # Is this zero-extension to 64-bit redundant?

It is: the rlwinm does an AND with 0xff already, so that clears the top 32
bits for sure.

> But ppc64 GCC does zero-extension of the result from 32 to 64-bit, which is
> probably not needed unless the calling convention has different requirements
> for return values than for incoming args.  (I don't know PPC well enough.)

Return values have to be properly (sign- or zero-) extended for its type,
just like function arguments.
Comment 8 Segher Boessenkool 2021-08-23 15:45:40 UTC
(In reply to Segher Boessenkool from comment #7)
> (In reply to Peter Cordes from comment #6)
> > # power64 GCC 9.2.1 (ATI13.0)
> >         rlwimi 3,4,0,255    # bit-blend according to mask, rotate count=0
> >         rldicl 3,3,0,32     # Is this zero-extension to 64-bit redundant?
> 
> It is: the rlwinm does an AND with 0xff already, so that clears the top 32
> bits for sure.

Wow I cannot read, it is an rlwimi, so scratch that.

The rlwimi here keeps the top 56 bits intact, and they already were 0, so the
insn still is redundant.