Bug 77729 - aarch64 inserts unneeded uxtb after ldrb, orr ...32
Summary: aarch64 inserts unneeded uxtb after ldrb, orr ...32
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 6.1.1
: P3 enhancement
Target Milestone: 8.0
Assignee: Segher Boessenkool
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2016-09-25 14:49 UTC by Julian Andres Klode
Modified: 2017-10-05 12:34 UTC (History)
3 users (show)

See Also:
Host:
Target: aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2016-09-25 00:00:00


Attachments
C reproducer (125 bytes, text/plain)
2016-09-25 14:51 UTC, Julian Andres Klode
Details
Aarch64 output at -O2 (278 bytes, text/plain)
2016-09-25 14:51 UTC, Julian Andres Klode
Details
arm (thumb2) output at -O2 (429 bytes, text/plain)
2016-09-25 14:52 UTC, Julian Andres Klode
Details
ppc64le (308 bytes, text/plain)
2016-09-25 15:04 UTC, Julian Andres Klode
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Julian Andres Klode 2016-09-25 14:49:48 UTC
In the attached test case, the aarch64 target inserts a uxtb instruction that is not needed:

	ldrb	w1, [x0]
	orr	w1, w1, 32
	uxtb	w1, w1
	cmp	w1, 116

The arm backend handles that just fine:

	ldrb	r3, [r0]	@ zero_extendqisi2
	orr	r3, r3, #32
	cmp	r3, #116

It also works with 33 or 34 instead of 32 for whatever reasons.
Comment 1 Julian Andres Klode 2016-09-25 14:51:10 UTC
Created attachment 39675 [details]
C reproducer
Comment 2 Julian Andres Klode 2016-09-25 14:51:31 UTC
Created attachment 39676 [details]
Aarch64 output at -O2
Comment 3 Julian Andres Klode 2016-09-25 14:52:01 UTC
Created attachment 39677 [details]
arm (thumb2) output at -O2
Comment 4 Andrew Pinski 2016-09-25 14:54:40 UTC
Note this testcase needs to be improved as I have a patch which converts a switch with just one case and a default into anew if statement.
Comment 5 Julian Andres Klode 2016-09-25 15:04:37 UTC
Created attachment 39678 [details]
ppc64le

Hmm, AFAICT the same seems to happen on powerpc64le:

	lbz 9,0(3)                      # Load zero
	ori 9,9,0x20                    # ors in 32
	rlwinm 9,9,0,0xff               # zero extend value AFAICT
	cmpwi 7,9,116


So far tested:

good: mipsel, x86_64, armhf (thumb2)
bad: aarch64, powerpc64le
Comment 6 Julian Andres Klode 2016-09-25 15:09:33 UTC
(In reply to Andrew Pinski from comment #4)
> Note this testcase needs to be improved as I have a patch which converts a
> switch with just one case and a default into anew if statement.

FWIW, The issue is exactly the same with if statements, like in:

int TrieCase3(const char *string)
{
    if((string[0] | 32) == 't') {
        if((string[1] | 32) == 'a') {
            if((string[2] | 32) == 'g') {
                return 42;
            }
        }
    }
    return -1;
}
Comment 7 Florian Weimer 2016-09-25 15:11:54 UTC
Something similar happens on x86_64 with -Os.  The OR instruction operates on %eax instead of %al:

0000000000000000 <TrieCase3>:
   0:	8a 07                	mov    (%rdi),%al
   2:	83 c8 20             	or     $0x20,%eax
   5:	3c 74                	cmp    $0x74,%al

(Byte-wise OR would be 0x0c 0x20, one byte shorter.)
Comment 8 Andrew Pinski 2016-09-25 15:15:59 UTC
This looks like missing removal of casts.

Note in C, char_var|32 is really the same as ((int)char_var)|32
Comment 9 Julian Andres Klode 2016-09-25 15:19:01 UTC
(In reply to Andrew Pinski from comment #8)
> This looks like missing removal of casts.
> 
> Note in C, char_var|32 is really the same as ((int)char_var)|32

Well. The loads of the byte are already zero-extend loads. The current code is like saying:

   (int) ((int)char|32)

:)
Comment 10 Andrew Pinski 2016-09-25 15:42:04 UTC
This is how we expand it on aarch64:
(insn 10 9 11 (set (reg:QI 81)
        (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1 A8])) t.c:4 -1
     (nil))

(insn 11 10 12 (set (reg:SI 82)
        (ior:SI (subreg:SI (reg:QI 81) 0)
            (const_int 32 [0x20]))) t.c:4 -1
     (nil))

(insn 12 11 13 (set (reg:SI 83)
        (zero_extend:SI (subreg:QI (reg:SI 82) 0))) t.c:4 -1
     (nil))

(insn 13 12 14 (set (reg:CC 66 cc)
        (compare:CC (reg:SI 83)
            (const_int 116 [0x74]))) t.c:4 -1
     (nil))


-----
(set (reg:SI 83)
    (ior:SI (and:SI (subreg:SI (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1 A8]) 0)
            (const_int 223 [0xdf]))
        (const_int 32 [0x20])))

Notice how the and there is 223, but really that can be still a zero_extend.  Basically combine is going funny.

---- CUT ----

Note for 33, orr does not accept 33 so combine does not see 33 and does not change the and part around the subreg.
Comment 11 Segher Boessenkool 2017-09-19 05:55:43 UTC
> Basically combine is going funny.

Hey, don't blame combine, it's all simplify-rtx's doing.

If in simplify_binary_operation_1 you disable the blocks after

          /* Minimize the number of bits set in C1, i.e. C1 := C1 & ~C2.  */

and after

      /* If we have (ior (and (X C1) C2)), simplify this by making
         C1 as small as possible if C1 actually changes.  */

all works as you want (this is on rs6000, pretty much the same thing):

===
Trying 10, 11 -> 12:
Failed to match this instruction:
(set (reg:SI 131)
    (ior:SI (zero_extend:SI (mem:QI (reg/v/f:DI 128 [ string ]) [0 *string_9(D)+0 S1 A8]))
        (const_int 96 [0x60])))
Successfully matched this instruction:
(set (reg:SI 130)
    (zero_extend:SI (mem:QI (reg/v/f:DI 128 [ string ]) [0 *string_9(D)+0 S1 A8])))
Successfully matched this instruction:
(set (reg:SI 131)
    (ior:SI (reg:SI 130)
        (const_int 96 [0x60])))
allowing combination of insns 10, 11 and 12
original costs 8 + 4 + 4 = 16
replacement costs 8 + 4 = 12
deferring deletion of insn with uid = 10.
modifying insn i2    11: r130:SI=zero_extend([r128:DI])
deferring rescan insn with uid = 11.
modifying insn i3    12: r131:SI=r130:SI|0x60
      REG_DEAD r130:SI
deferring rescan insn with uid = 12.
===

giving

        lbz 9,0(3)
        ori 9,9,0x60
        cmpwi 7,9,116
        bne 7,.L5

etc.
Comment 12 Segher Boessenkool 2017-09-21 14:19:53 UTC
I have a patch.
Comment 13 Segher Boessenkool 2017-10-03 16:03:10 UTC
Author: segher
Date: Tue Oct  3 16:02:38 2017
New Revision: 253384

URL: https://gcc.gnu.org/viewcvs?rev=253384&root=gcc&view=rev
Log:
simplify-rtx: Remove non-simplifying simplification (PR77729)

If we have (X&C1)|C2 simplify_binary_operation_1 makes C1 as small as
possible.  This makes worse code in common cases like when the AND with
C1 is from a zero-extension.  This patch fixes it by removing this
transformation (twice).


	PR rtl-optimization/77729
	* simplify-rtx.c (simplify_binary_operation_1): Delete the (X&C1)|C2
	to (X&(C1&~C2))|C2 transformations.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/simplify-rtx.c
Comment 14 Segher Boessenkool 2017-10-05 12:34:16 UTC
Fixed on trunk; no backports planned.