Bug 91202 - Unnecessary promotion of shift operands
Summary: Unnecessary promotion of shift operands
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 9.1.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2019-07-18 17:58 UTC by Uroš Bizjak
Modified: 2020-01-28 14:43 UTC (History)
4 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-01-28 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Uroš Bizjak 2019-07-18 17:58:15 UTC
Following testcase:

--cut here--
unsigned char
foo (unsigned char a, unsigned char b)
{
  return a >> b;
}
--cut here--

does not need its operand to be promoted to int on targets that have shift instruction in QI and HI mode.

Currently, gcc compiles the testcase (-O2) via _.original:

--cut here--
;; Function foo (null)
;; enabled by -tree-original


{
  return (unsigned char) ((int) a >> (int) b);
}
--cut here--

and ._optimized:

;; Function foo (foo, funcdef_no=0, decl_uid=1907, cgraph_uid=1, symbol_order=0)

--cut here--
foo (unsigned char a, unsigned char b)
{
  int _1;
  int _2;
  int _3;
  unsigned char _6;

  <bb 2> [local count: 1073741824]:
  _1 = (int) a_4(D);
  _2 = (int) b_5(D);
  _3 = _1 >> _2;
  _6 = (unsigned char) _3;
  return _6;

}
--cut here--

to:
        movzbl  %dil, %eax
        movl    %esi, %ecx
        sarl    %cl, %eax
        ret

However, on targets that provide QImode shift, the above code could be emitted as:

        movzbl  %dil, %eax
        movl    %esi, %ecx
        shrb    %cl, %al
        ret

This optimization would help the following testcase:

--cut here--
struct S1
{
  unsigned char val;
  unsigned char pad1;
  unsigned short pad2;
};

struct S1
test_shrb (struct S1 a, unsigned char b)
{
  a.val >>= b;

  return a;
}
--cut here--

that currently compiles to:

        movzbl  %dil, %edx
        movl    %esi, %ecx
        movl    %edi, %eax
        sarl    %cl, %edx
        movb    %dl, %al
        ret

but after PR91188 was fixed, the testcase could be compiled with the proposed optimization to:

        movl    %edi, %eax
        movl    %esi, %ecx
        shrb    %cl, %al
        ret

Please note that MSVC v19.21 implements the proposed optimization and generates:

        movzx   eax, cl
        movzx   ecx, dl
        shr     al, cl
        ret     0
Comment 1 Andrew Pinski 2019-07-18 18:14:37 UTC
>does not need its operand to be promoted to int on targets that have shift instruction 

Yes it does as that is what C says it does ...
Comment 2 Uroš Bizjak 2019-07-18 19:31:23 UTC
(In reply to Andrew Pinski from comment #1)
> >does not need its operand to be promoted to int on targets that have shift instruction 
> 
> Yes it does as that is what C says it does ...

The standard also says that all narrow operations have to be promoted to int, but we live just fine with:

unsigned char
bar (unsigned char a, unsigned char b)
{
  return a + b;
}

that dumps to:

--cut here--
bar (unsigned char a, unsigned char b)
{
  unsigned char _3;

  <bb 2> [local count: 1073741824]:
  _3 = a_1(D) + b_2(D);
  return _3;

}
--cut here--

I fail to see the difference...
Comment 3 Richard Biener 2019-07-19 08:08:15 UTC
If you use singed adds it makes a semantic difference and we narrow it to
(unsigned char)a + (unsigned char)b to make it well-defined in case of
overflow.  I think the same issue applies for the shift where
C says that a >> b is really (int)a >> b and thus it is well-defined if
b == 24.  If we write it as gimple (char)a >> b then we conclude b
has to be in the range [0, 7] which would be wrong.

So on GIMPLE we'd need to do (char)a >> (b&7) which I think wouldn't
be nicely canonical.

So this is probably best fixed at RTL expansion time or somewhen
shortly before RTL expansion in a "narrowing/widening" pass taking
into account target capabilities (and ABIs).

I think the vectorizer has similar tricks in the pattern recognizer.
Comment 4 Jakub Jelinek 2019-07-19 08:24:24 UTC
Looking at x86 shl/shr instructions, it seems they don't do the SHIFT_COUNT_TRUNCATED masking, but actually mask always the shift count with & 31 (unless 64-bit shift, then it is indeed SHIFT_COUNT_TRUNCATED).
So, I'd think we want alternate patterns to the QImode and HImode x86 shifts that represent it as zero (for shift left or logical shift right) or sign (for shift left maybe too or arithmetic shift right) extension of the QI/HI mode operand to SImode, performing a shift in SImode and a SUBREG back to QI/HImode and keep it expressed like that in the IL to say explicitly what the instruction does.
Comment 5 Jakub Jelinek 2019-07-19 08:34:07 UTC
Though, note the combiner doesn't try to match that, nor with the
void
foo (unsigned char a, unsigned char b, unsigned char *c)
{
  *c = a >> b;
}
case, the final subreg is in some other instruction (e.g. the set of hard register to it, or memory store) and combiner doesn't try to split i at the point of the subreg but of the subreg operand.
Dunno what options we have, hand recognize in some machine specific pass, or a new optab for that kind of thing, something else?
Comment 6 Uroš Bizjak 2019-07-19 08:35:42 UTC
(In reply to Jakub Jelinek from comment #4)
> Looking at x86 shl/shr instructions, it seems they don't do the
> SHIFT_COUNT_TRUNCATED masking, but actually mask always the shift count with
> & 31 (unless 64-bit shift, then it is indeed SHIFT_COUNT_TRUNCATED).

Yes, this is the reason x86 doesn't define SHIFT_COUNT_TRUNCATED (also, the documentation is confusing to me, talking about some "real or pretended" bit field operations). OTOH, x86 has several combined patterns that remove masking from SImode and DImode shift/rotate instructions.

It is possible to define TARGET_SHIFT_TRUNCATION_MASK, but when experimenting with this hook, I didn't find any effect on shifts.
Comment 7 Jakub Jelinek 2019-07-19 08:55:23 UTC
Perhaps we could define patterns for combine like:
        (set (match_operand:SI 0 "register_operand" "=q")
            (ashiftrt:SI (zero_extend:SI (match_operand:QI 1 "register_operand" "q"))
                (match_operand:QI 2 "nonmemory_operand" "cI")))
        (clobber (reg:CC 17 flags))
etc. and split them before reload into an insn that does that with a subreg:QI followed by zero extension (or sign extension in certain cases) and then hope before reload the zero extension will be cancelled with the following subreg use (I guess we can't look at immediate uses during splitting).
The disadvantage would be that this would match even if we aren't using a subreg at the end, so would match even the case of:
unsigned int foo (unsigned char a, unsigned char b)
{
  return a >> b;
}
and would change the movzbl %dil, %edi; shrq %cl, %edi into shrq %cl, %dil; movzbl %dil, %edi.  Dunno if that is acceptable.
Comment 8 Uroš Bizjak 2019-07-19 10:02:34 UTC
(In reply to Jakub Jelinek from comment #7)
> Perhaps we could define patterns for combine like:
>         (set (match_operand:SI 0 "register_operand" "=q")
>             (ashiftrt:SI (zero_extend:SI (match_operand:QI 1
> "register_operand" "q"))
>                 (match_operand:QI 2 "nonmemory_operand" "cI")))
>         (clobber (reg:CC 17 flags))
> etc. and split them before reload into an insn that does that with a
> subreg:QI followed by zero extension (or sign extension in certain cases)
> and then hope before reload the zero extension will be cancelled with the
> following subreg use (I guess we can't look at immediate uses during
> splitting).

The split pass is performed quite late in the game, and the narrowed insn would miss combine pass (c.f. Comment #0 for optimization opportunity).

> The disadvantage would be that this would match even if we aren't using a
> subreg at the end, so would match even the case of:
> unsigned int foo (unsigned char a, unsigned char b)
> {
>   return a >> b;
> }
> and would change the movzbl %dil, %edi; shrq %cl, %edi into shrq %cl, %dil;
> movzbl %dil, %edi.  Dunno if that is acceptable.

I think that much cleaner solution would be to do the transformation in a kind of "narrowing/widening" pass, as proposed by Richard in Comment #3. There, all type data can be easily determined and operation can be narrowed if the target provides adequate set of instructions.
Comment 9 Jakub Jelinek 2019-07-19 10:07:35 UTC
We have several PRs for narrowing/widening pass on late gimple, but I'm afraid this exact thing is not something that can be done there, because the semantics on what the x86 instructions do is quite weird and hard to express in a single gimple statement (and hard to explain to the middle-end that x86 has something like that).
As I said, we could add a new optab for that kind of thing and pattern match it  during expansion, or it can be done the above way and see how well does that work, or some machine specific pass.
Comment 10 Jakub Jelinek 2019-07-19 10:48:56 UTC
I've tried:
--- gcc/config/i386/i386.md.jj	2019-07-19 11:56:10.475964435 +0200
+++ gcc/config/i386/i386.md	2019-07-19 12:43:52.461469500 +0200
@@ -10661,6 +10661,43 @@
   "ix86_split_<shift_insn> (operands, NULL_RTX, <MODE>mode); DONE;"
   [(set_attr "type" "multi")])
 
+(define_insn_and_split "*shrqi3_3"
+  [(set (match_operand:SI 0 "register_operand")
+	(ashiftrt:SI
+	  (zero_extend:SI (match_operand:QI 1 "register_operand"))
+	  (match_operand:QI 2 "nonmemory_operand")))
+   (clobber (reg:CC FLAGS_REG))]
+  ""
+  "#"
+  "&& 1"
+  [(parallel
+     [(set (match_dup 3)
+	   (subreg:QI (ashiftrt:SI
+			(zero_extend:SI (match_dup 1))
+			(match_dup 2)) 0))
+      (clobber (reg:CC FLAGS_REG))])
+   (set (match_dup 0) (zero_extend:SI (match_dup 3)))]
+{
+  operands[3] = (can_create_pseudo_p ()
+		 ? gen_reg_rtx (QImode) : gen_lowpart (QImode, operands[0]));
+})
+
+(define_insn "*shrqi3_4"
+  [(set (match_operand:QI 0 "register_operand" "=q")
+	(subreg:QI
+	  (ashiftrt:SI
+	    (zero_extend:SI (match_operand:QI 1 "register_operand" "0"))
+	    (match_operand:QI 2 "nonmemory_operand" "cI")) 0))
+   (clobber (reg:CC FLAGS_REG))]
+  ""
+{
+  if (operands[2] == const1_rtx
+      && (TARGET_SHIFT1 || optimize_function_for_size_p (cfun)))
+    return "shr{q}\t%0";
+  else
+    return "shr{q}\t{%2, %0|%0, %2}";
+})
+
 ;; By default we don't ask for a scratch register, because when DWImode
 ;; values are manipulated, registers are already at a premium.  But if
 ;; we have one handy, we won't turn it away.
and surprisingly, not just before RA, but even after it nothing will optimize away the extra zero extend.
unsigned char foo (unsigned char a, unsigned char b) { return a >> b; }
void bar (unsigned char a, unsigned char b, unsigned char *c) { *c = a >> b; }
changes with the patch as:
-	movzbl	%dil, %eax
 	movl	%esi, %ecx
-	sarl	%cl, %eax
+	shrq	%cl, %dil
+	movzbl	%dil, %eax
and:
-	movzbl	%dil, %edi
 	movl	%esi, %ecx
-	sarl	%cl, %edi
+	shrq	%cl, %dil
+	movzbl	%dil, %edi
 	movb	%dil, (%rdx)
Comment 11 Jakub Jelinek 2019-07-19 11:18:53 UTC
As for TARGET_SHIFT_TRUNCATION_MASK, I'm not sure it can be safely used, because different instructions on x86 work differently.  The old scalar shifts do the & 31 masking for QImode/HImode, but e.g. vector shifts don't do any masking, but out of bound shift counts are basically infinity shift count.  So, if we wanted to use say truncation mask of 31 for QI/HI/SImode, we'd need to make sure we never do for those what say STV does and use vector instructions instead, because those have different behavior.