Bug 49263 - SH Target: underutilized "TST #imm, R0" instruction
Summary: SH Target: underutilized "TST #imm, R0" instruction
Status: REOPENED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.6.1
: P3 enhancement
Target Milestone: ---
Assignee: Oleg Endo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-06-01 20:17 UTC by Oleg Endo
Modified: 2023-06-16 13:57 UTC (History)
6 users (show)

See Also:
Host:
Target: sh*-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-08-27 00:00:00


Attachments
test for various integer types and constant values 0...255 (978 bytes, text/plain)
2011-06-01 20:17 UTC, Oleg Endo
Details
Proposed Patch (1.25 KB, patch)
2011-06-01 20:42 UTC, Oleg Endo
Details | Diff
CSiBE comparisons (3.36 KB, text/plain)
2011-10-10 23:48 UTC, Oleg Endo
Details
Proposed patch including test case (2.69 KB, patch)
2011-10-13 22:54 UTC, Oleg Endo
Details | Diff
Disable dynamic shift instructions patch (795 bytes, patch)
2023-05-23 12:35 UTC, Alexander Klepikov
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Oleg Endo 2011-06-01 20:17:19 UTC
Created attachment 24411 [details]
test for various integer types and constant values 0...255

The "TST #imm, R0" instruction is a bit underutilized on SH targets. For some bit patterns of the immediate constant it tries to extract the bits in question by various means and test the result against zero/non-zero and misses the straight forward instruction.
In particular: 
* one single bit 
* n contiguous bits starting at bit 0

When testing a byte against 0x80 it uses "CMP/PZ", which is as good as a "TST" instruction (in terms of costs), so this is OK. 

I've spotted this in version around 4.4. It still happens with 4.6 and the following config:

Using built-in specs.
COLLECT_GCC=sh-elf-gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.6.1/lto-wrapper
Target: sh-elf
Configured with: ../gcc-4.6-20110527/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --without-headers --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib
Thread model: single
gcc version 4.6.1 20110527 (prerelease) (GCC)
Comment 1 Oleg Endo 2011-06-01 20:42:00 UTC
Created attachment 24412 [details]
Proposed Patch

Although the patch gets the job done, programmer's sense tells me it is fishy, or at least pretty much brute forced cure of the symptoms, not the cause. It's my first GCC patch, so any feedback is highly appreciated.

What I did was looking at the RTL, in particular the combine pass, identifying patterns it failed to find a "shortcut" (tst insn) for and adding them to the insn descriptions. 

I also had to expand the costs calculation of the AND instruction to cover AND, OR and XOR (on SH they are the same anyways), or else the cost of a matched replacement insn would result in a rejection in the combine pass.
Comment 2 Kazumoto Kojima 2011-06-12 23:11:19 UTC
It looks that playing with the internal behavior of the combine
pass is a bit fragile.  Perhaps an alternative way is to define
a peephole for tst #imm8,r0, something like:

;; A peephole for the TST immediate instruction.

(define_peephole2
  [(set (match_operand:SI 2 "arith_reg_operand" "r")
	(and:SI (match_operand:SI 0 "arith_reg_operand" "z")
		(match_operand:SI 1 "const_int_operand" "i")))
   (set (reg:SI T_REG) (eq:SI (match_dup 2) (const_int 0)))]
  "TARGET_SH1
   && peep2_reg_dead_p (2, operands[2])
   && (satisfies_constraint_I08 (operands[1])
       || satisfies_constraint_K08 (operands[1]))"
  [(set (reg:SI T_REG)
	(eq:SI (and:SI (match_dup 0) (match_dup 1)) (const_int 0)))]
  "
{
  operands[1] = GEN_INT (INTVAL (operands[1]) & 0xff);
}")

which will work at -O2 or -fpeephole2, though there are pros
and cons of peephole approach.
Comment 3 Oleg Endo 2011-06-19 16:42:01 UTC
Thanks for having a look at it Kaz.
Yes, the fiddling with the combine pass is fragile. I've tried out your peephole idea. It works in most cases but not all the time. E.g. it doesn't catch the following example:

int test_imm (int i)
{
  return (i & 3) ? 20 : 40;
}


	mov	#3,r1
	and	r1,r4
	tst	r4,r4
	bt/s	.L5
	mov	#40,r0
	mov	#20,r0
.L5:
	rts	
	nop


Also the following... 

int test_imm (int* i)
{
   return (*i & 255) ? 20 : 40;
}

becomes on little endian (OK):
	mov.b	@r4,r1
	tst	r1,r1
	bt/s	.L10
	mov	#40,r0
	mov	#20,r0
.L10:
	rts	
	nop

but on big endian (NG):
	mov.l	@r4,r1
	extu.b	r1,r1
	tst	r1,r1
	bt/s	.L10
	mov	#40,r0
	mov	#20,r0
.L10:
	rts	
	nop


I'll give the thing another try.

Regarding the change to the andcosts function, the following should be better:

--- sh.c	(revision 175188)
+++ sh.c	(working copy)
@@ -243,7 +243,7 @@
 static int flow_dependent_p (rtx, rtx);
 static void flow_dependent_p_1 (rtx, const_rtx, void *);
 static int shiftcosts (rtx);
-static int andcosts (rtx);
+static int and_xor_ior_costs (rtx, int code);
 static int addsubcosts (rtx);
 static int multcosts (rtx);
 static bool unspec_caller_rtx_p (rtx);
@@ -2841,14 +2841,14 @@
     return shift_insns[value];
 }
 
-/* Return the cost of an AND operation.  */
+/* Return the cost of an AND/XOR/IOR operation.  */
 
 static inline int
-andcosts (rtx x)
+and_xor_ior_costs (rtx x, int code)
 {
   int i;
 
-  /* Anding with a register is a single cycle and instruction.  */
+  /* register x register is a single cycle instruction.  */
   if (!CONST_INT_P (XEXP (x, 1)))
     return 1;
 
@@ -2864,17 +2864,17 @@
     }
 
   /* These constants are single cycle extu.[bw] instructions.  */
-  if (i == 0xff || i == 0xffff)
+  if ((i == 0xff || i == 0xffff) && code == AND)
     return 1;
-  /* Constants that can be used in an and immediate instruction in a single
+  /* Constants that can be used in an instruction with an immediate are a single
      cycle, but this requires r0, so make it a little more expensive.  */
   if (CONST_OK_FOR_K08 (i))
     return 2;
-  /* Constants that can be loaded with a mov immediate and an and.
+  /* Constants that can be loaded with a mov immediate need one more cycle.
      This case is probably unnecessary.  */
   if (CONST_OK_FOR_I08 (i))
     return 2;
-  /* Any other constants requires a 2 cycle pc-relative load plus an and.
+  /* Any other constant requires an additional 2 cycle pc-relative load.
      This case is probably unnecessary.  */
   return 3;
 }
@@ -3043,7 +3043,9 @@
       return true;
 
     case AND:
-      *total = COSTS_N_INSNS (andcosts (x));
+	case XOR:
+	case IOR:
+      *total = COSTS_N_INSNS (and_xor_ior_costs (x, code));
       return true;
 
     case MULT:
Comment 4 Kazumoto Kojima 2011-06-22 22:34:04 UTC
Yes, that peephole doesn't catch all the patterns which could
make tst #imm8,r0 use.  Perhaps it would be a good idea to get
numbers for the test like CSiBE test with the vanilla and new
insns/peepholes patched compilers.  Something covers 80% of
the possible cases in the usual working set, it would be enough
successful for such a micro-optimization, I guess.

Cost patch looks fine to me.  Could you propose it as a separate
patch on gcc-patches list with an appropriate ChangeLog entry?
When proposing it, please refer how you've tested it.  Also
the numbers got with the patch are highly welcome.

BTW, do you have FSF copyright assignment for your GCC work?
Although the cost patch itself is essentially several lines which
doesn't require copyright assignment, the other changes you've
proposed clearly require the paper work, I think.
Comment 5 Oleg Endo 2011-06-26 22:30:05 UTC
> Yes, that peephole doesn't catch all the patterns which could
> make tst #imm8,r0 use.  Perhaps it would be a good idea to get
> numbers for the test like CSiBE test with the vanilla and new
> insns/peepholes patched compilers.  Something covers 80% of
> the possible cases in the usual working set, it would be enough
> successful for such a micro-optimization, I guess.

I'd also like to see some numbers on those micro-improvements. I'll have a look at CSiBE.
Anyway, why not just add all the currently known-to-work cases? What are your concerns regarding that? I can imagine that it is a maintenance burden to keep all those definitions and special cases in the MD up-to-date (bit rot etc). Do you have anything other than that in mind? 

It seems that others have been trying to solve the same problem in a very similar way: http://patchwork.ozlabs.org/patch/58832/ ;)

I've figured that the following pattern works quite well for this particular case. It seems to catch all the bit patterns. (sh_tst_const_ok and sh_zero_extract_val are added functions in sh.c)

(define_insn_and_split "*tstsi3"
  [(set (reg:SI T_REG)
    (zero_extract:SI (match_operand:SI 0 "arith_reg_operand" "")
	        (match_operand:SI 1 "const_int_operand")
	        (match_operand:SI 2 "const_int_operand")))]
  "TARGET_SH1 && sh_tst_const_ok (sh_zero_extract_val (operands[1], operands[2]))"
  "#"
  "&& 1"
  [(const_int 0)]
  "{
	int tstval = sh_zero_extract_val (operands[1], operands[2]);
	emit_insn (gen_tstsi (operands[0], GEN_INT (tstval)));
	DONE;
  }"
)

...however, the problem with that one is that the T bit is inverted afterwards, and thus the following branch logic (or whatever) gets inverted as well. One option could be to emit some T bit inversion insn after gen_tstsi and then optimize it away later on with e.g. a peephole (?), but I haven't tried that yet.

The actual "problem" is that the combine pass first sees the andsi, eq, ... insns and correctly matches them to those in the MD. But instead of continuing to match patterns from the MD it expands the and insn into something like zero_extract, which in turn will never be combined back into the tst insn. Isn't there a way to tell the combine pass not to do so, but instead first look deeper at what is in the MD?


Regarding the peephole:

> (satisfies_constraint_I08 (operands[1])
>       || satisfies_constraint_K08 (operands[1])

I guess this might generate wrong code for e.g. "if (x & -2)". When x has any bits[31:1] set this must return true. The code after the peephole optimization will look only at the lower 8 bits and would possibly return false for x = 0xFF000000, which is wrong. So it should be satisfies_constraint_K08 only, shouldn't it?


> 
> Cost patch looks fine to me.  Could you propose it as a separate
> patch on gcc-patches list with an appropriate ChangeLog entry?
> When proposing it, please refer how you've tested it.  Also
> the numbers got with the patch are highly welcome.
> 
> BTW, do you have FSF copyright assignment for your GCC work?
> Although the cost patch itself is essentially several lines which
> doesn't require copyright assignment, the other changes you've
> proposed clearly require the paper work, I think.

I'll contact you directly regarding that.
Comment 6 Kazumoto Kojima 2011-06-27 05:14:36 UTC
(In reply to comment #5)
> Anyway, why not just add all the currently known-to-work cases? What are your
> concerns regarding that? I can imagine that it is a maintenance burden to keep
> all those definitions and special cases in the MD up-to-date (bit rot etc). Do
> you have anything other than that in mind? 

Yep, maintenance burden but I don't mean ack/nak for anything.
If it's enough fruitful, we should take that route.  When it
gives 5% improvement in the usual working set like as CSiBE,
hundreds lines would be OK, but if it's ~0.5% or less, it doesn't
look worth to add many patterns for that.

> Isn't there a way to tell the combine pass not to do so, but instead first look
> deeper at what is in the MD?

I don't know how to do it cleanly.

> I guess this might generate wrong code for e.g. "if (x & -2)". When x has any
> bits[31:1] set this must return true. The code after the peephole optimization
> will look only at the lower 8 bits and would possibly return false for x =
> 0xFF000000, which is wrong. So it should be satisfies_constraint_K08 only,
> shouldn't it?

You are right.  That peephole was simply 'something like this'.
Comment 7 Oleg Endo 2011-10-09 23:34:45 UTC
(In reply to comment #6)

> Yep, maintenance burden but I don't mean ack/nak for anything.
> If it's enough fruitful, we should take that route.  When it
> gives 5% improvement in the usual working set like as CSiBE,
> hundreds lines would be OK, but if it's ~0.5% or less, it doesn't
> look worth to add many patterns for that.
> 
> > Isn't there a way to tell the combine pass not to do so, but instead first look
> > deeper at what is in the MD?
> 
> I don't know how to do it cleanly.
> 

I've tried out a couple of things and got some CSiBE numbers based on 
trunk rev 179430. Unfortunately only code size comparisons, no run time 
performance numbers. The tests were compiled with 
-ml -m4-single -Os -mnomacsave -mpretend-cmove -mfused-madd -freg-struct-return

Option 1)
  Use many (~10) patterns in the MD and some cost calculation tuning.
  The last patch required some adaptation, because the combine pass 
  started trying to match things slightly differently. I've also 
  noticed that it requires a special case for one pattern on SH4A...
  
  size of all modules: 2916390 -> 2909026    -7364 / -0.252504 %
  avg difference over all modules: -409.111111 / -0.273887 %
  max: compiler       22808 -> 22812           +4 / +0.017538 %
  min: libpng-1.2.5   99120 -> 97804        -1316 / -1.327684 %

Option 2)
  I've added another combine pass which has the make_compound_operation
  function turned off. The make_compound_operation function is used to
  produce zero_extract patterns. If the resulting "simplified" pattern does
  not match anything in the MD, combine reverts the transformation and 
  proceeds with the next insn. That way, it never tries to match the 
  tst #imm pattern in the MD. 
  With this option only ~5 patterns seem to be required and a small
  extension of the costs calculation.

  size of all modules:  2916390 -> 2909170    -7220 / -0.247566 %
  avg difference over all modules: -401.111111 / -0.254423 %
  max: compiler       22808 -> 22812           +4 / +0.017538 %
  min: libpng-1.2.5   99120 -> 97836        -1284 / -1.295400 %

Not so spectacular on average. It highly depends on the type of SW being
compiled, but it hits quite a lot of files in CSiBE.

Option 2 seems more robust even if it seems less effective, what do you think?
Comment 8 Kazumoto Kojima 2011-10-10 01:31:42 UTC
(In reply to comment #7)
> Option 2 seems more robust even if it seems less effective, what do you think?

Another combine pass to reduce size less than 0.3% on one target
would be not acceptable, I guess.  ~10 new patterns would be
overkill for that result, though I'm still expecting that a few
patterns of them were dominant.  Could you get numbers which pattern
was used in the former option?
Comment 9 Oleg Endo 2011-10-10 23:48:17 UTC
Created attachment 25461 [details]
CSiBE comparisons

(In reply to comment #8)
> 
> Another combine pass to reduce size less than 0.3% on one target
> would be not acceptable, I guess.  

I'm sorry, I forgot to mention that it was just a proof of concept hack
of mine, just to see whether it has any chance to work at all.
I think it would be better to change/fix the behavior of the combine pass
in this regard, so that it tries matching combined patterns without
sophisticated transformations. I will try asking on the gcc list about that.

> ~10 new patterns would be
> overkill for that result, though I'm still expecting that a few
> patterns of them were dominant. 

Yep, even if it turned out to be actually only 8 patterns in total, but
still.. I haven't looked at the issue with SH4A but most likely it would add
another one or two patterns... so basically ~10 :)

> Could you get numbers which pattern
> was used in the former option?

I think it would be a bit too much checking out each individual pattern.
Instead I grouped them into what they are effectively doing.
While I was at it, I also added your peephole idea, and a top 10 listing of
the individual files.
Comment 10 Kazumoto Kojima 2011-10-11 01:47:03 UTC
(In reply to comment #9)
> 3) only zero_extract special cases

looks to be dominant.

> I'm sorry, I forgot to mention that it was just a proof of concept hack
> of mine, just to see whether it has any chance to work at all.
> I think it would be better to change/fix the behavior of the combine pass
> in this regard, so that it tries matching combined patterns without
> sophisticated transformations. I will try asking on the gcc list about that.

I see.  I also expect that the experts have some idea for
this issue.

> I think it would be a bit too much checking out each individual pattern.

I don't think that it's too much.  Those numbers can be easily
collected for CSiBE.  If your patterns are named, you could
simply add "-dap -save-temps" to the compiler option which is
specified when ruining CSiBE's create-config and then get
the occurrences of testsi_6, for example, with something like
  grep "testsi_6" `find . -name "*.s" -print` | wc -l
after running the CSiBE size test.
Comment 11 Oleg Endo 2011-10-13 22:54:54 UTC
Created attachment 25491 [details]
Proposed patch including test case


> > 3) only zero_extract special cases
> 
> looks to be dominant.

Yes.  I've briefly looked through the test sources.  A popular use case
for bit test instructions seem to be single bit tests, which the patch
is basically adding. 

> I see.  I also expect that the experts have some idea for
> this issue.

Hm .. http://gcc.gnu.org/ml/gcc/2011-10/msg00189.html
Eric pointed me to the i386 back-end.  Unfortunately, what I found there is
where I originally started...

;; Combine likes to form bit extractions for some tests.  Humor it.

I.e. it is also coded against the behavior of the combine pass with a bunch
of pattern variations.  I guess that's the way it's supposed to be done then :T

> I don't think that it's too much.  Those numbers can be easily
> collected for CSiBE.  If your patterns are named, you could
> simply add "-dap -save-temps" to the compiler option which is
> specified when ruining CSiBE's create-config and then get
> the occurrences of testsi_6, for example, with something like
>   grep "testsi_6" `find . -name "*.s" -print` | wc -l
> after running the CSiBE size test.

Ah, right!  With the attached latest patch applied to trunk rev 179778
the numbers for 
  "-ml -m4-single -Os -mnomacsave -mpretend-cmove -mfused-madd -freg-struct-return"
look something like that:

tstsi_t: 1391
tsthi_t: 4
tstqi_t: 23
tstqi_t_zero: 667
tstsi_t_and_not: 598
tstsi_t_zero_extract_eq: 70
tstsi_t_zero_extract_xor: 923

Notice that the split contributes to the tstsi_t number.
Also, the 3 patterns 
  tstsi_t_zero_extract_xor
  tstsi_t_zero_extract_subreg_xor_little
  tstsi_t_zero_extract_subreg_xor_big

are basically one and the same. On SH4A the subreg variants are required,
because tstsi_t_zero_extract_xor will never match.

I've also added a special case to sh_rtx_costs to detect at least the tstsi_t
pattern. However, the other patterns are not really covered by that and the 
combine pass calculates the cost as a sum of all the operations of the pattern.
I guess the selection of the test instruction could be stimulated a bit more 
by a more accurate costs calculation, but my feeling is that it won't do a lot.


Cheers,
Oleg
Comment 12 Kazumoto Kojima 2011-10-14 23:06:06 UTC
(In reply to comment #11)
> Created attachment 25491 [details]
> Proposed patch including test case

Looks fine.  A very minor style nits:

> +      if (GET_CODE (XEXP (x, 0)) == AND		/* tst instruction.  */

This comment looks a bit bogus.  A full sentence comment would
be better.

> +
> +

There are some extra empty lines.  GNU/GCC coding style says
that only one empty line is needed.  I know that there are
extra empty lines already, but we should not add new ones :-)
Comment 13 Kazumoto Kojima 2011-10-15 02:32:56 UTC
Author: kkojima
Date: Sat Oct 15 02:32:53 2011
New Revision: 180020

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=180020
Log:
	PR target/49263
	* config/sh/sh.h (ZERO_EXTRACT_ANDMASK): New macro.
	* config/sh/sh.c (sh_rtx_costs): Add test instruction case.
	* config/sh/sh.md (tstsi_t): Name existing insn.  Make inner
	and instruction commutative.
	(tsthi_t, tstqi_t, tstqi_t_zero, tstsi_t_and_not,
	tstsi_t_zero_extract_eq, tstsi_t_zero_extract_xor,
	tstsi_t_zero_extract_subreg_xor_little,
	tstsi_t_zero_extract_subreg_xor_big): New insns.
	(*movsicc_t_false, *movsicc_t_true): Replace space with tab in
	asm output.
	(*andsi_compact): Reorder alternatives so that K08 is considered
	first.
	* gcc.target/sh/pr49263.c: New.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/sh/sh.c
    trunk/gcc/config/sh/sh.h
    trunk/gcc/config/sh/sh.md
    trunk/gcc/testsuite/ChangeLog
Comment 14 Oleg Endo 2011-11-20 13:04:39 UTC
With trunk rev 181517 I have observed the following problem, which happens when compiling for -m2*, -m3*, -m4* and -Os:

The function compiled with -m2 -Os

int test_int64_lowword (long long* x)
{
  return *x & 0xFFFFFFFF ? -20 : -40;
}

results in the following code:

	mov	#0,r2	! 42	movsi_i/3	[length = 2]
	tst	r2,r2	! 44	cmpeqsi_t/1	[length = 2]
	bf.s	.L4	! 45	branch_false	[length = 2]
	mov.l	@(4,r4),r3	! 12	movsi_i/5	[length = 2]
	tst	r3,r3	! 46	cmpeqsi_t/1	[length = 2]
.L4:
	bt	.L3	! 14	branch_true	[length = 2]
	mov	#-20,r0	! 4	movsi_i/3	[length = 2]
	rts	
	nop	! 52	*return_i	[length = 4]
.L3:
	rts		! 54	*return_i	[length = 2]
	mov	#-40,r0	! 5	movsi_i/3	[length = 2]

When compiled with -O1 or -O2 the high word test is completely eliminated correctly:

	mov.l	@(4,r4),r1	! 10	movsi_i/5	[length = 2]
	tst	r1,r1	! 17	cmpeqsi_t/1	[length = 2]
	bt	.L4	! 18	branch_true	[length = 2]
	mov	#-20,r0	! 4	movsi_i/3	[length = 2]
	rts	
	nop	! 50	*return_i	[length = 4]
.L4:
	rts		! 52	*return_i	[length = 2]
	mov	#-40,r0	! 5	movsi_i/3	[length = 2]

I'm not sure whether this is actually related to this PR, but have noticed it with the test cases of this PR.  It seems only to happen for comparison-like insns.  If I remember correctly, this problem did not exist when I started working on this PR.
Comment 15 Oleg Endo 2011-12-29 00:34:53 UTC
(In reply to comment #14)
> With trunk rev 181517 I have observed the following problem, which happens when
> compiling for -m2*, -m3*, -m4* and -Os:
> 

This is still present as of rev 182713 and seems to be a different issue.
I've created PR51697 for it.
Comment 16 Oleg Endo 2012-02-26 13:31:36 UTC
Author: olegendo
Date: Sun Feb 26 13:31:32 2012
New Revision: 184585

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=184585
Log:
	PR target/49263
	* gcc.target/sh/pr49263.c: New.


Added:
    trunk/gcc/testsuite/gcc.target/sh/pr49263.c
Modified:
    trunk/gcc/testsuite/ChangeLog
Comment 17 Oleg Endo 2012-02-26 23:26:28 UTC
I guess this one is done now.
Comment 18 Oleg Endo 2012-08-27 19:51:42 UTC
Not quite so done, actually.  The following example case does not work properly:


void test00 (int* x, int xb)
{
  if (! (xb & 128))
    x[0] = 0;
}

-O2 -m4:
        mov     r5,r0
        and     #128,r0
        tst     r0,r0
        bf      .L3
        mov.l   r0,@r4
.L3:
        rts	
        nop



void test01 (int* x, int xb)
{
  if (! (xb & 0x55))
    x[0] = 0;
}

-O2 -m4:
        mov     r5,r0
        and     #85,r0
        tst     r0,r0
        bf      .L7
        mov.l   r0,@r4
.L7:
        rts
        nop


It seems that combine is trying to look for the following patterns:

Failed to match this instruction:
(set (pc)
    (if_then_else (ne (zero_extract:SI (reg:SI 5 r5 [ xb ])
                (const_int 1 [0x1])
                (const_int 7 [0x7]))
            (const_int 0 [0]))
        (label_ref:SI 15)
        (pc)))


Failed to match this instruction:
(set (pc)
    (if_then_else (ne (and:SI (reg:SI 5 r5 [ xb ])
                (const_int 85 [0x55]))
            (const_int 0 [0]))
        (label_ref:SI 15)
        (pc)))



Another case that could see some improvement ...

bool test04 (int* x, int xb)
{
  return ! (xb & 0x55);
}

-O2 -m4 (OK):
        mov     r5,r0
        tst     #85,r0
        rts
        movt    r0


bool test02 (int* x, int xb)
{
  return ! (xb & (1 << 6));
}

-O2 -m4 (NG):
        mov     r5,r0
        mov     #-6,r1
        shld    r1,r0
        xor     #1,r0
        rts
        and     #1,r0
Comment 19 Oleg Endo 2012-10-28 22:01:47 UTC
Another thing I've noticed...
Cases such as:

        mov.l   r0,@r2        ! LS
        mov     r13,r0        ! MT
        and     #7,r0         ! EX
        tst     r0,r0         ! MT
        bt/s    .L8           ! BR
        mov.l   r0,@(16,r1)

where the result of the and op is re-used would be slightly better as:

        mov.l   r0,@r2       ! LS
        mov     r13,r0       ! MT
        tst     #7,r0        ! MT
        and     #7,r0        ! EX
        bt/s    .L8          ! BR
        mov.l   r0,@(16,r1)

because it reduces dependency on the result of the and op and thus has a higher chance to be executed in parallel.
Comment 20 Oleg Endo 2012-10-31 13:47:07 UTC
(In reply to comment #19)
> Another thing I've noticed...
> Cases such as:
> 
>         mov.l   r0,@r2        ! LS
>         mov     r13,r0        ! MT
>         and     #7,r0         ! EX
>         tst     r0,r0         ! MT
>         bt/s    .L8           ! BR
>         mov.l   r0,@(16,r1)
> 
> where the result of the and op is re-used would be slightly better as:
> 
>         mov.l   r0,@r2       ! LS
>         mov     r13,r0       ! MT
>         tst     #7,r0        ! MT
>         and     #7,r0        ! EX
>         bt/s    .L8          ! BR
>         mov.l   r0,@(16,r1)
> 
> because it reduces dependency on the result of the and op and thus has a higher
> chance to be executed in parallel.

Other similar cases where hoisting the tst insn would make sense:
        mov.b   @(13,r2),r0
        extu.b  r0,r0
        tst     #1,r0
        bt      .L53

        mov.b   @(13,r1),r0
        extu.b  r0,r0
        tst     r0,r0
        bt/s    .L37
Comment 21 Oleg Endo 2013-12-08 13:46:53 UTC
(In reply to Oleg Endo from comment #18)
> 
> It seems that combine is trying to look for the following patterns:
> 
> Failed to match this instruction:
> (set (pc)
>     (if_then_else (ne (and:SI (reg:SI 5 r5 [ xb ])
>                 (const_int 85 [0x55]))
>             (const_int 0 [0]))
>         (label_ref:SI 15)
>         (pc)))

Implementing such a combine pattern like ...
(define_insn_and_split "*tst_cbranch"
  [(set (pc)
	(if_then_else (ne (and:SI (match_operand:SI 0 "logical_operand")
				  (match_operand:SI 1 "const_int_operand"))
			  (const_int 0))
		      (label_ref (match_operand 2))
		      (pc)))
   (clobber (reg:SI T_REG))]
  "TARGET_SH1"
  "#"
  "&& 1"
  [(set (reg:SI T_REG) (eq:SI (and:SI (match_dup 0) (match_dup 1))
			      (const_int 0)))
   (set (pc) (if_then_else (eq (reg:SI T_REG) (const_int 0))
			   (label_ref (match_dup 2))
			   (pc)))])


results in code such as following code:
        mov     #33,r1
        mov     r5,r0
        tst     #33,r0
        bf/s    .L3
        and     r5,r1
        mov.l   r1,@r4
.L3:
        rts
        nop

which is worse.
What happens is that the sequence is expanded to RTL as follows:

(insn 7 4 8 2 (set (reg:SI 163 [ D.1856 ])
        (and:SI (reg/v:SI 162 [ xb ])
            (const_int 33 [0x21]))) sh_tmp.cpp:17 -1
     (nil))
(insn 8 7 9 2 (set (reg:SI 147 t)
        (eq:SI (reg:SI 163 [ D.1856 ])
            (const_int 0 [0]))) sh_tmp.cpp:17 -1
     (nil))
(jump_insn 9 8 10 2 (set (pc)
        (if_then_else (eq (reg:SI 147 t)
                (const_int 0 [0]))
            (label_ref:SI 15)
            (pc))) sh_tmp.cpp:17 301 {*cbranch_t}
     (int_list:REG_BR_PROB 3900 (nil))
 -> 15)
(note 10 9 11 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 11 10 12 4 (set (reg:SI 164)
        (const_int 0 [0])) sh_tmp.cpp:18 -1
     (nil))
(insn 12 11 15 4 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32])
        (reg:SI 164)) sh_tmp.cpp:18 -1
     (nil))


and the cse1 pass decides that the result of the and operation can be shared and replaces the operand in insn 12 with reg:SI 163:

(insn 12 11 15 3 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32])
        (reg:SI 163 [ D.1856 ])) sh_tmp.cpp:18 258 {movsi_ie}
     (expr_list:REG_DEAD (reg:SI 164)
        (expr_list:REG_DEAD (reg/v/f:SI 161 [ x ])
            (nil))))

and insn 11 becomes dead code and is eliminated.
All of that happens long time before combine, so the tst combine patterns have no chance to reconstruct the original code.

A sequence such as

        mov     r5,r0
        mov     #0,r1
        tst     #33,r0
        bf      .L3
        mov.l   r1,@r4
.L3:
        rts
        nop

could probably be achieved by combining insn 7 and insn 8 shortly after RTL expansion, or even during the expansion of insn 8 (by looking at previous already expanded insns and emitting a tst insn directly).
The idea would be to reduce dependencies on the tested register which allows better scheduling.  In addition to that, on SH4A "mov #imm8,Rn" is an MT group instruction which has a higher probability of being executed in parallel.
Comment 22 Oleg Endo 2013-12-17 12:37:48 UTC
(In reply to Oleg Endo from comment #21)
> What happens is that the sequence is expanded to RTL as follows:
> 
> (insn 7 4 8 2 (set (reg:SI 163 [ D.1856 ])
>         (and:SI (reg/v:SI 162 [ xb ])
>             (const_int 33 [0x21]))) sh_tmp.cpp:17 -1
>      (nil))
> (insn 8 7 9 2 (set (reg:SI 147 t)
>         (eq:SI (reg:SI 163 [ D.1856 ])
>             (const_int 0 [0]))) sh_tmp.cpp:17 -1
>      (nil))
> (jump_insn 9 8 10 2 (set (pc)
>         (if_then_else (eq (reg:SI 147 t)
>                 (const_int 0 [0]))
>             (label_ref:SI 15)
>             (pc))) sh_tmp.cpp:17 301 {*cbranch_t}
>      (int_list:REG_BR_PROB 3900 (nil))
>  -> 15)
> (note 10 9 11 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
> (insn 11 10 12 4 (set (reg:SI 164)
>         (const_int 0 [0])) sh_tmp.cpp:18 -1
>      (nil))
> (insn 12 11 15 4 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32])
>         (reg:SI 164)) sh_tmp.cpp:18 -1
>      (nil))
> 
> 
> and insn 11 becomes dead code and is eliminated.
> All of that happens long time before combine, so the tst combine patterns
> have no chance to reconstruct the original code.
> 

Adding an early peephole pass as described in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59533#c2 and then adding the following peephole:

;; Peephole after initial expansion.
(define_peephole2
  [(set (match_operand:SI 0 "arith_reg_dest")
	(and:SI (match_operand:SI 1 "arith_reg_operand")
		(match_operand:SI 2 "logical_operand")))
   (set (reg:SI T_REG) (eq:SI (match_dup 0) (const_int 0)))]
  "TARGET_SH1 && can_create_pseudo_p ()"
  [(set (reg:SI T_REG) (eq:SI (and:SI (match_dup 1) (match_dup 2))
			      (const_int 0)))
   (set (match_dup 0) (and:SI (match_dup 1) (match_dup 2)))])

... fixes the problem and results in more uses of the tst #imm,r0 insn according to the CSiBE set.  On the other hand there is a total code size increase of 792 bytes on the whole set.  Below are some things that get worse in the Linux source (mm/filemap.c):

        mov.b   @(15,r1),r0    ->    mov.b   @(15,r1),r0
        cmp/pz  r0                   tst     #128,r0     // cmp/pz has less
        bf      .L1016               bf      .L1001      // pressure on r0


        mov.b   @(15,r0),r0     ->   mov.b   @(15,r0),r0
        tst     #4,r0                shar    r0
        bf      .L107                shar    r0
                                     tst     #1,r0


        add     #16,r0          ->   add     #16,r0
        mov.b   @(15,r0),r0          mov.b   @(15,r0),r0
        tst     #16,r0               mov     #-4,r1
        bf/s    .L509                shad    r1,r0
                                     tst     #1,r0
                                     bf/s    .L509
Comment 23 Oleg Endo 2014-12-30 18:45:00 UTC
Author: olegendo
Date: Tue Dec 30 18:44:27 2014
New Revision: 219111

URL: https://gcc.gnu.org/viewcvs?rev=219111&root=gcc&view=rev
Log:
gcc/testsuite/
	PR target/49263
	* gcc.target/sh/pr49263-1.c: New.
	* gcc.target/sh/pr49263-2.c: New.

Added:
    trunk/gcc/testsuite/gcc.target/sh/pr49263-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr49263-2.c
Modified:
    trunk/gcc/testsuite/ChangeLog
Comment 24 Oleg Endo 2014-12-30 19:12:16 UTC
Author: olegendo
Date: Tue Dec 30 19:11:42 2014
New Revision: 219113

URL: https://gcc.gnu.org/viewcvs?rev=219113&root=gcc&view=rev
Log:
gcc/testsuite/
	PR target/49263
	* gcc.target/sh/sh.exp (check_effective_target_sh2a): New.
	* gcc.target/sh/pr49263-3.c: New.

Added:
    trunk/gcc/testsuite/gcc.target/sh/pr49263-3.c
Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/sh/sh.exp
Comment 25 Oleg Endo 2015-01-24 13:05:25 UTC
Author: olegendo
Date: Sat Jan 24 13:04:53 2015
New Revision: 220081

URL: https://gcc.gnu.org/viewcvs?rev=220081&root=gcc&view=rev
Log:
gcc/
	PR target/49263
	PR target/53987
	PR target/64345
	PR target/59533
	PR target/52933
	PR target/54236
	PR target/51244
	* config/sh/sh-protos.h
	(sh_extending_set_of_reg::can_use_as_unextended_reg,
	sh_extending_set_of_reg::use_as_unextended_reg,
	sh_is_nott_insn, sh_movt_set_dest, sh_movrt_set_dest, sh_is_movt_insn,
	sh_is_movrt_insn, sh_insn_operands_modified_between_p,
	sh_reg_dead_or_unused_after_insn, sh_in_recog_treg_set_expr,
	sh_recog_treg_set_expr, sh_split_treg_set_expr): New functions.
	(sh_treg_insns): New class.
	* config/sh/sh.c (TARGET_LEGITIMATE_COMBINED_INSN): Define target hook.
	(scope_counter): New class.
	(sh_legitimate_combined_insn, sh_is_nott_insn, sh_movt_set_dest,
	sh_movrt_set_dest, sh_reg_dead_or_unused_after_insn,
	sh_extending_set_of_reg::can_use_as_unextended_reg,
	sh_extending_set_of_reg::use_as_unextended_reg, sh_recog_treg_set_expr,
	sh_in_recog_treg_set_expr, sh_try_split_insn_simple,
	sh_split_treg_set_expr): New functions.
	(addsubcosts): Handle treg_set_expr.
	(sh_rtx_costs): Handle IF_THEN_ELSE and ZERO_EXTRACT.
	(sh_rtx_costs): Use arith_reg_operand in SIGN_EXTEND and ZERO_EXTEND.
	(sh_rtx_costs): Handle additional bit test patterns in EQ and AND cases.
	(sh_insn_operands_modified_between_p): Make non-static.
	* config/sh/predicates.md (zero_extend_movu_operand): Allow
	simple_mem_operand in addition to displacement_mem_operand.
	(zero_extend_operand): Don't allow zero_extend_movu_operand.
	(treg_set_expr, treg_set_expr_not_const01,
	arith_reg_or_treg_set_expr): New predicates.
	* config/sh/sh.md (tstsi_t): Use arith_reg_operand and
	arith_or_int_operand instead of logical_operand.  Convert to
	insn_and_split.  Try to optimize constant operand in splitter.
	(tsthi_t, tstqi_t): Fold into *tst<mode>_t.  Convert to insn_and_split.
	(*tstqi_t_zero): Delete.
	(*tst<mode>_t_subregs): Add !sh_in_recog_treg_set_expr split condition.
	(tstsi_t_and_not): Delete.
	(tst<mode>_t_zero_extract_eq): Rename to *tst<mode>_t_zero_extract.
	Convert to insn_and_split.
	(unnamed split, tstsi_t_zero_extract_xor,
	tstsi_t_zero_extract_subreg_xor_little,
	tstsi_t_zero_extract_subreg_xor_big): Delete.
	(*tstsi_t_shift_mask): New insn_and_split.
	(cmpeqsi_t, cmpgesi_t): Add new split for const_int 0 operands and try
	to recombine with surrounding insns when splitting.
	(*negtstsi): Add !sh_in_recog_treg_set_expr condition.
	(cmp_div0s_0, cmp_div0s_1, *cmp_div0s_0, *cmp_div0s_1): Rewrite as ...
	(cmp_div0s, *cmp_div0s_1, *cmp_div0s_2, *cmp_div0s_3, *cmp_div0s_4,
	*cmp_div0s_5, *cmp_div0s_6): ... these new insn_and_split patterns.
	(*cbranch_div0s: Delete.
	(*addc): Convert to insn_and_split.  Use treg_set_expr as 3rd operand.
	Try to recombine with surrounding insns when splitting.  Add operand
	order variants.
	(*addc_t_r, *addc_r_t): Use treg_set_expr_not_const01.
	(*addc_r_r_1, *addc_r_lsb, *addc_r_r_lsb, *addc_r_lsb_r, *addc_r_msb,
	*addc_r_r_msb, *addc_2r_msb): Delete.
	(*addc_2r_lsb): Rename to *addc_2r_t.  Use treg_set_expr.  Add operand
	order variant.
	(*addc_negreg_t): New insn_and_split.
	(*subc): Convert to insn_and_split.  Use treg_set_expr as 3rd operand.
	Try to recombine with surrounding insns when splitting.
	Add operand order variants.  
	(*subc_negt_reg, *subc_negreg_t, *reg_lsb_t, *reg_msb_t): New
	insn_and_split patterns.
	(*rotcr): Use arith_reg_or_treg_set_expr.  Try to recombine with
	surrounding insns when splitting.
	(unnamed rotcr split): Use arith_reg_or_treg_set_expr.
	(*rotcl): Likewise.  Add zero_extract variant.
	(*ashrsi2_31): New insn_and_split.
	(*negc): Convert to insn_and_split.  Use treg_set_expr.
	(*zero_extend<mode>si2_disp_mem): Update comment.
	(movrt_negc, *movrt_negc, nott): Add !sh_in_recog_treg_set_expr split
	condition.
	(*mov_t_msb_neg, mov_neg_si_t): Use treg_set_expr.  Try to recombine
	with surrounding insns when splitting.
	(any_treg_expr_to_reg): New insn_and_split.
	(*neg_zero_extract_0, *neg_zero_extract_1, *neg_zero_extract_2,
	*neg_zero_extract_3, *neg_zero_extract_4, *neg_zero_extract_5,
	*neg_zero_extract_6, *zero_extract_0, *zero_extract_1,
	*zero_extract_2): New single bit zero extract patterns.
	(bld_reg, *bld_regqi): Fold into bld<mode>_reg.
	(*get_thread_pointersi, store_gbr, *mov<mode>_gbr_load,
	*mov<mode>_gbr_load, *mov<mode>_gbr_load, *mov<mode>_gbr_load,
	*movdi_gbr_load): Use arith_reg_dest instead of register_operand for
	set destination.
	(set_thread_pointersi, load_gbr): Use arith_reg_operand instead of
	register_operand for set source.

gcc/testsuite/
	PR target/49263
	PR target/53987
	PR target/64345
	PR target/59533
	PR target/52933
	PR target/54236
	PR target/51244
	* gcc.target/sh/pr64345-1.c: New.
	* gcc.target/sh/pr64345-2.c: New.
	* gcc.target/sh/pr59533-1.c: New.
	* gcc.target/sh/pr49263.c: Adjust matching of expected insns.
	* gcc.target/sh/pr52933-2.c: Likewise.
	* gcc.target/sh/pr54089-1.c: Likewise.
	* gcc.target/sh/pr54236-1.c: Likewise.
	* gcc.target/sh/pr51244-20-sh2a.c: Likewise.
	* gcc.target/sh/pr49263-1.c: Remove xfails.
	* gcc.target/sh/pr49263-2.c: Likewise.
	* gcc.target/sh/pr49263-3.c: Likewise.
	* gcc.target/sh/pr53987-1.c: Likewise.
	* gcc.target/sh/pr52933-1.c: Adjust matching of expected insns.
	(test_24, test_25, test_26, test_27, test_28, test_29, test_30): New.
	* gcc.target/sh/pr51244-12.c: Adjust matching of expected insns.
	(test05, test06, test07, test08, test09, test10, test11, test12): New.
	* gcc.target/sh/pr54236-3.c: Adjust matching of expected insns.
	(test_002, test_003, test_004, test_005, test_006, test_007, test_008,
	test_009): New.
	* gcc.target/sh/pr51244-4.c: Adjust matching of expected insns.
	(test_02): New.

Added:
    trunk/gcc/testsuite/gcc.target/sh/pr59533-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr64345-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr64345-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/sh/predicates.md
    trunk/gcc/config/sh/sh-protos.h
    trunk/gcc/config/sh/sh.c
    trunk/gcc/config/sh/sh.md
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/sh/pr49263-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr49263-2.c
    trunk/gcc/testsuite/gcc.target/sh/pr49263-3.c
    trunk/gcc/testsuite/gcc.target/sh/pr49263.c
    trunk/gcc/testsuite/gcc.target/sh/pr51244-12.c
    trunk/gcc/testsuite/gcc.target/sh/pr51244-20-sh2a.c
    trunk/gcc/testsuite/gcc.target/sh/pr51244-4.c
    trunk/gcc/testsuite/gcc.target/sh/pr52933-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr52933-2.c
    trunk/gcc/testsuite/gcc.target/sh/pr53987-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr54089-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr54236-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr54236-3.c
Comment 26 Oleg Endo 2015-01-26 23:56:38 UTC
Author: olegendo
Date: Mon Jan 26 23:56:05 2015
New Revision: 220144

URL: https://gcc.gnu.org/viewcvs?rev=220144&root=gcc&view=rev
Log:
gcc/
	PR target/49263
	* config/sh/sh.c (sh_split_treg_set_expr): Invoke emit_insn before
	remove_insn.
	* config/sh/sh.md (tstsi_t): Don't try to optimize constant with right
	shifts if it already fits into K08.

gcc/testsuite/
	PR target/49263
	* gcc.target/sh/pr49263-4.c: New.


Added:
    trunk/gcc/testsuite/gcc.target/sh/pr49263-4.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/sh/sh.c
    trunk/gcc/config/sh/sh.md
    trunk/gcc/testsuite/ChangeLog
Comment 27 Eric Gallager 2018-07-08 21:33:28 UTC
(In reply to Oleg Endo from comment #26)
> Author: olegendo
> Date: Mon Jan 26 23:56:05 2015
> New Revision: 220144
> 
> URL: https://gcc.gnu.org/viewcvs?rev=220144&root=gcc&view=rev
> Log:
> gcc/
> 	PR target/49263
> 	* config/sh/sh.c (sh_split_treg_set_expr): Invoke emit_insn before
> 	remove_insn.
> 	* config/sh/sh.md (tstsi_t): Don't try to optimize constant with right
> 	shifts if it already fits into K08.
> 
> gcc/testsuite/
> 	PR target/49263
> 	* gcc.target/sh/pr49263-4.c: New.
> 
> 
> Added:
>     trunk/gcc/testsuite/gcc.target/sh/pr49263-4.c
> Modified:
>     trunk/gcc/ChangeLog
>     trunk/gcc/config/sh/sh.c
>     trunk/gcc/config/sh/sh.md
>     trunk/gcc/testsuite/ChangeLog

Did this fix it?
Comment 28 Oleg Endo 2018-07-09 12:10:36 UTC
(In reply to Eric Gallager from comment #27)
> (In reply to Oleg Endo from comment #26)
> > Author: olegendo
> > Date: Mon Jan 26 23:56:05 2015
> > New Revision: 220144

Well, it fixed some of the cases mentioned in this PR, but not all.  It's quite a  broad issue actually.
Comment 29 Eric Gallager 2018-10-09 01:10:22 UTC
(In reply to Oleg Endo from comment #28)
> (In reply to Eric Gallager from comment #27)
> > (In reply to Oleg Endo from comment #26)
> > > Author: olegendo
> > > Date: Mon Jan 26 23:56:05 2015
> > > New Revision: 220144
> 
> Well, it fixed some of the cases mentioned in this PR, but not all.  It's
> quite a  broad issue actually.

So maybe it's worth splitting up into sub-issues?
Comment 30 Oleg Endo 2018-10-09 14:34:47 UTC
(In reply to Eric Gallager from comment #29)
> 
> So maybe it's worth splitting up into sub-issues?

It'd be better to, yes.  But at the moment I don't have a lot of time to go through all the cases and factor out the individual cases.  Please leave this open.  It will be useful if I (or others) get back to active SH development in the future.
Comment 31 Alexander Klepikov 2023-05-12 11:46:44 UTC
I've found new cases for SH2 and SH2E CPUs only:

#define FLAG 0x40

unsigned int f(char v){
    return (v & FLAG) == FLAG;
}

For both big and little endian translates to dynamic shift call:

-O -m2 (or -m2e)

_f:
        sts.l   pr,@-r15
        mov.l   .L3,r1
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts     
        nop
.L3:
        .long   ___ashiftrt_r4_6

And

#define FLAG 0x40
#define ADDR 0xFFFF0000
#define P ((unsigned char *)ADDR)

unsigned int f(void){
    return (*P & FLAG) == FLAG;
}

Translates to

_f:
        sts.l   pr,@-r15
        mov.l   .L3,r1
        mov.b   @r1,r4
        mov.l   .L4,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts     
        nop
.L3:
        .long   -65536
.L4:
        .long   ___ashiftrt_r4_6

Assembler output does not depend on ADDR value, but depends on variable (or pointer) type. When type is integer, assembler code uses 'tst' for all options '-m4', '-m2', '-m2e':

#define FLAG 0x40

unsigned int f(unsigned int v){
    return (v & FLAG) == FLAG;
}

translates to 

_f:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts     
        negc    r0,r0

and

#define FLAG 0x40
#define ADDR 0xFFFF0000
#define P ((unsigned int *)ADDR)

unsigned int f(void){
    return (*P & FLAG) == FLAG;
}

translates to 

_f:
        mov.l   .L2,r1
        mov.l   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts     
        negc    r0,r0
.L2:
        .long   -65536

Interesting that when '-m4' flag is specified, later GCC always translates to code with 'tst' instruction. I played with godbolt and that's what I found. GCC ver 4 uses dynamic shift 'shad' with '-m4' option and library call with '-m2' or '-m2e' options. GCC 9.5 and later uses 'tst' with '-m4' and library call with both '-m2' and '-m2e' when FLAG==0x40 and 'shll' instruction with both '-m2' and '-m2e' when FLAG==0x80. I remind you that this is happening when char type used only.

Maybe SH4 solution can be extended to support SH2/SH2E?
Comment 32 Alexander Klepikov 2023-05-23 12:34:45 UTC
I'm not sure whether I should write here or open new discussion, but these topics are related very closely. I've been writing a patch to eliminate the generation of dynamic shift instructions 'shad' and 'shld' completely at least for SH4 CPU. And then I get a surprising result - in all the examples I gave earlier, library call converted to 'tst' instructions!

Here is the patch itself (I also will attach a file):

--- ../gcc-12.3.0.orig/gcc/config/sh/sh.cc      2023-05-08 15:14:39.681161695 +0300
+++ ./gcc/config/sh/sh.cc       2023-05-23 12:23:25.964375731 +0300
@@ -3061,7 +3061,7 @@
   else
     insn_count = ashl_lshr_seq[shift_amount_i].insn_count;

-  return TARGET_DYNSHIFT && (insn_count > 1 + SH_DYNAMIC_SHIFT_COST);
+  return TARGET_DYNSHIFT && (insn_count > 1 + SH_DYNAMIC_SHIFT_COST) && ! disable_dynshift;
 }

 /* Assuming we have a value that has been sign-extended by at least one bit,
@@ -3812,8 +3812,10 @@
   rtx wrk;
   char func[18];
   int value;
+  int long_shift  = disable_dynshift ? 30 : 19;
+  int short_shift = disable_dynshift ? 15 : 5;

-  if (TARGET_DYNSHIFT)
+  if (TARGET_DYNSHIFT && ! disable_dynshift)
     {
       if (!CONST_INT_P (operands[2]))
        {
@@ -3851,7 +3853,7 @@
       emit_insn (gen_ashrsi2_31 (operands[0], operands[1]));
       return true;
     }
-  else if (value >= 16 && value <= 19)
+  else if (value >= 16 && value <= long_shift)
     {
       wrk = gen_reg_rtx (SImode);
       emit_insn (gen_ashrsi2_16 (wrk, operands[1]));
@@ -3862,7 +3864,7 @@
       return true;
     }
   /* Expand a short sequence inline, longer call a magic routine.  */
-  else if (value <= 5)
+  else if (value <= short_shift)
     {
       wrk = gen_reg_rtx (SImode);
       emit_move_insn (wrk, operands[1]);
diff -ur ../gcc-12.3.0.orig/gcc/config/sh/sh.opt ./gcc/config/sh/sh.opt
--- ../gcc-12.3.0.orig/gcc/config/sh/sh.opt     2023-05-08 15:14:39.689161810 +0300
+++ ./gcc/config/sh/sh.opt      2023-05-23 10:45:36.814371159 +0300
@@ -301,3 +301,7 @@
 mlra
 Target Var(sh_lra_flag) Init(0) Save
 Use LRA instead of reload (transitional).
+
+mdisable-dynshift
+Target Var(disable_dynshift) Init(0)
+Disable dynamic shift 'shad' and 'shld' instructions

And here are my tests:
$ cat f.c
#define ADDR 0xFFFF0000
#define P ((unsigned char *)ADDR)
#define FLAG 0x40
#define S 7

unsigned char f(char v){
    return (v & FLAG) == FLAG;
}

unsigned char f_(unsigned char v){
    return (v & FLAG) == FLAG;
}

unsigned char f1(void){
    return (*P & FLAG) == FLAG;
}

int f_signed_rshift(int v){
    return v >> S;
}

int f_signed_lshift(int v){
    return v << S;
}

unsigned int f_unsigned_rshift(unsigned int v){
    return v >> S;
}

unsigned int f_unsigned_lshift(unsigned int v){
    return v << S;
}

$ /usr/local/sh-toolchain/bin/sh-elf-gcc -c -mrenesas -m2e -mb -O -fno-toplevel-reorder -mdisable-dynshift -S f.c
$ cat f.s
        .file   "f.c"
        .text
        .text
        .align 1
        .global _f
        .type   _f, @function
_f:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f, .-_f
        .align 1
        .global _f_
        .type   _f_, @function
_f_:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_, .-_f_
        .align 1
        .global _f1
        .type   _f1, @function
_f1:
        mov.l   .L4,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L5:
        .align 2
.L4:
        .long   -65536
        .size   _f1, .-_f1
        .align 1
        .global _f_signed_rshift
        .type   _f_signed_rshift, @function
_f_signed_rshift:
        mov     r4,r0
        shar    r0
        shar    r0
        shar    r0
        shar    r0
        shar    r0
        shar    r0
        rts
        shar    r0
        .size   _f_signed_rshift, .-_f_signed_rshift
        .align 1
        .global _f_signed_lshift
        .type   _f_signed_lshift, @function
_f_signed_lshift:
        mov     r4,r0
        shll2   r0
        shll2   r0
        add     r0,r0
        rts
        shll2   r0
        .size   _f_signed_lshift, .-_f_signed_lshift
        .align 1
        .global _f_unsigned_rshift
        .type   _f_unsigned_rshift, @function
_f_unsigned_rshift:
        mov     r4,r0
        shlr2   r0
        shlr2   r0
        shlr    r0
        rts
        shlr2   r0
        .size   _f_unsigned_rshift, .-_f_unsigned_rshift
        .align 1
        .global _f_unsigned_lshift
        .type   _f_unsigned_lshift, @function
_f_unsigned_lshift:
        mov     r4,r0
        shll2   r0
        shll2   r0
        add     r0,r0
        rts
        shll2   r0
        .size   _f_unsigned_lshift, .-_f_unsigned_lshift
        .ident  "GCC: (GNU) 12.3.0"

I also compiled my project with '-m2e' and new '-mdisable-dynshift' options and tested it in SH-2E mone on Renesas's emulator that comes with High-performance Embedded Workshop and all unit tests run as expected.

If this patch is useful let's include it in GCC.
Comment 33 Alexander Klepikov 2023-05-23 12:35:49 UTC
Created attachment 55142 [details]
Disable dynamic shift instructions patch
Comment 34 Oleg Endo 2023-05-23 19:05:52 UTC
(In reply to Alexander Klepikov from comment #33)
> Created attachment 55142 [details]
> Disable dynamic shift instructions patch

First of all, thanks for digging into this.  This issue has been a can of worms, due to all sorts of reasons.

As you have discovered, some code patterns take the shift instruction route, which is basically decided earlier by the various middle-end optimizations.  There have also been some changes to those parts recently, but I haven't been watching what it does for SH.

> unsigned int f(char v){
>     return (v & FLAG) == FLAG;
> }

Bit-tests of char and unsigned char should be covered by the test-suite and should work -- at least originally.  However, what might be triggering this problem is the '== FLAG' comparison.  When I was working on this issue I only used '== 0' or '!= 0' comparison.  I can imagine that your test code triggers some other middle end optimizations and hence we get this.

Can you try to rewrite your test code to something like this?

unsigned int f(char v){
    return (v & FLAG) != 0;
}

... and see if it generates the tst instruction as expected?


> I also compiled my project with '-m2e' and new '-mdisable-dynshift'
> options and tested it in SH-2E mone on Renesas's emulator that comes
> with High-performance Embedded Workshop and all unit tests run as expected.

I'm not sure what the purpose of the '-mdisable-dynshift' option would be here though.  For '-m2e' TARGET_DYNSHIFT is already 'false'.  So the option seems misnamed.
Comment 35 Alexander Klepikov 2023-05-24 11:40:45 UTC
(In reply to Oleg Endo from comment #34)
> Bit-tests of char and unsigned char should be covered by the test-suite and
> should work -- at least originally.  However, what might be triggering this
> problem is the '== FLAG' comparison.  When I was working on this issue I
> only used '== 0' or '!= 0' comparison.  I can imagine that your test code
> triggers some other middle end optimizations and hence we get this.

Yes, I am sure that the problem is the '== FLAG' comparison. Before I reported that bug, I tried to bypass it and this macro does not produce shift instructions even on GCC 4.7:

#define BIT_MASK_IS_SET_(VALUE, BITMASK)\
	({int _value = VALUE & BITMASK,\
	_result;\
	if (_value == BITMASK){\
		_result = 1;\
	}\
	else {\
		_result = 0;\
	}\
	_result;})

So this is definitely the comparison.

> 
> Can you try to rewrite your test code to something like this?
> 
> unsigned int f(char v){
>     return (v & FLAG) != 0;
> }
> 
> ... and see if it generates the tst instruction as expected?
> 

As I understand, you meant the following (I added new functions at the end of file):

$ cat f.c
#define ADDR 0xFFFF0000
#define P ((unsigned char *)ADDR)
#define FLAG 0x40
#define S 7

unsigned char f_char_var(char v){
    return (v & FLAG) == FLAG;
}

unsigned char f_unsigned_char_var(unsigned char v){
    return (v & FLAG) == FLAG;
}

unsigned char f_symbol(void){
    return (*P & FLAG) == FLAG;
}

unsigned char f_symbol_zero(void){
    return (*P & FLAG) == 0;
}

unsigned char f_symbol_non_zero(void){
    return (*P & FLAG) != 0;
}

Compiler flags: -c -mrenesas -m2e -mb -O -fno-toplevel-reorder

With patch disabled:

$ cat f_clean.s
        .file   "f.c"
        .text
        .text
        .align 1
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        sts.l   pr,@-r15
        mov.l   .L3,r1
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L4:
        .align 2
.L3:
        .long   ___ashiftrt_r4_6
        .size   _f_char_var, .-_f_char_var
        .align 1
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        sts.l   pr,@-r15
        mov.l   .L7,r1
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L8:
        .align 2
.L7:
        .long   ___ashiftrt_r4_6
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        sts.l   pr,@-r15
        mov.l   .L11,r1
        mov.b   @r1,r4
        mov.l   .L12,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L13:
        .align 2
.L11:
        .long   -65536
.L12:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol, .-_f_symbol
        .align 1
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L15,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L16:
        .align 2
.L15:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        sts.l   pr,@-r15
        mov.l   .L19,r1
        mov.b   @r1,r4
        mov.l   .L20,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L21:
        .align 2
.L19:
        .long   -65536
.L20:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .ident  "GCC: (GNU) 12.3.0"

With patch enabled:

$ cat f.s
        .file   "f.c"
        .text
        .text
        .align 1
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_char_var, .-_f_char_var
        .align 1
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        mov.l   .L4,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L5:
        .align 2
.L4:
        .long   -65536
        .size   _f_symbol, .-_f_symbol
        .align 1
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L7,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L8:
        .align 2
.L7:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        mov.l   .L10,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L11:
        .align 2
.L10:
        .long   -65536
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .ident  "GCC: (GNU) 12.3.0"
 
> I'm not sure what the purpose of the '-mdisable-dynshift' option would be
> here though.  For '-m2e' TARGET_DYNSHIFT is already 'false'.  So the option
> seems misnamed.

I choose that name because I wanted to disable dynamic shift instructions for all CPUs. I did not hope that it will affect SH-2E code in such way.

I can rewrite the patch so that it only affects CPUs that do not support dynamic shifts and disables library call for dynamic shifts. I'll do it anyway because I need it badly. How do you think, what name of option would be better: '-mdisable-dynshift-libcall' or '-mhw-shift'? Or if you want, please suggest another one. Thank you!
Comment 36 Oleg Endo 2023-05-24 11:57:20 UTC
(In reply to Alexander Klepikov from comment #35)
> 
> As I understand, you meant the following (I added new functions at the end
> of file):
> 
> $ cat f.c
> #define ADDR 0xFFFF0000
> #define P ((unsigned char *)ADDR)
> #define FLAG 0x40
> #define S 7
> ....

Yes, that's what I meant, thanks.

Can you also compile for little endian, and most of all, use -O2 optimization level.  Some optimizations are not done below -O2.

> 
> I choose that name because I wanted to disable dynamic shift instructions
> for all CPUs. I did not hope that it will affect SH-2E code in such way.
> 
> I can rewrite the patch so that it only affects CPUs that do not support
> dynamic shifts and disables library call for dynamic shifts. I'll do it
> anyway because I need it badly. How do you think, what name of option would
> be better: '-mdisable-dynshift-libcall' or '-mhw-shift'? Or if you want,
> please suggest another one. Thank you!

'-mdisable-dynshift-libcall' would be more appropriate for what it tries to do, I think.  Although that is a whole different issue ... but what is it going to do for real dynamic shifts on SH2?

What kind of code is it supposed to emit for things like

unsigned int dyn_shift (unsigned int x, unsigned int y)
{
  return x << (y & 31);
}
Comment 37 Alexander Klepikov 2023-05-24 13:34:00 UTC
> Can you also compile for little endian, and most of all, use -O2
> optimization level.  Some optimizations are not done below -O2.

Here's source file, I added functions with non-constant shifts

$ cat f.c
#define ADDR 0xFFFF0000
#define P ((unsigned char *)ADDR)
#define FLAG 0x40
#define S 7

unsigned char f_char_var(char v){
    return (v & FLAG) == FLAG;
}

unsigned char f_unsigned_char_var(unsigned char v){
    return (v & FLAG) == FLAG;
}

unsigned char f_symbol(void){
    return (*P & FLAG) == FLAG;
}

unsigned char f_symbol_zero(void){
    return (*P & FLAG) == 0;
}

unsigned char f_symbol_non_zero(void){
    return (*P & FLAG) != 0;
}

unsigned int dyn_lshift (unsigned int x, unsigned int y)
{
  return x << (y & 31);
}

unsigned int dyn_rshift (unsigned int x, unsigned int y)
{
  return x >> (y & 31);
}

unsigned int really_dyn_lshift (unsigned int x, unsigned int y)
{
  return x << y;
}

unsigned int really_dyn_rshift (unsigned int x, unsigned int y)
{
  return x >> y;
}

With patch disabled, -O2 -mb:

$ cat f.s
        .file   "f.c"
        .text
        .text
        .align 1
        .align 2
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        mov.l   .L4,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L5:
        .align 2
.L4:
        .long   ___ashiftrt_r4_6
        .size   _f_char_var, .-_f_char_var
        .align 1
        .align 2
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        mov.l   .L8,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L9:
        .align 2
.L8:
        .long   ___ashiftrt_r4_6
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .align 2
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        mov.l   .L12,r1
        sts.l   pr,@-r15
        mov.b   @r1,r4
        mov.l   .L13,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L14:
        .align 2
.L12:
        .long   -65536
.L13:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol, .-_f_symbol
        .align 1
        .align 2
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L16,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L17:
        .align 2
.L16:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .align 2
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        mov.l   .L20,r1
        sts.l   pr,@-r15
        mov.b   @r1,r4
        mov.l   .L21,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L22:
        .align 2
.L20:
        .long   -65536
.L21:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .align 1
        .align 2
        .global _dyn_lshift
        .type   _dyn_lshift, @function
_dyn_lshift:
        mov.l   .L25,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L26:
        .align 2
.L25:
        .long   ___ashlsi3_r0
        .size   _dyn_lshift, .-_dyn_lshift
        .align 1
        .align 2
        .global _dyn_rshift
        .type   _dyn_rshift, @function
_dyn_rshift:
        mov.l   .L29,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L30:
        .align 2
.L29:
        .long   ___lshrsi3_r0
        .size   _dyn_rshift, .-_dyn_rshift
        .align 1
        .align 2
        .global _really_dyn_lshift
        .type   _really_dyn_lshift, @function
_really_dyn_lshift:
        mov.l   .L33,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L34:
        .align 2
.L33:
        .long   ___ashlsi3_r0
        .size   _really_dyn_lshift, .-_really_dyn_lshift
        .align 1
        .align 2
        .global _really_dyn_rshift
        .type   _really_dyn_rshift, @function
_really_dyn_rshift:
        mov.l   .L37,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L38:
        .align 2
.L37:
        .long   ___lshrsi3_r0
        .size   _really_dyn_rshift, .-_really_dyn_rshift
        .ident  "GCC: (GNU) 12.3.0"

With patch disabled, -O2 -ml

$ cat f.s
        .file   "f.c"
        .text
        .little
        .text
        .align 1
        .align 2
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        mov.l   .L4,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L5:
        .align 2
.L4:
        .long   ___ashiftrt_r4_6
        .size   _f_char_var, .-_f_char_var
        .align 1
        .align 2
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        mov.l   .L8,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L9:
        .align 2
.L8:
        .long   ___ashiftrt_r4_6
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .align 2
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        mov.l   .L12,r1
        sts.l   pr,@-r15
        mov.b   @r1,r4
        mov.l   .L13,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L14:
        .align 2
.L12:
        .long   -65536
.L13:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol, .-_f_symbol
        .align 1
        .align 2
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L16,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L17:
        .align 2
.L16:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .align 2
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        mov.l   .L20,r1
        sts.l   pr,@-r15
        mov.b   @r1,r4
        mov.l   .L21,r1
        jsr     @r1
        nop
        mov     r4,r0
        and     #1,r0
        lds.l   @r15+,pr
        rts
        nop
.L22:
        .align 2
.L20:
        .long   -65536
.L21:
        .long   ___ashiftrt_r4_6
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .align 1
        .align 2
        .global _dyn_lshift
        .type   _dyn_lshift, @function
_dyn_lshift:
        mov.l   .L25,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L26:
        .align 2
.L25:
        .long   ___ashlsi3_r0
        .size   _dyn_lshift, .-_dyn_lshift
        .align 1
        .align 2
        .global _dyn_rshift
        .type   _dyn_rshift, @function
_dyn_rshift:
        mov.l   .L29,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L30:
        .align 2
.L29:
        .long   ___lshrsi3_r0
        .size   _dyn_rshift, .-_dyn_rshift
        .align 1
        .align 2
        .global _really_dyn_lshift
        .type   _really_dyn_lshift, @function
_really_dyn_lshift:
        mov.l   .L33,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L34:
        .align 2
.L33:
        .long   ___ashlsi3_r0
        .size   _really_dyn_lshift, .-_really_dyn_lshift
        .align 1
        .align 2
        .global _really_dyn_rshift
        .type   _really_dyn_rshift, @function
_really_dyn_rshift:
        mov.l   .L37,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L38:
        .align 2
.L37:
        .long   ___lshrsi3_r0
        .size   _really_dyn_rshift, .-_really_dyn_rshift
        .ident  "GCC: (GNU) 12.3.0"

With patch enabled -O2 -mb

$ cat f.s
        .file   "f.c"
        .text
        .text
        .align 1
        .align 2
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_char_var, .-_f_char_var
        .align 1
        .align 2
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .align 2
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        mov.l   .L5,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L6:
        .align 2
.L5:
        .long   -65536
        .size   _f_symbol, .-_f_symbol
        .align 1
        .align 2
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L8,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L9:
        .align 2
.L8:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .align 2
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        mov.l   .L11,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L12:
        .align 2
.L11:
        .long   -65536
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .align 1
        .align 2
        .global _dyn_lshift
        .type   _dyn_lshift, @function
_dyn_lshift:
        mov.l   .L15,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L16:
        .align 2
.L15:
        .long   ___ashlsi3_r0
        .size   _dyn_lshift, .-_dyn_lshift
        .align 1
        .align 2
        .global _dyn_rshift
        .type   _dyn_rshift, @function
_dyn_rshift:
        mov.l   .L19,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L20:
        .align 2
.L19:
        .long   ___lshrsi3_r0
        .size   _dyn_rshift, .-_dyn_rshift
        .align 1
        .align 2
        .global _really_dyn_lshift
        .type   _really_dyn_lshift, @function
_really_dyn_lshift:
        mov.l   .L23,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L24:
        .align 2
.L23:
        .long   ___ashlsi3_r0
        .size   _really_dyn_lshift, .-_really_dyn_lshift
        .align 1
        .align 2
        .global _really_dyn_rshift
        .type   _really_dyn_rshift, @function
_really_dyn_rshift:
        mov.l   .L27,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L28:
        .align 2
.L27:
        .long   ___lshrsi3_r0
        .size   _really_dyn_rshift, .-_really_dyn_rshift
        .ident  "GCC: (GNU) 12.3.0"

With patch enabled, -O2 -ml

$ cat f.s
        .file   "f.c"
        .text
        .little
        .text
        .align 1
        .align 2
        .global _f_char_var
        .type   _f_char_var, @function
_f_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_char_var, .-_f_char_var
        .align 1
        .align 2
        .global _f_unsigned_char_var
        .type   _f_unsigned_char_var, @function
_f_unsigned_char_var:
        mov     r4,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
        .size   _f_unsigned_char_var, .-_f_unsigned_char_var
        .align 1
        .align 2
        .global _f_symbol
        .type   _f_symbol, @function
_f_symbol:
        mov.l   .L5,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L6:
        .align 2
.L5:
        .long   -65536
        .size   _f_symbol, .-_f_symbol
        .align 1
        .align 2
        .global _f_symbol_zero
        .type   _f_symbol_zero, @function
_f_symbol_zero:
        mov.l   .L8,r1
        mov.b   @r1,r0
        tst     #64,r0
        rts
        movt    r0
.L9:
        .align 2
.L8:
        .long   -65536
        .size   _f_symbol_zero, .-_f_symbol_zero
        .align 1
        .align 2
        .global _f_symbol_non_zero
        .type   _f_symbol_non_zero, @function
_f_symbol_non_zero:
        mov.l   .L11,r1
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        rts
        negc    r0,r0
.L12:
        .align 2
.L11:
        .long   -65536
        .size   _f_symbol_non_zero, .-_f_symbol_non_zero
        .align 1
        .align 2
        .global _dyn_lshift
        .type   _dyn_lshift, @function
_dyn_lshift:
        mov.l   .L15,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L16:
        .align 2
.L15:
        .long   ___ashlsi3_r0
        .size   _dyn_lshift, .-_dyn_lshift
        .align 1
        .align 2
        .global _dyn_rshift
        .type   _dyn_rshift, @function
_dyn_rshift:
        mov.l   .L19,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L20:
        .align 2
.L19:
        .long   ___lshrsi3_r0
        .size   _dyn_rshift, .-_dyn_rshift
        .align 1
        .align 2
        .global _really_dyn_lshift
        .type   _really_dyn_lshift, @function
_really_dyn_lshift:
        mov.l   .L23,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L24:
        .align 2
.L23:
        .long   ___ashlsi3_r0
        .size   _really_dyn_lshift, .-_really_dyn_lshift
        .align 1
        .align 2
        .global _really_dyn_rshift
        .type   _really_dyn_rshift, @function
_really_dyn_rshift:
        mov.l   .L27,r1
        sts.l   pr,@-r15
        jsr     @r1
        mov     r5,r0
        lds.l   @r15+,pr
        rts
        nop
.L28:
        .align 2
.L27:
        .long   ___lshrsi3_r0
        .size   _really_dyn_rshift, .-_really_dyn_rshift
        .ident  "GCC: (GNU) 12.3.0" 
 
> '-mdisable-dynshift-libcall' would be more appropriate for what it tries to
> do, I think.  Although that is a whole different issue ... but what is it
> going to do for real dynamic shifts on SH2?
> 
> What kind of code is it supposed to emit for things like
> 
> unsigned int dyn_shift (unsigned int x, unsigned int y)
> {
>   return x << (y & 31);
> }

As far as I understand from GCC sources, function I patched 'expand_ashiftrt' process only constant values of shift. As you can see earlier, I added your and other examples to tests. It looks like really dynamic shifts translate to library calls.

Should I test more exotic situations? If so, could you please help me with really exotic or weired examples?
Comment 38 Oleg Endo 2023-05-24 15:00:09 UTC
(In reply to Alexander Klepikov from comment #37)
> 
> As far as I understand from GCC sources, function I patched
> 'expand_ashiftrt' process only constant values of shift. As you can see
> earlier, I added your and other examples to tests.

OK, thanks for the additional test cases.  It really looks like the way the constant shift is expanded (via ashrsi3_n insn) on SH1/SH2 is getting in the way.

The tst insn is mainly formed by the combine pass, which relies on certain insn patterns and combinations thereof.  See also sh.md, around line 530.

You can look at the debug output with the -fdump-rtl-all option to see what's happening in the RTL passes.

What your patch is doing is to make it not emit the ashrsi3_n insn for constant shifts altogether?  I guess it will make code that actually needs those real shifts larger, as it will always emit the whole shift stitching sequence.  That might be a good thing or not.


> It looks like really
> dynamic shifts translate to library calls.

So the option name '-mdisable-dynshift-libcall' doesn't make sense.
What it actually does is more like '-mdisable-constshift-libcall'.

> 
> Should I test more exotic situations? If so, could you please help me with
> really exotic or weired examples?

Have you had a look at the existing test cases for this in gcc/testsuite/gcc.target/sh ?
Comment 39 Alexander Klepikov 2023-05-25 17:53:04 UTC
> The tst insn is mainly formed by the combine pass, which relies on certain
> insn patterns and combinations thereof.  See also sh.md, around line 530.

I'm sorry, but .md lang is too complicated for me.
 
> You can look at the debug output with the -fdump-rtl-all option to see
> what's happening in the RTL passes.

It looks like with optimization enabled it converts bitwise AND to right shift and then optimizes again. But SH4 has 'shad' and 'shad' can be optimized to 'tst'. And SH2E has libcall instead of dynamic shift and libcll cannot be converted. It seems that very first optimization spoils things.

But when we have numerous 'shar' instructions, optimization joins the game again and converts them to 'tst'.

> What your patch is doing is to make it not emit the ashrsi3_n insn for
> constant shifts altogether?  I guess it will make code that actually needs
> those real shifts larger, as it will always emit the whole shift stitching
> sequence.  That might be a good thing or not.

You are absolutely right, the code will be larger when we do right shifts. But there's situations when you can't use library. For exmple, SH7055 engine control unit can barely contain the program. The library just won't fit.
Comment 40 Oleg Endo 2023-05-25 21:32:35 UTC
(In reply to Alexander Klepikov from comment #39)
> 
> I'm sorry, but .md lang is too complicated for me.

Yeah, it looks alien at first sight.  But it's where a lot of things happen w.r.t. instruction selection.

> It looks like with optimization enabled it converts bitwise AND to right
> shift and then optimizes again. But SH4 has 'shad' and 'shad' can be
> optimized to 'tst'. And SH2E has libcall instead of dynamic shift and libcll
> cannot be converted. It seems that very first optimization spoils things.
> 
> But when we have numerous 'shar' instructions, optimization joins the game
> again and converts them to 'tst'.

Yes, something like this is what I remember happening there.  I'll try to look into the issue with your test cases and see if it's possible to add some patterns to catch those.

BTW, have you tried it on a more recent GCC?  There have also been some optimizations in the middle-end (a bit more backend independent) for this kind of thing.

> You are absolutely right, the code will be larger when we do right shifts.
> But there's situations when you can't use library. For exmple, SH7055 engine
> control unit can barely contain the program. The library just won't fit.

Have you tried to use whole program optimization via -flto and -ffunction-sections? It  should be able to strip out all unnecessary library functions.
Comment 41 Alexander Klepikov 2023-05-26 12:03:34 UTC
> > It looks like with optimization enabled it converts bitwise AND to right
> > shift and then optimizes again. But SH4 has 'shad' and 'shad' can be
> > optimized to 'tst'. And SH2E has libcall instead of dynamic shift and libcll
> > cannot be converted. It seems that very first optimization spoils things.
> > 
> > But when we have numerous 'shar' instructions, optimization joins the game
> > again and converts them to 'tst'.
> 
> Yes, something like this is what I remember happening there.  I'll try to
> look into the issue with your test cases and see if it's possible to add
> some patterns to catch those.

Thank you! I have an idea. If it's impossible to defer initial optimization, maybe it's possible to emit some intermediate insn and catch it and optimize later?
 
> BTW, have you tried it on a more recent GCC?  There have also been some
> optimizations in the middle-end (a bit more backend independent) for this
> kind of thing.

I tried 13.1 about week or two ago with the same result.

> Have you tried to use whole program optimization via -flto and
> -ffunction-sections? It  should be able to strip out all unnecessary library
> functions.

Thank you for advice, I'll take a try.
Comment 42 Oleg Endo 2023-05-26 17:44:30 UTC
(In reply to Alexander Klepikov from comment #41)
> 
> Thank you! I have an idea. If it's impossible to defer initial optimization,
> maybe it's possible to emit some intermediate insn and catch it and optimize
> later?

This is basically what is supposed to be happening there already.

However, it's a bit of a dilemma.

1) If we don't have a dynamic shift insn and we smash the constant shift into individual 
stitching shifts early, it might open some new optimization opportunities, e.g. by sharing intermediate shift results.  Not sure though if that actually happens in practice though.

2) Whether to use the dynamic shift insn or emit a function call or use stitching shifts sequence, it all has an impact on register allocation and other instruction use.  This can be problematic during the course of RTL optimization passes.

3) Even if we have a dynamic shift, sometimes it's more beneficial to emit a shorter stitching shift sequence.  Which one is better depends on the surrounding code.  I don't think there is anything good there to make the proper choice.

Some other shift related PRs: PR 54089, PR 65317, PR 67691, PR 67869, PR 52628, PR 58017


> > BTW, have you tried it on a more recent GCC?  There have also been some
> > optimizations in the middle-end (a bit more backend independent) for this
> > kind of thing.
> 
> I tried 13.1 about week or two ago with the same result.
> 

Good to know.  Thanks for checking it!
Comment 43 Alexander Klepikov 2023-05-28 10:24:22 UTC
> > Thank you! I have an idea. If it's impossible to defer initial optimization,
> > maybe it's possible to emit some intermediate insn and catch it and optimize
> > later?
> 
> This is basically what is supposed to be happening there already.

Well, not really. Look what's happening during expand pass when 'ashrsi3' is expanding. Function 'expand_ashiftrt' is called and what it does at the end - it explicitly emits 3 insns:

wrk = gen_reg_rtx (Pmode);

  //This one
  emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]);

  sprintf (func, "__ashiftrt_r4_%d", value);
  rtx lab = function_symbol (wrk, func, SFUNC_STATIC).lab;

  //This one
  emit_insn (gen_ashrsi3_n (GEN_INT (value), wrk, lab));

  //And this one
  emit_move_insn (operands[0], gen_rtx_REG (SImode, 4));

As far as I understand these insns could be catched later by a peephole and converted to 'tstsi_t' insn like it is done for other much simple insn sequences.

What I'm thinkig about is to emit only one, say 'compound', insn. Which could be then splitted later somwhere in split pass to function call, to those 3 insns.

I wrote test code that emits only one bogus insn. This insn expands to pure asm code. Of course, that asm code is invalid, because it is impossible to place a libcall label at the end of function with pure asm code injection. But then all what is could be coverted to 'tst', converts to 'tst'. And all pure right shifts converts to invalid asm code, of course.

That's why I am thinking about possibility of emitting some intermediate insn at expand pass that will defer it real expanding. But I still don't know how to do it right and even if it is possible.

By the way, right shift for integers expands to only one 'lshiftrt' insn and that's why it can be catched and converted to 'tst'.

> 
> However, it's a bit of a dilemma.
> 
> 1) If we don't have a dynamic shift insn and we smash the constant shift
> into individual 
> stitching shifts early, it might open some new optimization opportunities,
> e.g. by sharing intermediate shift results.  Not sure though if that
> actually happens in practice though.
> 
> 2) Whether to use the dynamic shift insn or emit a function call or use
> stitching shifts sequence, it all has an impact on register allocation and
> other instruction use.  This can be problematic during the course of RTL
> optimization passes.
> 
> 3) Even if we have a dynamic shift, sometimes it's more beneficial to emit a
> shorter stitching shift sequence.  Which one is better depends on the
> surrounding code.  I don't think there is anything good there to make the
> proper choice.
> 
> Some other shift related PRs: PR 54089, PR 65317, PR 67691, PR 67869, PR
> 52628, PR 58017

Thank you for your time and detailed explanations! I agree with you on all points. Software cannot be perfect and it's OK for GCC not to be super optimized, so this part better sholud be left intact.

By the way, I tried to link library to my project and I figured out that linker is smart enough to link only necessary library functions even without LTO. So increase in size is about 100 or 200 bytes, that is acceptable. Thank you very much for help!
Comment 44 Oleg Endo 2023-05-28 10:48:16 UTC
(In reply to Alexander Klepikov from comment #43)
> 
> Well, not really. Look what's happening during expand pass when 'ashrsi3' is
> expanding. Function 'expand_ashiftrt' is called and what it does at the end
> - it explicitly emits 3 insns:
> [...]

> 
> By the way, right shift for integers expands to only one 'lshiftrt' insn and
> that's why it can be catched and converted to 'tst'.
> 

Yeah, I might have dropped the ball on the right shift patterns back then and only reworked the left shift patterns to do that. 


> 
> As far as I understand these insns could be catched later by a peephole and
> converted to 'tstsi_t' insn like it is done for other much simple insn
> sequences.

It's the combine RTL pass and split1 RTL pass that does most of this work here.  Peephole pass in GCC is too simplistic for this.


> 
> Thank you for your time and detailed explanations! I agree with you on all
> points. Software cannot be perfect and it's OK for GCC not to be super
> optimized, so this part better sholud be left intact.

We can't have it perfect, but we can try ;)

> 
> By the way, I tried to link library to my project and I figured out that
> linker is smart enough to link only necessary library functions even without
> LTO. So increase in size is about 100 or 200 bytes, that is acceptable.
> Thank you very much for help!

You're welcome.

Yes, to strip out unused library functions it doesn't need LTO.  But using LTO for embedded/MCU  firmware, I find it can reduce the code size by about 20%.  For example, it can also inline small library functions in your code (if the library was also compiled with LTO).
Comment 45 Alexander Klepikov 2023-05-29 14:54:11 UTC
>I have an idea. If it's impossible to defer initial optimization,
> maybe it's possible to emit some intermediate insn and catch it and optimize
> later?


Good news. I've made a proof of concept. It works at least sometimes - on simple tests.

$ cat f.c
#define A 0xFFFF0000
#define P ((unsigned char *)A)
#define F 64
#define S 8

unsigned char f_non_zero(unsigned char v){
    return (v & F) != 0;
}

unsigned f_sym_non_zero(void){
    return (*P & F) != 0;
}

unsigned f_sym_mask(void){
    return (*P & F) == F;
}

int f_rshift(char v){
    return v >> S;
}

$ /usr/local/sh-toolchain/bin/sh-elf-gcc -O2 -mb -m2e -da -S f.c

$ cat f.s
        .file   "f.c"
        .text
        .text
        .align 1
        .align 2
        .global _f_non_zero
        .type   _f_non_zero, @function
_f_non_zero:
        mov     r4,r0
        sts.l   pr,@-r15
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
        .size   _f_non_zero, .-_f_non_zero
        .align 1
        .align 2
        .global _f_sym_non_zero
        .type   _f_sym_non_zero, @function
_f_sym_non_zero:
        mov.l   .L6,r1
        sts.l   pr,@-r15
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
.L7:
        .align 2
.L6:
        .long   -65536
        .size   _f_sym_non_zero, .-_f_sym_non_zero
        .align 1
        .align 2
        .global _f_sym_mask
        .type   _f_sym_mask, @function
_f_sym_mask:
        mov.l   .L10,r1
        sts.l   pr,@-r15
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
.L11:
        .align 2
.L10:
        .long   -65536
        .size   _f_sym_mask, .-_f_sym_mask
        .align 1
        .align 2
        .global _f_rshift
        .type   _f_rshift, @function
_f_rshift:
        mov.l   .L14,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        lds.l   @r15+,pr
        rts
        nop
.L15:
        .align 2
.L14:
        .long   ___ashiftrt_r4_8
        .size   _f_rshift, .-_f_rshift
        .ident  "GCC: (GNU) 13.1.0"

$ /usr/local/sh-toolchain/bin/sh-elf-gcc -O2 -ml -m2e -da -S f.c

$ cat f.s
        .file   "f.c"
        .text
        .little
        .text
        .align 1
        .align 2
        .global _f_non_zero
        .type   _f_non_zero, @function
_f_non_zero:
        mov     r4,r0
        sts.l   pr,@-r15
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
        .size   _f_non_zero, .-_f_non_zero
        .align 1
        .align 2
        .global _f_sym_non_zero
        .type   _f_sym_non_zero, @function
_f_sym_non_zero:
        mov.l   .L6,r1
        sts.l   pr,@-r15
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
.L7:
        .align 2
.L6:
        .long   -65536
        .size   _f_sym_non_zero, .-_f_sym_non_zero
        .align 1
        .align 2
        .global _f_sym_mask
        .type   _f_sym_mask, @function
_f_sym_mask:
        mov.l   .L10,r1
        sts.l   pr,@-r15
        mov.b   @r1,r0
        tst     #64,r0
        mov     #-1,r0
        negc    r0,r0
        lds.l   @r15+,pr
        rts
        nop
.L11:
        .align 2
.L10:
        .long   -65536
        .size   _f_sym_mask, .-_f_sym_mask
        .align 1
        .align 2
        .global _f_rshift
        .type   _f_rshift, @function
_f_rshift:
        mov.l   .L14,r1
        sts.l   pr,@-r15
        jsr     @r1
        exts.b  r4,r4
        mov     r4,r0
        lds.l   @r15+,pr
        rts
        nop
.L15:
        .align 2
.L14:
        .long   ___ashiftrt_r4_8
        .size   _f_rshift, .-_f_rshift
        .ident  "GCC: (GNU) 13.1.0"

Splitting takes place at split1 pass as expected. Here is the patch itself.

$ cat gcc-13.1.0-ashrsi3_libcall.patch
diff -ur gcc-13.1.0.orig/gcc/config/sh/sh-protos.h gcc-13.1.0/gcc/config/sh/sh-protos.h
--- gcc-13.1.0.orig/gcc/config/sh/sh-protos.h   2023-04-26 10:09:39.000000000 +0300
+++ gcc-13.1.0/gcc/config/sh/sh-protos.h        2023-05-29 11:45:05.134723435 +0300
@@ -78,6 +78,7 @@
 extern void gen_shifty_op (int, rtx *);
 extern void gen_shifty_hi_op (int, rtx *);
 extern bool expand_ashiftrt (rtx *);
+extern bool expand_ashrsi3_libcall (rtx *);//delete
 extern bool sh_dynamicalize_shift_p (rtx);
 extern int shl_and_kind (rtx, rtx, int *);
 extern int shl_and_length (rtx);
diff -ur gcc-13.1.0.orig/gcc/config/sh/sh.cc gcc-13.1.0/gcc/config/sh/sh.cc
--- gcc-13.1.0.orig/gcc/config/sh/sh.cc 2023-04-26 10:09:39.000000000 +0300
+++ gcc-13.1.0/gcc/config/sh/sh.cc      2023-05-29 17:09:54.602787537 +0300
@@ -3875,11 +3877,37 @@
   wrk = gen_reg_rtx (Pmode);

   /* Load the value into an arg reg and call a helper.  */
-  emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]);
+  /*emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]);
   sprintf (func, "__ashiftrt_r4_%d", value);
   rtx lab = function_symbol (wrk, func, SFUNC_STATIC).lab;
   emit_insn (gen_ashrsi3_n (GEN_INT (value), wrk, lab));
   emit_move_insn (operands[0], gen_rtx_REG (SImode, 4));
+  return true;*/
+
+  if (dump_file)
+    fprintf(dump_file, "ashrsi3: Emitting collapsed libcall\n");
+  emit_insn (gen_ashrsi3_libcall_collapsed (operands[0], operands[1], GEN_INT(value)));//delete
+  return true;//delete
+}
+
+//delete
+bool
+expand_ashrsi3_libcall (rtx *operands) {
+  char func[18];
+
+  if (dump_file)
+    fprintf(dump_file, "ashrsi3_libcall_collapsed: Expanding ashrsi3 libcall\n");
+
+  rtx wrk = gen_reg_rtx (Pmode);
+  emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]);
+
+  sprintf (func, "__ashiftrt_r4_%d", INTVAL (operands[2]));
+
+  rtx lab = function_symbol (wrk, func, SFUNC_STATIC).lab;
+
+  emit_insn (gen_ashrsi3_n (operands[2], wrk, lab));
+  emit_move_insn (operands[0], gen_rtx_REG (SImode, 4));
+
   return true;
 }

diff -ur gcc-13.1.0.orig/gcc/config/sh/sh.md gcc-13.1.0/gcc/config/sh/sh.md
--- gcc-13.1.0.orig/gcc/config/sh/sh.md 2023-04-26 10:09:39.000000000 +0300
+++ gcc-13.1.0/gcc/config/sh/sh.md      2023-05-29 17:10:42.752779922 +0300
@@ -3867,6 +3867,35 @@
   [(set_attr "type" "sfunc")
    (set_attr "needs_delay_slot" "yes")])

+(define_insn "ashrsi3_libcall_collapsed"
+  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
+       (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand" "0")
+                    (match_operand:SI 2 "const_int_operand")))
+   (clobber (reg:SI T_REG))
+   (clobber (reg:SI PR_REG))]
+  "TARGET_SH1"
+  "OOPS"
+  [(set_attr "type" "dyn_shift")
+   (set_attr "needs_delay_slot" "yes")])
+
+(define_insn_and_split "ashrsi3_libcall_expand"
+  [(parallel [(set (match_operand:SI 0 "arith_reg_dest")
+       (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand")
+           (match_operand:SI 2 "const_int_operand"))
+       )(clobber (reg:SI T_REG))
+       (clobber (reg:SI PR_REG))
+  ])]
+  "TARGET_SH1"
+  "OOPS_1"
+  "&& 1"
+  [(const_int 0)]
+{
+  if (expand_ashrsi3_libcall(operands))
+    DONE;
+  else
+    FAIL;
+})
+
 ;; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
 ;; DImode arithmetic shift right

I did it by feel, actually picking up the parameters until it worked. So please check it and improve it because I'm sure it will break something. Thank you.
Comment 46 Eric Gallager 2023-05-30 01:48:18 UTC
Reminder that patches go to the gcc-patches mailing list
Comment 47 Oleg Endo 2023-05-30 01:56:31 UTC
(In reply to Eric Gallager from comment #46)
> Reminder that patches go to the gcc-patches mailing list

It's OK.  We're just throwing around ideas, nothing final
Comment 48 Alexander Klepikov 2023-05-30 12:42:00 UTC
I made tests (including *.c files from GCC testsuite) and everything looks fine for now. But I'm still afraid that pattern for 'ashrsi3_libcall_expand' is too wide. It is possible to narrow it down as much as possible by adding distinct attribute and set when emitting 'ashrsi3_libcall_collapsed' and then check it and fail if not set:

(define_attr "libcall_collapsed"
 "ashrsi3,nil"
 (const_string "nil"))

(define_insn "ashrsi3_libcall_collapsed"
  [(set (match_operand:SI 0 "arith_reg_dest" "=r")
        (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand" "0")
                     (match_operand:SI 2 "const_int_operand")))
   (clobber (reg:SI T_REG))
   (clobber (reg:SI PR_REG))]
  "TARGET_SH1"
  "OOPS"
  [(set_attr "type" "dyn_shift")
   (set_attr "libcall_collapsed" "ashrsi3")
   (set_attr "needs_delay_slot" "yes")])

 if (get_attr_libcall_collapsed(insn) != LIBCALL_COLLAPSED_ASHRSI3)
    return false;

It will be super safe then but ugly a little bit.
Comment 49 Oleg Endo 2023-05-30 19:57:07 UTC
(In reply to Alexander Klepikov from comment #48)
> I made tests (including *.c files from GCC testsuite) and everything looks
> fine for now. But I'm still afraid that pattern for 'ashrsi3_libcall_expand'
> is too wide. It is possible to narrow it down as much as possible by adding
> distinct attribute and set when emitting 'ashrsi3_libcall_collapsed' and
> then check it and fail if not set:
> 

For this kind of change, the whole GCC test suite needs to be ran for at least big/little -m2,-m4 variants.


+(define_insn_and_split "ashrsi3_libcall_expand"
+  [(parallel [(set (match_operand:SI 0 "arith_reg_dest")
+       (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand")
+           (match_operand:SI 2 "const_int_operand"))
+       )(clobber (reg:SI T_REG))
+       (clobber (reg:SI PR_REG))
+  ])]

The 'parallel' construct looks strange.
Comment 50 Oleg Endo 2023-05-30 20:00:18 UTC
Actually, let's take any further discussion of shift patterns to PR 54089.