Created attachment 24411 [details] test for various integer types and constant values 0...255 The "TST #imm, R0" instruction is a bit underutilized on SH targets. For some bit patterns of the immediate constant it tries to extract the bits in question by various means and test the result against zero/non-zero and misses the straight forward instruction. In particular: * one single bit * n contiguous bits starting at bit 0 When testing a byte against 0x80 it uses "CMP/PZ", which is as good as a "TST" instruction (in terms of costs), so this is OK. I've spotted this in version around 4.4. It still happens with 4.6 and the following config: Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.6.1/lto-wrapper Target: sh-elf Configured with: ../gcc-4.6-20110527/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --without-headers --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.6.1 20110527 (prerelease) (GCC)
Created attachment 24412 [details] Proposed Patch Although the patch gets the job done, programmer's sense tells me it is fishy, or at least pretty much brute forced cure of the symptoms, not the cause. It's my first GCC patch, so any feedback is highly appreciated. What I did was looking at the RTL, in particular the combine pass, identifying patterns it failed to find a "shortcut" (tst insn) for and adding them to the insn descriptions. I also had to expand the costs calculation of the AND instruction to cover AND, OR and XOR (on SH they are the same anyways), or else the cost of a matched replacement insn would result in a rejection in the combine pass.
It looks that playing with the internal behavior of the combine pass is a bit fragile. Perhaps an alternative way is to define a peephole for tst #imm8,r0, something like: ;; A peephole for the TST immediate instruction. (define_peephole2 [(set (match_operand:SI 2 "arith_reg_operand" "r") (and:SI (match_operand:SI 0 "arith_reg_operand" "z") (match_operand:SI 1 "const_int_operand" "i"))) (set (reg:SI T_REG) (eq:SI (match_dup 2) (const_int 0)))] "TARGET_SH1 && peep2_reg_dead_p (2, operands[2]) && (satisfies_constraint_I08 (operands[1]) || satisfies_constraint_K08 (operands[1]))" [(set (reg:SI T_REG) (eq:SI (and:SI (match_dup 0) (match_dup 1)) (const_int 0)))] " { operands[1] = GEN_INT (INTVAL (operands[1]) & 0xff); }") which will work at -O2 or -fpeephole2, though there are pros and cons of peephole approach.
Thanks for having a look at it Kaz. Yes, the fiddling with the combine pass is fragile. I've tried out your peephole idea. It works in most cases but not all the time. E.g. it doesn't catch the following example: int test_imm (int i) { return (i & 3) ? 20 : 40; } mov #3,r1 and r1,r4 tst r4,r4 bt/s .L5 mov #40,r0 mov #20,r0 .L5: rts nop Also the following... int test_imm (int* i) { return (*i & 255) ? 20 : 40; } becomes on little endian (OK): mov.b @r4,r1 tst r1,r1 bt/s .L10 mov #40,r0 mov #20,r0 .L10: rts nop but on big endian (NG): mov.l @r4,r1 extu.b r1,r1 tst r1,r1 bt/s .L10 mov #40,r0 mov #20,r0 .L10: rts nop I'll give the thing another try. Regarding the change to the andcosts function, the following should be better: --- sh.c (revision 175188) +++ sh.c (working copy) @@ -243,7 +243,7 @@ static int flow_dependent_p (rtx, rtx); static void flow_dependent_p_1 (rtx, const_rtx, void *); static int shiftcosts (rtx); -static int andcosts (rtx); +static int and_xor_ior_costs (rtx, int code); static int addsubcosts (rtx); static int multcosts (rtx); static bool unspec_caller_rtx_p (rtx); @@ -2841,14 +2841,14 @@ return shift_insns[value]; } -/* Return the cost of an AND operation. */ +/* Return the cost of an AND/XOR/IOR operation. */ static inline int -andcosts (rtx x) +and_xor_ior_costs (rtx x, int code) { int i; - /* Anding with a register is a single cycle and instruction. */ + /* register x register is a single cycle instruction. */ if (!CONST_INT_P (XEXP (x, 1))) return 1; @@ -2864,17 +2864,17 @@ } /* These constants are single cycle extu.[bw] instructions. */ - if (i == 0xff || i == 0xffff) + if ((i == 0xff || i == 0xffff) && code == AND) return 1; - /* Constants that can be used in an and immediate instruction in a single + /* Constants that can be used in an instruction with an immediate are a single cycle, but this requires r0, so make it a little more expensive. */ if (CONST_OK_FOR_K08 (i)) return 2; - /* Constants that can be loaded with a mov immediate and an and. + /* Constants that can be loaded with a mov immediate need one more cycle. This case is probably unnecessary. */ if (CONST_OK_FOR_I08 (i)) return 2; - /* Any other constants requires a 2 cycle pc-relative load plus an and. + /* Any other constant requires an additional 2 cycle pc-relative load. This case is probably unnecessary. */ return 3; } @@ -3043,7 +3043,9 @@ return true; case AND: - *total = COSTS_N_INSNS (andcosts (x)); + case XOR: + case IOR: + *total = COSTS_N_INSNS (and_xor_ior_costs (x, code)); return true; case MULT:
Yes, that peephole doesn't catch all the patterns which could make tst #imm8,r0 use. Perhaps it would be a good idea to get numbers for the test like CSiBE test with the vanilla and new insns/peepholes patched compilers. Something covers 80% of the possible cases in the usual working set, it would be enough successful for such a micro-optimization, I guess. Cost patch looks fine to me. Could you propose it as a separate patch on gcc-patches list with an appropriate ChangeLog entry? When proposing it, please refer how you've tested it. Also the numbers got with the patch are highly welcome. BTW, do you have FSF copyright assignment for your GCC work? Although the cost patch itself is essentially several lines which doesn't require copyright assignment, the other changes you've proposed clearly require the paper work, I think.
> Yes, that peephole doesn't catch all the patterns which could > make tst #imm8,r0 use. Perhaps it would be a good idea to get > numbers for the test like CSiBE test with the vanilla and new > insns/peepholes patched compilers. Something covers 80% of > the possible cases in the usual working set, it would be enough > successful for such a micro-optimization, I guess. I'd also like to see some numbers on those micro-improvements. I'll have a look at CSiBE. Anyway, why not just add all the currently known-to-work cases? What are your concerns regarding that? I can imagine that it is a maintenance burden to keep all those definitions and special cases in the MD up-to-date (bit rot etc). Do you have anything other than that in mind? It seems that others have been trying to solve the same problem in a very similar way: http://patchwork.ozlabs.org/patch/58832/ ;) I've figured that the following pattern works quite well for this particular case. It seems to catch all the bit patterns. (sh_tst_const_ok and sh_zero_extract_val are added functions in sh.c) (define_insn_and_split "*tstsi3" [(set (reg:SI T_REG) (zero_extract:SI (match_operand:SI 0 "arith_reg_operand" "") (match_operand:SI 1 "const_int_operand") (match_operand:SI 2 "const_int_operand")))] "TARGET_SH1 && sh_tst_const_ok (sh_zero_extract_val (operands[1], operands[2]))" "#" "&& 1" [(const_int 0)] "{ int tstval = sh_zero_extract_val (operands[1], operands[2]); emit_insn (gen_tstsi (operands[0], GEN_INT (tstval))); DONE; }" ) ...however, the problem with that one is that the T bit is inverted afterwards, and thus the following branch logic (or whatever) gets inverted as well. One option could be to emit some T bit inversion insn after gen_tstsi and then optimize it away later on with e.g. a peephole (?), but I haven't tried that yet. The actual "problem" is that the combine pass first sees the andsi, eq, ... insns and correctly matches them to those in the MD. But instead of continuing to match patterns from the MD it expands the and insn into something like zero_extract, which in turn will never be combined back into the tst insn. Isn't there a way to tell the combine pass not to do so, but instead first look deeper at what is in the MD? Regarding the peephole: > (satisfies_constraint_I08 (operands[1]) > || satisfies_constraint_K08 (operands[1]) I guess this might generate wrong code for e.g. "if (x & -2)". When x has any bits[31:1] set this must return true. The code after the peephole optimization will look only at the lower 8 bits and would possibly return false for x = 0xFF000000, which is wrong. So it should be satisfies_constraint_K08 only, shouldn't it? > > Cost patch looks fine to me. Could you propose it as a separate > patch on gcc-patches list with an appropriate ChangeLog entry? > When proposing it, please refer how you've tested it. Also > the numbers got with the patch are highly welcome. > > BTW, do you have FSF copyright assignment for your GCC work? > Although the cost patch itself is essentially several lines which > doesn't require copyright assignment, the other changes you've > proposed clearly require the paper work, I think. I'll contact you directly regarding that.
(In reply to comment #5) > Anyway, why not just add all the currently known-to-work cases? What are your > concerns regarding that? I can imagine that it is a maintenance burden to keep > all those definitions and special cases in the MD up-to-date (bit rot etc). Do > you have anything other than that in mind? Yep, maintenance burden but I don't mean ack/nak for anything. If it's enough fruitful, we should take that route. When it gives 5% improvement in the usual working set like as CSiBE, hundreds lines would be OK, but if it's ~0.5% or less, it doesn't look worth to add many patterns for that. > Isn't there a way to tell the combine pass not to do so, but instead first look > deeper at what is in the MD? I don't know how to do it cleanly. > I guess this might generate wrong code for e.g. "if (x & -2)". When x has any > bits[31:1] set this must return true. The code after the peephole optimization > will look only at the lower 8 bits and would possibly return false for x = > 0xFF000000, which is wrong. So it should be satisfies_constraint_K08 only, > shouldn't it? You are right. That peephole was simply 'something like this'.
(In reply to comment #6) > Yep, maintenance burden but I don't mean ack/nak for anything. > If it's enough fruitful, we should take that route. When it > gives 5% improvement in the usual working set like as CSiBE, > hundreds lines would be OK, but if it's ~0.5% or less, it doesn't > look worth to add many patterns for that. > > > Isn't there a way to tell the combine pass not to do so, but instead first look > > deeper at what is in the MD? > > I don't know how to do it cleanly. > I've tried out a couple of things and got some CSiBE numbers based on trunk rev 179430. Unfortunately only code size comparisons, no run time performance numbers. The tests were compiled with -ml -m4-single -Os -mnomacsave -mpretend-cmove -mfused-madd -freg-struct-return Option 1) Use many (~10) patterns in the MD and some cost calculation tuning. The last patch required some adaptation, because the combine pass started trying to match things slightly differently. I've also noticed that it requires a special case for one pattern on SH4A... size of all modules: 2916390 -> 2909026 -7364 / -0.252504 % avg difference over all modules: -409.111111 / -0.273887 % max: compiler 22808 -> 22812 +4 / +0.017538 % min: libpng-1.2.5 99120 -> 97804 -1316 / -1.327684 % Option 2) I've added another combine pass which has the make_compound_operation function turned off. The make_compound_operation function is used to produce zero_extract patterns. If the resulting "simplified" pattern does not match anything in the MD, combine reverts the transformation and proceeds with the next insn. That way, it never tries to match the tst #imm pattern in the MD. With this option only ~5 patterns seem to be required and a small extension of the costs calculation. size of all modules: 2916390 -> 2909170 -7220 / -0.247566 % avg difference over all modules: -401.111111 / -0.254423 % max: compiler 22808 -> 22812 +4 / +0.017538 % min: libpng-1.2.5 99120 -> 97836 -1284 / -1.295400 % Not so spectacular on average. It highly depends on the type of SW being compiled, but it hits quite a lot of files in CSiBE. Option 2 seems more robust even if it seems less effective, what do you think?
(In reply to comment #7) > Option 2 seems more robust even if it seems less effective, what do you think? Another combine pass to reduce size less than 0.3% on one target would be not acceptable, I guess. ~10 new patterns would be overkill for that result, though I'm still expecting that a few patterns of them were dominant. Could you get numbers which pattern was used in the former option?
Created attachment 25461 [details] CSiBE comparisons (In reply to comment #8) > > Another combine pass to reduce size less than 0.3% on one target > would be not acceptable, I guess. I'm sorry, I forgot to mention that it was just a proof of concept hack of mine, just to see whether it has any chance to work at all. I think it would be better to change/fix the behavior of the combine pass in this regard, so that it tries matching combined patterns without sophisticated transformations. I will try asking on the gcc list about that. > ~10 new patterns would be > overkill for that result, though I'm still expecting that a few > patterns of them were dominant. Yep, even if it turned out to be actually only 8 patterns in total, but still.. I haven't looked at the issue with SH4A but most likely it would add another one or two patterns... so basically ~10 :) > Could you get numbers which pattern > was used in the former option? I think it would be a bit too much checking out each individual pattern. Instead I grouped them into what they are effectively doing. While I was at it, I also added your peephole idea, and a top 10 listing of the individual files.
(In reply to comment #9) > 3) only zero_extract special cases looks to be dominant. > I'm sorry, I forgot to mention that it was just a proof of concept hack > of mine, just to see whether it has any chance to work at all. > I think it would be better to change/fix the behavior of the combine pass > in this regard, so that it tries matching combined patterns without > sophisticated transformations. I will try asking on the gcc list about that. I see. I also expect that the experts have some idea for this issue. > I think it would be a bit too much checking out each individual pattern. I don't think that it's too much. Those numbers can be easily collected for CSiBE. If your patterns are named, you could simply add "-dap -save-temps" to the compiler option which is specified when ruining CSiBE's create-config and then get the occurrences of testsi_6, for example, with something like grep "testsi_6" `find . -name "*.s" -print` | wc -l after running the CSiBE size test.
Created attachment 25491 [details] Proposed patch including test case > > 3) only zero_extract special cases > > looks to be dominant. Yes. I've briefly looked through the test sources. A popular use case for bit test instructions seem to be single bit tests, which the patch is basically adding. > I see. I also expect that the experts have some idea for > this issue. Hm .. http://gcc.gnu.org/ml/gcc/2011-10/msg00189.html Eric pointed me to the i386 back-end. Unfortunately, what I found there is where I originally started... ;; Combine likes to form bit extractions for some tests. Humor it. I.e. it is also coded against the behavior of the combine pass with a bunch of pattern variations. I guess that's the way it's supposed to be done then :T > I don't think that it's too much. Those numbers can be easily > collected for CSiBE. If your patterns are named, you could > simply add "-dap -save-temps" to the compiler option which is > specified when ruining CSiBE's create-config and then get > the occurrences of testsi_6, for example, with something like > grep "testsi_6" `find . -name "*.s" -print` | wc -l > after running the CSiBE size test. Ah, right! With the attached latest patch applied to trunk rev 179778 the numbers for "-ml -m4-single -Os -mnomacsave -mpretend-cmove -mfused-madd -freg-struct-return" look something like that: tstsi_t: 1391 tsthi_t: 4 tstqi_t: 23 tstqi_t_zero: 667 tstsi_t_and_not: 598 tstsi_t_zero_extract_eq: 70 tstsi_t_zero_extract_xor: 923 Notice that the split contributes to the tstsi_t number. Also, the 3 patterns tstsi_t_zero_extract_xor tstsi_t_zero_extract_subreg_xor_little tstsi_t_zero_extract_subreg_xor_big are basically one and the same. On SH4A the subreg variants are required, because tstsi_t_zero_extract_xor will never match. I've also added a special case to sh_rtx_costs to detect at least the tstsi_t pattern. However, the other patterns are not really covered by that and the combine pass calculates the cost as a sum of all the operations of the pattern. I guess the selection of the test instruction could be stimulated a bit more by a more accurate costs calculation, but my feeling is that it won't do a lot. Cheers, Oleg
(In reply to comment #11) > Created attachment 25491 [details] > Proposed patch including test case Looks fine. A very minor style nits: > + if (GET_CODE (XEXP (x, 0)) == AND /* tst instruction. */ This comment looks a bit bogus. A full sentence comment would be better. > + > + There are some extra empty lines. GNU/GCC coding style says that only one empty line is needed. I know that there are extra empty lines already, but we should not add new ones :-)
Author: kkojima Date: Sat Oct 15 02:32:53 2011 New Revision: 180020 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=180020 Log: PR target/49263 * config/sh/sh.h (ZERO_EXTRACT_ANDMASK): New macro. * config/sh/sh.c (sh_rtx_costs): Add test instruction case. * config/sh/sh.md (tstsi_t): Name existing insn. Make inner and instruction commutative. (tsthi_t, tstqi_t, tstqi_t_zero, tstsi_t_and_not, tstsi_t_zero_extract_eq, tstsi_t_zero_extract_xor, tstsi_t_zero_extract_subreg_xor_little, tstsi_t_zero_extract_subreg_xor_big): New insns. (*movsicc_t_false, *movsicc_t_true): Replace space with tab in asm output. (*andsi_compact): Reorder alternatives so that K08 is considered first. * gcc.target/sh/pr49263.c: New. Modified: trunk/gcc/ChangeLog trunk/gcc/config/sh/sh.c trunk/gcc/config/sh/sh.h trunk/gcc/config/sh/sh.md trunk/gcc/testsuite/ChangeLog
With trunk rev 181517 I have observed the following problem, which happens when compiling for -m2*, -m3*, -m4* and -Os: The function compiled with -m2 -Os int test_int64_lowword (long long* x) { return *x & 0xFFFFFFFF ? -20 : -40; } results in the following code: mov #0,r2 ! 42 movsi_i/3 [length = 2] tst r2,r2 ! 44 cmpeqsi_t/1 [length = 2] bf.s .L4 ! 45 branch_false [length = 2] mov.l @(4,r4),r3 ! 12 movsi_i/5 [length = 2] tst r3,r3 ! 46 cmpeqsi_t/1 [length = 2] .L4: bt .L3 ! 14 branch_true [length = 2] mov #-20,r0 ! 4 movsi_i/3 [length = 2] rts nop ! 52 *return_i [length = 4] .L3: rts ! 54 *return_i [length = 2] mov #-40,r0 ! 5 movsi_i/3 [length = 2] When compiled with -O1 or -O2 the high word test is completely eliminated correctly: mov.l @(4,r4),r1 ! 10 movsi_i/5 [length = 2] tst r1,r1 ! 17 cmpeqsi_t/1 [length = 2] bt .L4 ! 18 branch_true [length = 2] mov #-20,r0 ! 4 movsi_i/3 [length = 2] rts nop ! 50 *return_i [length = 4] .L4: rts ! 52 *return_i [length = 2] mov #-40,r0 ! 5 movsi_i/3 [length = 2] I'm not sure whether this is actually related to this PR, but have noticed it with the test cases of this PR. It seems only to happen for comparison-like insns. If I remember correctly, this problem did not exist when I started working on this PR.
(In reply to comment #14) > With trunk rev 181517 I have observed the following problem, which happens when > compiling for -m2*, -m3*, -m4* and -Os: > This is still present as of rev 182713 and seems to be a different issue. I've created PR51697 for it.
Author: olegendo Date: Sun Feb 26 13:31:32 2012 New Revision: 184585 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=184585 Log: PR target/49263 * gcc.target/sh/pr49263.c: New. Added: trunk/gcc/testsuite/gcc.target/sh/pr49263.c Modified: trunk/gcc/testsuite/ChangeLog
I guess this one is done now.
Not quite so done, actually. The following example case does not work properly: void test00 (int* x, int xb) { if (! (xb & 128)) x[0] = 0; } -O2 -m4: mov r5,r0 and #128,r0 tst r0,r0 bf .L3 mov.l r0,@r4 .L3: rts nop void test01 (int* x, int xb) { if (! (xb & 0x55)) x[0] = 0; } -O2 -m4: mov r5,r0 and #85,r0 tst r0,r0 bf .L7 mov.l r0,@r4 .L7: rts nop It seems that combine is trying to look for the following patterns: Failed to match this instruction: (set (pc) (if_then_else (ne (zero_extract:SI (reg:SI 5 r5 [ xb ]) (const_int 1 [0x1]) (const_int 7 [0x7])) (const_int 0 [0])) (label_ref:SI 15) (pc))) Failed to match this instruction: (set (pc) (if_then_else (ne (and:SI (reg:SI 5 r5 [ xb ]) (const_int 85 [0x55])) (const_int 0 [0])) (label_ref:SI 15) (pc))) Another case that could see some improvement ... bool test04 (int* x, int xb) { return ! (xb & 0x55); } -O2 -m4 (OK): mov r5,r0 tst #85,r0 rts movt r0 bool test02 (int* x, int xb) { return ! (xb & (1 << 6)); } -O2 -m4 (NG): mov r5,r0 mov #-6,r1 shld r1,r0 xor #1,r0 rts and #1,r0
Another thing I've noticed... Cases such as: mov.l r0,@r2 ! LS mov r13,r0 ! MT and #7,r0 ! EX tst r0,r0 ! MT bt/s .L8 ! BR mov.l r0,@(16,r1) where the result of the and op is re-used would be slightly better as: mov.l r0,@r2 ! LS mov r13,r0 ! MT tst #7,r0 ! MT and #7,r0 ! EX bt/s .L8 ! BR mov.l r0,@(16,r1) because it reduces dependency on the result of the and op and thus has a higher chance to be executed in parallel.
(In reply to comment #19) > Another thing I've noticed... > Cases such as: > > mov.l r0,@r2 ! LS > mov r13,r0 ! MT > and #7,r0 ! EX > tst r0,r0 ! MT > bt/s .L8 ! BR > mov.l r0,@(16,r1) > > where the result of the and op is re-used would be slightly better as: > > mov.l r0,@r2 ! LS > mov r13,r0 ! MT > tst #7,r0 ! MT > and #7,r0 ! EX > bt/s .L8 ! BR > mov.l r0,@(16,r1) > > because it reduces dependency on the result of the and op and thus has a higher > chance to be executed in parallel. Other similar cases where hoisting the tst insn would make sense: mov.b @(13,r2),r0 extu.b r0,r0 tst #1,r0 bt .L53 mov.b @(13,r1),r0 extu.b r0,r0 tst r0,r0 bt/s .L37
(In reply to Oleg Endo from comment #18) > > It seems that combine is trying to look for the following patterns: > > Failed to match this instruction: > (set (pc) > (if_then_else (ne (and:SI (reg:SI 5 r5 [ xb ]) > (const_int 85 [0x55])) > (const_int 0 [0])) > (label_ref:SI 15) > (pc))) Implementing such a combine pattern like ... (define_insn_and_split "*tst_cbranch" [(set (pc) (if_then_else (ne (and:SI (match_operand:SI 0 "logical_operand") (match_operand:SI 1 "const_int_operand")) (const_int 0)) (label_ref (match_operand 2)) (pc))) (clobber (reg:SI T_REG))] "TARGET_SH1" "#" "&& 1" [(set (reg:SI T_REG) (eq:SI (and:SI (match_dup 0) (match_dup 1)) (const_int 0))) (set (pc) (if_then_else (eq (reg:SI T_REG) (const_int 0)) (label_ref (match_dup 2)) (pc)))]) results in code such as following code: mov #33,r1 mov r5,r0 tst #33,r0 bf/s .L3 and r5,r1 mov.l r1,@r4 .L3: rts nop which is worse. What happens is that the sequence is expanded to RTL as follows: (insn 7 4 8 2 (set (reg:SI 163 [ D.1856 ]) (and:SI (reg/v:SI 162 [ xb ]) (const_int 33 [0x21]))) sh_tmp.cpp:17 -1 (nil)) (insn 8 7 9 2 (set (reg:SI 147 t) (eq:SI (reg:SI 163 [ D.1856 ]) (const_int 0 [0]))) sh_tmp.cpp:17 -1 (nil)) (jump_insn 9 8 10 2 (set (pc) (if_then_else (eq (reg:SI 147 t) (const_int 0 [0])) (label_ref:SI 15) (pc))) sh_tmp.cpp:17 301 {*cbranch_t} (int_list:REG_BR_PROB 3900 (nil)) -> 15) (note 10 9 11 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (insn 11 10 12 4 (set (reg:SI 164) (const_int 0 [0])) sh_tmp.cpp:18 -1 (nil)) (insn 12 11 15 4 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32]) (reg:SI 164)) sh_tmp.cpp:18 -1 (nil)) and the cse1 pass decides that the result of the and operation can be shared and replaces the operand in insn 12 with reg:SI 163: (insn 12 11 15 3 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32]) (reg:SI 163 [ D.1856 ])) sh_tmp.cpp:18 258 {movsi_ie} (expr_list:REG_DEAD (reg:SI 164) (expr_list:REG_DEAD (reg/v/f:SI 161 [ x ]) (nil)))) and insn 11 becomes dead code and is eliminated. All of that happens long time before combine, so the tst combine patterns have no chance to reconstruct the original code. A sequence such as mov r5,r0 mov #0,r1 tst #33,r0 bf .L3 mov.l r1,@r4 .L3: rts nop could probably be achieved by combining insn 7 and insn 8 shortly after RTL expansion, or even during the expansion of insn 8 (by looking at previous already expanded insns and emitting a tst insn directly). The idea would be to reduce dependencies on the tested register which allows better scheduling. In addition to that, on SH4A "mov #imm8,Rn" is an MT group instruction which has a higher probability of being executed in parallel.
(In reply to Oleg Endo from comment #21) > What happens is that the sequence is expanded to RTL as follows: > > (insn 7 4 8 2 (set (reg:SI 163 [ D.1856 ]) > (and:SI (reg/v:SI 162 [ xb ]) > (const_int 33 [0x21]))) sh_tmp.cpp:17 -1 > (nil)) > (insn 8 7 9 2 (set (reg:SI 147 t) > (eq:SI (reg:SI 163 [ D.1856 ]) > (const_int 0 [0]))) sh_tmp.cpp:17 -1 > (nil)) > (jump_insn 9 8 10 2 (set (pc) > (if_then_else (eq (reg:SI 147 t) > (const_int 0 [0])) > (label_ref:SI 15) > (pc))) sh_tmp.cpp:17 301 {*cbranch_t} > (int_list:REG_BR_PROB 3900 (nil)) > -> 15) > (note 10 9 11 4 [bb 4] NOTE_INSN_BASIC_BLOCK) > (insn 11 10 12 4 (set (reg:SI 164) > (const_int 0 [0])) sh_tmp.cpp:18 -1 > (nil)) > (insn 12 11 15 4 (set (mem:SI (reg/v/f:SI 161 [ x ]) [2 *x_5(D)+0 S4 A32]) > (reg:SI 164)) sh_tmp.cpp:18 -1 > (nil)) > > > and insn 11 becomes dead code and is eliminated. > All of that happens long time before combine, so the tst combine patterns > have no chance to reconstruct the original code. > Adding an early peephole pass as described in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59533#c2 and then adding the following peephole: ;; Peephole after initial expansion. (define_peephole2 [(set (match_operand:SI 0 "arith_reg_dest") (and:SI (match_operand:SI 1 "arith_reg_operand") (match_operand:SI 2 "logical_operand"))) (set (reg:SI T_REG) (eq:SI (match_dup 0) (const_int 0)))] "TARGET_SH1 && can_create_pseudo_p ()" [(set (reg:SI T_REG) (eq:SI (and:SI (match_dup 1) (match_dup 2)) (const_int 0))) (set (match_dup 0) (and:SI (match_dup 1) (match_dup 2)))]) ... fixes the problem and results in more uses of the tst #imm,r0 insn according to the CSiBE set. On the other hand there is a total code size increase of 792 bytes on the whole set. Below are some things that get worse in the Linux source (mm/filemap.c): mov.b @(15,r1),r0 -> mov.b @(15,r1),r0 cmp/pz r0 tst #128,r0 // cmp/pz has less bf .L1016 bf .L1001 // pressure on r0 mov.b @(15,r0),r0 -> mov.b @(15,r0),r0 tst #4,r0 shar r0 bf .L107 shar r0 tst #1,r0 add #16,r0 -> add #16,r0 mov.b @(15,r0),r0 mov.b @(15,r0),r0 tst #16,r0 mov #-4,r1 bf/s .L509 shad r1,r0 tst #1,r0 bf/s .L509
Author: olegendo Date: Tue Dec 30 18:44:27 2014 New Revision: 219111 URL: https://gcc.gnu.org/viewcvs?rev=219111&root=gcc&view=rev Log: gcc/testsuite/ PR target/49263 * gcc.target/sh/pr49263-1.c: New. * gcc.target/sh/pr49263-2.c: New. Added: trunk/gcc/testsuite/gcc.target/sh/pr49263-1.c trunk/gcc/testsuite/gcc.target/sh/pr49263-2.c Modified: trunk/gcc/testsuite/ChangeLog
Author: olegendo Date: Tue Dec 30 19:11:42 2014 New Revision: 219113 URL: https://gcc.gnu.org/viewcvs?rev=219113&root=gcc&view=rev Log: gcc/testsuite/ PR target/49263 * gcc.target/sh/sh.exp (check_effective_target_sh2a): New. * gcc.target/sh/pr49263-3.c: New. Added: trunk/gcc/testsuite/gcc.target/sh/pr49263-3.c Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/sh/sh.exp
Author: olegendo Date: Sat Jan 24 13:04:53 2015 New Revision: 220081 URL: https://gcc.gnu.org/viewcvs?rev=220081&root=gcc&view=rev Log: gcc/ PR target/49263 PR target/53987 PR target/64345 PR target/59533 PR target/52933 PR target/54236 PR target/51244 * config/sh/sh-protos.h (sh_extending_set_of_reg::can_use_as_unextended_reg, sh_extending_set_of_reg::use_as_unextended_reg, sh_is_nott_insn, sh_movt_set_dest, sh_movrt_set_dest, sh_is_movt_insn, sh_is_movrt_insn, sh_insn_operands_modified_between_p, sh_reg_dead_or_unused_after_insn, sh_in_recog_treg_set_expr, sh_recog_treg_set_expr, sh_split_treg_set_expr): New functions. (sh_treg_insns): New class. * config/sh/sh.c (TARGET_LEGITIMATE_COMBINED_INSN): Define target hook. (scope_counter): New class. (sh_legitimate_combined_insn, sh_is_nott_insn, sh_movt_set_dest, sh_movrt_set_dest, sh_reg_dead_or_unused_after_insn, sh_extending_set_of_reg::can_use_as_unextended_reg, sh_extending_set_of_reg::use_as_unextended_reg, sh_recog_treg_set_expr, sh_in_recog_treg_set_expr, sh_try_split_insn_simple, sh_split_treg_set_expr): New functions. (addsubcosts): Handle treg_set_expr. (sh_rtx_costs): Handle IF_THEN_ELSE and ZERO_EXTRACT. (sh_rtx_costs): Use arith_reg_operand in SIGN_EXTEND and ZERO_EXTEND. (sh_rtx_costs): Handle additional bit test patterns in EQ and AND cases. (sh_insn_operands_modified_between_p): Make non-static. * config/sh/predicates.md (zero_extend_movu_operand): Allow simple_mem_operand in addition to displacement_mem_operand. (zero_extend_operand): Don't allow zero_extend_movu_operand. (treg_set_expr, treg_set_expr_not_const01, arith_reg_or_treg_set_expr): New predicates. * config/sh/sh.md (tstsi_t): Use arith_reg_operand and arith_or_int_operand instead of logical_operand. Convert to insn_and_split. Try to optimize constant operand in splitter. (tsthi_t, tstqi_t): Fold into *tst<mode>_t. Convert to insn_and_split. (*tstqi_t_zero): Delete. (*tst<mode>_t_subregs): Add !sh_in_recog_treg_set_expr split condition. (tstsi_t_and_not): Delete. (tst<mode>_t_zero_extract_eq): Rename to *tst<mode>_t_zero_extract. Convert to insn_and_split. (unnamed split, tstsi_t_zero_extract_xor, tstsi_t_zero_extract_subreg_xor_little, tstsi_t_zero_extract_subreg_xor_big): Delete. (*tstsi_t_shift_mask): New insn_and_split. (cmpeqsi_t, cmpgesi_t): Add new split for const_int 0 operands and try to recombine with surrounding insns when splitting. (*negtstsi): Add !sh_in_recog_treg_set_expr condition. (cmp_div0s_0, cmp_div0s_1, *cmp_div0s_0, *cmp_div0s_1): Rewrite as ... (cmp_div0s, *cmp_div0s_1, *cmp_div0s_2, *cmp_div0s_3, *cmp_div0s_4, *cmp_div0s_5, *cmp_div0s_6): ... these new insn_and_split patterns. (*cbranch_div0s: Delete. (*addc): Convert to insn_and_split. Use treg_set_expr as 3rd operand. Try to recombine with surrounding insns when splitting. Add operand order variants. (*addc_t_r, *addc_r_t): Use treg_set_expr_not_const01. (*addc_r_r_1, *addc_r_lsb, *addc_r_r_lsb, *addc_r_lsb_r, *addc_r_msb, *addc_r_r_msb, *addc_2r_msb): Delete. (*addc_2r_lsb): Rename to *addc_2r_t. Use treg_set_expr. Add operand order variant. (*addc_negreg_t): New insn_and_split. (*subc): Convert to insn_and_split. Use treg_set_expr as 3rd operand. Try to recombine with surrounding insns when splitting. Add operand order variants. (*subc_negt_reg, *subc_negreg_t, *reg_lsb_t, *reg_msb_t): New insn_and_split patterns. (*rotcr): Use arith_reg_or_treg_set_expr. Try to recombine with surrounding insns when splitting. (unnamed rotcr split): Use arith_reg_or_treg_set_expr. (*rotcl): Likewise. Add zero_extract variant. (*ashrsi2_31): New insn_and_split. (*negc): Convert to insn_and_split. Use treg_set_expr. (*zero_extend<mode>si2_disp_mem): Update comment. (movrt_negc, *movrt_negc, nott): Add !sh_in_recog_treg_set_expr split condition. (*mov_t_msb_neg, mov_neg_si_t): Use treg_set_expr. Try to recombine with surrounding insns when splitting. (any_treg_expr_to_reg): New insn_and_split. (*neg_zero_extract_0, *neg_zero_extract_1, *neg_zero_extract_2, *neg_zero_extract_3, *neg_zero_extract_4, *neg_zero_extract_5, *neg_zero_extract_6, *zero_extract_0, *zero_extract_1, *zero_extract_2): New single bit zero extract patterns. (bld_reg, *bld_regqi): Fold into bld<mode>_reg. (*get_thread_pointersi, store_gbr, *mov<mode>_gbr_load, *mov<mode>_gbr_load, *mov<mode>_gbr_load, *mov<mode>_gbr_load, *movdi_gbr_load): Use arith_reg_dest instead of register_operand for set destination. (set_thread_pointersi, load_gbr): Use arith_reg_operand instead of register_operand for set source. gcc/testsuite/ PR target/49263 PR target/53987 PR target/64345 PR target/59533 PR target/52933 PR target/54236 PR target/51244 * gcc.target/sh/pr64345-1.c: New. * gcc.target/sh/pr64345-2.c: New. * gcc.target/sh/pr59533-1.c: New. * gcc.target/sh/pr49263.c: Adjust matching of expected insns. * gcc.target/sh/pr52933-2.c: Likewise. * gcc.target/sh/pr54089-1.c: Likewise. * gcc.target/sh/pr54236-1.c: Likewise. * gcc.target/sh/pr51244-20-sh2a.c: Likewise. * gcc.target/sh/pr49263-1.c: Remove xfails. * gcc.target/sh/pr49263-2.c: Likewise. * gcc.target/sh/pr49263-3.c: Likewise. * gcc.target/sh/pr53987-1.c: Likewise. * gcc.target/sh/pr52933-1.c: Adjust matching of expected insns. (test_24, test_25, test_26, test_27, test_28, test_29, test_30): New. * gcc.target/sh/pr51244-12.c: Adjust matching of expected insns. (test05, test06, test07, test08, test09, test10, test11, test12): New. * gcc.target/sh/pr54236-3.c: Adjust matching of expected insns. (test_002, test_003, test_004, test_005, test_006, test_007, test_008, test_009): New. * gcc.target/sh/pr51244-4.c: Adjust matching of expected insns. (test_02): New. Added: trunk/gcc/testsuite/gcc.target/sh/pr59533-1.c trunk/gcc/testsuite/gcc.target/sh/pr64345-1.c trunk/gcc/testsuite/gcc.target/sh/pr64345-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/sh/predicates.md trunk/gcc/config/sh/sh-protos.h trunk/gcc/config/sh/sh.c trunk/gcc/config/sh/sh.md trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/sh/pr49263-1.c trunk/gcc/testsuite/gcc.target/sh/pr49263-2.c trunk/gcc/testsuite/gcc.target/sh/pr49263-3.c trunk/gcc/testsuite/gcc.target/sh/pr49263.c trunk/gcc/testsuite/gcc.target/sh/pr51244-12.c trunk/gcc/testsuite/gcc.target/sh/pr51244-20-sh2a.c trunk/gcc/testsuite/gcc.target/sh/pr51244-4.c trunk/gcc/testsuite/gcc.target/sh/pr52933-1.c trunk/gcc/testsuite/gcc.target/sh/pr52933-2.c trunk/gcc/testsuite/gcc.target/sh/pr53987-1.c trunk/gcc/testsuite/gcc.target/sh/pr54089-1.c trunk/gcc/testsuite/gcc.target/sh/pr54236-1.c trunk/gcc/testsuite/gcc.target/sh/pr54236-3.c
Author: olegendo Date: Mon Jan 26 23:56:05 2015 New Revision: 220144 URL: https://gcc.gnu.org/viewcvs?rev=220144&root=gcc&view=rev Log: gcc/ PR target/49263 * config/sh/sh.c (sh_split_treg_set_expr): Invoke emit_insn before remove_insn. * config/sh/sh.md (tstsi_t): Don't try to optimize constant with right shifts if it already fits into K08. gcc/testsuite/ PR target/49263 * gcc.target/sh/pr49263-4.c: New. Added: trunk/gcc/testsuite/gcc.target/sh/pr49263-4.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/sh/sh.c trunk/gcc/config/sh/sh.md trunk/gcc/testsuite/ChangeLog
(In reply to Oleg Endo from comment #26) > Author: olegendo > Date: Mon Jan 26 23:56:05 2015 > New Revision: 220144 > > URL: https://gcc.gnu.org/viewcvs?rev=220144&root=gcc&view=rev > Log: > gcc/ > PR target/49263 > * config/sh/sh.c (sh_split_treg_set_expr): Invoke emit_insn before > remove_insn. > * config/sh/sh.md (tstsi_t): Don't try to optimize constant with right > shifts if it already fits into K08. > > gcc/testsuite/ > PR target/49263 > * gcc.target/sh/pr49263-4.c: New. > > > Added: > trunk/gcc/testsuite/gcc.target/sh/pr49263-4.c > Modified: > trunk/gcc/ChangeLog > trunk/gcc/config/sh/sh.c > trunk/gcc/config/sh/sh.md > trunk/gcc/testsuite/ChangeLog Did this fix it?
(In reply to Eric Gallager from comment #27) > (In reply to Oleg Endo from comment #26) > > Author: olegendo > > Date: Mon Jan 26 23:56:05 2015 > > New Revision: 220144 Well, it fixed some of the cases mentioned in this PR, but not all. It's quite a broad issue actually.
(In reply to Oleg Endo from comment #28) > (In reply to Eric Gallager from comment #27) > > (In reply to Oleg Endo from comment #26) > > > Author: olegendo > > > Date: Mon Jan 26 23:56:05 2015 > > > New Revision: 220144 > > Well, it fixed some of the cases mentioned in this PR, but not all. It's > quite a broad issue actually. So maybe it's worth splitting up into sub-issues?
(In reply to Eric Gallager from comment #29) > > So maybe it's worth splitting up into sub-issues? It'd be better to, yes. But at the moment I don't have a lot of time to go through all the cases and factor out the individual cases. Please leave this open. It will be useful if I (or others) get back to active SH development in the future.
I've found new cases for SH2 and SH2E CPUs only: #define FLAG 0x40 unsigned int f(char v){ return (v & FLAG) == FLAG; } For both big and little endian translates to dynamic shift call: -O -m2 (or -m2e) _f: sts.l pr,@-r15 mov.l .L3,r1 jsr @r1 exts.b r4,r4 mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L3: .long ___ashiftrt_r4_6 And #define FLAG 0x40 #define ADDR 0xFFFF0000 #define P ((unsigned char *)ADDR) unsigned int f(void){ return (*P & FLAG) == FLAG; } Translates to _f: sts.l pr,@-r15 mov.l .L3,r1 mov.b @r1,r4 mov.l .L4,r1 jsr @r1 nop mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L3: .long -65536 .L4: .long ___ashiftrt_r4_6 Assembler output does not depend on ADDR value, but depends on variable (or pointer) type. When type is integer, assembler code uses 'tst' for all options '-m4', '-m2', '-m2e': #define FLAG 0x40 unsigned int f(unsigned int v){ return (v & FLAG) == FLAG; } translates to _f: mov r4,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 and #define FLAG 0x40 #define ADDR 0xFFFF0000 #define P ((unsigned int *)ADDR) unsigned int f(void){ return (*P & FLAG) == FLAG; } translates to _f: mov.l .L2,r1 mov.l @r1,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .L2: .long -65536 Interesting that when '-m4' flag is specified, later GCC always translates to code with 'tst' instruction. I played with godbolt and that's what I found. GCC ver 4 uses dynamic shift 'shad' with '-m4' option and library call with '-m2' or '-m2e' options. GCC 9.5 and later uses 'tst' with '-m4' and library call with both '-m2' and '-m2e' when FLAG==0x40 and 'shll' instruction with both '-m2' and '-m2e' when FLAG==0x80. I remind you that this is happening when char type used only. Maybe SH4 solution can be extended to support SH2/SH2E?
I'm not sure whether I should write here or open new discussion, but these topics are related very closely. I've been writing a patch to eliminate the generation of dynamic shift instructions 'shad' and 'shld' completely at least for SH4 CPU. And then I get a surprising result - in all the examples I gave earlier, library call converted to 'tst' instructions! Here is the patch itself (I also will attach a file): --- ../gcc-12.3.0.orig/gcc/config/sh/sh.cc 2023-05-08 15:14:39.681161695 +0300 +++ ./gcc/config/sh/sh.cc 2023-05-23 12:23:25.964375731 +0300 @@ -3061,7 +3061,7 @@ else insn_count = ashl_lshr_seq[shift_amount_i].insn_count; - return TARGET_DYNSHIFT && (insn_count > 1 + SH_DYNAMIC_SHIFT_COST); + return TARGET_DYNSHIFT && (insn_count > 1 + SH_DYNAMIC_SHIFT_COST) && ! disable_dynshift; } /* Assuming we have a value that has been sign-extended by at least one bit, @@ -3812,8 +3812,10 @@ rtx wrk; char func[18]; int value; + int long_shift = disable_dynshift ? 30 : 19; + int short_shift = disable_dynshift ? 15 : 5; - if (TARGET_DYNSHIFT) + if (TARGET_DYNSHIFT && ! disable_dynshift) { if (!CONST_INT_P (operands[2])) { @@ -3851,7 +3853,7 @@ emit_insn (gen_ashrsi2_31 (operands[0], operands[1])); return true; } - else if (value >= 16 && value <= 19) + else if (value >= 16 && value <= long_shift) { wrk = gen_reg_rtx (SImode); emit_insn (gen_ashrsi2_16 (wrk, operands[1])); @@ -3862,7 +3864,7 @@ return true; } /* Expand a short sequence inline, longer call a magic routine. */ - else if (value <= 5) + else if (value <= short_shift) { wrk = gen_reg_rtx (SImode); emit_move_insn (wrk, operands[1]); diff -ur ../gcc-12.3.0.orig/gcc/config/sh/sh.opt ./gcc/config/sh/sh.opt --- ../gcc-12.3.0.orig/gcc/config/sh/sh.opt 2023-05-08 15:14:39.689161810 +0300 +++ ./gcc/config/sh/sh.opt 2023-05-23 10:45:36.814371159 +0300 @@ -301,3 +301,7 @@ mlra Target Var(sh_lra_flag) Init(0) Save Use LRA instead of reload (transitional). + +mdisable-dynshift +Target Var(disable_dynshift) Init(0) +Disable dynamic shift 'shad' and 'shld' instructions And here are my tests: $ cat f.c #define ADDR 0xFFFF0000 #define P ((unsigned char *)ADDR) #define FLAG 0x40 #define S 7 unsigned char f(char v){ return (v & FLAG) == FLAG; } unsigned char f_(unsigned char v){ return (v & FLAG) == FLAG; } unsigned char f1(void){ return (*P & FLAG) == FLAG; } int f_signed_rshift(int v){ return v >> S; } int f_signed_lshift(int v){ return v << S; } unsigned int f_unsigned_rshift(unsigned int v){ return v >> S; } unsigned int f_unsigned_lshift(unsigned int v){ return v << S; } $ /usr/local/sh-toolchain/bin/sh-elf-gcc -c -mrenesas -m2e -mb -O -fno-toplevel-reorder -mdisable-dynshift -S f.c $ cat f.s .file "f.c" .text .text .align 1 .global _f .type _f, @function _f: mov r4,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .size _f, .-_f .align 1 .global _f_ .type _f_, @function _f_: mov r4,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .size _f_, .-_f_ .align 1 .global _f1 .type _f1, @function _f1: mov.l .L4,r1 mov.b @r1,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .L5: .align 2 .L4: .long -65536 .size _f1, .-_f1 .align 1 .global _f_signed_rshift .type _f_signed_rshift, @function _f_signed_rshift: mov r4,r0 shar r0 shar r0 shar r0 shar r0 shar r0 shar r0 rts shar r0 .size _f_signed_rshift, .-_f_signed_rshift .align 1 .global _f_signed_lshift .type _f_signed_lshift, @function _f_signed_lshift: mov r4,r0 shll2 r0 shll2 r0 add r0,r0 rts shll2 r0 .size _f_signed_lshift, .-_f_signed_lshift .align 1 .global _f_unsigned_rshift .type _f_unsigned_rshift, @function _f_unsigned_rshift: mov r4,r0 shlr2 r0 shlr2 r0 shlr r0 rts shlr2 r0 .size _f_unsigned_rshift, .-_f_unsigned_rshift .align 1 .global _f_unsigned_lshift .type _f_unsigned_lshift, @function _f_unsigned_lshift: mov r4,r0 shll2 r0 shll2 r0 add r0,r0 rts shll2 r0 .size _f_unsigned_lshift, .-_f_unsigned_lshift .ident "GCC: (GNU) 12.3.0" I also compiled my project with '-m2e' and new '-mdisable-dynshift' options and tested it in SH-2E mone on Renesas's emulator that comes with High-performance Embedded Workshop and all unit tests run as expected. If this patch is useful let's include it in GCC.
Created attachment 55142 [details] Disable dynamic shift instructions patch
(In reply to Alexander Klepikov from comment #33) > Created attachment 55142 [details] > Disable dynamic shift instructions patch First of all, thanks for digging into this. This issue has been a can of worms, due to all sorts of reasons. As you have discovered, some code patterns take the shift instruction route, which is basically decided earlier by the various middle-end optimizations. There have also been some changes to those parts recently, but I haven't been watching what it does for SH. > unsigned int f(char v){ > return (v & FLAG) == FLAG; > } Bit-tests of char and unsigned char should be covered by the test-suite and should work -- at least originally. However, what might be triggering this problem is the '== FLAG' comparison. When I was working on this issue I only used '== 0' or '!= 0' comparison. I can imagine that your test code triggers some other middle end optimizations and hence we get this. Can you try to rewrite your test code to something like this? unsigned int f(char v){ return (v & FLAG) != 0; } ... and see if it generates the tst instruction as expected? > I also compiled my project with '-m2e' and new '-mdisable-dynshift' > options and tested it in SH-2E mone on Renesas's emulator that comes > with High-performance Embedded Workshop and all unit tests run as expected. I'm not sure what the purpose of the '-mdisable-dynshift' option would be here though. For '-m2e' TARGET_DYNSHIFT is already 'false'. So the option seems misnamed.
(In reply to Oleg Endo from comment #34) > Bit-tests of char and unsigned char should be covered by the test-suite and > should work -- at least originally. However, what might be triggering this > problem is the '== FLAG' comparison. When I was working on this issue I > only used '== 0' or '!= 0' comparison. I can imagine that your test code > triggers some other middle end optimizations and hence we get this. Yes, I am sure that the problem is the '== FLAG' comparison. Before I reported that bug, I tried to bypass it and this macro does not produce shift instructions even on GCC 4.7: #define BIT_MASK_IS_SET_(VALUE, BITMASK)\ ({int _value = VALUE & BITMASK,\ _result;\ if (_value == BITMASK){\ _result = 1;\ }\ else {\ _result = 0;\ }\ _result;}) So this is definitely the comparison. > > Can you try to rewrite your test code to something like this? > > unsigned int f(char v){ > return (v & FLAG) != 0; > } > > ... and see if it generates the tst instruction as expected? > As I understand, you meant the following (I added new functions at the end of file): $ cat f.c #define ADDR 0xFFFF0000 #define P ((unsigned char *)ADDR) #define FLAG 0x40 #define S 7 unsigned char f_char_var(char v){ return (v & FLAG) == FLAG; } unsigned char f_unsigned_char_var(unsigned char v){ return (v & FLAG) == FLAG; } unsigned char f_symbol(void){ return (*P & FLAG) == FLAG; } unsigned char f_symbol_zero(void){ return (*P & FLAG) == 0; } unsigned char f_symbol_non_zero(void){ return (*P & FLAG) != 0; } Compiler flags: -c -mrenesas -m2e -mb -O -fno-toplevel-reorder With patch disabled: $ cat f_clean.s .file "f.c" .text .text .align 1 .global _f_char_var .type _f_char_var, @function _f_char_var: sts.l pr,@-r15 mov.l .L3,r1 jsr @r1 exts.b r4,r4 mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L4: .align 2 .L3: .long ___ashiftrt_r4_6 .size _f_char_var, .-_f_char_var .align 1 .global _f_unsigned_char_var .type _f_unsigned_char_var, @function _f_unsigned_char_var: sts.l pr,@-r15 mov.l .L7,r1 jsr @r1 exts.b r4,r4 mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L8: .align 2 .L7: .long ___ashiftrt_r4_6 .size _f_unsigned_char_var, .-_f_unsigned_char_var .align 1 .global _f_symbol .type _f_symbol, @function _f_symbol: sts.l pr,@-r15 mov.l .L11,r1 mov.b @r1,r4 mov.l .L12,r1 jsr @r1 nop mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L13: .align 2 .L11: .long -65536 .L12: .long ___ashiftrt_r4_6 .size _f_symbol, .-_f_symbol .align 1 .global _f_symbol_zero .type _f_symbol_zero, @function _f_symbol_zero: mov.l .L15,r1 mov.b @r1,r0 tst #64,r0 rts movt r0 .L16: .align 2 .L15: .long -65536 .size _f_symbol_zero, .-_f_symbol_zero .align 1 .global _f_symbol_non_zero .type _f_symbol_non_zero, @function _f_symbol_non_zero: sts.l pr,@-r15 mov.l .L19,r1 mov.b @r1,r4 mov.l .L20,r1 jsr @r1 nop mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L21: .align 2 .L19: .long -65536 .L20: .long ___ashiftrt_r4_6 .size _f_symbol_non_zero, .-_f_symbol_non_zero .ident "GCC: (GNU) 12.3.0" With patch enabled: $ cat f.s .file "f.c" .text .text .align 1 .global _f_char_var .type _f_char_var, @function _f_char_var: mov r4,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .size _f_char_var, .-_f_char_var .align 1 .global _f_unsigned_char_var .type _f_unsigned_char_var, @function _f_unsigned_char_var: mov r4,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .size _f_unsigned_char_var, .-_f_unsigned_char_var .align 1 .global _f_symbol .type _f_symbol, @function _f_symbol: mov.l .L4,r1 mov.b @r1,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .L5: .align 2 .L4: .long -65536 .size _f_symbol, .-_f_symbol .align 1 .global _f_symbol_zero .type _f_symbol_zero, @function _f_symbol_zero: mov.l .L7,r1 mov.b @r1,r0 tst #64,r0 rts movt r0 .L8: .align 2 .L7: .long -65536 .size _f_symbol_zero, .-_f_symbol_zero .align 1 .global _f_symbol_non_zero .type _f_symbol_non_zero, @function _f_symbol_non_zero: mov.l .L10,r1 mov.b @r1,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .L11: .align 2 .L10: .long -65536 .size _f_symbol_non_zero, .-_f_symbol_non_zero .ident "GCC: (GNU) 12.3.0" > I'm not sure what the purpose of the '-mdisable-dynshift' option would be > here though. For '-m2e' TARGET_DYNSHIFT is already 'false'. So the option > seems misnamed. I choose that name because I wanted to disable dynamic shift instructions for all CPUs. I did not hope that it will affect SH-2E code in such way. I can rewrite the patch so that it only affects CPUs that do not support dynamic shifts and disables library call for dynamic shifts. I'll do it anyway because I need it badly. How do you think, what name of option would be better: '-mdisable-dynshift-libcall' or '-mhw-shift'? Or if you want, please suggest another one. Thank you!
(In reply to Alexander Klepikov from comment #35) > > As I understand, you meant the following (I added new functions at the end > of file): > > $ cat f.c > #define ADDR 0xFFFF0000 > #define P ((unsigned char *)ADDR) > #define FLAG 0x40 > #define S 7 > .... Yes, that's what I meant, thanks. Can you also compile for little endian, and most of all, use -O2 optimization level. Some optimizations are not done below -O2. > > I choose that name because I wanted to disable dynamic shift instructions > for all CPUs. I did not hope that it will affect SH-2E code in such way. > > I can rewrite the patch so that it only affects CPUs that do not support > dynamic shifts and disables library call for dynamic shifts. I'll do it > anyway because I need it badly. How do you think, what name of option would > be better: '-mdisable-dynshift-libcall' or '-mhw-shift'? Or if you want, > please suggest another one. Thank you! '-mdisable-dynshift-libcall' would be more appropriate for what it tries to do, I think. Although that is a whole different issue ... but what is it going to do for real dynamic shifts on SH2? What kind of code is it supposed to emit for things like unsigned int dyn_shift (unsigned int x, unsigned int y) { return x << (y & 31); }
> Can you also compile for little endian, and most of all, use -O2 > optimization level. Some optimizations are not done below -O2. Here's source file, I added functions with non-constant shifts $ cat f.c #define ADDR 0xFFFF0000 #define P ((unsigned char *)ADDR) #define FLAG 0x40 #define S 7 unsigned char f_char_var(char v){ return (v & FLAG) == FLAG; } unsigned char f_unsigned_char_var(unsigned char v){ return (v & FLAG) == FLAG; } unsigned char f_symbol(void){ return (*P & FLAG) == FLAG; } unsigned char f_symbol_zero(void){ return (*P & FLAG) == 0; } unsigned char f_symbol_non_zero(void){ return (*P & FLAG) != 0; } unsigned int dyn_lshift (unsigned int x, unsigned int y) { return x << (y & 31); } unsigned int dyn_rshift (unsigned int x, unsigned int y) { return x >> (y & 31); } unsigned int really_dyn_lshift (unsigned int x, unsigned int y) { return x << y; } unsigned int really_dyn_rshift (unsigned int x, unsigned int y) { return x >> y; } With patch disabled, -O2 -mb: $ cat f.s .file "f.c" .text .text .align 1 .align 2 .global _f_char_var .type _f_char_var, @function _f_char_var: mov.l .L4,r1 sts.l pr,@-r15 jsr @r1 exts.b r4,r4 mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L5: .align 2 .L4: .long ___ashiftrt_r4_6 .size _f_char_var, .-_f_char_var .align 1 .align 2 .global _f_unsigned_char_var .type _f_unsigned_char_var, @function _f_unsigned_char_var: mov.l .L8,r1 sts.l pr,@-r15 jsr @r1 exts.b r4,r4 mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L9: .align 2 .L8: .long ___ashiftrt_r4_6 .size _f_unsigned_char_var, .-_f_unsigned_char_var .align 1 .align 2 .global _f_symbol .type _f_symbol, @function _f_symbol: mov.l .L12,r1 sts.l pr,@-r15 mov.b @r1,r4 mov.l .L13,r1 jsr @r1 nop mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L14: .align 2 .L12: .long -65536 .L13: .long ___ashiftrt_r4_6 .size _f_symbol, .-_f_symbol .align 1 .align 2 .global _f_symbol_zero .type _f_symbol_zero, @function _f_symbol_zero: mov.l .L16,r1 mov.b @r1,r0 tst #64,r0 rts movt r0 .L17: .align 2 .L16: .long -65536 .size _f_symbol_zero, .-_f_symbol_zero .align 1 .align 2 .global _f_symbol_non_zero .type _f_symbol_non_zero, @function _f_symbol_non_zero: mov.l .L20,r1 sts.l pr,@-r15 mov.b @r1,r4 mov.l .L21,r1 jsr @r1 nop mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L22: .align 2 .L20: .long -65536 .L21: .long ___ashiftrt_r4_6 .size _f_symbol_non_zero, .-_f_symbol_non_zero .align 1 .align 2 .global _dyn_lshift .type _dyn_lshift, @function _dyn_lshift: mov.l .L25,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L26: .align 2 .L25: .long ___ashlsi3_r0 .size _dyn_lshift, .-_dyn_lshift .align 1 .align 2 .global _dyn_rshift .type _dyn_rshift, @function _dyn_rshift: mov.l .L29,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L30: .align 2 .L29: .long ___lshrsi3_r0 .size _dyn_rshift, .-_dyn_rshift .align 1 .align 2 .global _really_dyn_lshift .type _really_dyn_lshift, @function _really_dyn_lshift: mov.l .L33,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L34: .align 2 .L33: .long ___ashlsi3_r0 .size _really_dyn_lshift, .-_really_dyn_lshift .align 1 .align 2 .global _really_dyn_rshift .type _really_dyn_rshift, @function _really_dyn_rshift: mov.l .L37,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L38: .align 2 .L37: .long ___lshrsi3_r0 .size _really_dyn_rshift, .-_really_dyn_rshift .ident "GCC: (GNU) 12.3.0" With patch disabled, -O2 -ml $ cat f.s .file "f.c" .text .little .text .align 1 .align 2 .global _f_char_var .type _f_char_var, @function _f_char_var: mov.l .L4,r1 sts.l pr,@-r15 jsr @r1 exts.b r4,r4 mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L5: .align 2 .L4: .long ___ashiftrt_r4_6 .size _f_char_var, .-_f_char_var .align 1 .align 2 .global _f_unsigned_char_var .type _f_unsigned_char_var, @function _f_unsigned_char_var: mov.l .L8,r1 sts.l pr,@-r15 jsr @r1 exts.b r4,r4 mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L9: .align 2 .L8: .long ___ashiftrt_r4_6 .size _f_unsigned_char_var, .-_f_unsigned_char_var .align 1 .align 2 .global _f_symbol .type _f_symbol, @function _f_symbol: mov.l .L12,r1 sts.l pr,@-r15 mov.b @r1,r4 mov.l .L13,r1 jsr @r1 nop mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L14: .align 2 .L12: .long -65536 .L13: .long ___ashiftrt_r4_6 .size _f_symbol, .-_f_symbol .align 1 .align 2 .global _f_symbol_zero .type _f_symbol_zero, @function _f_symbol_zero: mov.l .L16,r1 mov.b @r1,r0 tst #64,r0 rts movt r0 .L17: .align 2 .L16: .long -65536 .size _f_symbol_zero, .-_f_symbol_zero .align 1 .align 2 .global _f_symbol_non_zero .type _f_symbol_non_zero, @function _f_symbol_non_zero: mov.l .L20,r1 sts.l pr,@-r15 mov.b @r1,r4 mov.l .L21,r1 jsr @r1 nop mov r4,r0 and #1,r0 lds.l @r15+,pr rts nop .L22: .align 2 .L20: .long -65536 .L21: .long ___ashiftrt_r4_6 .size _f_symbol_non_zero, .-_f_symbol_non_zero .align 1 .align 2 .global _dyn_lshift .type _dyn_lshift, @function _dyn_lshift: mov.l .L25,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L26: .align 2 .L25: .long ___ashlsi3_r0 .size _dyn_lshift, .-_dyn_lshift .align 1 .align 2 .global _dyn_rshift .type _dyn_rshift, @function _dyn_rshift: mov.l .L29,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L30: .align 2 .L29: .long ___lshrsi3_r0 .size _dyn_rshift, .-_dyn_rshift .align 1 .align 2 .global _really_dyn_lshift .type _really_dyn_lshift, @function _really_dyn_lshift: mov.l .L33,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L34: .align 2 .L33: .long ___ashlsi3_r0 .size _really_dyn_lshift, .-_really_dyn_lshift .align 1 .align 2 .global _really_dyn_rshift .type _really_dyn_rshift, @function _really_dyn_rshift: mov.l .L37,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L38: .align 2 .L37: .long ___lshrsi3_r0 .size _really_dyn_rshift, .-_really_dyn_rshift .ident "GCC: (GNU) 12.3.0" With patch enabled -O2 -mb $ cat f.s .file "f.c" .text .text .align 1 .align 2 .global _f_char_var .type _f_char_var, @function _f_char_var: mov r4,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .size _f_char_var, .-_f_char_var .align 1 .align 2 .global _f_unsigned_char_var .type _f_unsigned_char_var, @function _f_unsigned_char_var: mov r4,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .size _f_unsigned_char_var, .-_f_unsigned_char_var .align 1 .align 2 .global _f_symbol .type _f_symbol, @function _f_symbol: mov.l .L5,r1 mov.b @r1,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .L6: .align 2 .L5: .long -65536 .size _f_symbol, .-_f_symbol .align 1 .align 2 .global _f_symbol_zero .type _f_symbol_zero, @function _f_symbol_zero: mov.l .L8,r1 mov.b @r1,r0 tst #64,r0 rts movt r0 .L9: .align 2 .L8: .long -65536 .size _f_symbol_zero, .-_f_symbol_zero .align 1 .align 2 .global _f_symbol_non_zero .type _f_symbol_non_zero, @function _f_symbol_non_zero: mov.l .L11,r1 mov.b @r1,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .L12: .align 2 .L11: .long -65536 .size _f_symbol_non_zero, .-_f_symbol_non_zero .align 1 .align 2 .global _dyn_lshift .type _dyn_lshift, @function _dyn_lshift: mov.l .L15,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L16: .align 2 .L15: .long ___ashlsi3_r0 .size _dyn_lshift, .-_dyn_lshift .align 1 .align 2 .global _dyn_rshift .type _dyn_rshift, @function _dyn_rshift: mov.l .L19,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L20: .align 2 .L19: .long ___lshrsi3_r0 .size _dyn_rshift, .-_dyn_rshift .align 1 .align 2 .global _really_dyn_lshift .type _really_dyn_lshift, @function _really_dyn_lshift: mov.l .L23,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L24: .align 2 .L23: .long ___ashlsi3_r0 .size _really_dyn_lshift, .-_really_dyn_lshift .align 1 .align 2 .global _really_dyn_rshift .type _really_dyn_rshift, @function _really_dyn_rshift: mov.l .L27,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L28: .align 2 .L27: .long ___lshrsi3_r0 .size _really_dyn_rshift, .-_really_dyn_rshift .ident "GCC: (GNU) 12.3.0" With patch enabled, -O2 -ml $ cat f.s .file "f.c" .text .little .text .align 1 .align 2 .global _f_char_var .type _f_char_var, @function _f_char_var: mov r4,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .size _f_char_var, .-_f_char_var .align 1 .align 2 .global _f_unsigned_char_var .type _f_unsigned_char_var, @function _f_unsigned_char_var: mov r4,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .size _f_unsigned_char_var, .-_f_unsigned_char_var .align 1 .align 2 .global _f_symbol .type _f_symbol, @function _f_symbol: mov.l .L5,r1 mov.b @r1,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .L6: .align 2 .L5: .long -65536 .size _f_symbol, .-_f_symbol .align 1 .align 2 .global _f_symbol_zero .type _f_symbol_zero, @function _f_symbol_zero: mov.l .L8,r1 mov.b @r1,r0 tst #64,r0 rts movt r0 .L9: .align 2 .L8: .long -65536 .size _f_symbol_zero, .-_f_symbol_zero .align 1 .align 2 .global _f_symbol_non_zero .type _f_symbol_non_zero, @function _f_symbol_non_zero: mov.l .L11,r1 mov.b @r1,r0 tst #64,r0 mov #-1,r0 rts negc r0,r0 .L12: .align 2 .L11: .long -65536 .size _f_symbol_non_zero, .-_f_symbol_non_zero .align 1 .align 2 .global _dyn_lshift .type _dyn_lshift, @function _dyn_lshift: mov.l .L15,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L16: .align 2 .L15: .long ___ashlsi3_r0 .size _dyn_lshift, .-_dyn_lshift .align 1 .align 2 .global _dyn_rshift .type _dyn_rshift, @function _dyn_rshift: mov.l .L19,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L20: .align 2 .L19: .long ___lshrsi3_r0 .size _dyn_rshift, .-_dyn_rshift .align 1 .align 2 .global _really_dyn_lshift .type _really_dyn_lshift, @function _really_dyn_lshift: mov.l .L23,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L24: .align 2 .L23: .long ___ashlsi3_r0 .size _really_dyn_lshift, .-_really_dyn_lshift .align 1 .align 2 .global _really_dyn_rshift .type _really_dyn_rshift, @function _really_dyn_rshift: mov.l .L27,r1 sts.l pr,@-r15 jsr @r1 mov r5,r0 lds.l @r15+,pr rts nop .L28: .align 2 .L27: .long ___lshrsi3_r0 .size _really_dyn_rshift, .-_really_dyn_rshift .ident "GCC: (GNU) 12.3.0" > '-mdisable-dynshift-libcall' would be more appropriate for what it tries to > do, I think. Although that is a whole different issue ... but what is it > going to do for real dynamic shifts on SH2? > > What kind of code is it supposed to emit for things like > > unsigned int dyn_shift (unsigned int x, unsigned int y) > { > return x << (y & 31); > } As far as I understand from GCC sources, function I patched 'expand_ashiftrt' process only constant values of shift. As you can see earlier, I added your and other examples to tests. It looks like really dynamic shifts translate to library calls. Should I test more exotic situations? If so, could you please help me with really exotic or weired examples?
(In reply to Alexander Klepikov from comment #37) > > As far as I understand from GCC sources, function I patched > 'expand_ashiftrt' process only constant values of shift. As you can see > earlier, I added your and other examples to tests. OK, thanks for the additional test cases. It really looks like the way the constant shift is expanded (via ashrsi3_n insn) on SH1/SH2 is getting in the way. The tst insn is mainly formed by the combine pass, which relies on certain insn patterns and combinations thereof. See also sh.md, around line 530. You can look at the debug output with the -fdump-rtl-all option to see what's happening in the RTL passes. What your patch is doing is to make it not emit the ashrsi3_n insn for constant shifts altogether? I guess it will make code that actually needs those real shifts larger, as it will always emit the whole shift stitching sequence. That might be a good thing or not. > It looks like really > dynamic shifts translate to library calls. So the option name '-mdisable-dynshift-libcall' doesn't make sense. What it actually does is more like '-mdisable-constshift-libcall'. > > Should I test more exotic situations? If so, could you please help me with > really exotic or weired examples? Have you had a look at the existing test cases for this in gcc/testsuite/gcc.target/sh ?
> The tst insn is mainly formed by the combine pass, which relies on certain > insn patterns and combinations thereof. See also sh.md, around line 530. I'm sorry, but .md lang is too complicated for me. > You can look at the debug output with the -fdump-rtl-all option to see > what's happening in the RTL passes. It looks like with optimization enabled it converts bitwise AND to right shift and then optimizes again. But SH4 has 'shad' and 'shad' can be optimized to 'tst'. And SH2E has libcall instead of dynamic shift and libcll cannot be converted. It seems that very first optimization spoils things. But when we have numerous 'shar' instructions, optimization joins the game again and converts them to 'tst'. > What your patch is doing is to make it not emit the ashrsi3_n insn for > constant shifts altogether? I guess it will make code that actually needs > those real shifts larger, as it will always emit the whole shift stitching > sequence. That might be a good thing or not. You are absolutely right, the code will be larger when we do right shifts. But there's situations when you can't use library. For exmple, SH7055 engine control unit can barely contain the program. The library just won't fit.
(In reply to Alexander Klepikov from comment #39) > > I'm sorry, but .md lang is too complicated for me. Yeah, it looks alien at first sight. But it's where a lot of things happen w.r.t. instruction selection. > It looks like with optimization enabled it converts bitwise AND to right > shift and then optimizes again. But SH4 has 'shad' and 'shad' can be > optimized to 'tst'. And SH2E has libcall instead of dynamic shift and libcll > cannot be converted. It seems that very first optimization spoils things. > > But when we have numerous 'shar' instructions, optimization joins the game > again and converts them to 'tst'. Yes, something like this is what I remember happening there. I'll try to look into the issue with your test cases and see if it's possible to add some patterns to catch those. BTW, have you tried it on a more recent GCC? There have also been some optimizations in the middle-end (a bit more backend independent) for this kind of thing. > You are absolutely right, the code will be larger when we do right shifts. > But there's situations when you can't use library. For exmple, SH7055 engine > control unit can barely contain the program. The library just won't fit. Have you tried to use whole program optimization via -flto and -ffunction-sections? It should be able to strip out all unnecessary library functions.
> > It looks like with optimization enabled it converts bitwise AND to right > > shift and then optimizes again. But SH4 has 'shad' and 'shad' can be > > optimized to 'tst'. And SH2E has libcall instead of dynamic shift and libcll > > cannot be converted. It seems that very first optimization spoils things. > > > > But when we have numerous 'shar' instructions, optimization joins the game > > again and converts them to 'tst'. > > Yes, something like this is what I remember happening there. I'll try to > look into the issue with your test cases and see if it's possible to add > some patterns to catch those. Thank you! I have an idea. If it's impossible to defer initial optimization, maybe it's possible to emit some intermediate insn and catch it and optimize later? > BTW, have you tried it on a more recent GCC? There have also been some > optimizations in the middle-end (a bit more backend independent) for this > kind of thing. I tried 13.1 about week or two ago with the same result. > Have you tried to use whole program optimization via -flto and > -ffunction-sections? It should be able to strip out all unnecessary library > functions. Thank you for advice, I'll take a try.
(In reply to Alexander Klepikov from comment #41) > > Thank you! I have an idea. If it's impossible to defer initial optimization, > maybe it's possible to emit some intermediate insn and catch it and optimize > later? This is basically what is supposed to be happening there already. However, it's a bit of a dilemma. 1) If we don't have a dynamic shift insn and we smash the constant shift into individual stitching shifts early, it might open some new optimization opportunities, e.g. by sharing intermediate shift results. Not sure though if that actually happens in practice though. 2) Whether to use the dynamic shift insn or emit a function call or use stitching shifts sequence, it all has an impact on register allocation and other instruction use. This can be problematic during the course of RTL optimization passes. 3) Even if we have a dynamic shift, sometimes it's more beneficial to emit a shorter stitching shift sequence. Which one is better depends on the surrounding code. I don't think there is anything good there to make the proper choice. Some other shift related PRs: PR 54089, PR 65317, PR 67691, PR 67869, PR 52628, PR 58017 > > BTW, have you tried it on a more recent GCC? There have also been some > > optimizations in the middle-end (a bit more backend independent) for this > > kind of thing. > > I tried 13.1 about week or two ago with the same result. > Good to know. Thanks for checking it!
> > Thank you! I have an idea. If it's impossible to defer initial optimization, > > maybe it's possible to emit some intermediate insn and catch it and optimize > > later? > > This is basically what is supposed to be happening there already. Well, not really. Look what's happening during expand pass when 'ashrsi3' is expanding. Function 'expand_ashiftrt' is called and what it does at the end - it explicitly emits 3 insns: wrk = gen_reg_rtx (Pmode); //This one emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]); sprintf (func, "__ashiftrt_r4_%d", value); rtx lab = function_symbol (wrk, func, SFUNC_STATIC).lab; //This one emit_insn (gen_ashrsi3_n (GEN_INT (value), wrk, lab)); //And this one emit_move_insn (operands[0], gen_rtx_REG (SImode, 4)); As far as I understand these insns could be catched later by a peephole and converted to 'tstsi_t' insn like it is done for other much simple insn sequences. What I'm thinkig about is to emit only one, say 'compound', insn. Which could be then splitted later somwhere in split pass to function call, to those 3 insns. I wrote test code that emits only one bogus insn. This insn expands to pure asm code. Of course, that asm code is invalid, because it is impossible to place a libcall label at the end of function with pure asm code injection. But then all what is could be coverted to 'tst', converts to 'tst'. And all pure right shifts converts to invalid asm code, of course. That's why I am thinking about possibility of emitting some intermediate insn at expand pass that will defer it real expanding. But I still don't know how to do it right and even if it is possible. By the way, right shift for integers expands to only one 'lshiftrt' insn and that's why it can be catched and converted to 'tst'. > > However, it's a bit of a dilemma. > > 1) If we don't have a dynamic shift insn and we smash the constant shift > into individual > stitching shifts early, it might open some new optimization opportunities, > e.g. by sharing intermediate shift results. Not sure though if that > actually happens in practice though. > > 2) Whether to use the dynamic shift insn or emit a function call or use > stitching shifts sequence, it all has an impact on register allocation and > other instruction use. This can be problematic during the course of RTL > optimization passes. > > 3) Even if we have a dynamic shift, sometimes it's more beneficial to emit a > shorter stitching shift sequence. Which one is better depends on the > surrounding code. I don't think there is anything good there to make the > proper choice. > > Some other shift related PRs: PR 54089, PR 65317, PR 67691, PR 67869, PR > 52628, PR 58017 Thank you for your time and detailed explanations! I agree with you on all points. Software cannot be perfect and it's OK for GCC not to be super optimized, so this part better sholud be left intact. By the way, I tried to link library to my project and I figured out that linker is smart enough to link only necessary library functions even without LTO. So increase in size is about 100 or 200 bytes, that is acceptable. Thank you very much for help!
(In reply to Alexander Klepikov from comment #43) > > Well, not really. Look what's happening during expand pass when 'ashrsi3' is > expanding. Function 'expand_ashiftrt' is called and what it does at the end > - it explicitly emits 3 insns: > [...] > > By the way, right shift for integers expands to only one 'lshiftrt' insn and > that's why it can be catched and converted to 'tst'. > Yeah, I might have dropped the ball on the right shift patterns back then and only reworked the left shift patterns to do that. > > As far as I understand these insns could be catched later by a peephole and > converted to 'tstsi_t' insn like it is done for other much simple insn > sequences. It's the combine RTL pass and split1 RTL pass that does most of this work here. Peephole pass in GCC is too simplistic for this. > > Thank you for your time and detailed explanations! I agree with you on all > points. Software cannot be perfect and it's OK for GCC not to be super > optimized, so this part better sholud be left intact. We can't have it perfect, but we can try ;) > > By the way, I tried to link library to my project and I figured out that > linker is smart enough to link only necessary library functions even without > LTO. So increase in size is about 100 or 200 bytes, that is acceptable. > Thank you very much for help! You're welcome. Yes, to strip out unused library functions it doesn't need LTO. But using LTO for embedded/MCU firmware, I find it can reduce the code size by about 20%. For example, it can also inline small library functions in your code (if the library was also compiled with LTO).
>I have an idea. If it's impossible to defer initial optimization, > maybe it's possible to emit some intermediate insn and catch it and optimize > later? Good news. I've made a proof of concept. It works at least sometimes - on simple tests. $ cat f.c #define A 0xFFFF0000 #define P ((unsigned char *)A) #define F 64 #define S 8 unsigned char f_non_zero(unsigned char v){ return (v & F) != 0; } unsigned f_sym_non_zero(void){ return (*P & F) != 0; } unsigned f_sym_mask(void){ return (*P & F) == F; } int f_rshift(char v){ return v >> S; } $ /usr/local/sh-toolchain/bin/sh-elf-gcc -O2 -mb -m2e -da -S f.c $ cat f.s .file "f.c" .text .text .align 1 .align 2 .global _f_non_zero .type _f_non_zero, @function _f_non_zero: mov r4,r0 sts.l pr,@-r15 tst #64,r0 mov #-1,r0 negc r0,r0 lds.l @r15+,pr rts nop .size _f_non_zero, .-_f_non_zero .align 1 .align 2 .global _f_sym_non_zero .type _f_sym_non_zero, @function _f_sym_non_zero: mov.l .L6,r1 sts.l pr,@-r15 mov.b @r1,r0 tst #64,r0 mov #-1,r0 negc r0,r0 lds.l @r15+,pr rts nop .L7: .align 2 .L6: .long -65536 .size _f_sym_non_zero, .-_f_sym_non_zero .align 1 .align 2 .global _f_sym_mask .type _f_sym_mask, @function _f_sym_mask: mov.l .L10,r1 sts.l pr,@-r15 mov.b @r1,r0 tst #64,r0 mov #-1,r0 negc r0,r0 lds.l @r15+,pr rts nop .L11: .align 2 .L10: .long -65536 .size _f_sym_mask, .-_f_sym_mask .align 1 .align 2 .global _f_rshift .type _f_rshift, @function _f_rshift: mov.l .L14,r1 sts.l pr,@-r15 jsr @r1 exts.b r4,r4 mov r4,r0 lds.l @r15+,pr rts nop .L15: .align 2 .L14: .long ___ashiftrt_r4_8 .size _f_rshift, .-_f_rshift .ident "GCC: (GNU) 13.1.0" $ /usr/local/sh-toolchain/bin/sh-elf-gcc -O2 -ml -m2e -da -S f.c $ cat f.s .file "f.c" .text .little .text .align 1 .align 2 .global _f_non_zero .type _f_non_zero, @function _f_non_zero: mov r4,r0 sts.l pr,@-r15 tst #64,r0 mov #-1,r0 negc r0,r0 lds.l @r15+,pr rts nop .size _f_non_zero, .-_f_non_zero .align 1 .align 2 .global _f_sym_non_zero .type _f_sym_non_zero, @function _f_sym_non_zero: mov.l .L6,r1 sts.l pr,@-r15 mov.b @r1,r0 tst #64,r0 mov #-1,r0 negc r0,r0 lds.l @r15+,pr rts nop .L7: .align 2 .L6: .long -65536 .size _f_sym_non_zero, .-_f_sym_non_zero .align 1 .align 2 .global _f_sym_mask .type _f_sym_mask, @function _f_sym_mask: mov.l .L10,r1 sts.l pr,@-r15 mov.b @r1,r0 tst #64,r0 mov #-1,r0 negc r0,r0 lds.l @r15+,pr rts nop .L11: .align 2 .L10: .long -65536 .size _f_sym_mask, .-_f_sym_mask .align 1 .align 2 .global _f_rshift .type _f_rshift, @function _f_rshift: mov.l .L14,r1 sts.l pr,@-r15 jsr @r1 exts.b r4,r4 mov r4,r0 lds.l @r15+,pr rts nop .L15: .align 2 .L14: .long ___ashiftrt_r4_8 .size _f_rshift, .-_f_rshift .ident "GCC: (GNU) 13.1.0" Splitting takes place at split1 pass as expected. Here is the patch itself. $ cat gcc-13.1.0-ashrsi3_libcall.patch diff -ur gcc-13.1.0.orig/gcc/config/sh/sh-protos.h gcc-13.1.0/gcc/config/sh/sh-protos.h --- gcc-13.1.0.orig/gcc/config/sh/sh-protos.h 2023-04-26 10:09:39.000000000 +0300 +++ gcc-13.1.0/gcc/config/sh/sh-protos.h 2023-05-29 11:45:05.134723435 +0300 @@ -78,6 +78,7 @@ extern void gen_shifty_op (int, rtx *); extern void gen_shifty_hi_op (int, rtx *); extern bool expand_ashiftrt (rtx *); +extern bool expand_ashrsi3_libcall (rtx *);//delete extern bool sh_dynamicalize_shift_p (rtx); extern int shl_and_kind (rtx, rtx, int *); extern int shl_and_length (rtx); diff -ur gcc-13.1.0.orig/gcc/config/sh/sh.cc gcc-13.1.0/gcc/config/sh/sh.cc --- gcc-13.1.0.orig/gcc/config/sh/sh.cc 2023-04-26 10:09:39.000000000 +0300 +++ gcc-13.1.0/gcc/config/sh/sh.cc 2023-05-29 17:09:54.602787537 +0300 @@ -3875,11 +3877,37 @@ wrk = gen_reg_rtx (Pmode); /* Load the value into an arg reg and call a helper. */ - emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]); + /*emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]); sprintf (func, "__ashiftrt_r4_%d", value); rtx lab = function_symbol (wrk, func, SFUNC_STATIC).lab; emit_insn (gen_ashrsi3_n (GEN_INT (value), wrk, lab)); emit_move_insn (operands[0], gen_rtx_REG (SImode, 4)); + return true;*/ + + if (dump_file) + fprintf(dump_file, "ashrsi3: Emitting collapsed libcall\n"); + emit_insn (gen_ashrsi3_libcall_collapsed (operands[0], operands[1], GEN_INT(value)));//delete + return true;//delete +} + +//delete +bool +expand_ashrsi3_libcall (rtx *operands) { + char func[18]; + + if (dump_file) + fprintf(dump_file, "ashrsi3_libcall_collapsed: Expanding ashrsi3 libcall\n"); + + rtx wrk = gen_reg_rtx (Pmode); + emit_move_insn (gen_rtx_REG (SImode, 4), operands[1]); + + sprintf (func, "__ashiftrt_r4_%d", INTVAL (operands[2])); + + rtx lab = function_symbol (wrk, func, SFUNC_STATIC).lab; + + emit_insn (gen_ashrsi3_n (operands[2], wrk, lab)); + emit_move_insn (operands[0], gen_rtx_REG (SImode, 4)); + return true; } diff -ur gcc-13.1.0.orig/gcc/config/sh/sh.md gcc-13.1.0/gcc/config/sh/sh.md --- gcc-13.1.0.orig/gcc/config/sh/sh.md 2023-04-26 10:09:39.000000000 +0300 +++ gcc-13.1.0/gcc/config/sh/sh.md 2023-05-29 17:10:42.752779922 +0300 @@ -3867,6 +3867,35 @@ [(set_attr "type" "sfunc") (set_attr "needs_delay_slot" "yes")]) +(define_insn "ashrsi3_libcall_collapsed" + [(set (match_operand:SI 0 "arith_reg_dest" "=r") + (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand" "0") + (match_operand:SI 2 "const_int_operand"))) + (clobber (reg:SI T_REG)) + (clobber (reg:SI PR_REG))] + "TARGET_SH1" + "OOPS" + [(set_attr "type" "dyn_shift") + (set_attr "needs_delay_slot" "yes")]) + +(define_insn_and_split "ashrsi3_libcall_expand" + [(parallel [(set (match_operand:SI 0 "arith_reg_dest") + (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand") + (match_operand:SI 2 "const_int_operand")) + )(clobber (reg:SI T_REG)) + (clobber (reg:SI PR_REG)) + ])] + "TARGET_SH1" + "OOPS_1" + "&& 1" + [(const_int 0)] +{ + if (expand_ashrsi3_libcall(operands)) + DONE; + else + FAIL; +}) + ;; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ;; DImode arithmetic shift right I did it by feel, actually picking up the parameters until it worked. So please check it and improve it because I'm sure it will break something. Thank you.
Reminder that patches go to the gcc-patches mailing list
(In reply to Eric Gallager from comment #46) > Reminder that patches go to the gcc-patches mailing list It's OK. We're just throwing around ideas, nothing final
I made tests (including *.c files from GCC testsuite) and everything looks fine for now. But I'm still afraid that pattern for 'ashrsi3_libcall_expand' is too wide. It is possible to narrow it down as much as possible by adding distinct attribute and set when emitting 'ashrsi3_libcall_collapsed' and then check it and fail if not set: (define_attr "libcall_collapsed" "ashrsi3,nil" (const_string "nil")) (define_insn "ashrsi3_libcall_collapsed" [(set (match_operand:SI 0 "arith_reg_dest" "=r") (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand" "0") (match_operand:SI 2 "const_int_operand"))) (clobber (reg:SI T_REG)) (clobber (reg:SI PR_REG))] "TARGET_SH1" "OOPS" [(set_attr "type" "dyn_shift") (set_attr "libcall_collapsed" "ashrsi3") (set_attr "needs_delay_slot" "yes")]) if (get_attr_libcall_collapsed(insn) != LIBCALL_COLLAPSED_ASHRSI3) return false; It will be super safe then but ugly a little bit.
(In reply to Alexander Klepikov from comment #48) > I made tests (including *.c files from GCC testsuite) and everything looks > fine for now. But I'm still afraid that pattern for 'ashrsi3_libcall_expand' > is too wide. It is possible to narrow it down as much as possible by adding > distinct attribute and set when emitting 'ashrsi3_libcall_collapsed' and > then check it and fail if not set: > For this kind of change, the whole GCC test suite needs to be ran for at least big/little -m2,-m4 variants. +(define_insn_and_split "ashrsi3_libcall_expand" + [(parallel [(set (match_operand:SI 0 "arith_reg_dest") + (ashiftrt:SI (match_operand:SI 1 "arith_reg_operand") + (match_operand:SI 2 "const_int_operand")) + )(clobber (reg:SI T_REG)) + (clobber (reg:SI PR_REG)) + ])] The 'parallel' construct looks strange.
Actually, let's take any further discussion of shift patterns to PR 54089.