gcc 10.3.1 misses an easy oportunity for optimization: Instruction: if ((bits<<=1)>=0) goto p3; generated code: strh r7, [r0, #4] lsls r3, r2, #1 lsls r2, r2, #1 bpl .L8 i believe the 2nd "lsls" is superfluous. It would save me 3MHz if it wasn't there. :-) full example: https://godbolt.org/z/xocbvjn5x
Not a C++ FE problem.
Created attachment 53032 [details] testcase
Combine: Trying 41 -> 44: 41: r120:SI=r119:SI<<0x1a REG_DEAD r119:SI 44: pc={(r120:SI<0)?L114:pc} REG_BR_PROB 283038348 Failed to match this instruction: (parallel [ (set (pc) (if_then_else (ne (zero_extract:SI (reg:SI 119 [ _7 ]) (const_int 1 [0x1]) (const_int 5 [0x5])) (const_int 0 [0])) (label_ref 114) (pc))) (set (reg/v:SI 120 [ bits ]) (ashift:SI (reg:SI 119 [ _7 ]) (const_int 26 [0x1a]))) ]) Failed to match this instruction: (parallel [ (set (pc) (if_then_else (ne (zero_extract:SI (reg:SI 119 [ _7 ]) (const_int 1 [0x1]) (const_int 5 [0x5])) (const_int 0 [0])) (label_ref 114) (pc))) (set (reg/v:SI 120 [ bits ]) (ashift:SI (reg:SI 119 [ _7 ]) (const_int 26 [0x1a]))) ]) Failed to match this instruction: (parallel [ (set (pc) (if_then_else (ne (and:SI (lshiftrt:SI (reg:SI 119 [ _7 ]) (const_int 5 [0x5])) (const_int 1 [0x1])) (const_int 0 [0])) (label_ref 114) (pc))) (set (reg/v:SI 120 [ bits ]) (ashift:SI (reg:SI 119 [ _7 ]) (const_int 26 [0x1a]))) ]) Failed to match this instruction: (parallel [ (set (pc) (if_then_else (ne (and:SI (lshiftrt:SI (reg:SI 119 [ _7 ]) (const_int 5 [0x5])) (const_int 1 [0x1])) (const_int 0 [0])) (label_ref 114) (pc))) (set (reg/v:SI 120 [ bits ]) (ashift:SI (reg:SI 119 [ _7 ]) (const_int 26 [0x1a]))) ]) Successfully matched this instruction: (set (reg/v:SI 120 [ bits ]) (ashift:SI (reg:SI 119 [ _7 ]) (const_int 26 [0x1a]))) Successfully matched this instruction: (set (pc) (if_then_else (ne (zero_extract:SI (reg:SI 119 [ _7 ]) (const_int 1 [0x1]) (const_int 5 [0x5])) (const_int 0 [0])) (label_ref 114) (pc))) allowing combination of insns 41 and 44 original costs 4 + 14 = 18 replacement costs 4 + 14 = 18 modifying insn i2 41: r120:SI=r119:SI<<0x1a deferring rescan insn with uid = 41. modifying insn i3 44: {pc={(zero_extract(r119:SI,0x1,0x5)!=0)?L114:pc};clobber scratch;} REG_DEAD r119:SI REG_BR_PROB 283038348 deferring rescan insn with uid = 44. The cost for the following instruction is wrong: (jump_insn 113 110 114 13 (parallel [ (set (pc) (if_then_else (eq (zero_extract:SI (reg:SI 119 [ _7 ]) (const_int 1 [0x1]) (const_int 5 [0x5])) (const_int 0 [0])) (label_ref 59) (pc))) (clobber (scratch:SI)) ]) "/app/example.cpp":33:26 961 {*tbit_cbranch} (expr_list:REG_DEAD (reg:SI 119 [ _7 ]) (int_list:REG_BR_PROB 457091900 (nil))) -> 59) It should be 18 since it gets splitted into two one which has a cost of 4 and another which is a cost of 14.
Confirmed.
The thumb1 rtx-costing function needs a complete rewrite along the lines and style of the Thumb2 and Arm costing routines. Another thing for my copious free time (TM).
i just came across the test on godbolt again and found they added gcc 11.2.1. compiled code no change. so i updated gcc version for the ticket.