Consider: typedef unsigned char T; T i[2]; int f() { T *p = &i[0], *q = &i[1]; T c = __builtin_add_overflow(*p, 1, p); *q += c; } The desired code sequence on x86_64 is: addb $1, i(%rip) adcb $0, i+1(%rip) What we get instead of the desired addb are separate load, addb, and store instructions. There are two reasons why we don't combine them to form the addb: - when we try_combine the 3 of them, the flag-store insn is still present, between M (add) and W (store), thus can_combine_p fails. after we combine the flag-store into adcb, we do not retry - if I manually force the retry, we break up the M parallel insn into a naked add in i2, and a flag-setting non-canonical compare in i0. we substitute R and M into W, for an add without flag-setting. finally, we deal with added_sets, building a new parallel to hold the RMW add and appending the flag-setter as the second item, after the combined add. alas, recog won't match them in this order. *add<mode>3_cc_overflow_1 requires the flag-setter before the reg-setter. Here's discussion and combine dumbs from a slightly different testcase that triggers the same combine behavior: https://gcc.gnu.org/pipermail/gcc-patches/2020-September/553242.html
One nit: I wrote the flag-setting non-canonical compare ended up in i0, but it actually becomes i1, with the original i1 (R) moved to i0.
insn_cost 8 for 6: r91:SI=zero_extend([`i']) insn_cost 4 for 8: {flags:CCC=cmp(r91:SI#0+0x1,r91:SI#0);r92:QI=r91:SI#0+0x1 ;} REG_DEAD r91:SI insn_cost 4 for 33: r89:QI=ltu(flags:CCC,0) REG_DEAD flags:CCC insn_cost 4 for 17: [`i']=r92:QI REG_DEAD r92:QI 8+17 does not work, because the combined insn replaces 17, but it has to stay before 33. 8+33+17 is never tried. I don't immediately see why not? It may still not work because of the irregularities in the x86 ISA, but 6+8+33+17 should work (if that would be tried... but if 8+33+17 already is not done, that needs to be fixed first).