Bug 96965 - combine RMW and flags
Summary: combine RMW and flags
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: ---
Assignee: Segher Boessenkool
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2020-09-07 22:09 UTC by Alexandre Oliva
Modified: 2021-05-04 12:31 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-09-08 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexandre Oliva 2020-09-07 22:09:16 UTC
Consider:

typedef unsigned char T;
T i[2];
int f() {
  T *p = &i[0], *q = &i[1];
  T c = __builtin_add_overflow(*p, 1, p);
  *q += c;
}

The desired code sequence on x86_64 is:

  addb $1, i(%rip)
  adcb $0, i+1(%rip)

What we get instead of the desired addb are separate load, addb, and store instructions.  There are two reasons why we don't combine them to form the addb:

- when we try_combine the 3 of them, the flag-store insn is still present, between M (add) and W (store), thus can_combine_p fails.  after we combine the flag-store into adcb, we do not retry

- if I manually force the retry, we break up the M parallel insn into a naked add in i2, and a flag-setting non-canonical compare in i0.  we substitute R and M into W, for an add without flag-setting.  finally, we deal with added_sets, building a new parallel to hold the RMW add and appending the flag-setter as the second item, after the combined add.  alas, recog won't match them in this order.  *add<mode>3_cc_overflow_1 requires the flag-setter before the reg-setter.

Here's discussion and combine dumbs from a slightly different testcase that triggers the same combine behavior:

https://gcc.gnu.org/pipermail/gcc-patches/2020-September/553242.html
Comment 1 Alexandre Oliva 2020-09-07 22:16:29 UTC
One nit: I wrote the flag-setting non-canonical compare ended up in i0, but it actually becomes i1, with the original i1 (R) moved to i0.
Comment 2 Segher Boessenkool 2020-09-08 19:04:22 UTC
insn_cost 8 for     6: r91:SI=zero_extend([`i'])
insn_cost 4 for     8: {flags:CCC=cmp(r91:SI#0+0x1,r91:SI#0);r92:QI=r91:SI#0+0x1
;}
      REG_DEAD r91:SI
insn_cost 4 for    33: r89:QI=ltu(flags:CCC,0)
      REG_DEAD flags:CCC
insn_cost 4 for    17: [`i']=r92:QI
      REG_DEAD r92:QI

8+17 does not work, because the combined insn replaces 17, but it has to
stay before 33.

8+33+17 is never tried.  I don't immediately see why not?

It may still not work because of the irregularities in the x86 ISA, but
6+8+33+17 should work (if that would be tried...  but if 8+33+17 already
is not done, that needs to be fixed first).