Take: ``` unsigned long func(unsigned long x) { return x * 240; } ``` For most modern CPUs (I think), it is faster to use imul here instead of shifts/adds/sub. clang's default tuning uses imul even. -mtune=skylake, -mtune=znver[123], all produces imul. Note -mtune=lunarlake/arrowlake does not but maybe they are wrong. Split out from PR 115749 .
I'm surprised that the difference in performance is (so) observable. The sal/sub/sal sequence (which admittedly has a long dependency chain) consists of three single-cycle latency instructions, whereas the imul is documented by Agner Fog to also have a latency of 3. The biggest difference may be the number of instructions and bytes (a decoder bottleneck?). Interestingly, if you specify -Os gcc uses the imul. Is the register value being multiplied by 240 always 1 or 0, allowing the hardware to invoke some form of bypass? The constant 240 has (popcount) 4 set bits, so implementing this in only three instructions (which may be scheduled concurrently with other operations) is pretty impressive. Perhaps someone can post a microbenchmark?
it probably very much depends on surrounding code and port utilization but for generic tuning I'd lean towards shift+add because for example Intel E-cores have a slow imul.
Current rtx_cost for imulq in generic_cost is COST_N_INSNS (4), make it as COST_N_INSNS (3) could generate imulq. {COSTS_N_INSNS (3), /* cost of starting multiply for QI */ COSTS_N_INSNS (4), /* HI */ COSTS_N_INSNS (3), /* SI */ COSTS_N_INSNS (4), /* DI */
Fixed in GCC15.