Bug 115756 - default tuning for x86_64 produces shifts for `*240`
Summary: default tuning for x86_64 produces shifts for `*240`
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 15.0
: P3 normal
Target Milestone: 15.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 115749
  Show dependency treegraph
 
Reported: 2024-07-02 18:13 UTC by Andrew Pinski
Modified: 2024-08-16 04:59 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Pinski 2024-07-02 18:13:49 UTC
Take:
```

unsigned long func(unsigned long x)
{
  return x * 240;
}
```

For most modern CPUs (I think), it is faster to use imul here instead of shifts/adds/sub. clang's default tuning uses imul even. -mtune=skylake, -mtune=znver[123],  all produces imul. Note -mtune=lunarlake/arrowlake  does not but maybe they are wrong. 

Split out from PR 115749 .
Comment 1 Roger Sayle 2024-07-02 18:55:40 UTC
I'm surprised that the difference in performance is (so) observable.  The sal/sub/sal sequence (which admittedly has a long dependency chain) consists of three single-cycle latency instructions, whereas the imul is documented by Agner Fog to also have a latency of 3.  The biggest difference may be the number of instructions and bytes (a decoder bottleneck?).

Interestingly, if you specify -Os gcc uses the imul.

Is the register value being multiplied by 240 always 1 or 0, allowing the hardware to invoke some form of bypass?  The constant 240 has (popcount) 4 set bits, so implementing this in only three instructions (which may be scheduled concurrently with other operations) is pretty impressive.  Perhaps someone can post a microbenchmark?
Comment 2 Richard Biener 2024-07-03 07:59:48 UTC
it probably very much depends on surrounding code and port utilization but for generic tuning I'd lean towards shift+add because for example Intel E-cores have a slow imul.
Comment 3 Hongtao Liu 2024-07-03 08:09:53 UTC
Current rtx_cost for imulq in generic_cost is COST_N_INSNS (4), make it as COST_N_INSNS (3) could generate imulq.


  {COSTS_N_INSNS (3),			/* cost of starting multiply for QI */
   COSTS_N_INSNS (4),			/*				 HI */
   COSTS_N_INSNS (3),			/*				 SI */
   COSTS_N_INSNS (4),			/*				 DI */
Comment 4 Hongtao Liu 2024-08-15 05:11:13 UTC
Fixed in GCC15.