115756 – default tuning for x86_64 produces shifts for `*240`

Bug 115756 - default tuning for x86_64 produces shifts for `*240`

Summary: default tuning for x86_64 produces shifts for `*240`

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	15.0

Importance:	P3 normal
Target Milestone:	15.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	115749
	Show dependency tree / graph

Reported:	2024-07-02 18:13 UTC by Andrew Pinski
Modified:	2024-08-16 04:59 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrew Pinski 2024-07-02 18:13:49 UTC

Take:
```

unsigned long func(unsigned long x)
{
  return x * 240;
}
```

For most modern CPUs (I think), it is faster to use imul here instead of shifts/adds/sub. clang's default tuning uses imul even. -mtune=skylake, -mtune=znver[123],  all produces imul. Note -mtune=lunarlake/arrowlake  does not but maybe they are wrong. 

Split out from PR 115749 .

Comment 1 Roger Sayle 2024-07-02 18:55:40 UTC

I'm surprised that the difference in performance is (so) observable.  The sal/sub/sal sequence (which admittedly has a long dependency chain) consists of three single-cycle latency instructions, whereas the imul is documented by Agner Fog to also have a latency of 3.  The biggest difference may be the number of instructions and bytes (a decoder bottleneck?).

Interestingly, if you specify -Os gcc uses the imul.

Is the register value being multiplied by 240 always 1 or 0, allowing the hardware to invoke some form of bypass?  The constant 240 has (popcount) 4 set bits, so implementing this in only three instructions (which may be scheduled concurrently with other operations) is pretty impressive.  Perhaps someone can post a microbenchmark?

Comment 2 Richard Biener 2024-07-03 07:59:48 UTC

it probably very much depends on surrounding code and port utilization but for generic tuning I'd lean towards shift+add because for example Intel E-cores have a slow imul.

Comment 3 Hongtao Liu 2024-07-03 08:09:53 UTC

Current rtx_cost for imulq in generic_cost is COST_N_INSNS (4), make it as COST_N_INSNS (3) could generate imulq.


  {COSTS_N_INSNS (3),			/* cost of starting multiply for QI */
   COSTS_N_INSNS (4),			/*				 HI */
   COSTS_N_INSNS (3),			/*				 SI */
   COSTS_N_INSNS (4),			/*				 DI */

Comment 4 Hongtao Liu 2024-08-15 05:11:13 UTC

Fixed in GCC15.