Bug 55583 - Extended shift instruction on x86-64 is not used, producing unoptimal code
Summary: Extended shift instruction on x86-64 is not used, producing unoptimal code
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.8.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
Keywords: missed-optimization
Depends on:
Reported: 2012-12-04 00:06 UTC by Mikko Markus Torni
Modified: 2016-07-25 05:43 UTC (History)
5 users (show)

See Also:
Target: x86_64-*-*, i?86-*-*
Known to work:
Known to fail:
Last reconfirmed: 2014-06-07 00:00:00

Source code demonstrating bad code generation (304 bytes, text/x-csrc)
2012-12-04 00:06 UTC, Mikko Markus Torni
gcc-HEAD compiler output (849 bytes, text/plain)
2012-12-04 00:08 UTC, Mikko Markus Torni
Source code demonstrating code generated (updated) (346 bytes, text/plain)
2012-12-04 01:03 UTC, Mikko Markus Torni
Patch from comment #4 (574 bytes, patch)
2013-04-01 13:45 UTC, Marc Glisse
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mikko Markus Torni 2012-12-04 00:06:00 UTC
Created attachment 28866 [details]
Source code demonstrating bad code generation

On x86-64, extended shift instruction is not generated for some reason.
Combined with other problems this creates very bad code.

Test functions included for signed and unsigned 16,32,64-bit types for both left and right shifts and for constant n and function parameter n.

Code of this form:
  unsigned int a, b; const int n = 2;
  void test32l (void) { b = (b << n) | (a >> (32 - n)); }

expected code:
  mov     a(%rip),%eax
  shld    $0x2,%eax,b(%rip)

produced code:
  mov    b(%rip), %edx   ; Size of register used here depends on gcc version
  mov    a(%rip), %eax   ; Size of register used here depends on gcc version
  sal    $2, %edx        ; Size of register used here depends on gcc version
  shr    $25, %eax
  or     %edx, %eax
  mov    %eax, b(%rip)

Tested with:
COLLECT_GCC_OPTIONS='-v' '-c' '-save-temps' '-O2' '-Wall' '-W' '-o' 'gcc_shld_not_used' '-mtune=generic'

I tried gcc versions:
GNU C (Debian 4.7.2-4) version 4.7.2 (x86_64-linux-gnu)
GNU C (Debian 4.6.3-11) version 4.6.3 (x86_64-linux-gnu)
GNU C (Debian 4.5.3-9) version 4.5.3 (x86_64-linux-gnu)
GNU C (Debian 4.4.7-2) version 4.4.7 (x86_64-linux-gnu)
GNU C (GCC) version 4.8.0 20121203 (experimental) [trunk revision 194106] (x86_64-unknown-linux-gnu)

All produce the same code modulo register size differences mentioned above. gcc HEAD changes sal to leal (,%rcx,4),%eax
Comment 1 Mikko Markus Torni 2012-12-04 00:08:21 UTC
Created attachment 28867 [details]
gcc-HEAD compiler output
Comment 2 H.J. Lu 2012-12-04 00:21:02 UTC
Clang generates:

	movl	a(%rip), %eax
	shldl	$2, %eax, b(%rip)

at -O2.
Comment 3 Mikko Markus Torni 2012-12-04 01:03:44 UTC
Created attachment 28868 [details]
Source code demonstrating code generated (updated)

Bug fixes in signed integer testcases.

Clang 3.0 seems to produce optimal looking code in the following test cases:
  test32rn testu64l testu32l testu16l testu64ln testu64rn testu32ln testu32rn

Clang 3.0 manages to use shld/shrd, but generates extra moves in the following test cases:
  test64r test32r test16r test64rn test32rn testu64rn testu32r testu16r

Clang 3.0 fails to use shld/shrd in the following test cases:
  test64l test32l test16l test64ln test32ln test16ln test16rn testu16ln testu16rn

Tested with clang:
 "/usr/bin/clang" -cc1 -triple x86_64-pc-linux-gnu -S -disable-free -disable-llvm-verifier -main-file-name gcc_shld_not_used.c -mrelocation-model static -mdisable-fp-elim -masm-verbose -mconstructor-aliases -munwind-tables -target-cpu x86-64 -target-linker-version 2.22 -momit-leaf-frame-pointer -v -coverage-file gcc_shld_not_used.s -resource-dir /usr/bin/../lib/clang/3.0 -O2 -Wall -W -ferror-limit 19 -fmessage-length 0 -fgnu-runtime -fobjc-runtime-has-arc -fobjc-runtime-has-weak -fobjc-fragile-abi -fdiagnostics-show-option -o gcc_shld_not_used.s -x cpp-output gcc_shld_not_used.i
clang -cc1 version 3.0 based upon llvm 3.0 hosted on x86_64-pc-linux-gnu
Comment 4 Marc Glisse 2012-12-04 10:15:27 UTC
It looks like the patterns all look for 32-i as the second shift amount. Writing an additional version that takes a constant (with an extra check that the sum of the constants is 32, and we then have to specify immediate_length manually) and replacing (match_dup 0) with an extra operand that has the constraint "0" seems to work. (and breaks again if I swap the 2 sides of operator| )
Comment 5 Marc Glisse 2013-04-01 13:45:33 UTC
Created attachment 29764 [details]
Patch from comment #4

I apparently forgot to attach a patch when I posted comment #4. This is just to show the idea, it doesn't handle many cases, and the length_immediate value was randomly filled just to let it compile.
Comment 6 Marc Glisse 2014-06-07 09:12:06 UTC
Several things:

1) https://gcc.gnu.org/ml/gcc/2014-06/msg00063.html points out that our shrd patterns wrongly use ashiftrt instead of lshiftrt

2) We can convince the current compiler to generate shrd by constructing ((((unsigned long long)a)<<32) | b) >> n (take care not to use '+' in place of '|' because gcc is unable to realize that x+0 has no carry and thus leaves plenty of unneeded code in that case). For a constant shift, it manages to clean up all the useless code. At least that works for the 32 bit version with -m32 and the 64 bit version (using unsigned __int128) with -m64, it doesn't work for the 32 bit version with -m64.

3) With extra patterns as attached here, combine can handle the case where the shift amount is constant. However, the non-constant pattern is too big for combine. The closest it gets to matching is (b<<n)|(a>>(l-n)), but replacing l with 32 is one more substitution than it is willing  to try (it also ignores the REG_EQUAL note that would give (32-n) with one substitution less). Improving combine would be nice. I am not sure what intermediate pattern (not too artificial) we could introduce to help it. Maybe a>>(32-n), though I don't even know if it is better to implement that as a subtraction and a shift or as generating zero then using sh[lr]d.