Bug 65871 - bzhi builtin/intrinsic wrongly assumes bzhi instruction doesn't set the ZF flag
Summary: bzhi builtin/intrinsic wrongly assumes bzhi instruction doesn't set the ZF flag
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 5.1.0
: P3 normal
Target Milestone: 6.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2015-04-23 23:34 UTC by James Almer
Modified: 2018-01-29 22:55 UTC (History)
1 user (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-04-28 00:00:00


Attachments
Prototype patch for bextr and bzhi (655 bytes, patch)
2015-04-28 12:58 UTC, Uroš Bizjak
Details | Diff
Preprocessed code where a test instruction is still generated (52.25 KB, application/x-gzip)
2015-06-18 23:20 UTC, James Almer
Details

Note You need to log in before you can comment on or make changes to this bug.
Description James Almer 2015-04-23 23:34:32 UTC
unsigned foo(void);
int main(void)
{
    if (__builtin_ia32_bzhi_si(foo(), foo()))
        return 1;
    return 0;
}

Compiled with -mbmi2 -O3

0000000000000000 <main>:
   0:   53                      push   rbx
   1:   e8 00 00 00 00          call   6 <main+0x6>
   6:   89 c3                   mov    ebx,eax
   8:   e8 00 00 00 00          call   d <main+0xd>
   d:   c4 e2 60 f5 c0          bzhi   eax,eax,ebx
  12:   85 c0                   test   eax,eax
  14:   0f 95 c0                setne  al
  17:   0f b6 c0                movzx  eax,al
  1a:   5b                      pop    rbx
  1b:   c3                      ret

It generates a redundant test instruction. According to http://www.felixcloutier.com/x86/BZHI.html bzhi already sets the ZF flag on its own.
Same happens when using inline assembly instead of the builtin to generate the bzhi instruction. In all cases reproducible with GCC 4.9.2 and GCC 5.1.0. Didn't test the 4.8 branch or trunk.

This aside, it would be nice if gcc could generate a bzhi instruction on its own if it detects "X & ((1 << Y) - 1)" where Y is not a constant, same as it does for several other bmi and tbm instructions, instead of needing to use the builtin (Which is only available when targeting bmi2).
I can open a new bug report for that if needed.
Comment 1 James Almer 2015-04-25 05:46:07 UTC
The same apparently happens with bextr, blsi, blsr, and most (if not all) of AMD's tbm instructions. They set the ZF flag but gcc still generates a test instruction.

http://www.felixcloutier.com/x86/BEXTR.html
http://www.felixcloutier.com/x86/BLSI.html
http://www.felixcloutier.com/x86/BLSR.html
http://support.amd.com/TechDocs/24594.pdf
Comment 2 Uroš Bizjak 2015-04-28 12:58:00 UTC
Created attachment 35411 [details]
Prototype patch for bextr and bzhi

Prototype patch that removes flag checks for bextr and bzhi insns.
Comment 3 Uroš Bizjak 2015-04-28 13:06:42 UTC
(In reply to James Almer from comment #1)
> The same apparently happens with bextr, blsi, blsr, and most (if not all) of
> AMD's tbm instructions. They set the ZF flag but gcc still generates a test
> instruction.

Please see the patch, attached in Comment #2.

While I can see the use (and benefit) to model the patterns that also set CC register internally for BEXTR and BZHI instructions, I don't think other listed instructions have clear usage scenarios to warrant additional patterns.

Can you perhaps show the benefit to have more insns modelled this way?
Comment 4 James Almer 2015-04-28 16:29:22 UTC
(In reply to Uroš Bizjak from comment #3)
> Please see the patch, attached in Comment #2.
> 
> While I can see the use (and benefit) to model the patterns that also set CC
> register internally for BEXTR and BZHI instructions, I don't think other
> listed instructions have clear usage scenarios to warrant additional
> patterns.
> 
> Can you perhaps show the benefit to have more insns modelled this way?

Not really, i simply assumed that taking into consideration what flags these instructions affected in every case was the intended behavior, so figured it was worth pointing them out.
I'm mainly interested in bzhi (and by extension bextr).
Comment 5 uros 2015-04-29 18:53:51 UTC
Author: uros
Date: Wed Apr 29 18:53:19 2015
New Revision: 222588

URL: https://gcc.gnu.org/viewcvs?rev=222588&root=gcc&view=rev
Log:
	PR target/65871
	* config/i386/i386.md (*bmi_bextr_<mode>_cczonly): New pattern.
	(*bmi2_bzhi_<mode>3_1_cczonly): Ditto.

testsuite/ChangeLog:

	PR target/65871
	* gcc.target/i386/pr65871-1.c: New test
	* gcc.target/i386/pr65871-2.c: Ditto.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr65871-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr65871-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md
    trunk/gcc/testsuite/ChangeLog
Comment 6 uros 2015-04-29 20:58:57 UTC
Author: uros
Date: Wed Apr 29 20:58:25 2015
New Revision: 222592

URL: https://gcc.gnu.org/viewcvs?rev=222592&root=gcc&view=rev
Log:
	PR target/65871
        * config/i386/i386.md (*bmi_bextr_<mode>_cczonly): New pattern.
        (*bmi2_bzhi_<mode>3_1_cczonly): Ditto.
        (setcc+movzbl peephole2): Check also clobbered reg.
        (setcc+andl peephole2): Ditto.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md
Comment 7 James Almer 2015-05-04 03:03:26 UTC
Thanks for the above fix.

I forgot to test BMI1's andn. That one should have an insn modeled this way as well.

int foo (unsigned int x, unsigned int y)
{
    if (~x & y)
        return 1;
    return 0;
}


gcc -O2 -mbmi -c andn.c

0000000000000000 <foo>:
   0:   c4 e2 40 f2 fe          andn   %esi,%edi,%edi
   5:   31 c0                   xor    %eax,%eax
   7:   85 ff                   test   %edi,%edi
   9:   0f 95 c0                setne  %al
   c:   c3                      retq

http://www.felixcloutier.com/x86/ANDN.html
Comment 8 uros 2015-05-05 04:36:51 UTC
Author: uros
Date: Tue May  5 04:36:19 2015
New Revision: 222795

URL: https://gcc.gnu.org/viewcvs?rev=222795&root=gcc&view=rev
Log:
	PR target/65871
	* config/i386/i386.md (*bmi_andn_<mode>_ccno): New pattern.

testsuite/ChangeLog:

	PR target/65871
	* gcc.target/i386/pr65871-3.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/i386/pr65871-3.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md
    trunk/gcc/testsuite/ChangeLog
Comment 9 James Almer 2015-06-18 23:20:07 UTC
Created attachment 35804 [details]
Preprocessed code where a test instruction is still generated

Please look at the attachment.

[jamrial@archVM ~]$ gcc -std=c99 -march=haswell -O3 -c hevc_ps.i

[jamrial@archVM ~]$ objdump --disassemble hevc_ps.o | grep -B2 -A2 bzhi
    5854:       c4 62 22 f7 d6          sarx   %r11d,%esi,%r10d
    5859:       c4 62 22 f7 da          sarx   %r11d,%edx,%r11d
    585e:       c4 e2 70 f5 f6          bzhi   %ecx,%esi,%esi
    5863:       45 89 9e 68 33 00 00    mov    %r11d,0x3368(%r14)
    586a:       41 89 c3                mov    %eax,%r11d
--
    589d:       85 f6                   test   %esi,%esi
    589f:       75 0d                   jne    58ae <ff_hevc_decode_nal_sps+0x1a6e>
    58a1:       c4 e2 70 f5 d2          bzhi   %ecx,%edx,%edx
    58a6:       85 d2                   test   %edx,%edx
    58a8:       0f 84 5a 03 00 00       je     5c08 <ff_hevc_decode_nal_sps+0x1dc8>

[jamrial@archVM ~]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/local/lib/gcc/x86_64-unknown-linux-gnu/6.0.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: /home/jamrial/gcc-svn/configure --prefix=/usr/local --libdir=/usr/local/lib --libexecdir=/usr/local/lib --mandir=/usr/local/share/man --infodir=/usr/local/share/info --enable-languages=c,c++,lto --enable-shared --disable-bootstrap --enable-threads=posix --enable-libmpx --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror
Thread model: posix
gcc version 6.0.0 20150618 (experimental) (GCC)


Not sure why the new model is not working here.
Comment 10 Uroš Bizjak 2015-06-21 09:57:50 UTC
(In reply to James Almer from comment #9)
> Created attachment 35804 [details]
> Preprocessed code where a test instruction is still generated
> 
> Please look at the attachment.
> 
> [jamrial@archVM ~]$ gcc -std=c99 -march=haswell -O3 -c hevc_ps.i
> 
> [jamrial@archVM ~]$ objdump --disassemble hevc_ps.o | grep -B2 -A2 bzhi
>     5854:       c4 62 22 f7 d6          sarx   %r11d,%esi,%r10d
>     5859:       c4 62 22 f7 da          sarx   %r11d,%edx,%r11d
>     585e:       c4 e2 70 f5 f6          bzhi   %ecx,%esi,%esi
>     5863:       45 89 9e 68 33 00 00    mov    %r11d,0x3368(%r14)
>     586a:       41 89 c3                mov    %eax,%r11d
> --
>     589d:       85 f6                   test   %esi,%esi
>     589f:       75 0d                   jne    58ae
> <ff_hevc_decode_nal_sps+0x1a6e>
>     58a1:       c4 e2 70 f5 d2          bzhi   %ecx,%edx,%edx
>     58a6:       85 d2                   test   %edx,%edx
>     58a8:       0f 84 5a 03 00 00       je     5c08
> <ff_hevc_decode_nal_sps+0x1dc8>

[...]

> Not sure why the new model is not working here.

Because of the cost model. Combine pass says:

Trying 2631 -> 2633:
Successfully matched this instruction:
(set (reg:CCZ 17 flags)
    (compare:CCZ (zero_extract:SI (reg:SI 2360 [ D.14992 ])
            (umin:SI (zero_extend:SI (subreg:QI (reg:SI 301 [ D.14999 ]) 0))
                (const_int 32 [0x20]))
            (const_int 0 [0]))
        (const_int 0 [0])))
rejecting combination of insns 2631 and 2633
original costs 12 + 4 = 16
replacement cost 18
Comment 11 Uroš Bizjak 2015-06-21 11:09:26 UTC
(In reply to Uroš Bizjak from comment #10)

> Because of the cost model. Combine pass says:

Testing the patch:

--cut here--
Index: i386.c
===================================================================
--- i386.c      (revision 224630)
+++ i386.c      (working copy)
@@ -42533,6 +42533,12 @@ ix86_rtx_costs (rtx x, int code_i, int outer_code_
                    + rtx_cost (const1_rtx, outer_code, opno, speed));
          return true;
        }
+
+      /* The embedded comparison operand is completely free.  */
+      if (!general_operand (XEXP (x, 0), GET_MODE (XEXP (x, 0)))
+         && XEXP (x, 1) == const0_rtx)
+       *total = 0;
+
       return false;
 
     case FLOAT_EXTEND:
--cut here--
Comment 12 uros 2015-06-22 13:55:31 UTC
Author: uros
Date: Mon Jun 22 13:54:58 2015
New Revision: 224729

URL: https://gcc.gnu.org/viewcvs?rev=224729&root=gcc&view=rev
Log:
	PR target/65871
	* config/i386/i386.c (ix86_rtx_costs) <case COMPARE>: Ignore the
	cost of embedded comparison.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
Comment 13 Uroš Bizjak 2015-06-22 14:54:57 UTC
Fixed for 6.0.
Comment 14 Yann Collet 2018-01-29 22:55:17 UTC
Is gcc -mbmi2 currently able to generate automatically a bzhi instruction when it detects a "X & ((1 << Y) - 1)" sequence, as suggested by James Almer ?
If yes, are there some available examples ?