101185 – [12 Regression] pr96814.c failed after r12-1669 on non-avx512 platform

Bug 101185 - [12 Regression] pr96814.c failed after r12-1669 on non-avx512 platform

Summary: [12 Regression] pr96814.c failed after r12-1669 on non-avx512 platform

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	12.0

Importance:	P1 normal
Target Milestone:	12.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	patch

Depends on:
Blocks:

Reported:	2021-06-24 01:32 UTC by Hongtao.liu
Modified:	2022-01-05 06:00 UTC (History)
CC List:	5 users (show)

See Also:
Host:	x86_64-pc-linux-gnu
Target:	x86_64-- i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-06-24 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Hongtao.liu 2021-06-24 01:32:18 UTC

> > > Running target unix/-m32
> > > FAIL: gcc.target/i386/avx512bw-pr70329-1.c execution test
> > > FAIL: gcc.target/i386/pr96814.c execution test
> > >
> > > Debugging pr96814 failure:
> > >
> > >   0x0804921d <+66>:    mov    %edx,%ecx
> > >   0x0804921f <+68>:    cpuid
> > > => 0x08049221 <+70>:    kmovd  %edx,%k0
> > >   0x08049225 <+74>:    mov    %eax,-0x8(%ebp)
> > >   0x08049228 <+77>:    mov    %ebx,-0xc(%ebp)
> > >


mask intructions generated after cpuid which raise SIGILL on non-avx512 platform.

Comment 1 Hongtao.liu 2021-06-24 01:44:50 UTC

Alloc order is just another kind of cost which can be compensated by increasing cost of mask->integer and integer->mask.

With below patch , pr96814 wouldn't generate any mask intructions execept for 

	kmovd	%eax, %k1
	vpcmpeqd	%ymm1, %ymm1, %ymm1
	vmovdqu8	%ymm1, %ymm0{%k1}{z}

which is what we want.


modified   gcc/config/i386/i386.md
@@ -1335,7 +1335,7 @@
 (define_insn "*cmp<mode>_ccz_1"
   [(set (reg FLAGS_REG)
 	(compare (match_operand:SWI1248_AVX512BWDQ_64 0
-			"nonimmediate_operand" "<r>,?m<r>,$k")
+			"nonimmediate_operand" "<r>,?m<r>,*k")
 		 (match_operand:SWI1248_AVX512BWDQ_64 1 "const0_operand")))]
   "TARGET_AVX512F && ix86_match_ccmode (insn, CCZmode)"
   "@
modified   gcc/config/i386/x86-tune-costs.h
@@ -2768,7 +2768,7 @@ struct processor_costs intel_cost = {
   {6, 6, 6, 6, 6},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
   4, 4,				/* SSE->integer and integer->SSE moves */
-  4, 4,				/* mask->integer and integer->mask moves */
+  6, 6,				/* mask->integer and integer->mask moves */
   {4, 4, 4},				/* cost of loading mask register
 					   in QImode, HImode, SImode.  */
   {6, 6, 6},				/* cost if storing mask register
@@ -2882,7 +2882,7 @@ struct processor_costs generic_cost = {
   {6, 6, 6, 10, 15},			/* cost of storing SSE registers
 					   in 32,64,128,256 and 512-bit */
   6, 6,				/* SSE->integer and integer->SSE moves */
-  6, 6,				/* mask->integer and integer->mask moves */
+  8, 8,				/* mask->integer and integer->mask moves */
   {6, 6, 6},				/* cost of loading mask register
 					   in QImode, HImode, SImode.  */
   {6, 6, 6},			/* cost if storing mask register


So would the solution of increasing one more unit(or maybe more) for cost of mask->integer and integer->mask as compensation for changing alloca order be acceptable for you? or do you insist on reverting the x86_order_regs_for_local_alloc part?

Comment 2 Hongtao.liu 2021-06-24 01:52:29 UTC

About longteam part, i'm working slowly on that, it's in PR98478.

Comment 3 Hongtao.liu 2021-06-24 01:56:38 UTC

(In reply to Hongtao.liu from comment #1)
> Alloc order is just another kind of cost which can be compensated by
> increasing cost of mask->integer and integer->mask.
> 
> With below patch , pr96814 wouldn't generate any mask intructions execept
> for 
> 
> 	kmovd	%eax, %k1
> 	vpcmpeqd	%ymm1, %ymm1, %ymm1
> 	vmovdqu8	%ymm1, %ymm0{%k1}{z}
> 
> which is what we want.
> 
> 
> modified   gcc/config/i386/i386.md
> @@ -1335,7 +1335,7 @@
>  (define_insn "*cmp<mode>_ccz_1"
>    [(set (reg FLAGS_REG)
>  	(compare (match_operand:SWI1248_AVX512BWDQ_64 0
> -			"nonimmediate_operand" "<r>,?m<r>,$k")
> +			"nonimmediate_operand" "<r>,?m<r>,*k")
>  		 (match_operand:SWI1248_AVX512BWDQ_64 1 "const0_operand")))]
>    "TARGET_AVX512F && ix86_match_ccmode (insn, CCZmode)"
>    "@
> modified   gcc/config/i386/x86-tune-costs.h
> @@ -2768,7 +2768,7 @@ struct processor_costs intel_cost = {
>    {6, 6, 6, 6, 6},			/* cost of storing SSE registers
>  					   in 32,64,128,256 and 512-bit */
>    4, 4,				/* SSE->integer and integer->SSE moves */
> -  4, 4,				/* mask->integer and integer->mask moves */
> +  6, 6,				/* mask->integer and integer->mask moves */
I changed intel_cost just to validate 1 unit more cost is also enough for -mtune=intel to prevent generation of mask instructions.

Comment 4 Uroš Bizjak 2021-06-24 06:01:49 UTC

(In reply to Hongtao.liu from comment #1)
> So would the solution of increasing one more unit(or maybe more) for cost of
> mask->integer and integer->mask as compensation for changing alloca order be
> acceptable for you? or do you insist on reverting the
> x86_order_regs_for_local_alloc part?

To avoid catastrophic failures (mask insns on non-avx512f targets will fault the execution), I'd propose to revert x86_order_regs_for_local_alloc part. The net effect of the reversion is slightly un-optimal code for avx512f targets, where we have path forward for improvement with PR98478.

Comment 5 Richard Biener 2021-06-24 06:44:46 UTC

Of course I wonder why the RA even chooses registers that are not available on the architecture.  I suppose there's no real way to turn regs on/off but the way is to make them never match any instruction pattern and thus give RA no incentive to use them?  That is, why was kmovd  %edx,%k0 deemed a valid recognizable instruction?

Comment 6 Uroš Bizjak 2021-06-24 07:08:09 UTC

(In reply to Richard Biener from comment #5)
> Of course I wonder why the RA even chooses registers that are not available
> on the architecture.  I suppose there's no real way to turn regs on/off but
> the way is to make them never match any instruction pattern and thus give RA
> no incentive to use them?  That is, why was kmovd  %edx,%k0 deemed a valid
> recognizable instruction?

RA is not problematic here, because registers are available due to -mavx512f. The problem is that due to the register pressure around cpuid insn on 32bit targets RA chooses to move the value to the mask register. cpuid and the logic around is used to determine if avx512f is supported on the runtime target, and mask instructions acting as logic instructions around cpuid causes SIGILL in case avx512 is not supported during execution.

The solution to this is to allocate mask registers last and raise the cost of moves from GPR to mask regs, so (similar to MMX) mask regs are only used when absolutely necessary.

In the past, I have introduced separate instruction patterns, instanced exclusively from builtins, but they didn't support generic logic operators, e.g. "mask1 & mask2". To solve this, mask reg alternatives were added to standard logic patterns, so there is now no clear cut between mask and GPR alternatives.

Comment 7 Hongtao.liu 2021-06-24 07:14:11 UTC

The key point here is cpuid check function should not be compiled with -mavx512vl or -mavx512bw, rely on cost model or alloca order is not all-inclusive.

Comment 8 Jakub Jelinek 2021-06-24 08:41:37 UTC

Yeah, ideally main including the cpuid check should be compiled with the least possible target and if the check is successful call a noipa function with the command line chosen attributes.
We've always been playing with fire here...

Comment 9 Hongtao.liu 2021-06-24 08:52:39 UTC

(In reply to Jakub Jelinek from comment #8)
> Yeah, ideally main including the cpuid check should be compiled with the
> least possible target and if the check is successful call a noipa function
> with the command line chosen attributes.
> We've always been playing with fire here...

Yes, does this solution sound good to you, uros? If yes, please ignore my patch[1], I'll resend a new one.

[1]https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573591.html

Comment 10 Uroš Bizjak 2021-06-24 09:12:50 UTC

(In reply to Hongtao.liu from comment #9)
> (In reply to Jakub Jelinek from comment #8)
> > Yeah, ideally main including the cpuid check should be compiled with the
> > least possible target and if the check is successful call a noipa function
> > with the command line chosen attributes.
> > We've always been playing with fire here...
> 
> Yes, does this solution sound good to you, uros? If yes, please ignore my
> patch[1], I'll resend a new one.
> 
> [1]https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573591.html

I'm afraid the above proposed solution would shift the burden from the compiler to the user, and the burden does not justify relatively minor and solvable issue to use generic "a & b" on masks.

IMO, using VxBI modes offer clear path to solution; VxBI move patterns and logic insns should be trivial to implement, whereas more "esoteric" operations, like shifts, unpacks and arithmetic, need some more though, since they don't operate on vectors.

Another issue with current implementation is with DImode logic operations on masks for 32-bit targets. ATM, they are simply disabled, e.g.:

(define_insn "*<code><mode>_1"
  [(set (match_operand:SWI248 0 "nonimmediate_operand" "=rm,r,?k")
	(any_or:SWI248
	 (match_operand:SWI248 1 "nonimmediate_operand" "%0,0,k")
	 (match_operand:SWI248 2 "<general_operand>" "r<i>,m,k")))
   (clobber (reg:CC FLAGS_REG))]

where

;; Single word integer modes without QImode.
(define_mode_iterator SWI248 [HI SI (DI "TARGET_64BIT")])

Please also note that mask operations do not clobber flags

Comment 11 Hongtao.liu 2021-06-24 09:18:40 UTC

(In reply to Uroš Bizjak from comment #10)
> (In reply to Hongtao.liu from comment #9)
> > (In reply to Jakub Jelinek from comment #8)
> > > Yeah, ideally main including the cpuid check should be compiled with the
> > > least possible target and if the check is successful call a noipa function
> > > with the command line chosen attributes.
> > > We've always been playing with fire here...
> > 
> > Yes, does this solution sound good to you, uros? If yes, please ignore my
> > patch[1], I'll resend a new one.
> > 
> > [1]https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573591.html
> 
> I'm afraid the above proposed solution would shift the burden from the
> compiler to the user, and the burden does not justify relatively minor and
> solvable issue to use generic "a & b" on masks.
Then we need a new type for __mmask8(perhaps w/ new psABI), orelse in the backend we always see integer mode for a & b.

Comment 12 Richard Biener 2021-06-24 09:36:39 UTC

(In reply to Hongtao.liu from comment #7)
> The key point here is cpuid check function should not be compiled with
> -mavx512vl or -mavx512bw, rely on cost model or alloca order is not
> all-inclusive.

Yeah, it looks bogus to check for AVX512 in a function that has the ISA
already enabled ...

Comment 13 H.J. Lu 2021-06-24 12:13:50 UTC

(In reply to Richard Biener from comment #12)
> (In reply to Hongtao.liu from comment #7)
> > The key point here is cpuid check function should not be compiled with
> > -mavx512vl or -mavx512bw, rely on cost model or alloca order is not
> > all-inclusive.
> 
> Yeah, it looks bogus to check for AVX512 in a function that has the ISA
> already enabled ...

This worked by accident, not by design.  The cost of mask register is
orthogonal to this CPUID issue. CPUID check should be compiled with
__attribute__ ((target("general-regs-only"))):

https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573616.html

Comment 14 GCC Commits 2021-06-25 01:17:39 UTC

The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:980e278dbe5b50dc5a856ea627359c521f1cda53

commit r12-1800-g980e278dbe5b50dc5a856ea627359c521f1cda53
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Jun 24 16:14:13 2021 +0800

    Revert x86_order_regs_for_local_alloc changes in r12-1669.
    
    Still put general regs as first alloca order.
    
    gcc/ChangeLog:
    
            PR target/101185
            * config/i386/i386.c (x86_order_regs_for_local_alloc):
            Revert r12-1669.
    
    gcc/testsuite/ChangeLog
    
            PR target/101185
            * gcc.target/i386/bitwise_mask_op-3.c: Add xfail to
            temporarily avoid regression, eventually xfail should be
            removed.

Comment 15 GCC Commits 2021-07-14 12:10:42 UTC

The releases/gcc-11 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:5bde7650caa84aa1dee979122834619a8cc748d4

commit r11-8738-g5bde7650caa84aa1dee979122834619a8cc748d4
Author: liuhongt <hongtao.liu@intel.com>
Date:   Thu Jun 24 16:14:13 2021 +0800

    Revert x86_order_regs_for_local_alloc changes in r12-1669.
    
    Still put general regs as first alloca order.
    
    gcc/ChangeLog:
    
            PR target/101185
            * config/i386/i386.c (x86_order_regs_for_local_alloc):
            Revert r12-1669.
    
    gcc/testsuite/ChangeLog
    
            PR target/101185
            * gcc.target/i386/bitwise_mask_op-3.c: Add xfail to
            temporarily avoid regression, eventually xfail should be
            removed.

Comment 16 Andrew Pinski 2021-09-20 04:33:18 UTC

> > > FAIL: gcc.target/i386/avx512bw-pr70329-1.c execution test
> > > FAIL: gcc.target/i386/pr96814.c execution test

That seems like a testsuite issue.
main and check_osxsave don't turn off AVX512f.

This should fix the issue to the testsuite:
diff --git a/gcc/testsuite/gcc.target/i386/avx512-check.h b/gcc/testsuite/gcc.target/i386/avx512-check.h
index 0ad9064f637..212dcea7600 100644
--- a/gcc/testsuite/gcc.target/i386/avx512-check.h
+++ b/gcc/testsuite/gcc.target/i386/avx512-check.h
@@ -25,6 +25,8 @@ do_test (void)
 }
 #endif

+static int
+check_osxsave (void) __attribute__((target("no-avx")));
 static int
 check_osxsave (void)
 {
@@ -34,6 +36,8 @@ check_osxsave (void)
   return (ecx & bit_OSXSAVE) != 0;
 }

+int
+main () __attribute__((target("no-avx")));
 int
 main ()
 {

Comment 17 Andrew Pinski 2021-09-20 04:43:18 UTC

The code is not wrong, just the testcase is written such a way it thinks it can get away with not using avx code which is not always going to be true.  I have a patch which fixes avx-check.h, avx2-check.h, and avx512-check.h.  Obvious it just accidently works really.

Comment 18 Andrew Pinski 2021-09-20 04:47:35 UTC

Never mind because it requires more thought on the side of the x86 headers.

Comment 19 rguenther@suse.de 2021-09-20 06:19:05 UTC

On Mon, 20 Sep 2021, pinskia at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101185
> 
> Andrew Pinski <pinskia at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>            Keywords|wrong-code                  |
>              Status|NEW                         |ASSIGNED
> 
> --- Comment #17 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
> The code is not wrong, just the testcase is written such a way it thinks it can
> get away with not using avx code which is not always going to be true.  I have
> a patch which fixes avx-check.h, avx2-check.h, and avx512-check.h.  Obvious it
> just accidently works really.

IIRC HJ posted quite some variants of patches for the underlying issue 
(but eventually none were applied)

Comment 20 Hongtao.liu 2022-01-05 06:00:16 UTC

Fixed in GCC12.