> > > Running target unix/-m32 > > > FAIL: gcc.target/i386/avx512bw-pr70329-1.c execution test > > > FAIL: gcc.target/i386/pr96814.c execution test > > > > > > Debugging pr96814 failure: > > > > > > 0x0804921d <+66>: mov %edx,%ecx > > > 0x0804921f <+68>: cpuid > > > => 0x08049221 <+70>: kmovd %edx,%k0 > > > 0x08049225 <+74>: mov %eax,-0x8(%ebp) > > > 0x08049228 <+77>: mov %ebx,-0xc(%ebp) > > > mask intructions generated after cpuid which raise SIGILL on non-avx512 platform.
Alloc order is just another kind of cost which can be compensated by increasing cost of mask->integer and integer->mask. With below patch , pr96814 wouldn't generate any mask intructions execept for kmovd %eax, %k1 vpcmpeqd %ymm1, %ymm1, %ymm1 vmovdqu8 %ymm1, %ymm0{%k1}{z} which is what we want. modified gcc/config/i386/i386.md @@ -1335,7 +1335,7 @@ (define_insn "*cmp<mode>_ccz_1" [(set (reg FLAGS_REG) (compare (match_operand:SWI1248_AVX512BWDQ_64 0 - "nonimmediate_operand" "<r>,?m<r>,$k") + "nonimmediate_operand" "<r>,?m<r>,*k") (match_operand:SWI1248_AVX512BWDQ_64 1 "const0_operand")))] "TARGET_AVX512F && ix86_match_ccmode (insn, CCZmode)" "@ modified gcc/config/i386/x86-tune-costs.h @@ -2768,7 +2768,7 @@ struct processor_costs intel_cost = { {6, 6, 6, 6, 6}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit */ 4, 4, /* SSE->integer and integer->SSE moves */ - 4, 4, /* mask->integer and integer->mask moves */ + 6, 6, /* mask->integer and integer->mask moves */ {4, 4, 4}, /* cost of loading mask register in QImode, HImode, SImode. */ {6, 6, 6}, /* cost if storing mask register @@ -2882,7 +2882,7 @@ struct processor_costs generic_cost = { {6, 6, 6, 10, 15}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit */ 6, 6, /* SSE->integer and integer->SSE moves */ - 6, 6, /* mask->integer and integer->mask moves */ + 8, 8, /* mask->integer and integer->mask moves */ {6, 6, 6}, /* cost of loading mask register in QImode, HImode, SImode. */ {6, 6, 6}, /* cost if storing mask register So would the solution of increasing one more unit(or maybe more) for cost of mask->integer and integer->mask as compensation for changing alloca order be acceptable for you? or do you insist on reverting the x86_order_regs_for_local_alloc part?
About longteam part, i'm working slowly on that, it's in PR98478.
(In reply to Hongtao.liu from comment #1) > Alloc order is just another kind of cost which can be compensated by > increasing cost of mask->integer and integer->mask. > > With below patch , pr96814 wouldn't generate any mask intructions execept > for > > kmovd %eax, %k1 > vpcmpeqd %ymm1, %ymm1, %ymm1 > vmovdqu8 %ymm1, %ymm0{%k1}{z} > > which is what we want. > > > modified gcc/config/i386/i386.md > @@ -1335,7 +1335,7 @@ > (define_insn "*cmp<mode>_ccz_1" > [(set (reg FLAGS_REG) > (compare (match_operand:SWI1248_AVX512BWDQ_64 0 > - "nonimmediate_operand" "<r>,?m<r>,$k") > + "nonimmediate_operand" "<r>,?m<r>,*k") > (match_operand:SWI1248_AVX512BWDQ_64 1 "const0_operand")))] > "TARGET_AVX512F && ix86_match_ccmode (insn, CCZmode)" > "@ > modified gcc/config/i386/x86-tune-costs.h > @@ -2768,7 +2768,7 @@ struct processor_costs intel_cost = { > {6, 6, 6, 6, 6}, /* cost of storing SSE registers > in 32,64,128,256 and 512-bit */ > 4, 4, /* SSE->integer and integer->SSE moves */ > - 4, 4, /* mask->integer and integer->mask moves */ > + 6, 6, /* mask->integer and integer->mask moves */ I changed intel_cost just to validate 1 unit more cost is also enough for -mtune=intel to prevent generation of mask instructions.
(In reply to Hongtao.liu from comment #1) > So would the solution of increasing one more unit(or maybe more) for cost of > mask->integer and integer->mask as compensation for changing alloca order be > acceptable for you? or do you insist on reverting the > x86_order_regs_for_local_alloc part? To avoid catastrophic failures (mask insns on non-avx512f targets will fault the execution), I'd propose to revert x86_order_regs_for_local_alloc part. The net effect of the reversion is slightly un-optimal code for avx512f targets, where we have path forward for improvement with PR98478.
Of course I wonder why the RA even chooses registers that are not available on the architecture. I suppose there's no real way to turn regs on/off but the way is to make them never match any instruction pattern and thus give RA no incentive to use them? That is, why was kmovd %edx,%k0 deemed a valid recognizable instruction?
(In reply to Richard Biener from comment #5) > Of course I wonder why the RA even chooses registers that are not available > on the architecture. I suppose there's no real way to turn regs on/off but > the way is to make them never match any instruction pattern and thus give RA > no incentive to use them? That is, why was kmovd %edx,%k0 deemed a valid > recognizable instruction? RA is not problematic here, because registers are available due to -mavx512f. The problem is that due to the register pressure around cpuid insn on 32bit targets RA chooses to move the value to the mask register. cpuid and the logic around is used to determine if avx512f is supported on the runtime target, and mask instructions acting as logic instructions around cpuid causes SIGILL in case avx512 is not supported during execution. The solution to this is to allocate mask registers last and raise the cost of moves from GPR to mask regs, so (similar to MMX) mask regs are only used when absolutely necessary. In the past, I have introduced separate instruction patterns, instanced exclusively from builtins, but they didn't support generic logic operators, e.g. "mask1 & mask2". To solve this, mask reg alternatives were added to standard logic patterns, so there is now no clear cut between mask and GPR alternatives.
The key point here is cpuid check function should not be compiled with -mavx512vl or -mavx512bw, rely on cost model or alloca order is not all-inclusive.
Yeah, ideally main including the cpuid check should be compiled with the least possible target and if the check is successful call a noipa function with the command line chosen attributes. We've always been playing with fire here...
(In reply to Jakub Jelinek from comment #8) > Yeah, ideally main including the cpuid check should be compiled with the > least possible target and if the check is successful call a noipa function > with the command line chosen attributes. > We've always been playing with fire here... Yes, does this solution sound good to you, uros? If yes, please ignore my patch[1], I'll resend a new one. [1]https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573591.html
(In reply to Hongtao.liu from comment #9) > (In reply to Jakub Jelinek from comment #8) > > Yeah, ideally main including the cpuid check should be compiled with the > > least possible target and if the check is successful call a noipa function > > with the command line chosen attributes. > > We've always been playing with fire here... > > Yes, does this solution sound good to you, uros? If yes, please ignore my > patch[1], I'll resend a new one. > > [1]https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573591.html I'm afraid the above proposed solution would shift the burden from the compiler to the user, and the burden does not justify relatively minor and solvable issue to use generic "a & b" on masks. IMO, using VxBI modes offer clear path to solution; VxBI move patterns and logic insns should be trivial to implement, whereas more "esoteric" operations, like shifts, unpacks and arithmetic, need some more though, since they don't operate on vectors. Another issue with current implementation is with DImode logic operations on masks for 32-bit targets. ATM, they are simply disabled, e.g.: (define_insn "*<code><mode>_1" [(set (match_operand:SWI248 0 "nonimmediate_operand" "=rm,r,?k") (any_or:SWI248 (match_operand:SWI248 1 "nonimmediate_operand" "%0,0,k") (match_operand:SWI248 2 "<general_operand>" "r<i>,m,k"))) (clobber (reg:CC FLAGS_REG))] where ;; Single word integer modes without QImode. (define_mode_iterator SWI248 [HI SI (DI "TARGET_64BIT")]) Please also note that mask operations do not clobber flags
(In reply to Uroš Bizjak from comment #10) > (In reply to Hongtao.liu from comment #9) > > (In reply to Jakub Jelinek from comment #8) > > > Yeah, ideally main including the cpuid check should be compiled with the > > > least possible target and if the check is successful call a noipa function > > > with the command line chosen attributes. > > > We've always been playing with fire here... > > > > Yes, does this solution sound good to you, uros? If yes, please ignore my > > patch[1], I'll resend a new one. > > > > [1]https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573591.html > > I'm afraid the above proposed solution would shift the burden from the > compiler to the user, and the burden does not justify relatively minor and > solvable issue to use generic "a & b" on masks. Then we need a new type for __mmask8(perhaps w/ new psABI), orelse in the backend we always see integer mode for a & b.
(In reply to Hongtao.liu from comment #7) > The key point here is cpuid check function should not be compiled with > -mavx512vl or -mavx512bw, rely on cost model or alloca order is not > all-inclusive. Yeah, it looks bogus to check for AVX512 in a function that has the ISA already enabled ...
(In reply to Richard Biener from comment #12) > (In reply to Hongtao.liu from comment #7) > > The key point here is cpuid check function should not be compiled with > > -mavx512vl or -mavx512bw, rely on cost model or alloca order is not > > all-inclusive. > > Yeah, it looks bogus to check for AVX512 in a function that has the ISA > already enabled ... This worked by accident, not by design. The cost of mask register is orthogonal to this CPUID issue. CPUID check should be compiled with __attribute__ ((target("general-regs-only"))): https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573616.html
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:980e278dbe5b50dc5a856ea627359c521f1cda53 commit r12-1800-g980e278dbe5b50dc5a856ea627359c521f1cda53 Author: liuhongt <hongtao.liu@intel.com> Date: Thu Jun 24 16:14:13 2021 +0800 Revert x86_order_regs_for_local_alloc changes in r12-1669. Still put general regs as first alloca order. gcc/ChangeLog: PR target/101185 * config/i386/i386.c (x86_order_regs_for_local_alloc): Revert r12-1669. gcc/testsuite/ChangeLog PR target/101185 * gcc.target/i386/bitwise_mask_op-3.c: Add xfail to temporarily avoid regression, eventually xfail should be removed.
The releases/gcc-11 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:5bde7650caa84aa1dee979122834619a8cc748d4 commit r11-8738-g5bde7650caa84aa1dee979122834619a8cc748d4 Author: liuhongt <hongtao.liu@intel.com> Date: Thu Jun 24 16:14:13 2021 +0800 Revert x86_order_regs_for_local_alloc changes in r12-1669. Still put general regs as first alloca order. gcc/ChangeLog: PR target/101185 * config/i386/i386.c (x86_order_regs_for_local_alloc): Revert r12-1669. gcc/testsuite/ChangeLog PR target/101185 * gcc.target/i386/bitwise_mask_op-3.c: Add xfail to temporarily avoid regression, eventually xfail should be removed.
> > > FAIL: gcc.target/i386/avx512bw-pr70329-1.c execution test > > > FAIL: gcc.target/i386/pr96814.c execution test That seems like a testsuite issue. main and check_osxsave don't turn off AVX512f. This should fix the issue to the testsuite: diff --git a/gcc/testsuite/gcc.target/i386/avx512-check.h b/gcc/testsuite/gcc.target/i386/avx512-check.h index 0ad9064f637..212dcea7600 100644 --- a/gcc/testsuite/gcc.target/i386/avx512-check.h +++ b/gcc/testsuite/gcc.target/i386/avx512-check.h @@ -25,6 +25,8 @@ do_test (void) } #endif +static int +check_osxsave (void) __attribute__((target("no-avx"))); static int check_osxsave (void) { @@ -34,6 +36,8 @@ check_osxsave (void) return (ecx & bit_OSXSAVE) != 0; } +int +main () __attribute__((target("no-avx"))); int main () {
The code is not wrong, just the testcase is written such a way it thinks it can get away with not using avx code which is not always going to be true. I have a patch which fixes avx-check.h, avx2-check.h, and avx512-check.h. Obvious it just accidently works really.
Never mind because it requires more thought on the side of the x86 headers.
On Mon, 20 Sep 2021, pinskia at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101185 > > Andrew Pinski <pinskia at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Keywords|wrong-code | > Status|NEW |ASSIGNED > > --- Comment #17 from Andrew Pinski <pinskia at gcc dot gnu.org> --- > The code is not wrong, just the testcase is written such a way it thinks it can > get away with not using avx code which is not always going to be true. I have > a patch which fixes avx-check.h, avx2-check.h, and avx512-check.h. Obvious it > just accidently works really. IIRC HJ posted quite some variants of patches for the underlying issue (but eventually none were applied)
Fixed in GCC12.