This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
On 8/12/19 6:27 AM, Richard Biener wrote:
> On Fri, 9 Aug 2019, Uros Bizjak wrote:
>
>> On Fri, Aug 9, 2019 at 3:00 PM Richard Biener <rguenther@suse.de> wrote:
>>
>>>>>>>>>>>> (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
>>>>>>>>>>>>
>>>>>>>>>>>> and then we need to split DImode for 32bits, too.
>>>>>>>>>>>
>>>>>>>>>>> For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
>>>>>>>>>>> condition, I'll provide _doubleword splitter later.
>>>>>>>>>>
>>>>>>>>>> Shouldn't that be TARGET_AVX512VL instead? Or does the insn use %g0 etc.
>>>>>>>>>> to force use of %zmmN?
>>>>>>>>>
>>>>>>>>> It generates V4SI mode, so - yes, AVX512VL.
>>>>>>>>
>>>>>>>> case SMAX:
>>>>>>>> case SMIN:
>>>>>>>> case UMAX:
>>>>>>>> case UMIN:
>>>>>>>> if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
>>>>>>>> || (mode == SImode && !TARGET_SSE4_1))
>>>>>>>> return false;
>>>>>>>>
>>>>>>>> so there's no way to use AVX512VL for 32bit?
>>>>>>>
>>>>>>> There is a way, but on 32bit targets, we need to split DImode
>>>>>>> operation to a sequence of SImode operations for unconverted pattern.
>>>>>>> This is of course doable, but somehow more complex than simply
>>>>>>> emitting a DImode compare + DImode cmove, which is what current
>>>>>>> splitter does. So, a follow-up task.
>>>>>>
>>>>>> Please find attached the complete .md part that enables SImode for
>>>>>> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
>>>>>> targets. The patterns also allows for memory operand 2, so STV has
>>>>>> chance to create the vector pattern with implicit load. In case STV
>>>>>> fails, the memory operand 2 is loaded to the register first; operand
>>>>>> 2 is used in compare and cmove instruction, so pre-loading of the
>>>>>> operand should be beneficial.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>> Also note, that splitting should happen rarely. Due to the cost
>>>>>> function, STV should effectively always convert minmax to a vector
>>>>>> insn.
>>>>>
>>>>> I've analyzed the 464.h264ref slowdown on Haswell and it is due to
>>>>> this kind of "simple" conversion:
>>>>>
>>>>> 5.50 │1d0: test %esi,%es
>>>>> 0.07 │ mov $0x0,%ex
>>>>> │ cmovs %eax,%es
>>>>> 5.84 │ imul %r8d,%es
>>>>>
>>>>> to
>>>>>
>>>>> 0.65 │1e0: vpxor %xmm0,%xmm0,%xmm0
>>>>> 0.32 │ vpmaxs -0x10(%rsp),%xmm0,%xmm0
>>>>> 40.45 │ vmovd %xmm0,%eax
>>>>> 2.45 │ imul %r8d,%eax
>>>>>
>>>>> which looks like a RA artifact in the end. We spill %esi only
>>>>> with -mstv here as STV introduces a (subreg:V4SI ...) use
>>>>> of a pseudo ultimatively set from di. STV creates an additional
>>>>> pseudo for this (copy-in) but it places that copy next to the
>>>>> original def rather than next to the start of the chain it
>>>>> converts which is probably the issue why we spill. And this
>>>>> is because it inserts those at each definition of the pseudo
>>>>> rather than just at the reaching definition(s) or at the
>>>>> uses of the pseudo in the chain (that because there may be
>>>>> defs of that pseudo in the chain itself). Note that STV emits
>>>>> such "conversion" copies as simple reg-reg moves:
>>>>>
>>>>> (insn 1094 3 4 2 (set (reg:SI 777)
>>>>> (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
>>>>> (nil))
>>>>>
>>>>> but those do not prevail very long (this one gets removed by CSE2).
>>>>> So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
>>>>> and computes
>>>>>
>>>>> r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
>>>>> a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618
>>>>>
>>>>> so I wonder if STV shouldn't instead emit gpr->xmm moves
>>>>> here (but I guess nothing again prevents RTL optimizers from
>>>>> combining that with the single-use in the max instruction...).
>>>>>
>>>>> So this boils down to STV splitting live-ranges but other
>>>>> passes undoing that and then RA not considering splitting
>>>>> live-ranges here, arriving at unoptimal allocation.
>>>>>
>>>>> A testcase showing this issue is (simplified from 464.h264ref
>>>>> UMVLine16Y_11):
>>>>>
>>>>> unsigned short
>>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
>>>>> {
>>>>> if (y != width)
>>>>> {
>>>>> y = y < 0 ? 0 : y;
>>>>> return Pic[y * width];
>>>>> }
>>>>> return Pic[y];
>>>>> }
>>>>>
>>>>> where the condition and the Pic[y] load mimics the other use of y.
>>>>> Different, even worse spilling is generated by
>>>>>
>>>>> unsigned short
>>>>> UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
>>>>> {
>>>>> y = y < 0 ? 0 : y;
>>>>> return Pic[y * width] + y;
>>>>> }
>>>>>
>>>>> I guess this all shows that STVs "trick" of simply wrapping
>>>>> integer mode pseudos in (subreg:vector-mode ...) is bad?
>>>>>
>>>>> I've added a (failing) testcase to reflect the above.
>>>>
>>>> Experimenting a bit with just for the conversion insns using
>>>> V4SImode pseudos we end up preserving those moves (but I
>>>> do have to use a lowpart set, using reg:V4SI = subreg:V4SI Simode-reg
>>>> ends up using movv4si_internal which only leaves us with
>>>> memory for the SImode operand) _plus_ moving the move next
>>>> to the actual use has an effect. Not necssarily a good one
>>>> though:
>>>>
>>>> vpxor %xmm0, %xmm0, %xmm0
>>>> vmovaps %xmm0, -16(%rsp)
>>>> movl %esi, -16(%rsp)
>>>> vpmaxsd -16(%rsp), %xmm0, %xmm0
>>>> vmovd %xmm0, %eax
>>>>
>>>> eh? I guess the lowpart set is not good (my patch has this
>>>> as well, but I got saved by never having vector modes to subset...).
>>>> Using
>>>>
>>>> (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 83 [ i ]))
>>>> (const_vector:V4SI [
>>>> (const_int 0 [0]) repeated x4
>>>> ])
>>>> (const_int 1 [0x1]))) "t3.c":5:10 -1
>>>>
>>>> for the move ends up with
>>>>
>>>> vpxor %xmm1, %xmm1, %xmm1
>>>> vpinsrd $0, %esi, %xmm1, %xmm0
>>>>
>>>> eh? LRA chooses the correct alternative here but somehow
>>>> postreload CSE CSEs the zero with the xmm1 clearing, leading
>>>> to the vpinsrd... (I guess a general issue, not sure if really
>>>> worse - definitely a larger instruction). Unfortunately
>>>> postreload-cse doesn't add a reg-equal note. This happens only
>>>> when emitting the reg move before the use, not doing that emits
>>>> a vmovd as expected.
>>>>
>>>> At least the spilling is gone here.
>>>>
>>>> I am re-testing as follows, the main change is that
>>>> general_scalar_chain::make_vector_copies now generates a
>>>> vector pseudo as destination (and I've fixed up the code
>>>> to not generate (subreg:V4SI (reg:V4SI 1234) 0)).
>>>>
>>>> Hope this fixes the observed slowdowns (it fixes the new testcase).
>>>
>>> It fixes the slowdown observed in 416.gamess and 464.h264ref.
>>>
>>> Bootstrapped on x86_64-unknown-linux-gnu, testing still in progress.
>>>
>>> CCing Jeff who "knows RTL".
>>>
>>> OK?
>>
>> Please add -mno-stv to gcc.target/i386/minmax-{1,2}.c to avoid
>> spurious test failures on SSE4.1 targets.
>
> Done. I've also adjusted the i386.md changelog as follows:
>
> * config/i386/i386.md (<maxmin><MAXMIN_IMODE>3): New expander.
> (*<maxmin><MAXMIN_IMODE>3_1): New insn-and-split.
> (*<maxmin>di3_doubleword): Likewise.
>
> I see
>
> FAIL: gcc.target/i386/pr65105-3.c scan-assembler ptest
> FAIL: gcc.target/i386/pr65105-5.c scan-assembler ptest
> FAIL: gcc.target/i386/pr78794.c scan-assembler pandn
>
> with the latest patch (this is with -m32) where -mstv causes
> all spills to go away and the cmoves replaced (so clearly
> better code after the patch) for pr65105-5.c, no obvious
> improvements for pr65105-3.c where cmov does appear with -mstv.
> I'd rather not "fix" those by adding -mno-stv but instead have
> the Intel people fix costing for slm and/or decide what to do.
> For pr65105-3.c I'm not sure why if-conversion didn't choose
> to use cmov, so clearly the enabled minmax patterns expose the
> "failure" here.
I'm not sure how much effort Intel is putting into Silvermont tuning
these days. So I'd suggest giving HJ a heads-up and a reasonable period
of time to take a looksie, but I wouldn't hold the patch for long due to
a Silvermont tuning issue.
jeff
- References:
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs
- Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs