This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: PATCH: PR target/46519: Missing vzeroupper


On Sat, Nov 20, 2010 at 3:11 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

>>>>>>> Here is the patch for
>>>>>>>
>>>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519
>>>>>>>
>>>>>>> We have 2 blocks pointing to each others. This patch first scans
>>>>>>> all blocks without moving vzeroupper so that we can have accurate
>>>>>>> information about upper 128bits at block entry.
>>>>>>
>>>>>> This introduces another insn scanning pass, almost the same as
>>>>>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).
>>>>>>
>>>>>> So, if I understand correctly:
>>>>>> - The patch removes the detection if the function ever touches AVX registers.
>>>>>> - Due to this, all call_insn RTXes have to be decorated with
>>>>>> CALL_NEEDS_VZEROUPPER.
>>>>>> - A new pre-pass is required that scans all functions in order to
>>>>>> detect functions with live AVX registers at exit, and at the same time
>>>>>> marks the functions that *do not* use AVX registers.
>>>>>> - Existing pass then re-scans everything to again detect functions
>>>>>> with live AVX registers at exit and handles vzeroupper emission.
>>>>>>
>>>>>> I don't think this approach is acceptable. Maybe a LCM infrastructure
>>>>>> can be used to handle this case?
>>>>>>
>>>>>
>>>>> Here is the rewrite of the vzeroupper optimization pass.
>>>>> To avoid circular dependency, it has 2 passes. ?It
>>>>> delays the circular dependency to the second pass
>>>>> and avoid rescan as much as possible.
>>>>>
>>>>> I compared the bootstrap times with/wthout this patch
>>>>> on 64bit Sandy Bridge with multilib and --with-fpmath=avx.
>>>>> I enabled c,c++,fortran,java,lto,objc
>>>>>
>>>>> Without patch:
>>>>>
>>>>> 12378.70user 573.02system 41:54.21elapsed 515%CPU
>>>>>
>>>>> With patch
>>>>>
>>>>> 12580.56user 578.07system 42:25.41elapsed 516%CPU
>>>>>
>>>>> The overhead is about 1.6%.
>>>>
>>>> That's a quite big overhead for something that doesn't use FP
>>>> math (and thus no AVX).
>>>
>>> AVX256 vector insns are independent of FP math. ?They can be
>>> generated by vectorizer as well as loop unroll. ?We can limit
>>> it to -O2 or -O3 if overhead is a big concern.
>>
>> Limiting it to -fexpensive-optimizations would be a good start. ?Btw,
>> how is code-size affected? ?Does it make sense to disable it when
>> optimizing a function for size? ?As it affects performance of callees
>> whether the caller is optimized for size or speed probably isn't the
>> best thing to check.
>>
>
> We pay penalty at SSE<->AVX transition, not exactly in callee/caller.
> We can just check optimize_size.
>
> Here is the updated patch to limit vzeroupper optimization to
> -fexpensive-optimizations and not optimizing for size. ?OK for trunk?

ATM, I have no other (obvious) solution to two-pass problem, although
I think LCM (please look at gcc/lcm.c) should be investigated if it
fits this job.

The patch demonstrates better generated code, so I propose to proceed
with the patch. Although IMO non-optimal solution depends on
TARGET_VZEROUPPER and -fexpensive-optimizations.

So, since it looks that there are no other objections, the patch is OK
for mainline.

Uros.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]