This is the mail archive of the
mailing list for the GCC project.
Re: PATCH: PR target/46519: Missing vzeroupper
On Fri, Nov 19, 2010 at 2:48 PM, Richard Guenther
> On Fri, Nov 19, 2010 at 10:30 PM, H.J. Lu <firstname.lastname@example.org> wrote:
>> On Thu, Nov 18, 2010 at 1:11 AM, Uros Bizjak <email@example.com> wrote:
>>> On Thu, Nov 18, 2010 at 12:36 AM, H.J. Lu <firstname.lastname@example.org> wrote:
>>>> Here is the patch for
>>>> We have 2 blocks pointing to each others. This patch first scans
>>>> all blocks without moving vzeroupper so that we can have accurate
>>>> information about upper 128bits at block entry.
>>> This introduces another insn scanning pass, almost the same as
>>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling).
>>> So, if I understand correctly:
>>> - The patch removes the detection if the function ever touches AVX registers.
>>> - Due to this, all call_insn RTXes have to be decorated with
>>> - A new pre-pass is required that scans all functions in order to
>>> detect functions with live AVX registers at exit, and at the same time
>>> marks the functions that *do not* use AVX registers.
>>> - Existing pass then re-scans everything to again detect functions
>>> with live AVX registers at exit and handles vzeroupper emission.
>>> I don't think this approach is acceptable. Maybe a LCM infrastructure
>>> can be used to handle this case?
>> Here is the rewrite of the vzeroupper optimization pass.
>> To avoid circular dependency, it has 2 passes. ?It
>> delays the circular dependency to the second pass
>> and avoid rescan as much as possible.
>> I compared the bootstrap times with/wthout this patch
>> on 64bit Sandy Bridge with multilib and --with-fpmath=avx.
>> I enabled c,c++,fortran,java,lto,objc
>> Without patch:
>> 12378.70user 573.02system 41:54.21elapsed 515%CPU
>> With patch
>> 12580.56user 578.07system 42:25.41elapsed 516%CPU
>> The overhead is about 1.6%.
> That's a quite big overhead for something that doesn't use FP
> math (and thus no AVX).
AVX256 vector insns are independent of FP math. They can be
generated by vectorizer as well as loop unroll. We can limit
it to -O2 or -O3 if overhead is a big concern.
>> 2010-11-19 ?H.J. Lu ?<email@example.com>
>> ? ? ? ?PR target/46519
>> ? ? ? ?* config/i386/i386.c (upper_128bits_state): New.
>> ? ? ? ?(block_info_def): Remove upper_128bits_set and done. ?Add state,
>> ? ? ? ?referenced, count, processed and rescanned.
>> ? ? ? ?(check_avx256_stores): Updated.
>> ? ? ? ?(move_or_delete_vzeroupper_2): Updated. Handle deleted BB_END.
>> ? ? ? ?Call note_stores only if needed. ?Set referenced and count.
>> ? ? ? ?(move_or_delete_vzeroupper_1): Updated. ?Set rescan_vzeroupper_p.
>> ? ? ? ?(rescan_move_or_delete_vzeroupper): New.
>> ? ? ? ?(move_or_delete_vzeroupper): ?Process and rescan all all basic
>> ? ? ? ?blocks instead of predecessor blocks of all exit points.
>> ? ? ? ?(use_avx256_p): Removed.
>> ? ? ? ?(init_cumulative_args): Don't set use_avx256_p.
>> ? ? ? ?(ix86_function_arg): Likewise.
>> ? ? ? ?(ix86_expand_move): Likewise.
>> ? ? ? ?(ix86_expand_vector_move_misalign): Likewise.
>> ? ? ? ?(ix86_local_alignment): Likewise.
>> ? ? ? ?(ix86_minimum_alignment): Likewise.
>> ? ? ? ?(ix86_expand_epilogue): Don't check use_avx256_p when generating
>> ? ? ? ?vzeroupper.
>> ? ? ? ?(ix86_expand_call): Likewise.
>> ? ? ? ?* config/i386/i386.h (machine_function): Remove use_vzeroupper_p
>> ? ? ? ?and use_avx256_p. ?Add rescan_vzeroupper_p.
>> 2010-11-17 ?H.J. Lu ?<firstname.lastname@example.org>
>> ? ? ? ?PR target/46519
>> ? ? ? ?* gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper.
>> ? ? ? ?* gcc.target/i386/avx-vzeroupper-11.c: Likewise.
>> ? ? ? ?* gcc.target/i386/avx-vzeroupper-20.c: New.
>> ? ? ? ?* gcc.target/i386/avx-vzeroupper-21.c: Likewise.
>> ? ? ? ?* gcc.target/i386/avx-vzeroupper-22.c: Likewise.
>> ? ? ? ?* gcc.target/i386/avx-vzeroupper-23.c: Likewise.
>> ? ? ? ?* gcc.target/i386/avx-vzeroupper-24.c: Likewise.