This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Sat, Nov 20, 2010 at 2:53 AM, Richard Guenther <richard.guenther@gmail.com> wrote: > On Sat, Nov 20, 2010 at 12:31 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >> On Fri, Nov 19, 2010 at 2:48 PM, Richard Guenther >> <richard.guenther@gmail.com> wrote: >>> On Fri, Nov 19, 2010 at 10:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>>> On Thu, Nov 18, 2010 at 1:11 AM, Uros Bizjak <ubizjak@gmail.com> wrote: >>>>> On Thu, Nov 18, 2010 at 12:36 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >>>>> >>>>>> Here is the patch for >>>>>> >>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46519 >>>>>> >>>>>> We have 2 blocks pointing to each others. This patch first scans >>>>>> all blocks without moving vzeroupper so that we can have accurate >>>>>> information about upper 128bits at block entry. >>>>> >>>>> This introduces another insn scanning pass, almost the same as >>>>> existing vzeroupper pass (modulo CALL_INSN/JUMP_INSN handling). >>>>> >>>>> So, if I understand correctly: >>>>> - The patch removes the detection if the function ever touches AVX registers. >>>>> - Due to this, all call_insn RTXes have to be decorated with >>>>> CALL_NEEDS_VZEROUPPER. >>>>> - A new pre-pass is required that scans all functions in order to >>>>> detect functions with live AVX registers at exit, and at the same time >>>>> marks the functions that *do not* use AVX registers. >>>>> - Existing pass then re-scans everything to again detect functions >>>>> with live AVX registers at exit and handles vzeroupper emission. >>>>> >>>>> I don't think this approach is acceptable. Maybe a LCM infrastructure >>>>> can be used to handle this case? >>>>> >>>> >>>> Here is the rewrite of the vzeroupper optimization pass. >>>> To avoid circular dependency, it has 2 passes. ?It >>>> delays the circular dependency to the second pass >>>> and avoid rescan as much as possible. >>>> >>>> I compared the bootstrap times with/wthout this patch >>>> on 64bit Sandy Bridge with multilib and --with-fpmath=avx. >>>> I enabled c,c++,fortran,java,lto,objc >>>> >>>> Without patch: >>>> >>>> 12378.70user 573.02system 41:54.21elapsed 515%CPU >>>> >>>> With patch >>>> >>>> 12580.56user 578.07system 42:25.41elapsed 516%CPU >>>> >>>> The overhead is about 1.6%. >>> >>> That's a quite big overhead for something that doesn't use FP >>> math (and thus no AVX). >> >> AVX256 vector insns are independent of FP math. ?They can be >> generated by vectorizer as well as loop unroll. ?We can limit >> it to -O2 or -O3 if overhead is a big concern. > > Limiting it to -fexpensive-optimizations would be a good start. ?Btw, > how is code-size affected? ?Does it make sense to disable it when > optimizing a function for size? ?As it affects performance of callees > whether the caller is optimized for size or speed probably isn't the > best thing to check. > We pay penalty at SSE<->AVX transition, not exactly in callee/caller. We can just check optimize_size. Here is the updated patch to limit vzeroupper optimization to -fexpensive-optimizations and not optimizing for size. OK for trunk? Thanks. -- H.J. --- gcc/ 2010-11-20 H.J. Lu <hongjiu.lu@intel.com> PR target/46519 * config/i386/i386.c (upper_128bits_state): New. (block_info_def): Remove upper_128bits_set and done. Add state, referenced, count, processed and rescanned. (check_avx256_stores): Updated. (move_or_delete_vzeroupper_2): Updated. Handle deleted BB_END. Call note_stores only if needed. Set referenced and count. (move_or_delete_vzeroupper_1): Updated. Set rescan_vzeroupper_p. (rescan_move_or_delete_vzeroupper): New. (move_or_delete_vzeroupper): Process and rescan all all basic blocks instead of predecessor blocks of all exit points. (ix86_option_override_internal): Enable vzeroupper optimization only for -fexpensive-optimizations and not optimizing for size. (use_avx256_p): Removed. (init_cumulative_args): Don't set use_avx256_p. (ix86_function_arg): Likewise. (ix86_expand_move): Likewise. (ix86_expand_vector_move_misalign): Likewise. (ix86_local_alignment): Likewise. (ix86_minimum_alignment): Likewise. (ix86_expand_epilogue): Don't check use_avx256_p when generating vzeroupper. (ix86_expand_call): Likewise. * config/i386/i386.h (machine_function): Remove use_vzeroupper_p and use_avx256_p. Add rescan_vzeroupper_p. gcc/testsuite/ 2010-11-20 H.J. Lu <hongjiu.lu@intel.com> PR target/46519 * gcc.target/i386/avx-vzeroupper-10.c: Expect no avx_vzeroupper. * gcc.target/i386/avx-vzeroupper-11.c: Likewise. * gcc.target/i386/avx-vzeroupper-14.c: Replace -O0 with -O2. * gcc.target/i386/avx-vzeroupper-15.c: Likewise. * gcc.target/i386/avx-vzeroupper-16.c: Likewise. * gcc.target/i386/avx-vzeroupper-17.c: Likewise. * gcc.target/i386/avx-vzeroupper-20.c: New. * gcc.target/i386/avx-vzeroupper-21.c: Likewise. * gcc.target/i386/avx-vzeroupper-22.c: Likewise. * gcc.target/i386/avx-vzeroupper-23.c: Likewise. * gcc.target/i386/avx-vzeroupper-24.c: Likewise. * gcc.target/i386/avx-vzeroupper-25.c: Likewise. * gcc.target/i386/avx-vzeroupper-26.c: Likewise.
Attachment:
gcc-pr46519-4.patch
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |