After patch to improve register preferencing in IRA and to *remove regmove* pass we noticed performance degradation on several benchmarks from eembc2.0 suite in 32-bit mode for all x86 targets (such as atom, slm, hsw, etc.). This can be reproduced with attached test-case - after fix 3 more instructions are generated for innermost loop (compiled with -O2 -m32 -march=core-avx2 options): before fix .L4: movl 12(%esp), %edx addl $3, %ecx movl 4(%esp), %ebx movl (%esp), %ebp movl 8(%esp), %esi movzbl (%edx,%eax), %edi movl 16(%esp), %edx movzbl (%ebx,%eax), %ebx movzbl (%esi,%eax), %esi addl $1, %eax addl (%edx,%edi,4), %ebp movzbl 0(%ebp,%ebx), %edx movl 28(%esp), %ebp movb %dl, -3(%ecx) movl 24(%esp), %edx movl (%edx,%edi,4), %edx movl (%esp), %edi addl 0(%ebp,%esi,4), %edx leal (%edi,%ebx), %ebp sarl $16, %edx movzbl 0(%ebp,%edx), %edx movl 20(%esp), %ebp movb %dl, -2(%ecx) movl 0(%ebp,%esi,4), %edx addl %edi, %edx movzbl (%edx,%ebx), %edx movb %dl, -1(%ecx) cmpl 80(%esp), %eax jne .L4 after fix .L4: movl 8(%esp), %ebx addl $3, %edx movl 12(%esp), %esi movl 4(%esp), %ecx movzbl (%ebx,%eax), %ebx movzbl (%esi,%eax), %esi movzbl (%ecx,%eax), %ecx addl $1, %eax movb %bl, (%esp) movl 16(%esp), %ebx movl (%ebx,%esi,4), %ebp addl %edi, %ebp movzbl 0(%ebp,%ecx), %ebx movzbl (%esp), %ebp movb %bl, -3(%edx) movl 24(%esp), %ebx movl %ebp, (%esp) movl (%ebx,%esi,4), %esi movl 28(%esp), %ebx addl (%ebx,%ebp,4), %esi leal (%edi,%ecx), %ebp sarl $16, %esi movzbl 0(%ebp,%esi), %ebx movl 20(%esp), %esi movl (%esp), %ebp movb %bl, -2(%edx) movl %edi, %ebx addl (%esi,%ebp,4), %ebx movzbl (%ebx,%ecx), %ecx movb %cl, -1(%edx) cmpl 80(%esp), %eax jne .L4
Created attachment 31178 [details] test-case to reproduce test need to be compiled with -m32 option for any x86 targets.
(In reply to Yuri Rumyantsev from comment #0) > After patch to improve register preferencing in IRA and to *remove regmove* > pass we noticed performance degradation on several benchmarks from eembc2.0 > suite in 32-bit mode for all x86 targets (such as atom, slm, hsw, etc.). > This can be reproduced with attached test-case - after fix 3 more > instructions are generated for innermost loop (compiled with -O2 -m32 > -march=core-avx2 options): > I am just curious what is the overall score change? Are there only performance degradations? Was something improved? In general would you prefer to reverse this patch? Because I am affraid, it will be only solution for the PR. I am asking this because very frequently heuristic based optimizations generate something better and something worse. That is their nature. When I worked on this optimization I had to change about 15 tests from GCC testsuites checking AVX and found that in every tests uneccessary register shuffling moves were deleted after applying the patch.
Author: vmakarov Date: Wed Nov 13 18:00:43 2013 New Revision: 204752 URL: http://gcc.gnu.org/viewcvs?rev=204752&root=gcc&view=rev Log: 2013-11-13 Vladimir Makarov <vmakarov@redhat.com> PR rtl-optimization/59036 * ira-color.c (struct allocno_color_data): Add new members first_thread_allocno, next_thread_allocno, thread_freq. (sorted_copies): New static var. (allocnos_conflict_by_live_ranges_p, copy_freq_compare_func): Move up. (allocno_thread_conflict_p, merge_threads) (form_threads_from_copies, form_threads_from_bucket) (form_threads_from_colorable_allocno, init_allocno_threads): New functions. (bucket_allocno_compare_func): Add comparison by thread frequency and threads. (add_allocno_to_ordered_bucket): Rename to add_allocno_to_ordered_colorable_bucket. Remove parameter. (push_only_colorable): Call form_threads_from_bucket. (color_pass): Call init_allocno_threads. Use consideration_allocno_bitmap instead of coloring_allocno_bitmap for nuillify allocno color data. (ira_initiate_assign, ira_finish_assign): Allocate/free sorted_copies. (coalesce_allocnos): Use static sorted copies. Modified: trunk/gcc/ChangeLog trunk/gcc/ira-color.c
I suppose fixed.