Bug 59036 - [4.9 regression] Performance degradation after r204212 on 32-bit x86 targets.
Summary: [4.9 regression] Performance degradation after r204212 on 32-bit x86 targets.
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.9.0
: P3 normal
Target Milestone: 4.9.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-07 10:16 UTC by Yuri Rumyantsev
Modified: 2013-11-19 15:03 UTC (History)
2 users (show)

See Also:
Host:
Target: i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
test-case to reproduce (316 bytes, text/plain)
2013-11-07 10:18 UTC, Yuri Rumyantsev
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yuri Rumyantsev 2013-11-07 10:16:08 UTC
After patch to improve register preferencing in IRA and to *remove regmove* pass we noticed performance degradation on several benchmarks from eembc2.0 suite in 32-bit mode for all x86 targets (such as atom, slm, hsw, etc.).
This can be reproduced with attached test-case - after fix 3 more instructions are generated for innermost loop (compiled with -O2 -m32 -march=core-avx2 options):

  before fix
.L4:
	movl	12(%esp), %edx
	addl	$3, %ecx
	movl	4(%esp), %ebx
	movl	(%esp), %ebp
	movl	8(%esp), %esi
	movzbl	(%edx,%eax), %edi
	movl	16(%esp), %edx
	movzbl	(%ebx,%eax), %ebx
	movzbl	(%esi,%eax), %esi
	addl	$1, %eax
	addl	(%edx,%edi,4), %ebp
	movzbl	0(%ebp,%ebx), %edx
	movl	28(%esp), %ebp
	movb	%dl, -3(%ecx)
	movl	24(%esp), %edx
	movl	(%edx,%edi,4), %edx
	movl	(%esp), %edi
	addl	0(%ebp,%esi,4), %edx
	leal	(%edi,%ebx), %ebp
	sarl	$16, %edx
	movzbl	0(%ebp,%edx), %edx
	movl	20(%esp), %ebp
	movb	%dl, -2(%ecx)
	movl	0(%ebp,%esi,4), %edx
	addl	%edi, %edx
	movzbl	(%edx,%ebx), %edx
	movb	%dl, -1(%ecx)
	cmpl	80(%esp), %eax
	jne	.L4

  after fix
.L4:
	movl	8(%esp), %ebx
	addl	$3, %edx
	movl	12(%esp), %esi
	movl	4(%esp), %ecx
	movzbl	(%ebx,%eax), %ebx
	movzbl	(%esi,%eax), %esi
	movzbl	(%ecx,%eax), %ecx
	addl	$1, %eax
	movb	%bl, (%esp)
	movl	16(%esp), %ebx
	movl	(%ebx,%esi,4), %ebp
	addl	%edi, %ebp
	movzbl	0(%ebp,%ecx), %ebx
	movzbl	(%esp), %ebp
	movb	%bl, -3(%edx)
	movl	24(%esp), %ebx
	movl	%ebp, (%esp)
	movl	(%ebx,%esi,4), %esi
	movl	28(%esp), %ebx
	addl	(%ebx,%ebp,4), %esi
	leal	(%edi,%ecx), %ebp
	sarl	$16, %esi
	movzbl	0(%ebp,%esi), %ebx
	movl	20(%esp), %esi
	movl	(%esp), %ebp
	movb	%bl, -2(%edx)
	movl	%edi, %ebx
	addl	(%esi,%ebp,4), %ebx
	movzbl	(%ebx,%ecx), %ecx
	movb	%cl, -1(%edx)
	cmpl	80(%esp), %eax
	jne	.L4
Comment 1 Yuri Rumyantsev 2013-11-07 10:18:37 UTC
Created attachment 31178 [details]
test-case to reproduce

test need to be compiled with -m32 option for any x86 targets.
Comment 2 Vladimir Makarov 2013-11-07 15:34:18 UTC
(In reply to Yuri Rumyantsev from comment #0)
> After patch to improve register preferencing in IRA and to *remove regmove*
> pass we noticed performance degradation on several benchmarks from eembc2.0
> suite in 32-bit mode for all x86 targets (such as atom, slm, hsw, etc.).
> This can be reproduced with attached test-case - after fix 3 more
> instructions are generated for innermost loop (compiled with -O2 -m32
> -march=core-avx2 options):
> 

I am just curious what is the overall score change?  Are there only performance degradations?  Was something improved?

In general would you prefer to reverse this patch?  Because I am affraid, it will be only solution for the PR.

I am asking this because very frequently heuristic based optimizations generate something better and something worse.  That is their nature.

When I worked on this optimization I had to change about 15 tests from GCC testsuites checking AVX and found that in every tests uneccessary register shuffling moves were deleted after applying the patch.
Comment 3 Vladimir Makarov 2013-11-13 18:00:44 UTC
Author: vmakarov
Date: Wed Nov 13 18:00:43 2013
New Revision: 204752

URL: http://gcc.gnu.org/viewcvs?rev=204752&root=gcc&view=rev
Log:
2013-11-13  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/59036
	* ira-color.c (struct allocno_color_data): Add new members
	first_thread_allocno, next_thread_allocno, thread_freq.
	(sorted_copies): New static var.
	(allocnos_conflict_by_live_ranges_p, copy_freq_compare_func): Move
	up.
	(allocno_thread_conflict_p, merge_threads)
	(form_threads_from_copies, form_threads_from_bucket)
	(form_threads_from_colorable_allocno, init_allocno_threads): New
	functions.
	(bucket_allocno_compare_func): Add comparison by thread frequency
	and threads.
	(add_allocno_to_ordered_bucket): Rename to
	add_allocno_to_ordered_colorable_bucket.  Remove parameter.
        (push_only_colorable): Call form_threads_from_bucket.
	(color_pass): Call init_allocno_threads.  Use
	consideration_allocno_bitmap instead of coloring_allocno_bitmap
	for nuillify allocno color data.
	(ira_initiate_assign, ira_finish_assign): Allocate/free
	sorted_copies.
	(coalesce_allocnos): Use static sorted copies.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ira-color.c
Comment 4 Richard Biener 2013-11-19 15:03:28 UTC
I suppose fixed.