59036 – [4.9 regression] Performance degradation after r204212 on 32-bit x86 targets.

Bug 59036 - [4.9 regression] Performance degradation after r204212 on 32-bit x86 targets.

Summary: [4.9 regression] Performance degradation after r204212 on 32-bit x86 targets.

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.9.0

Importance:	P3 normal
Target Milestone:	4.9.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2013-11-07 10:16 UTC by Yuri Rumyantsev
Modified:	2013-11-19 15:03 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
test-case to reproduce (316 bytes, text/plain) 2013-11-07 10:18 UTC, Yuri Rumyantsev	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yuri Rumyantsev 2013-11-07 10:16:08 UTC

After patch to improve register preferencing in IRA and to *remove regmove* pass we noticed performance degradation on several benchmarks from eembc2.0 suite in 32-bit mode for all x86 targets (such as atom, slm, hsw, etc.).
This can be reproduced with attached test-case - after fix 3 more instructions are generated for innermost loop (compiled with -O2 -m32 -march=core-avx2 options):

  before fix
.L4:
	movl	12(%esp), %edx
	addl	$3, %ecx
	movl	4(%esp), %ebx
	movl	(%esp), %ebp
	movl	8(%esp), %esi
	movzbl	(%edx,%eax), %edi
	movl	16(%esp), %edx
	movzbl	(%ebx,%eax), %ebx
	movzbl	(%esi,%eax), %esi
	addl	$1, %eax
	addl	(%edx,%edi,4), %ebp
	movzbl	0(%ebp,%ebx), %edx
	movl	28(%esp), %ebp
	movb	%dl, -3(%ecx)
	movl	24(%esp), %edx
	movl	(%edx,%edi,4), %edx
	movl	(%esp), %edi
	addl	0(%ebp,%esi,4), %edx
	leal	(%edi,%ebx), %ebp
	sarl	$16, %edx
	movzbl	0(%ebp,%edx), %edx
	movl	20(%esp), %ebp
	movb	%dl, -2(%ecx)
	movl	0(%ebp,%esi,4), %edx
	addl	%edi, %edx
	movzbl	(%edx,%ebx), %edx
	movb	%dl, -1(%ecx)
	cmpl	80(%esp), %eax
	jne	.L4

  after fix
.L4:
	movl	8(%esp), %ebx
	addl	$3, %edx
	movl	12(%esp), %esi
	movl	4(%esp), %ecx
	movzbl	(%ebx,%eax), %ebx
	movzbl	(%esi,%eax), %esi
	movzbl	(%ecx,%eax), %ecx
	addl	$1, %eax
	movb	%bl, (%esp)
	movl	16(%esp), %ebx
	movl	(%ebx,%esi,4), %ebp
	addl	%edi, %ebp
	movzbl	0(%ebp,%ecx), %ebx
	movzbl	(%esp), %ebp
	movb	%bl, -3(%edx)
	movl	24(%esp), %ebx
	movl	%ebp, (%esp)
	movl	(%ebx,%esi,4), %esi
	movl	28(%esp), %ebx
	addl	(%ebx,%ebp,4), %esi
	leal	(%edi,%ecx), %ebp
	sarl	$16, %esi
	movzbl	0(%ebp,%esi), %ebx
	movl	20(%esp), %esi
	movl	(%esp), %ebp
	movb	%bl, -2(%edx)
	movl	%edi, %ebx
	addl	(%esi,%ebp,4), %ebx
	movzbl	(%ebx,%ecx), %ecx
	movb	%cl, -1(%edx)
	cmpl	80(%esp), %eax
	jne	.L4

Comment 1 Yuri Rumyantsev 2013-11-07 10:18:37 UTC

Created attachment 31178 [details]
test-case to reproduce

test need to be compiled with -m32 option for any x86 targets.

Comment 2 Vladimir Makarov 2013-11-07 15:34:18 UTC

(In reply to Yuri Rumyantsev from comment #0)
> After patch to improve register preferencing in IRA and to *remove regmove*
> pass we noticed performance degradation on several benchmarks from eembc2.0
> suite in 32-bit mode for all x86 targets (such as atom, slm, hsw, etc.).
> This can be reproduced with attached test-case - after fix 3 more
> instructions are generated for innermost loop (compiled with -O2 -m32
> -march=core-avx2 options):
> 

I am just curious what is the overall score change?  Are there only performance degradations?  Was something improved?

In general would you prefer to reverse this patch?  Because I am affraid, it will be only solution for the PR.

I am asking this because very frequently heuristic based optimizations generate something better and something worse.  That is their nature.

When I worked on this optimization I had to change about 15 tests from GCC testsuites checking AVX and found that in every tests uneccessary register shuffling moves were deleted after applying the patch.

Comment 3 Vladimir Makarov 2013-11-13 18:00:44 UTC

Author: vmakarov
Date: Wed Nov 13 18:00:43 2013
New Revision: 204752

URL: http://gcc.gnu.org/viewcvs?rev=204752&root=gcc&view=rev
Log:
2013-11-13  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/59036
	* ira-color.c (struct allocno_color_data): Add new members
	first_thread_allocno, next_thread_allocno, thread_freq.
	(sorted_copies): New static var.
	(allocnos_conflict_by_live_ranges_p, copy_freq_compare_func): Move
	up.
	(allocno_thread_conflict_p, merge_threads)
	(form_threads_from_copies, form_threads_from_bucket)
	(form_threads_from_colorable_allocno, init_allocno_threads): New
	functions.
	(bucket_allocno_compare_func): Add comparison by thread frequency
	and threads.
	(add_allocno_to_ordered_bucket): Rename to
	add_allocno_to_ordered_colorable_bucket.  Remove parameter.
        (push_only_colorable): Call form_threads_from_bucket.
	(color_pass): Call init_allocno_threads.  Use
	consideration_allocno_bitmap instead of coloring_allocno_bitmap
	for nuillify allocno color data.
	(ira_initiate_assign, ira_finish_assign): Allocate/free
	sorted_copies.
	(coalesce_allocnos): Use static sorted copies.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ira-color.c

Comment 4 Richard Biener 2013-11-19 15:03:28 UTC

I suppose fixed.