68654 – [6 Regression] CoreMark Pro performance degradation

Bug 68654 - [6 Regression] CoreMark Pro performance degradation

Summary: [6 Regression] CoreMark Pro performance degradation

Status:	RESOLVED DUPLICATE of bug 64058

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	6.0

Importance:	P3 normal
Target Milestone:	6.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2015-12-02 13:34 UTC by Alexander Fomin
Modified:	2016-03-10 20:25 UTC (History)
CC List:	4 users (show)

See Also:	64058
Host:
Target:	i686--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2015-12-02 00:00:00

Attachments
Dumps (38.50 KB, application/x-compressed) 2015-12-08 15:47 UTC, Igor Zamyatin	Details
Detailed dumps (9.67 KB, application/x-xz) 2015-12-15 12:40 UTC, Alexander Fomin	Details
Detailed RTL expand dumps (59.62 KB, application/x-xz) 2016-01-25 12:58 UTC, Alexander Fomin	Details
patch for SSA name management (944 bytes, patch) 2016-01-26 12:31 UTC, Richard Biener	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexander Fomin 2015-12-02 13:34:09 UTC

After r228668 we can see a performance degradation in CoreMark Pro benchmark compiled with -m32 -Ofast -funroll-loops for Haswell CPU.

Comment 1 Alexander Fomin 2015-12-02 13:39:47 UTC

The degrading test is naturally CoreMark itself.

Comment 2 Richard Biener 2015-12-02 14:12:54 UTC

Can you please produce a testcase from the relevant part of CoreMark?

Comment 3 Richard Biener 2015-12-02 14:14:30 UTC

Btw, that revision is unlikely a bisect, correct?  It should have zero effects on code generation.

Comment 4 Alexander Fomin 2015-12-02 14:30:21 UTC

It should, but for some reason we see a stable reproducible degradation between r228667 and r228668...

Comment 5 Richard Biener 2015-12-02 15:14:06 UTC

Interesting the only effect could be different GC allocation pattern because
the non-splice variant may end up re-allocating the target vector multiple
times.  But this alone should never change code generation.

So - can you share a preprocessed file where code generation changes with
this revision?

Comment 6 Igor Zamyatin 2015-12-08 15:47:59 UTC

Created attachment 36961 [details]
Dumps

Profilers show that core_state_transition and calc_func indeed became slower after r228668.

First difference in dumps seems start in expand pass. I attached the dumps - quick look shows that there are extra register copies in several places could be seen.

Options that were used - -m32  -Ofast -funroll-loops -flto -static -march=core-avx2

Comment 7 Jakub Jelinek 2015-12-09 10:51:20 UTC

I think the difference is that the older flush_ssaname_freelist loop added the SSA_NAMEs from FREE_SSANAMES_QUEUE (cfun) in reverse order, i.e. it first pushed the last SSA_NAME from FREE_SSANAMES_QUEUE (cfun), then the one before that and so on until the first one, while the splice version pushes them in the order they appear.  The new one is closer to the pre-FREE_SSANAMES_QUEUE behavior, last SSA_NAME to be freed gets reused first, the r228667 behavior has been very weird, it would try to reuse first the oldest SSA_NAME freed since the one before last flush_ssaname_freelist call, until the newest such SSA_NAME, then continue with the even older ones etc.
That said, on CoreMark with the options you've provided I don't see any difference in core_state_transition, nor in calc_func, but on core_list_join.c see lots of changes in other functions (core_list_mergesort and its constprop.2 variant).

Comment 8 Richard Biener 2015-12-09 11:07:00 UTC

(In reply to Jakub Jelinek from comment #7)
> I think the difference is that the older flush_ssaname_freelist loop added
> the SSA_NAMEs from FREE_SSANAMES_QUEUE (cfun) in reverse order, i.e. it
> first pushed the last SSA_NAME from FREE_SSANAMES_QUEUE (cfun), then the one
> before that and so on until the first one, while the splice version pushes
> them in the order they appear.  The new one is closer to the
> pre-FREE_SSANAMES_QUEUE behavior, last SSA_NAME to be freed gets reused
> first, the r228667 behavior has been very weird, it would try to reuse first
> the oldest SSA_NAME freed since the one before last flush_ssaname_freelist
> call, until the newest such SSA_NAME, then continue with the even older ones
> etc.
> That said, on CoreMark with the options you've provided I don't see any
> difference in core_state_transition, nor in calc_func, but on
> core_list_join.c see lots of changes in other functions (core_list_mergesort
> and its constprop.2 variant).

I'd argue that ideally we'd want to re-use SSA names in SSA_NAME_VERSION order
to make ssa_names () dense.  And we'd release the "tail" to GC.

So as we now have two vectors for free SSA names we could as well keep
free_ssanames sorted, sort free_ssanames_queue and then merge them
appropriately... (sort them in reverse order so we can still pop() from
the free vector)

Comment 9 Jeffrey A. Law 2015-12-10 07:11:17 UTC

Without looking at the dumps or the code we generate, the only thing that makes any sense would be the different SSA_NAME_VERSIONs resulting in different coalescing.

If that is indeed the case, then that's an indication that performance is likely to randomly bounce around due to seemingly innocuous changes.  Whether or not there's something we ought to do, both to stabilize the resulting code and hopefully get consistently good performance would require some deep investigation of the coalescing heuristics.

Comment 10 Jeffrey A. Law 2015-12-10 07:25:52 UTC

I've just scanned the provided dumps.  This is almost certainly a difference in the SSA_NAME_VERSIONs causing different coalescing.  The tell-tale sign is seeing identical code in the .optimized dump, except that the SSA_NAME versions are different, then seeing different code being generated in the .expand dump.

This can be confirmed by providing the same dumps, but with the "details" modifier.  ie, -fdump-tree-optimized-details

Comment 11 Alexander Fomin 2015-12-15 12:40:31 UTC

Created attachment 37037 [details]
Detailed dumps

Attached the detailed dumps for optimized SSA tree.

Comment 12 Jeffrey A. Law 2016-01-19 23:13:43 UTC

Ah nuts.  The partitioning information is part of the .expand dump, not the .optimized dump.   -fdump-rtl-expand-details.

Comment 13 Alexander Fomin 2016-01-25 12:58:46 UTC

Created attachment 37460 [details]
Detailed RTL expand dumps

Ready.

Comment 14 Jeffrey A. Law 2016-01-25 19:45:49 UTC

Thanks.  This is definitely an issue with the changing version #s changing the ordering in which particular coalescing pairs are tried.

This is most likely coming from this (and related) code in tree-ssa-coalesce.c:

            case GIMPLE_ASSIGN:
              {
                tree lhs = gimple_assign_lhs (stmt);
                tree rhs1 = gimple_assign_rhs1 (stmt);
                if (gimple_assign_ssa_name_copy_p (stmt)
                    && gimple_can_coalesce_p (lhs, rhs1))
                  {
                    v1 = SSA_NAME_VERSION (lhs);
                    v2 = SSA_NAME_VERSION (rhs1);
                    cost = coalesce_cost_bb (bb);
                    add_coalesce (cl, v1, v2, cost);
                    bitmap_set_bit (used_in_copy, v1);
                    bitmap_set_bit (used_in_copy, v2);
                  }

Note how the version #s are passed into add_coalesce.

Those are then used to order the coalesce pair for hashing in find_coalesce_pair as well as in the qsort function compare_pairs.  They could easily be used elsewhere.

So while we do pull pairs out in a priority order, once the priorities are the same, the order selected is based on the SSA_NAME_VERSION, which, due to SSA_NAME recycling is effectively random.  You can see this by looking at the Sorted Coalesce list in the dumps.    The format is

(cost) obj1 <->obj2

If you compare the dumps you'll see that we have certain coalescing opportunities with the same cost, but which are ordered differently because of the SSA_NAME_VERSION differences.

The biggest obvious downside to coalescing  being dependent on SSA_NAME_VERSION is that a change in how names as returned to the name manager can affect code generation -- essentially causing code to improve or degrade in a random fashion.

One approach to fixing this problem might be to recompute the SSA_NAME_VERSION #s just prior to coalescing.

Essentially do a domwalk or some other walk of the IL reassigning version #s for the LHS operand.   This won't be 100% foolproof, but would probably be more stable than what we're doing now.  That may be possible in the gcc-6 timeframe, I'm not really sure right now.

Comment 15 Richard Biener 2016-01-26 11:42:34 UTC

(In reply to Jeffrey A. Law from comment #14)
> Thanks.  This is definitely an issue with the changing version #s changing
> the ordering in which particular coalescing pairs are tried.
> 
> This is most likely coming from this (and related) code in
> tree-ssa-coalesce.c:
> 
>             case GIMPLE_ASSIGN:
>               {
>                 tree lhs = gimple_assign_lhs (stmt);
>                 tree rhs1 = gimple_assign_rhs1 (stmt);
>                 if (gimple_assign_ssa_name_copy_p (stmt)
>                     && gimple_can_coalesce_p (lhs, rhs1))
>                   {
>                     v1 = SSA_NAME_VERSION (lhs);
>                     v2 = SSA_NAME_VERSION (rhs1);
>                     cost = coalesce_cost_bb (bb);
>                     add_coalesce (cl, v1, v2, cost);
>                     bitmap_set_bit (used_in_copy, v1);
>                     bitmap_set_bit (used_in_copy, v2);
>                   }
> 
> Note how the version #s are passed into add_coalesce.
> 
> Those are then used to order the coalesce pair for hashing in
> find_coalesce_pair as well as in the qsort function compare_pairs.  They
> could easily be used elsewhere.
> 
> So while we do pull pairs out in a priority order, once the priorities are
> the same, the order selected is based on the SSA_NAME_VERSION, which, due to
> SSA_NAME recycling is effectively random.  You can see this by looking at
> the Sorted Coalesce list in the dumps.    The format is
> 
> (cost) obj1 <->obj2
> 
> If you compare the dumps you'll see that we have certain coalescing
> opportunities with the same cost, but which are ordered differently because
> of the SSA_NAME_VERSION differences.
> 
> The biggest obvious downside to coalescing  being dependent on
> SSA_NAME_VERSION is that a change in how names as returned to the name
> manager can affect code generation -- essentially causing code to improve or
> degrade in a random fashion.
> 
> One approach to fixing this problem might be to recompute the
> SSA_NAME_VERSION #s just prior to coalescing.
> 
> Essentially do a domwalk or some other walk of the IL reassigning version #s
> for the LHS operand.   This won't be 100% foolproof, but would probably be
> more stable than what we're doing now.  That may be possible in the gcc-6
> timeframe, I'm not really sure right now.

I think the easier solution is to preserve the order we created the coalesce
pairs by adding a "coalesce number" to sort against of the costs are equal.
coalesce_pair has three ints so on 64bit hosts there isn't any memory
wasted if adding another one given we go via glibc malloc for allocating
them (ugh).

Of course if that produces the desired order for the benchmark is another
question ;)  (it's really random before as well)

But comment #8 also still holds (keep the free ssa name list we allocate
from sorted to pack names dense).

Comment 16 Richard Biener 2016-01-26 12:31:02 UTC

Created attachment 37474 [details]
patch for SSA name management

This patch changes SSA name freelist management to keep the freelist sorted
and at flush_ssaname_freelist shrink num_ssa_names as much as possible.

The goal is to make used SSA names as dense as possible.

Does this patch fix the Coremark regression by chance?  (I'll note that a
proper fix would change the important coalesce cost so it is carried out
regardless of SSA name order).

Comment 17 Alexander Fomin 2016-01-26 13:45:33 UTC

Unfortunately, it doesn't. Moreover, it causes another perf loss (about 1.2%).

Comment 18 Richard Biener 2016-01-26 14:37:50 UTC

(In reply to Alexander Fomin from comment #17)
> Unfortunately, it doesn't. Moreover, it causes another perf loss (about
> 1.2%).

Heh.  What about testcases?

Comment 19 Jeffrey A. Law 2016-01-26 15:40:24 UTC

What's initially more important than a particular improvement or degradation of performance is bringing more stability to the coalescing results -- so that we can focus on issues with coalescing rather than getting distracted by the semi-random performance improvements/drops that we have now.

Comment 20 rguenther@suse.de 2016-01-26 15:44:31 UTC

On Tue, 26 Jan 2016, law at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68654
> 
> --- Comment #19 from Jeffrey A. Law <law at redhat dot com> ---
> What's initially more important than a particular improvement or degradation of
> performance is bringing more stability to the coalescing results -- so that we
> can focus on issues with coalescing rather than getting distracted by the
> semi-random performance improvements/drops that we have now.

Well, you can simply insert a pass_release_ssa_names before pass_expand
then which will compact SSA names for you.  It keeps SSA name ordering
the same though so it probably doesn't gain anything.

Well, and then make the coalesce list stable by preserving the
coalesce candidate creation ordering (which works by walking BBs
and thus should be more stable in some sense)

Comment 21 Alexander Fomin 2016-01-26 17:49:52 UTC

(In reply to Richard Biener from comment #18)
> Heh.  What about testcases?
We don't have a reproducer yet, but I can provide any RTL dumps immediately (of course).

Comment 22 rguenther@suse.de 2016-01-27 08:23:13 UTC

On Tue, 26 Jan 2016, afomin.mailbox at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68654
> 
> --- Comment #21 from Alexander Fomin <afomin.mailbox at gmail dot com> ---
> (In reply to Richard Biener from comment #18)
> > Heh.  What about testcases?
> We don't have a reproducer yet, but I can provide any RTL dumps immediately (of
> course).

Can you simply attach preprocessed source of the relevant file?

Comment 23 Jeffrey A. Law 2016-02-08 21:23:24 UTC

I don't think it's worth the effort to try and keep that list sorted.  I think we can get what we want with a single walk over the IL just before coalescing.   That addresses the stability issue.

Then we will obviously want to look at whether or not there's something we're missing in the coalescer.  Maybe there's a reasonable tie-breaker that needs to be added or some such. But until we have stability in the coalescing phase, it's rather pointless to spend much effort on the coalescing algorithm as the results may easily be skewed by the instabilities.

Comment 24 rguenther@suse.de 2016-02-09 08:39:14 UTC

On Mon, 8 Feb 2016, law at redhat dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68654
> 
> --- Comment #23 from Jeffrey A. Law <law at redhat dot com> ---
> I don't think it's worth the effort to try and keep that list sorted.  I think
> we can get what we want with a single walk over the IL just before coalescing. 
>  That addresses the stability issue.
> 
> Then we will obviously want to look at whether or not there's something we're
> missing in the coalescer.  Maybe there's a reasonable tie-breaker that needs to
> be added or some such. But until we have stability in the coalescing phase,
> it's rather pointless to spend much effort on the coalescing algorithm as the
> results may easily be skewed by the instabilities.

Well, the results are only skewed for equal cost candidates due to SSA 
name allocation instabilities.  If we'd address this bug properly the
important case to catch would have a higher cost.

Comment 25 Jeffrey A. Law 2016-03-10 20:25:24 UTC

Essentially the same problem as 64058.  64058 includes a testcase, so I'm going to make it the canonical bug to track this issue.

If I take my prototype patch for 64058 and apply it to r228667 and r228668 things stabilize considerably.  All the changes in core_list_mergesort and its constprop variant disappear, leaving us with trivial changes in core_bench_list.

The core_bench_list changes do not involve any additional copies -- they're cases where we swap the operands of comparisons due to changes in the SSA_NAME_VERSIONs and one trivial difference where a copy is sunk one instruction.

If I compare r228667 to r228668+my patch I do see considerable changes to the constprop variant of core_list_mergesort.  That's not unexpected because r228667 essentially randomly orders coalesce pairs with the same cost while 228668 will apply a real tie-breaking heuristic.  When I look at the resulting code it appears that we get fewer copies and with 228668+my patch when compared to 228667 -- which is likely a good thing.

Anyway, I'm confident that addressing 64058 will also address 68654.

*** This bug has been marked as a duplicate of bug 64058 ***