84490 – [11/12/13/14 regression] 436.cactusADM regressed by 6-8% percent with -Ofast on Zen and Haswell, compared to gcc 7.2

Bug 84490 - [11/12/13/14 regression] 436.cactusADM regressed by 6-8% percent with -Ofast on Zen and Haswell, compared to gcc 7.2

Summary: [11/12/13/14 regression] 436.cactusADM regressed by 6-8% percent with -Ofast ...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	8.0

Importance:	P2 normal
Target Milestone:	11.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2018-02-20 17:47 UTC by Martin Jambor
Modified:	2023-07-07 10:33 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2018-04-10 00:00:00

Attachments
r254011 with peeling disabled (53.38 KB, text/plain) 2018-04-10 12:25 UTC, Richard Biener	Details
r254012 with peeling disabled (53.68 KB, text/plain) 2018-04-10 12:26 UTC, Richard Biener	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin Jambor 2018-02-20 17:47:26 UTC

Compared to gcc 7.2, 436.cactusADM is about 8% and 6% slower when run
on an AND Ryzen and an AMD EPYC respectively after compiling it on
trunk with -Ofast (and generic march and tuning).

So far I have not done any bisecting, only verified that this is most
probably unrelated to PR 82362 because r251713 does not seem to have
any effect on the benchmark (the benchmark is already slow with that
revision).

Comment 1 Richard Biener 2018-04-10 08:19:01 UTC

I also see this for Haswell: https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/index.html

There it's more like 10-14% depending on which parts you look at.

For bisection it's a bit weird:

201710240032 r254030  base 48.3  peak 52.2
201710230039 r253996  base 64.7  peak 57.2
201710221240 r253982  base 64.6  peak 65.8
201710210035 r253966  base 65.6  peak 65.2

where base is -Ofast -march=haswell and peak adds -flto.

Note it might be that around this time I disabled address-space randomization
just in case it is an issue similar to PR82362.  I just don't remember exactly
so I'd have to reproduce the regression around this revs.

between r253982 and r253996 the culprit likely would be

r253993 | hubicka | 2017-10-23 00:09:47 +0200 (Mon, 23 Oct 2017) | 12 lines


        * i386.c (ix86_builtin_vectorization_cost): Use existing rtx_cost
        latencies instead of having separate table; make difference between
        integer and float costs.
        * i386.h (processor_costs): Remove scalar_stmt_cost,
        scalar_load_cost, scalar_store_cost, vec_stmt_cost, vec_to_scalar_cost,
        scalar_to_vec_cost, vec_align_load_cost, vec_unalign_load_cost,
        vec_store_cost.
        * x86-tune-costs.h: Remove entries which has been removed in
        procesor_costs from all tables; make cond_taken_branch_cost
        and cond_not_taken_branch_cost COST_N_INSNS based.

similar the other range includes

r254012 | hubicka | 2017-10-23 17:10:09 +0200 (Mon, 23 Oct 2017) | 15 lines


        * i386.c (dimode_scalar_chain::compute_convert_gain): Use
        xmm_move instead of sse_move.
        (sse_store_index): New function.
        (ix86_register_move_cost): Be more sensible about mismatch stall;
        model AVX moves correctly; make difference between sse->integer and
        integer->sse.
        (ix86_builtin_vectorization_cost): Model correctly aligned and unaligned
        moves; make difference between SSE and AVX.
        * i386.h (processor_costs): Remove sse_move; add xmm_move, ymm_move
        and zmm_move. Increase size of sse load and store tables;
        add unaligned load and store tables; add ssemmx_to_integer.
        * x86-tune-costs.h: Update all entries according to real 
        move latencies from Agner Fog's manual and chip documentation.

so it indeed looks like a target (vectorization) cost model issue at a first
glance.

Profiling the difference between non-LTO r253982 and r254030 might tell
apart the important loop(s).  Note that we did recover performance later.
cactusADM is a bit noisy (see that other PR) but base is now in the range
of 51-55 with peak a little bit higher than that.

Comment 2 Richard Biener 2018-04-10 08:25:23 UTC

I'll get numbers/profiles for r253992, r253993 and r254012 for -Ofast -march=haswell.

Comment 3 Richard Biener 2018-04-10 09:05:44 UTC

Actually r253993 was just the changelog part, r253975 was the actual change.

So I'm doing r254012 vs r254011 instead.

Comment 4 Richard Biener 2018-04-10 12:05:48 UTC

(In reply to Richard Biener from comment #3)
> Actually r253993 was just the changelog part, r253975 was the actual change.
> 
> So I'm doing r254012 vs r254011 instead.

Base = r254011, Peak = r254012

                                  Estimated                       Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
436.cactusADM   11950        185       64.6 *   11950        233       51.4 S
436.cactusADM   11950        185       64.6 S   11950        230       51.9 S
436.cactusADM   11950        185       64.6 S   11950        232       51.5 *

so confirmed.  Unsurprisingly:

 53.66%  cactusADM_peak.  cactusADM_peak.amd64-m64-gcc42-nn  [.] bench_staggeredleapfrog2_                  
 43.53%  cactusADM_base.  cactusADM_base.amd64-m64-gcc42-nn  [.] bench_staggeredleapfrog2_         

and the difference is that for the fast version we peel for alignment.  And
we're probably lucky in that we end up aligning all stores given they are

      do k = 2,nz-1
...
         do j=2,ny-1
            do i=2,nx-1
...
               ADM_kxx_stag(i,j,k) = ADM_kxx_stag_p(i,j,k)+
     &              dkdt_dkxxdt*dt
...
               ADM_kxy_stag(i,j,k) = ADM_kxy_stag_p(i,j,k)+
     &              dkdt_dkxydt*dt
...
               ADM_kxz_stag(i,j,k) = ADM_kxz_stag_p(i,j,k)+
     &              dkdt_dkxzdt*dt
...
               ADM_kyy_stag(i,j,k) = ADM_kyy_stag_p(i,j,k)+
     &              dkdt_dkyydt*dt
...
               ADM_kyz_stag(i,j,k) = ADM_kyz_stag_p(i,j,k)+
     &              dkdt_dkyzdt*dt
...
               ADM_kzz_stag(i,j,k) = ADM_kzz_stag_p(i,j,k)+
     &              dkdt_dkzzdt*dt
            end do
         end do
      end do

all arrays have the same shape (but that fact isn't exposed in the IL)
and assuming same alignment of the incoming pointers we could have
"guessed" we are actually aligning more than one store.

Both variants spill _a lot_ (but hopefully to aligned stack slots),
so we are most probably store-bound here (without actually verifying).

Disabling peeling for alignment on r254011 results in

436.cactusADM   11950        207       57.7 *    

so that's only half-way.  Disabling peeling for alignment on r254012 does
nothing (as expected).  So there's more than just peeling for alignment
here.

Comment 5 Richard Biener 2018-04-10 12:25:33 UTC

Created attachment 43896 [details]
r254011 with peeling disabled

The other differences look like RA/scheduling in the end the stack frame in the new rev. is 32 bytes larger (up from $4800 to $4832).  Disabling the 2nd
scheduling pass doesn't have any nice effects btw.

All the spills in the code certainly makes for bad code so I'm not sure that
trying to fix things by re-introducing the peeling for alignment somehow
makes most sense...

Looking for an opportunity to distribute the loop might make more sense,
eventually more explicitely "spilling" shared intermediate results to
memory in distribution.  The source is quite unwieldly and dependences
are not obvious here.

Comment 6 Richard Biener 2018-04-10 12:26:03 UTC

Created attachment 43897 [details]
r254012 with peeling disabled

Comment 7 Jakub Jelinek 2018-05-02 10:05:03 UTC

GCC 8.1 has been released.

Comment 8 Jakub Jelinek 2018-07-26 11:01:47 UTC

GCC 8.2 has been released.

Comment 9 Richard Biener 2018-12-20 11:11:47 UTC

what's the state on trunk?

Comment 10 Martin Jambor 2018-12-20 11:31:11 UTC

I should have my own numbers only in January, but according to
https://lnt.opensuse.org/db_default/v4/SPEC/spec_report/branch there
is a 7% regression at -Ofast and generic march/mtune on Zen.

Comment 11 Jakub Jelinek 2019-02-22 15:20:44 UTC

GCC 8.3 has been released.

Comment 12 Richard Biener 2020-01-17 09:17:42 UTC

Wonder if we can have an update on this?

Comment 13 Martin Jambor 2020-02-28 19:10:05 UTC

(In reply to Richard Biener from comment #12)
> Wonder if we can have an update on this?

TL;DR: there still seems to be a regression, but smaller and difficult to pin down.

The benchmark often goes up and down a bit:
https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=134.100.0&plot.1=37.100.0&plot.2=69.100.0&plot.3=248.100.0&plot.4=27.100.0&

Nevertheless, I recently got the following numbers at Ofast and
generic march (times, so lower is better):

| GCC 7.5      | 100.00% |
| GCC 8.3      | 102.07% |
| GCC 9.2      | 104.15% |
| trunk 17 Feb | 109.33% |
| trunk 24 Feb | 104.66% |

Zen2 based CPU
| GCC 7.5      | 100.00% |
| GCC 8.3      | 100.00% |
| GCC 9.2      | 110.06% |
| trunk 17 Feb | 111.32% |
| trunk 24 Feb | 105.66% |

OTOH, on zen2 GCC 9 and trunk are 23% better at -Ofast and native over GCC 8 or 7 (but those do not know znver2 and I did not try forcing 256bit vectors).

Comment 14 Jakub Jelinek 2020-03-04 09:31:50 UTC

GCC 8.4.0 has been released, adjusting target milestone.

Comment 15 Martin Jambor 2020-07-30 19:52:30 UTC

The problem sometimes is still there, sometimes it isn't:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=37.100.0&plot.1=27.100.0&

I wonder whether we should keep this bug opened, the benchmark seems too erratic.

Comment 16 Jakub Jelinek 2021-05-14 09:50:00 UTC

GCC 8 branch is being closed.

Comment 17 Richard Biener 2021-06-01 08:10:26 UTC

GCC 9.4 is being released, retargeting bugs to GCC 9.5.

Comment 18 Richard Biener 2022-05-27 09:38:27 UTC

GCC 9 branch is being closed

Comment 19 Jakub Jelinek 2022-06-28 10:34:37 UTC

GCC 10.4 is being released, retargeting bugs to GCC 10.5.

Comment 20 Richard Biener 2023-07-07 10:33:21 UTC

GCC 10 branch is being closed.