This is the mail archive of the
mailing list for the GCC project.
Re: Generic tuning in x86-tune.def 1/2
- From: Jan Hubicka <hubicka at ucw dot cz>
- To: "H.J. Lu" <hjl dot tools at gmail dot com>
- Cc: Jan Hubicka <hubicka at ucw dot cz>, GCC Patches <gcc-patches at gcc dot gnu dot org>
- Date: Fri, 27 Sep 2013 17:36:22 +0200
- Subject: Re: Generic tuning in x86-tune.def 1/2
- Authentication-results: sourceware.org; auth=none
- References: <20130927085640 dot GD21484 at kam dot mff dot cuni dot cz> <CAMe9rOqQG1Lyb0aNKe2ShZ-_0Ong0cejPj=JBLv=vNyJSbO46Q at mail dot gmail dot com>
> On Fri, Sep 27, 2013 at 1:56 AM, Jan Hubicka <email@example.com> wrote:
> > Hi,
> > this is second part of the generic tuning changes sanityzing the tuning flags.
> > This patch again is supposed to deal with the "obvious" part only.
> > I will send separate patch for more changes.
> > The flags changed agree on all CPUs considered for generic (and their
> > optimization manuals) + amdfam10, core2 and Atom SLM.
> > I also added X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL to bobcat tuning, since it
> > seems like obvious omision (after double checking in optimization manual) and
> > droped X86_TUNE_FOUR_JUMP_LIMIT for buldozer cores. Implementation of this
> > feature was always bit weird and its main purpose was to avoid terrible branch
> > predictor degeneration on the older AMD branch predictors. I benchmarked both
> > spec2k and 2k6 to verify there are no regression.
> > Especially X86_TUNE_REASSOC_FP_TO_PARALLEL seems to bring nice improvements in specfp
> > benchmarks.
> > Bootstrapped/regtested x86_64-linux, will wait for comments and commit it
> > during weekend. I will be happy to revisit any of the generic tuning if
> > regressions pop up.
> > Overall this patch also brings small code size improvements for smaller
> > loads/stores and less padding at -O2. Differences are sub 0.1% however.
> > Honza
> > * x86-tune.def (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL): Enable for generic.
> > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Likewise.
> > (X86_TUNE_FOUR_JUMP_LIMIT): Drop for generic and buldozer.
> > (X86_TUNE_PAD_RETURNS): Drop for newer AMD chips.
> Can we drop generic on X86_TUNE_PAD_RETURNS?
It is on my list for not-so-obvious changes. I tested and removed it from
BDVER with intention to drop it from generic. But after furhter testing I lean
towards keeping it for some extra time.
I tested it on fam10 machines and it causes over 10% regressions on some
benchmarks, including bzip and botan (where it is up to 4-fold regression).
Missing a return on amdfam10 hardware is bad, because it causes return stack to
go out of sync. At the same time I can not really measure benefits for
disabling it - the code size cost is very small and runtime cost on
non-amdfam10 cores is not important, too, since the function call overhead hide
the extra nop quite easily.
So I would incline to be apply extra care on this flag and keep it for extra
release or two. Most of gcc.opensuse.org testing runs on these and adding
random branch mispredictions will trash them.
At the related note, would would you think of X86_TUNE_PARTIAL_FLAG_REG_STALL?
I benchmarked it on my I5 notebook and it seems to have no measurable effects
I also did some benchmarking of the patch to disable alignments you proposed.
Unforutnately I can measure slowdowns on fam10/bdver/and on botan/hand written
loops even for core.
I am considering to drop the branch target/function alignment and keep only loop
alignment, but I did not test this yet.