This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH i386] Enable -freorder-blocks-and-partition


On 11/19/13 10:24, Teresa Johnson wrote:
On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
Martin,
can you, please, generate the updated systemtap with
-freorder-blocks-and-partition enabled?

I am in favour of enabling this - it is usefull pass and it is pointless ot
have passes that are not enabled by default.
Is there reason why this would not work on other ELF target? Is it working
with Darwin and Windows?

I don't know how to test these (I don't see any machines listed in the
gcc compile farm of those types). For Windows, I assume you mean
MinGW, which should be enabled as it is under i386. Should I disable
it there and for Darwin?


This patch enables -freorder-blocks-and-partition by default for x86
at -O2 and up. It is showing some modest gains in cpu2006 performance
with profile feedback and -O2 on an Intel Westmere system. Specifically,
I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk
(1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.

This actually sounds very good ;)

Lets see how the systemtap graphs goes.  If we will end up with problem
of too many accesses to cold section, I would suggest making cold section
subdivided into .unlikely and .unlikely.part (we could have better name)
with the second consisting only of unlikely parts of hot&normal functions.

This should reduce the problems we are seeing with mistakely identifying
code to be cold because of roundoff errors (and it probably makes sense
in general, too).
We will however need to update gold and ld for that.

Note that I don't think this would help much unless the linker is
changed to move the cold split section close to the hot section. There
is probably some fine-tuning we could do eventually in the linker
under -ffunction-sections without putting the split portions in a
separate section. I.e. clump the split parts together within unlikely.
But hopefully this can all be done later on as follow-on work to boost
the performance further.


Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal
bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were
configured with --enable-languages=all,obj-c++ and tested for both
32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}".

It would be good to enable this for additional targets as a follow on,
but it needs more testing for both correctness and performance on those
other targets (i.e for correctness because I see a number of places
in other config/*/*.c files that do some special handling under this
option for different targets or simply disable it, so I am not sure
how well-tested it is under different architectural constraints).

Ok for trunk?

Thanks,
Teresa

2013-11-19  Teresa Johnson  <tejohnson@google.com>

         * common/config/i386/i386-common.c: Enable
         -freorder-blocks-and-partition at -O2 and up for x86.
         * opts.c (finish_options): Only warn if -freorder-blocks-and-
         partition was set on command line.

You probably mis doc/invoke.texi update.
Thank you for working on this!

Yes, thanks. Here is the patch with the invoke.texi update.

Teresa


2013-11-19  Teresa Johnson  <tejohnson@google.com>

         * common/config/i386/i386-common.c: Enable
         -freorder-blocks-and-partition at -O2 and up for x86.
         * doc/invoke.texi: Update -freorder-blocks-and-partition default.
         * opts.c (finish_options): Only warn if -freorder-blocks-and-
         partition was set on command line.
This is fine.  Glad to see we're getting some good benefits from this code.

FWIW, I would expect the PA to show benefits from this work -- HP showed good results for block positioning and procedure splitting eons ago. A secondary effect you would expect to see would be an increase in the number of long branch stubs (static count), but a decrease in the number of those stubs that actually execute. Measuring those effects would obviously be out-of-scope.

I totally understand that the PA is irrelevant these days, but seeing good results on another architecture would go a long way to answering the question "should this be enabled by default across the board?"

jeff



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]