This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [RFC] Old school parallelization of WPA streaming
- From: Andi Kleen <ak at linux dot intel dot com>
- To: Jan Hubicka <hubicka at ucw dot cz>
- Cc: gcc-patches at gcc dot gnu dot org, rguenther at suse dot de, dnovillo at google dot com, dmalcolm at redhat dot com
- Date: Wed, 21 Aug 2013 07:58:53 -0700
- Subject: Re: [RFC] Old school parallelization of WPA streaming
- References: <20130821141747 dot GD24782 at kam dot mff dot cuni dot cz>
On Wed, Aug 21, 2013 at 04:17:48PM +0200, Jan Hubicka wrote:
> Hi,
> this is my attempt to bring GCC into wonderful era of multicore CPUs :)
> It is a hack, but it seems to help quite a lot. About 50% of WPA time is spent
> by streaming the individual ltrans .o files. This can be easily parallelized
> by fork - we do nothing afterwards, just exit and pass the list to the linker.
One risk is if someone streams to a spinning disk it may add more seeks for
the parallel IO. But I think it's a reasonable tradeoffs.
We should also use a faster compressor
> For -flto=jobserver I simply fork all 32 processes. It may not be a disaster,
> but perhaps we should figure out how to communicate with jobserver. At first
> glance on document on how it works, it seems easy to add. Perhaps we can even
> convicne GNU Make folks to put simple helpers to libiberty?
lto=jobserver is still broken and confuses tokens on large builds (ends
with a 0 read) I did some debugging recently, and I suspect a Linux kernel
bug now. Still haven't tracked it down.
Any workarounds would need make changs unfortunately.
>
> We also may figure out number of CPUs (is it available i.e. from libgomp)
sysconf(_SC_NPROCESSORS_ONLN) ?
> and use it by default even if user do not care to pass number of processes.
> Naturally these streaming forks should be cheap memory wise. I hope Martin
> will get me some actual numbers.
>
> With the patch the WPA time of firefox goes down to 2 minutes (4.8 needs about
> 30 minutes and without the hack one needs about 5 minutes)
Cool!
I'll try it on my builds
>
> +fparallelism=
> +LTO Joined
> +Run the link-time optimizer in whole program analysis (WPA) mode.
The description does not make sense
Rest of patch looks good from a quick read, although I would prefer to
do the waiting for children in the "parent", not the "last one"
-Andi
--
ak@linux.intel.com -- Speaking for myself only