[RFC] Old school parallelization of WPA streaming

Wed Aug 21 15:58:00 GMT 2013

Andi Kleen <ak@linux.intel.com> wrote:
>On Wed, Aug 21, 2013 at 04:17:48PM +0200, Jan Hubicka wrote:
>> Hi,
>> this is my attempt to bring GCC into wonderful era of multicore CPUs
>:)
>> It is a hack, but it seems to help quite a lot.  About 50% of WPA
>time is spent
>> by streaming the individual ltrans .o files.  This can be easily
>parallelized
>> by fork - we do nothing afterwards, just exit and pass the list to
>the linker.
>
>One risk is if someone streams to a spinning disk it may add more seeks
>for 
>the parallel IO. But I think it's a reasonable tradeoffs.

It'll also wreck all WPA dump files.

>We should also use a faster compressor

And we should avoid uncompressing the function sections...

That said, the patch is enough of a hack that I don't ever want to debug a bug in it....

I also fail to see why threads should not work here.  Maybe simply annotate gcc with openmp?

Richard.

>> For -flto=jobserver I simply fork all 32 processes.  It may not be a
>disaster,?
>> but perhaps we should figure out how to communicate with jobserver. 
>At first
>> glance on document on how it works, it seems easy to add. Perhaps we
>can even
>> convicne GNU Make folks to put simple helpers to libiberty?
>
>lto=jobserver is still broken and confuses tokens on large builds (ends
>with a 0 read) I did some debugging recently, and I suspect a Linux
>kernel
>bug now. Still haven't tracked it down.
>
>Any workarounds would need make changs unfortunately.
>
>> 
>> We also may figure out number of CPUs (is it available i.e. from
>libgomp)
>
>sysconf(_SC_NPROCESSORS_ONLN) ? 
>
>> and use it by default even if user do not care to pass number of
>processes.
>> Naturally these streaming forks should be cheap memory wise. I hope
>Martin
>> will get me some actual numbers.
>> 
>> With the patch the WPA time of firefox goes down to 2 minutes (4.8
>needs about
>> 30 minutes and without the hack one needs about 5 minutes)
>
>Cool!
>
>I'll try it on my builds
>>  
>> +fparallelism=
>> +LTO Joined
>> +Run the link-time optimizer in whole program analysis (WPA) mode.
>
>The description does not make sense
>
>Rest of patch looks good from a quick read, although I would prefer to 
>do the waiting for children in the "parent", not the "last one"
>
>-Andi