[PATCH 0/6] Parallelize Intra-Procedural Optimizations using the LTO Engine.

Tue Aug 25 07:03:32 GMT 2020

On Mon, Aug 24, 2020 at 8:39 PM Giuliano Belinassi via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Ho, Josh.
>
> On 08/24, Josh Triplett wrote:
> > On Sat, Aug 22, 2020 at 06:04:48PM -0300, Giuliano Belinassi wrote:
> > > Hi, Josh
> > >
> > > On 08/21, Josh Triplett wrote:
> > > > On Thu, Aug 20, 2020 at 07:00:13PM -0300, Giuliano Belinassi wrote:
> > > > > This patch series add a new flag "-fparallel-jobs=" to control if the
> > > > > compiler should try to compile the current file in parallel.
> > > > [...]
> > > > > Bootstrapped and Regtested on Linux x86_64.
> > > > >
> > > > > Giuliano Belinassi (6):
> > > > >   Modify gcc driver for parallel compilation
> > > > >   Implement a new partitioner for parallel compilation
> > > > >   Implement fork-based parallelism engine
> > > > >   Add `+' for Jobserver Integration
> > > > >   Add invoke documentation
> > > > >   New tests for parallel compilation feature
> > > >
> > > > Very nice!
> > >
> > > Thank you for your interest in this :)
> > >
> > > >
> > > > I'm interested in testing this on a highly parallel system. What
> > > > baseline do these patches apply to?  They don't seem to apply to GCC
> > > > trunk.
> > >
> > > Hummm, this was supposed to work on trunk out of the box. However,
> > > there is a high probability that I messed up something while rebasing.
> > > I will post a version 2 of it when I get more comments and when I fix
> > > the Makefile issue that Joseph pointed out in other e-mail.
> > >
> > > If you want to test it on a high parallel system, I think it will be
> > > cool to see how it behaves also when --param=promote-statics=1, as it
> > > increases parallelism opportunity. :)
> >
> > I plan to try several variations, including that.
> >
> > I'd like to see how it affects the performance of Linux kernel builds.
>
> Well, I expect little to no impact on that.  I ran an experiment back
> on 2018 looking for parallelism bottleneck in Kernel, and what I found
> was that the developers did a good job on balancing the file sizes.
>
> This was run on a machine with 4x AMD Opteron CPUs, (64 cores in total)
> https://www.ime.usp.br/~belinass/64cores-kernel-experiment.svg
>
> As you can see from this image, the jobs ends almost at the same time.
>
> >
> > > > Also, I tried to bootstrap the current tip of the devel/autopar_devel
> > > > branch, but ended up with compiler segfaults that all look like this:
> > > > ../../gcc/zlib/compress.c:86:1: internal compiler error: Segmentation fault
> > > >    86 | }
> > > >       | ^
> > >
> > > Well, there was once a bug in this branch when compiling with -flto that
> > > caused the assembler output file not to be properly initialized early
> > > enough, resulting in LTO LGEN stage writing into a invalid FILE pointer.
> > > I fixed this during rebasing but I forgot to push to the autopar_devel
> > > branch. In any case, I just pushed the recent changes to autopar_devel
> > > which fix this issue.
> >
> > That might explain the problem; I had tried to build gcc with the
> > bootstrap-lto configuration.
> >
> > > In any case, -fparallel-jobs= should NOT be used together with -flto.
> > > Although I used part of the LTO engine for development of this feature,
> > > they are meant for distinct things. I guess I should give a warning
> > > about that in next version :)
> >
> > Interesting. Is that something that could change in the future? I'd like
> > to be able to get some parallelism when creating the object files, and
> > then more parallelism when doing the final LTO link.
>
> Well, if by "final LTO link" you mean LTO's Whole Program Analysis,
> that is a quite challenging task to parallelize :)
>
> As for the "creating object files", you mean the LTO LGEN, I think
> it is not possible for now because -- as far as I understeand --, LTO
> object files are just containers for a intermediate language and
> does not support partial linking.

It was designed to allow partially linked LTO IR files but IIRC support
for this may have been rotten a bit.  But since most of the compile time
for LTO LGEN is spent in the frontends parsing the code (and that's
incredibly hard if not impossible to parallelize), splitting this task
is not going to bring much improvements.

> However, I would not expect LGEN bottlenecking compilation of any
> project. Most compilation time is spent in optimization, that is
> IPA and Intra-Procedural.

Indeed.

Btw, you can "emulate" what -fparallel-jobs=N does via

> gcc -c t.c -o t.il.o -flto -fno-fat-lto-objects
> gcc -o t.o t.il.o -r -flinker-output=nolto-rel -flto=N

with the twist that the partitioning done by the LTO link step
might not be exactly the same as the one done by -fparallel-jobs
(surprisingly we needed a different partitioner).  What -fparallel-jobs
improves over the manual -flto way above is that it completely
elides LTO IR streaming but otherwise it operates in the same
manner.

There's a regression with -fparallel-jobs when you use -g which
we still need to address since with -fparallel-jobs you get
duplicate DWARF for most of the "early" source-level debug info.

I guess for the final report of the GSoC project it would be nice
to include the two-step -flto "paralellization" in the tables comparing
the compile-speed.  At least for gimple-match.o it provided a
reasonable speedup (wall-clock) as well.

> >
> > > Also, I just tested bootstrap with
> > >
> > > ../gcc/configure --disable-multilib --enable-languages=c,c++
> > >
> > > on x86_64 linux and it is working.
> >
> > I'd used --enable-multilib, and --enable-languages=c,c++,lto . Would
> > that be expected to work?
>
> Yes. If it doesn't, that is a bug :)
>
> >
> > Thanks,
> > Josh