Automatic parallelization in GCC

Automatic parallelization distributes sequential code into multi-threaded code. It automatically generates parallel (multi-threaded) code for specific loop constructs using the gomp library.

The first version of the code, allowing parallelization of inner-most loops that carry no dependences, was contributed by Zdenek Dvorak and Sebastian Pop (integrated to GCC 4.3). The feature was later enhanced with reduction dependencies and outer loops support by Razya Ladelsky (GCC 4.3).

Number of threads is currently determined by the user via the compile command (-ftree-parallelize-loops=4)

There are simple profitability conditions:

Based on the profile information to determine how frequently the loop is executed,
Examining whether the number of iterations is large enough to create new threads

If a loop satisfies the correctness and profitability conditions, GIMPLE_OMP_PARALLEL and GIMPLE_OMP_FOR codes are added (and OMP_ATOMIC for reduction support), and later expanded by the omp expansion machinery.

SPEC2006 speedups with autopar

After refining the cost model, http://gcc.gnu.org/ml/gcc-patches/2012-05/msg00881.html (GCC4.8), the following speedups are obtained on a Power7 with 6 cores, 4 way SMT each, comparing the trunk with O3 + autopar (parallelizing with 6 threads) vs. the trunk with O3 minus vectorization:

462.libquantum 2.5X
410.bwaves 3.3X
436.cactusADM 4.5X
459.GemsFDTD 1.27X
481.wrf 1.25X

Note: The speedup shown for libquatum with autopar has been obtained with previous versions of autopar, gaining the performance did not need the cost model change.

Autopar integration with Graphite

With the integration of Graphite (http://gcc.gnu.org/wiki/Graphite) to GCC4.4, a strong loop nest analysis and transformation engine was introduced, and the notion of using the polyhedral model to expose loop parallelism in GCC became feasible and relevant.

The first step, teaching Graphite that parallel code needs to be produced, was accomplished (GCC4.4). Graphite recognizes simple parallel loops (using SCoP detection and data dependency analysis), and passes on that information. Graphite annotates parallel loops and passes that information all the way through CLOOG to the current autopar code generator to produce the parallel, GOMP based code.

You can trigger it by 2 flags -floop-parallelize-all -ftree-parallelize-loops=4. Both of them are needed, the first flag will trigger Graphite pass to mark loops that can be parallel and the second flag will trigger the code generation part. See also Automatic Parallelization in Graphite

Teaching Graphite to find loop transformations (such as skewing, interchange etc.) that expose coarse grain synchronization free parallelism, and furthermore, handling parallelism requiring a small amount of synchronization were part of the graphite-autopar integration plans/TODOs detailed in http://gcc.gnu.org/ml/gcc/2009-03/msg00239.html, but have not yet been followed.