Autopar's(in trunk) Algorithms
Mostly about the data dependency algorithms in autopar.
Data dependency algorithms under polyhedral model vs. normal algorithms
Theoretically[1]:
Data dependency algorithms in trunk is mostly with I test and Omega test. Both of them are limited. Here is a comparison between them.
- Precise level
- I test (GCD test and Banerjee test): Only evaluates the ability to prove or disprove dependence between statements.
- Omega test: Can give a precise level (Tell which iteration of statements are in dependent), but less precise when there is one or more integer symbolic constant are present.
- Polyhedral dependence: Give the right precision level ( Will tell which iteration of statements are in dependent).
- Triangle loops
- I test and Omega test: Can not handle.
- Polyhedral dependence: Can handle.
- conditionals with boolean expressions of affine inequalities
- I test and Omega test: Can not handle.
- Polyhedral dependence: Can handle.
- Cost
- I test: It cost less and is always preferred because they capture most dependency under a low computational cost.
- Omega test: Worst exponential complexities, but works good in practice.
- Polyhedral dependence: High cost.
Reference: [1] Violated Dependence Analysis by Nicolas Vasilache, Albert Cohen, Cedric Bastoul, Sylvain Girbal.
Practically (Implement in Graphite):
/* FIXME: Need to be added. */
A General Test About Autopar pass in trunk
I test some loops to find out what kinds of simple loops can autopar handle. Because it is an experimental work, so if you find some mistakes, You can contact me(LiFeng) by nemokingdom at gmail dot com.
I test with GRAPHITE disabled with flags:
set args -O2 -fno-graphite-identity -fno-graphite -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-final_cleanup ../../mytest/parloop/par-1.c
I judge if autopar can/cannot handle with dumped file: par-1.c.128t.final_cleanup to see if there are libomp functions.
Loops that autopar can handle
1. One nested and simple loop
2. Nested loops with simple dependency
3. One nested loop with Not-very-simple dependency
Loops that autopar can not handle
1. Nested loop with not-very-simple dependency
This test fails at analyze_overlapping_iterations, yields scev_not_known.
(subscript_dependence_tester
(analyze_overlapping_iterations
(chrec_a = pretmp.13_2)
(chrec_b = {0, +, 1}_2)
(overlap_iterations_a = not known
)
(overlap_iterations_b = not known
)
)Actually we can find out if there are some dependency between Z[i][j] and Z[j+10][j+11] by some simple test:
0<=i,j,i1,j1<=10 i = j1 + 10 j = i1 + 11
From the equalities above, we can get:
10<=i<=10 11<=j<=10
Where we cannot get such an integer fit both. So there is no data dependency here.
2. Simple loop with if statement
3. Triangle loop
4. Some complicated loops
1 /*case 0: matrix multiplication*/
2 for (i = 0; i < 1000; i++){
3 for (j = 0; j < 1000; j++){
4 Z[i][j] = 0;
5 for (k = 0; k < 1000; k++){
6 Z[i][j]+=X[i][k]*Y[k][j];
7 }
8 }
9 }
10
11 /*
12 *case 1:Long chain data dependency
13 *A some sort of complicated one
14 */
15 for (i = 1; i < 100; i++)
16 for (j = 1; j < 100; j++){
17 X[i][j] = X[i][j] + Y[i-1][j];
18 Y[i][j] = Y[i][j] + X[i][j-1];
19 }
Puzzled results
Case 0 works while Case 1 not, but it seems that they are almost the same and both have no data dependency.
This bug has been fixed, see middle-end/PR39500
I test with this example but it fails to parrallelized, is this correct?:
Might be enhancement
The code below create thread twice, maybe we can do it only once:
Autopar's Performance
Here is a primary results about autopar's performance.
Attachment is the test result of run-time of loop-parallelization. Mainly include 4 figures.
- One nested loop, number of threads vs. run-time, with not all the code parallel.
- One nested loop, number of threads vs. run-time, almost all code got parallel.
- Two nested loops, with innermost parallel, number of threads vs. run-time. (It got worse after parallel.)
- Two nested loops, with outer loop parallel, number of threads vs. run-time. (It's better).
The 4 figures include comparing of:
- Runtime of openmp directive inserted c code.
- Runtime of Graphite autoparallel c code.
- Runtime of non-parallel c code.
autopar_performance_testresults.pdf