Autopar's(in trunk) Algorithms

Mostly about the data dependency algorithms in autopar.

autopar's algorithm

Data dependency algorithms under polyhedral model vs. normal algorithms

Theoretically[1]:

Data dependency algorithms in trunk is mostly with I test and Omega test. Both of them are limited. Here is a comparison between them.

Reference: [1] Violated Dependence Analysis by Nicolas Vasilache, Albert Cohen, Cedric Bastoul, Sylvain Girbal.

Practically (Implement in Graphite):

/* FIXME: Need to be added.  */

A General Test About Autopar pass in trunk

I test some loops to find out what kinds of simple loops can autopar handle. Because it is an experimental work, so if you find some mistakes, You can contact me(LiFeng) by nemokingdom at gmail dot com.

I test with GRAPHITE disabled with flags:

set args -O2 -fno-graphite-identity -fno-graphite -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-final_cleanup ../../mytest/parloop/par-1.c

I judge if autopar can/cannot handle with dumped file: par-1.c.128t.final_cleanup to see if there are libomp functions.

Loops that autopar can handle

1. One nested and simple loop

   1   for (i = 0; i < 1000; i++){
   2     x[i] = i + 3;

2. Nested loops with simple dependency

   1 /*case 0*/
   2   for (i = 0; i < 1000; i++){
   3     for (j = 0; j < 1000; j++){
   4     x[i][j] = i + j + 3;
   5     }
   6   }
   7 
   8 /*case 1*/
   9 
  10   for (i = 0; i < 100; i++)
  11     for (j = 0; j < 100; j++)
  12       X[i][j] = X[i][j] + Y[i-1][j];

3. One nested loop with Not-very-simple dependency

   1   for (i = 0; i < 10; i++)
   2     X[2*i+1] = X[2*i];

Loops that autopar can not handle

1. Nested loop with not-very-simple dependency

   1   for (i = 0; i <= 10; i++)
   2     for (j = 0; j <=10; j++)
   3       Z[i][j] = Z[j+10][i+11];

This test fails at analyze_overlapping_iterations, yields scev_not_known.

(subscript_dependence_tester 
(analyze_overlapping_iterations 
  (chrec_a = pretmp.13_2)
  (chrec_b = {0, +, 1}_2)
  (overlap_iterations_a = not known
)
  (overlap_iterations_b = not known
)
)

Actually we can find out if there are some dependency between Z[i][j] and Z[j+10][j+11] by some simple test:

0<=i,j,i1,j1<=10
i = j1 + 10
j = i1 + 11

From the equalities above, we can get:

10<=i<=10
11<=j<=10

Where we cannot get such an integer fit both. So there is no data dependency here.

2. Simple loop with if statement

   1     for (j = 0; j <=10; j++)
   2       if(j > 5) X[i] = i + 3;

3. Triangle loop

   1     for (i = 0; i < 100; i++)
   2       for (j = i; j < 100; j++)
   3         X[i][j] = 5;

4. Some complicated loops

   1 /*case 0: matrix multiplication*/
   2   for (i = 0; i < 1000; i++){
   3     for (j = 0; j < 1000; j++){
   4       Z[i][j] = 0;
   5       for (k = 0; k < 1000; k++){
   6         Z[i][j]+=X[i][k]*Y[k][j];
   7       }
   8     }
   9   }
  10 
  11 /*
  12  *case 1:Long chain data dependency
  13  *A some sort of complicated one
  14 */
  15   for (i = 1; i < 100; i++)
  16     for (j = 1; j < 100; j++){
  17       X[i][j] = X[i][j] + Y[i-1][j];
  18       Y[i][j] = Y[i][j] + X[i][j-1];
  19     }

Puzzled results

Case 0 works while Case 1 not, but it seems that they are almost the same and both have no data dependency.

This bug has been fixed, see middle-end/PR39500

   1 /*case 0: success*/
   2   for (i = 1; i < 100; i++)
   3     X[i] = X[i+100];
   4 
   5 /*case 1: failed at analyze_subscript_affine_affine 
   6   when analyze_siv_subscript*/
   7   for (i = 0; i < 100; i++)
   8     X[i] = X[i+100];

I test with this example but it fails to parrallelized, is this correct?:

   1     for (i = 0; i <= 10; i++)
   2         sum+=X[i];
   3     printf("%d\n",sum);

Might be enhancement

The code below create thread twice, maybe we can do it only once:

   1   for (i = 0; i < 1000; i++){
   2     x[i] = i + 3;
   3   }
   4   for(i = 0; i < 1000; i++){
   5     y[i] = i - 3;
   6   }
   7   __builtin_GOMP_parallel_start (main._loopfn.1, &.paral_data_store.40, 4);
   8   __builtin_GOMP_parallel_start (main._loopfn.0, &.paral_data_store.31, 4);

Autopar's Performance

Here is a primary results about autopar's performance.

Attachment is the test result of run-time of loop-parallelization. Mainly include 4 figures.

  1. One nested loop, number of threads vs. run-time, with not all the code parallel.
  2. One nested loop, number of threads vs. run-time, almost all code got parallel.
  3. Two nested loops, with innermost parallel, number of threads vs. run-time. (It got worse after parallel.)
  4. Two nested loops, with outer loop parallel, number of threads vs. run-time. (It's better).

The 4 figures include comparing of:

autopar_performance_testresults.pdf


CategoryCategory

None: AutoparRelated (last edited 2009-05-26 03:03:43 by 121)