Presentation by Gautam Gupta and Sanjay Rajopadhye
Speedup mostly obtained by removing indirections. Loop transforms were applied by hand. Loop fission/distribution was also contributing. Code worked on was not production code, but code in development.
- Write program
- Load program
- Specify parameters for the optimizer.
Is there a way to drive which transforms that should be performed? Currently there is no such API. In the case of Graphite there could be plug-ins to try different heuristics.
Internally PPL was used and modified to work on integer space. Set difference had to be implemented, plus some other operations. Implementation done through wrappers around PPL. Affine functions were added to transform polyhedra e.g. to map i,j,k to i,j in matrix multiply. Simplification was implemented.
Less than 1 programmer currently working on this project.
There may be problems introducing a new language. It may be better to use subset of FORTRAN. Intermediate language could be the same.
Performance Modeling and Heuristics
Must describe memory hierarchy: caches, memory and how computers are linked together.
Schedule selection, two options:
- Create schedule from scratch.
- Start with original schedule and refine, using transforms.
Instruction caches must also be considered: must not duplicate too much code.
Latencies and bandwidth must be described memory hierarchy.
The cost of duplicating code (a black box) must be described.
Can add a parameterization to CLooG to specify costs.
Volume of computation can be used to make decisions about code generation.
Domains can be marked for generating highly optimized code. Weights on statements can help determine if a statement should be copied or not.
Analytically solve equations for deciding a schedule may not be possible, simpler heuristics must be applied some times for a hybrid approach.
Initially simpler techniques can be implemented to get something working.
Implement interchange and loop distribution with an analytical solution could be done. This is currently in development in GCC for deciding between inner and outer vectorization.
Loop structure in GCC can be extended with flags that indicate if a loop can be vectorized. This way Graphite can be used for vectorization.
Iterative cost model can be very quick, patterns (pre-defined combinations of transforms) can be stored to "skip" through the search space. Database can be developed iteratively. If integrated in GCC it can be decided when stage 3 begins.
Similar idea to PCP. Input from AST (Clast) or source code (C) as a language to a polyhedral framework.
Currently array sizes are not specified but the pragmas can be extended to include this information.
Possible to use the same representation as PCP.
Cloog can emit debug code for counting to check if the code is correct, this could perhaps be used in GCC for debugging.
Runtime testing is available to check if the output is correct.
What should be done in 4.4
- Eliminate reductions from scops (fix in 4.5)
- Packaging.. Gloog
What can be done in GCC 4.5?
- Make identity work (understand overhead)
- Get rid of lambda framework.
- Translate scalars to zero-dimensional arrays (e.g. for reductions)
- Dependence analysis
- Program Structure Tree (PST) based implementation of SCoP detection