[RFC] A memory gathering optimization for loop

JiangNing OS jiangning@os.amperecomputing.com
Sun Jan 17 01:23:55 GMT 2021


> -----Original Message-----
> From: Feng Xue OS <fxue@os.amperecomputing.com>
> Sent: Thursday, January 14, 2021 12:28 PM
> To: gcc@gcc.gnu.org
> Cc: JiangNing OS <jiangning@os.amperecomputing.com>; Hao Liu OS
> <hliu@os.amperecomputing.com>
> Subject: [RFC] A memory gathering optimization for loop
> 
> 1. Background
> 
> In a loop, it is optimal if only one memory stream is activated, that is, all
> memory operations sequentially access one data region. But that is always
> not the case, such as traversing link list and manipulating discrete arrays. In
> this scenario, the loop would contain multiple scattered memory accesses,
> which could trigger inefficient multi-way hardware prefetching, thus making
> cpu cache hit rate drop. The tracker
> pr98598 (https://gcc.gnu.org/bugzilla//show_bug.cgi?id=98598) gives a
> description on one related problem that we are meant to target.
> 
> 2. Memory gathering optimization (MGO)
> 
> For nested loops, if scattered memory accesses inside inner loop remain
> unchanged in each iteration of outer loop, we can dynamically gather result
> data into a sequential memory region when the first access in the inner loop
> takes place. This way the hardware prefetching can be reduced, and cpu
> cache hit rate can be improved. We name this optimization MGO (memory
> gathering optimization). An example is given as below, which is a little
> different from pr98598, and could not be optimized by loop distribution.
> 
>   struct A { int **p1; };
> 
>   int foo (int n, struct A *array)
>   {
>       int sum = 0;
> 
>       for (int i = 0; i < n; i++) {
>           for (int j = 0; j < i; j++) {  /* iteration count is i */
>               int **p1 = array[j].p1;
>               int  *p2 = *p1;
> 
>               sum += *p2;
>           }
>       }
> 
>       return sum;
>   }
> 
> Although this example seems to be pretty artificial, the kind of data reuse in
> nested loops is often seen in real applications, especially written by Fortran,
> and C++ STL.
> 
> To simplify explanation of this memory gathering optimization, pseudo code
> on original nested loops is given as:
> 
>   outer-loop ( ) { /* data in inner loop are also invariant in outer loop.  */
>       ...
>       inner-loop (iter, iter_count) {  /* inner loop to apply MGO */
> 
>           Type1 v1 = LOAD_FN1 (iter);
>           Type2 v2 = LOAD_FN2 (v1);
>           Type3 v3 = LOAD_FN3 (v2);
>           ...
>           iter = NEXT_FN (iter);
>       }
> 
>      ...
>   }
> 
> "LOAD_FNx()" is a conceptual function that translates its argument to a result,
> and is composed of a sequence of operations in which there is only one
> memory load. It is capable of representing simple memory dereference like
> "iter->field", or more complicated expression like "array[iter * 2 + cst].field".
> 
> Likewise, "NEXT_FN()" is also a conceptual function which defines how an
> iterator is advanced. It is able to describe simple integer iterator, list iterator,
> or even pure-function-based iterator on advanced stuff like hashset/tree.
> 
> Then the code will be transformed to:
> 
>   typedef struct cache_elem {
>       bool   init;
>       Type1  c_v1;
>       Type2  c_v2;
>       Type3  c_v3;
>   } cache_elem;
> 
>   cache_elem *cache_arr = calloc (iter_count, sizeof(cache_elem));
> 
>   outer-loop () {
>       ...
>       size_t cache_idx = 0;
> 
>       inner-loop (iter, iter_count) {
> 
>           if (!cache_arr[cache_idx]->init) {
>               v1 = LOAD_FN1 (iter);
>               v2 = LOAD_FN2 (v1);
>               v3 = LOAD_FN3 (v2);
> 
>               cache_arr[cache_idx]->init = true;
>               cache_arr[cache_idx]->c_v1 = v1;
>               cache_arr[cache_idx]->c_v2 = v2;
>               cache_arr[cache_idx]->c_v3 = v3;
>           }
>           else {
>               v1 = cache_arr[cache_idx]->c_v1;
>               v2 = cache_arr[cache_idx]->c_v2;
>               v3 = cache_arr[cache_idx]->c_v3;
>           }
>           ...
>           cache_idx++;
>           iter = NEXT_FN (iter);
>       }
>       ...
>   }
> 
>   free (cache_arr);

The example in PR#98598 preloads the data before the outer loop, while the
transformation in this RFC lazily caches the data in the inner loop, which is to
avoid introducing unnecessary data RW hazards.

> 
> In the transformation, cache space belongs to compiler-generated temporary
> storage, but it is from user heap. We find this trick is very uncommon in gcc.
> We'd like to make sure whether there are any special considerations or
> restrictions on this. Anyway, using alloca() to get space from stack is an
> alternative, but we should be cautious to avoid incurring stack overflow by
> non-user logic.
> 
> "iter_count" stands for an upper bound for inner-loop iteration count, which
> must be an outer-loop invariant expression, in that it determines data count
> in cache space, and the space is allocated before outer-loop.
> 
> In the example below, we can simply get such bound for inner-loop 1 from
> its niter_desc, while for inner-loop 2, it is not that easy, but it is possible to
> infer that the loop's iteration count never exceeds "n" via comparison
> predicate information.
> 
>   foo (int n, int m)
>   {
>       for (int i = 0; i < n; i++ ) {      /* outer-loop 1 */
>           ...
>           for (int j = 0; j < m; j++) {   /* inner-loop 1 */
>               ...
>           }
>           ...
>       }
> 
>       for (int i = 0; i < n; i ++) {     /* outer-loop 2 */
>           ...
>           for (int j = 0; j < i; j++) {  /* inner-loop 2 */
>               ...
>           }
>           ...
>       }
>   }
> 
> 3. Cache count threshold
> 
> We expect cache space works as normal stack temporary, and would not
> impact execution of program so much, especially when heap is heavily used
> in user code, and at least should not exhaust memory and make the program
> broken. For this, a maximum cache count threshold needs to be introduced.
> 
> Since memory accesses are dynamically gathered at runtime, it has cost,
> which lies in cache data filling, status check and space allocation.
> Runtime overhead of MGO could outweigh performance benefit if cache data
> count is too small, so we also need a minimum cache count threshold.
> 
> With a new runtime check against two cache data count thresholds, MGO
> transformation will be:
> 
>   trans_ok = iter_count >= min_cache_data_count &&
>              iter_count <= max_cache_data_count;
> 
>   if (trans_ok) {
>       cache_arr = calloc (iter_count, sizeof(cache_elem));
>   }
> 
>   outer-loop () {
>       ...
>       if (trans_ok)  {
>           /* transformed inner-loop */
>       }
>       else {
>           /* original inner-loop */
>       }
>   }
> 
>   if (trans_ok) {
>       free (cache_arr);
>   }
> 
> 4. Runtime alias check
> 
> In baseline MGO, memory disambiguation for candidate load is based on
> static alias analysis, which always provides very conservative result.
> A more aggressive option is to use runtime alias check, which could be easily
> integrated into data-caching code. In the following pseudo code, a store
> outside inner-loop is aliased with the candidate load, and it prevents the load
> and its dependent loads from being optimized if we only apply the
> optimization via static alias analysis.
> 
>   Type1 v0;
> 
>   outer-loop () {
> 
>       v0->field = value;  /* Aliased with v1->field */
> 
>       inner-loop (iter) {
> 
>           Type1 v1 = LOAD_FN1 (iter);
>           Type2 v2 = v1->field;
>           ...
>       }
>      ...
>   }
> 
> 
> When runtime alias check is enabled, MGO will generate code to
> incrementally compute address range for candidate load, and check address
> overlap for may-alias stores and the range at runtime. Once conflict is
> detected, execution will switch to original inner-loop. With runtime alias
> check added, transformation will be:
> 
>   Type1 v0;
>   std::pair<void *, void *> v1_addr_range;
> 
>   outer-loop () {
> 
>       v0->field = value;
>       ...
>       /* Check whether v0 is covered by range.  */
>       trans_ok &= !IN_RANGE (&v1_addr_range, v0);
> 
>       if (trans_ok) {
> 
>           inner-loop (iter) {
> 
>               if (!cache_arr[cache_idx]->init) {
>                  Type1 v1 = LOAD_FN1 (iter);
>                  Type2 v2 = v1->field;
> 
>                  /* Extend range to cover v1. */
>                  UPDATE_RANGE (&v1_addr_range, v1);
>                  ...
>               }
>               ...
>           }
>       }
>       else {
>          /* original inner-loop */
>       }
>       ...
>   }
> 
> Currently, we are working on an implementation for this proposal. For sure,
> it might be limited, so hope your comments on it. Thanks.
> 
> Best Regards,
> 
> Feng


More information about the Gcc mailing list