Load/Store Hoisting

August 19, 1998

Mark Mitchell Consulting as part of a continuing contract with the Accelerated Strategic Computing Initiative (ASCI) program at Los Alamos National Laboratory, has contributed code to hoist loads from and stores to memory out of loops. Because many programs spend most of their time inside loops, making loops go faster can result in dramatic improvements in the execution time of the programs.

To see the utility of this new optimization, consider the following function:

  void f(int* k, int j)
  {
    int i;

    for (i = 0; i < j; ++i)
      *k = *k + 2 * ((*k) - 1);
  }

Without the new optimization, the code generated for the loop on an x86 looked like:

.L5:
	movl (%ecx),%eax           # Load *k
	leal -2(%eax,%eax,2),%eax  # Compute *k + 2 * ((*k) - 1)
	movl %eax,(%ecx)           # Store *k
	incl %edx                  # Increment i
	cmpl %ebx,%edx             # Test i < j
	jl .L5                     # Loop if i < j

With the optimization, the code generated for the loop becomes:

	movl (%ebx),%eax           # Load *k
.L5:
	leal -2(%eax,%eax,2),%eax  # Compute *k + 2 * ((*k) - 1)
	incl %edx                  # Increment i
	cmpl %ecx,%edx             # Test i < j
	jl .L5                     # Loop if i < j
	movl %eax,(%ebx)           # Store *k

Note that in the second case the loads and stores occur outside the loop. In this particular case, we can expect the loop to run about 33% faster, since there are 4 instructions instead of 6. (Of course, the vagaries of today's deeply pipelined architectures can play havoc with some estimates, so instruction count is only a rough guess at execution time.)

Another example, of interest to the Fortran community, is code like:

   subroutine gemm(a, b, c, m, n, k)
   integer i,m,n,k,l,j
   dimension a(k,m),  b(n,k),  c(n,m)
   do i=1,m
     do j=1,n
       do l=1,k
	 c(j,i) = c(j,i) + a(l,i)*b(j,l)
       end do
     end do
   end do
   end

Here, the code generated for the inner loop will not load or store c(j,i) inside the loop. Instead, the initial value will be loaded into a register, and the sum will be accumulated there. After the loop is complete, the value will be stored back to memory.