For the following code: #define MATRIX_SIZE 512 static double a[MATRIX_SIZE][MATRIX_SIZE]; static double b[MATRIX_SIZE][MATRIX_SIZE]; static double c[MATRIX_SIZE][MATRIX_SIZE]; double foo (void) { double s; int i, j, k; /* Section A */ for (i = 0; i < MATRIX_SIZE; i++) { for (j = 0; j < MATRIX_SIZE; j++) { a[i][j] = (double)i * (double)j; b[i][j] = (double)i / (double)(j+5); } } /* Section B */ for (j = 0; j < MATRIX_SIZE; j++) { for (i = 0; i < MATRIX_SIZE; i++) { s = 0; for (k = 0; k < MATRIX_SIZE; k++) { s += a[i][k] * b[k][j]; } c[i][j] = s; } } s = 0.0; // (1) #if 0 /* Section C */ for (i = 0; i < MATRIX_SIZE; i++) { for (j = 0; j < MATRIX_SIZE; j++) { s += c[i][j]; } } #endif return s; } GCC does not manage to eliminate the code up to (1) and retains the expensive Section A. Clang manages to eliminate much more and produces: foo: // @foo // %bb.0: // %entry orr w8, wzr, #0x200 .LBB0_1: // %vector.ph // =>This Inner Loop Header: Depth=1 subs x8, x8, #1 // =1 b.ne .LBB0_1 // %bb.2: // %for.cond20.preheader.preheader fmov d0, xzr ret on aarch64. This happens at -O3 as well as -O2 as well as other targets (occurs also on x86)
Confirmed. But in general a prerequesite to eliminate Section A is fusion of both loop nests and then CSEing the loads from a and b and finally eliminate the stores as dead because a and b are only written to. For the testcase we eliminate Section B because 'c' is discovered as writeonly. But we do not update the writeonly flag of a or b when removing uses (and IMHO we cannot easily). Note we perform this "DSE" in fixup_cfg.