Created attachment 37966 [details] ira dump A new s390x pattern for shift-and-xor does not yield a satisfying result with this code when compiled with "-O3 -funroll-loops": -- snip -- unsigned long hash(unsigned long l) { unsigned long v = 0; unsigned long i; for (i = 0; i < 8; i++) { v <<= 1; v ^= l; } return v; } -- snip -- => lgr %r1,%r2 lgr %r3,%r2 rxsbg %r1,%r2,0,62,1 # (shift r2 by one bit left and xor with r1) rxsbg %r3,%r1,0,62,1 lgr %r1,%r2 rxsbg %r1,%r3,0,62,1 lgr %r4,%r1 <----- unnecessary lgr %r1,%r2 rxsbg %r1,%r4,0,62,1 lgr %r5,%r1 <----- unnecessary lgr %r1,%r2 rxsbg %r1,%r5,0,62,1 lgr %r0,%r1 <----- unnecessary lgr %r1,%r2 rxsbg %r1,%r0,0,62,1 rxsbg %r2,%r1,0,62,1 br %r14 ("%r1,%r2,0,62,1" means "r1 := r1 ^ (r2 << 1)"; the ",0,62,1" part of the instruction effectively means "shift left by one".) (gets worse with more loop passes). The code got unrolled in tree: v_16 = l_4(D) << 1; v_17 = l_4(D) ^ v_16; v_21 = v_17 << 1; v_22 = l_4(D) ^ v_21; v_26 = v_22 << 1; v_27 = l_4(D) ^ v_26; v_31 = v_27 << 1; v_32 = l_4(D) ^ v_31; v_36 = v_32 << 1; v_37 = l_4(D) ^ v_36; v_41 = v_37 << 1; v_42 = l_4(D) ^ v_41; v_3 = v_42 << 1; v_5 = v_3 ^ l_4(D); return v_5; Register allocation insists on having the value of "l" in r1. As the result of the previous pass through the loop is in r1, it's necessary to move that value out of the way first. Later on, regrename fails to clean up this situation, probaby because the problem is too complex with many sequential overlapping register use chains. This: lgr %r1,%r2 rxsbg %r1,%r3,0,62,1 lgr %r4,%r1 lgr %r1,%r2 rxsbg %r1,%r4,0,62,1 lgr %r5,%r1 lgr %r1,%r2 rxsbg %r1,%r5,0,62,1 lgr %r0,%r1 lgr %r1,%r2 rxsbg %r1,%r0,0,62,1 could be rewritten to lgr %r1,%r2 rxsbg %r1,%r3,0,62,1 lgr %r3,%r2 rxsbg %r3,%r1,0,62,1 lgr %r1,%r2 rxsbg %r1,%r3,0,62,1 lgr %r3,%r2 rxsbg %r3,%r1,0,62,1 just using three registers. The question is whether this situation can be improved, either in the register allocator, or regrename, or in the pattern.
Created attachment 37967 [details] rnreg dump