70236 – Register allocation and loop unrolling lead to waste of registers

Bug 70236 - Register allocation and loop unrolling lead to waste of registers

Summary: Register allocation and loop unrolling lead to waste of registers

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	6.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:

Reported:	2016-03-15 10:40 UTC by Dominik Vogt
Modified:	2017-02-03 12:38 UTC (History)
CC List:	2 users (show)

See Also:	79057
Host:
Target:	s390x--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
ira dump (2.48 KB, text/plain) 2016-03-15 10:40 UTC, Dominik Vogt	Details
rnreg dump (2.10 KB, text/plain) 2016-03-15 10:40 UTC, Dominik Vogt	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dominik Vogt 2016-03-15 10:40:11 UTC

Created attachment 37966 [details]
ira dump

A new s390x pattern for shift-and-xor does not yield a satisfying result with this code when compiled with "-O3 -funroll-loops":

-- snip --
unsigned long hash(unsigned long l)
{
  unsigned long v = 0;
  unsigned long i;

  for (i = 0; i < 8; i++)
    {
      v <<= 1;
      v ^= l;
    }
  return v;
}
-- snip --

=>

  lgr     %r1,%r2
  lgr     %r3,%r2
  rxsbg   %r1,%r2,0,62,1   # (shift r2 by one bit left and xor with r1)
  rxsbg   %r3,%r1,0,62,1
  lgr     %r1,%r2
  rxsbg   %r1,%r3,0,62,1
  lgr     %r4,%r1          <----- unnecessary
  lgr     %r1,%r2
  rxsbg   %r1,%r4,0,62,1
  lgr     %r5,%r1          <----- unnecessary
  lgr     %r1,%r2
  rxsbg   %r1,%r5,0,62,1
  lgr     %r0,%r1          <----- unnecessary
  lgr     %r1,%r2
  rxsbg   %r1,%r0,0,62,1
  rxsbg   %r2,%r1,0,62,1
  br      %r14

("%r1,%r2,0,62,1" means "r1 := r1 ^ (r2 << 1)"; the ",0,62,1" part of the instruction effectively means "shift left by one".)

(gets worse with more loop passes).  The code got unrolled in tree:

  v_16 = l_4(D) << 1;
  v_17 = l_4(D) ^ v_16;
  v_21 = v_17 << 1;
  v_22 = l_4(D) ^ v_21;
  v_26 = v_22 << 1;
  v_27 = l_4(D) ^ v_26;
  v_31 = v_27 << 1;
  v_32 = l_4(D) ^ v_31;
  v_36 = v_32 << 1;
  v_37 = l_4(D) ^ v_36;
  v_41 = v_37 << 1;
  v_42 = l_4(D) ^ v_41;
  v_3 = v_42 << 1;
  v_5 = v_3 ^ l_4(D);
  return v_5;

Register allocation insists on having the value of "l" in r1.  As the result of the previous pass through the loop is in r1, it's necessary to move that value out of the way first.  Later on, regrename fails to clean up this situation, probaby because the problem is too complex with many sequential overlapping register use chains.

This:

  lgr     %r1,%r2
  rxsbg   %r1,%r3,0,62,1
  lgr     %r4,%r1
  lgr     %r1,%r2
  rxsbg   %r1,%r4,0,62,1
  lgr     %r5,%r1
  lgr     %r1,%r2
  rxsbg   %r1,%r5,0,62,1
  lgr     %r0,%r1
  lgr     %r1,%r2
  rxsbg   %r1,%r0,0,62,1

could be rewritten to

  lgr     %r1,%r2
  rxsbg   %r1,%r3,0,62,1
  lgr     %r3,%r2
  rxsbg   %r3,%r1,0,62,1
  lgr     %r1,%r2
  rxsbg   %r1,%r3,0,62,1
  lgr     %r3,%r2
  rxsbg   %r3,%r1,0,62,1

just using three registers.

The question is whether this situation can be improved, either in the register allocator, or regrename, or in the pattern.

Comment 1 Dominik Vogt 2016-03-15 10:40:54 UTC

Created attachment 37967 [details]
rnreg dump