target cost model tuning for x86

Sun Sep 9 07:09:00 GMT 2007

Hi,

>> - The patch also changes the number of branches per prologue and 
>> epilogue each to 1 instead of 2. Looking at some loops, the branches 
>> can get optimized out,
>
>could you please show an example where these guards are optimized away?
>
>> especially when the alignment and iterations are known at compile 
>> time.
>>
>
>so maybe there's room to reduce the number of estimated branches from 2

>to
>1 only when the number of iterations is known at compile time, and not 
>always?
>

At line 2092 in induct.f90 (built with -mtune=k8 -O3 -msse2
-ftree-vectorize), 

denominator =
sqrt(dot_product(rot_c_vector-rot_q_vector,rot_c_vector-rot_q_vector))

The vectors used by the dot product are 128-bit aligned, double
precision, length 3 vectors, known at compile time. Such loops are
repeated through out the 2 major hotspots of this benchmark.

The vector assembly generated for the loop is:
417498:       movapd (%rdi),%xmm0		=> vector_loop
41749c:       subpd  (%rsi),%xmm0
4174a9:       mulpd  %xmm0,%xmm0
4174ad:       movsd  %xmm0,%xmm3
4174b1:       unpckhpd %xmm0,%xmm0
4174b5:       addsd  %xmm0,%xmm3
4174b9:       movsd  %xmm5,%xmm0		=> epilogue
4174bd:       subsd  %xmm4,%xmm0
4174c1:       mulsd  %xmm0,%xmm0
4174c5:       addsd  %xmm3,%xmm0
4174c9:       sqrtsd %xmm0,%xmm0

I also tried a simple test case extracted from vect-31.c (loop at line
65) (built with -mtune=k8 -O2 -msse2 -ftree-vectorize), 

struct t{
  int k[N];
  int l;
};

struct s{
  char a;       /* aligned */
  char b[N-1];  /* unaligned (offset 1B) */
  char c[N];    /* aligned (offset NB) */
  struct t d;   /* aligned (offset 2NB) */
  struct t e;   /* unaligned (offset 2N+4N+4 B) */
};

/* unaligned */
for (i = 0; i < N/2; i++)
{
	tmp.e.k[i] = 8;
}

The vector assembly generated for the loop is:
movdqa 463(%rip),%xmm0
mov    %rsp,%rdx
movl   $0x8,0xc4(%rsp)		
movl   $0x8,0xc8(%rsp)		
movaps %xmm0,0xd0(%rsp)		
movl   $0x8,0xcc(%rsp)
movl   $0x8,0x100(%rsp)		
movaps %xmm0,0xe0(%rsp)
movaps %xmm0,0xf0(%rsp)

I just thought, I should also mention this run time case in linpk. 

In daxpy:
m = MOD(N,4)
IF ( m.NE.0 ) THEN
  DO i = 1 , m
    Dy(i) = Dy(i) + Da*Dx(i)
  ENDDO
  IF ( N.LT.4 ) RETURN
ENDIF

In this case, Dx and Dy are both double precision vectors. Dx is aligned
and Dy is misaligned to an unknown amount at compile-time.

(FWIW, it should be possible with bound analysis to know that this loop
can iterate <= 3 times and use that for the cost model analysis, but gcc
does not do that)

I can post the disassembly if needed, but basically this loop looks
like:

if (prologue_iters == 0)
 go to before_vector_loop
else
 go to prologue

prologue: 
execute prologue
if prologue=num_iters
  go to exit

before_vector_loop:
if (num_iters-prologue-epilogue = 0)
  go to epilogue
else
  go to vector loop

vector loop:
execute vector loop
if (num_iters-prologue-epilogue = num_iters-prologue)
  go to exit
else
  go to epilogue

epilogue:
execute epilogue

exit:

AFAIU here are the different cases that can occur:
----------------------------------------------------------------------
Prolog guard1	Prolog guard2	Epilogue guard1	Epilogue guard2
----------------------------------------------------------------------
Misaligned, 1 iteration: 	
not-taken,	taken	

Misaligned, 2 iterations: 
not-taken,	not-taken,	taken

Misaligned, 3 iterations: 
not-taken,	not-taken,	not-taken,	taken

Aligned, 1 iteration: 
taken	not-taken	taken

Aligned, 2 iterations
taken	not-taken	not taken	taken

Aligned, 3 iterations
taken	not-taken	not-taken	not-taken

Conservatively, I think it amounts to 1 taken and 1 not-taken branch in
the epilogue and prologue each. Please correct me if you think if I am
generalizing this incorrectly. 

If you think its ok, then perhaps the guards should be qualified as a
taken and not-taken. In that case, we can count 2 guards for the
run-time case for the prologue and epilogue each, but 1 guard will be
counted as taken with a higher cost and the other guard will be counted
as not-taken with a lower cost. The not-taken guard still involves a
compare of some sort and that should be counted.

Thanks,
Harsha