target cost model tuning for x86
Jagasia, Harsha
harsha.jagasia@amd.com
Sun Sep 9 07:09:00 GMT 2007
Hi,
>> - The patch also changes the number of branches per prologue and
>> epilogue each to 1 instead of 2. Looking at some loops, the branches
>> can get optimized out,
>
>could you please show an example where these guards are optimized away?
>
>> especially when the alignment and iterations are known at compile
>> time.
>>
>
>so maybe there's room to reduce the number of estimated branches from 2
>to
>1 only when the number of iterations is known at compile time, and not
>always?
>
At line 2092 in induct.f90 (built with -mtune=k8 -O3 -msse2
-ftree-vectorize),
denominator =
sqrt(dot_product(rot_c_vector-rot_q_vector,rot_c_vector-rot_q_vector))
The vectors used by the dot product are 128-bit aligned, double
precision, length 3 vectors, known at compile time. Such loops are
repeated through out the 2 major hotspots of this benchmark.
The vector assembly generated for the loop is:
417498: movapd (%rdi),%xmm0 => vector_loop
41749c: subpd (%rsi),%xmm0
4174a9: mulpd %xmm0,%xmm0
4174ad: movsd %xmm0,%xmm3
4174b1: unpckhpd %xmm0,%xmm0
4174b5: addsd %xmm0,%xmm3
4174b9: movsd %xmm5,%xmm0 => epilogue
4174bd: subsd %xmm4,%xmm0
4174c1: mulsd %xmm0,%xmm0
4174c5: addsd %xmm3,%xmm0
4174c9: sqrtsd %xmm0,%xmm0
I also tried a simple test case extracted from vect-31.c (loop at line
65) (built with -mtune=k8 -O2 -msse2 -ftree-vectorize),
struct t{
int k[N];
int l;
};
struct s{
char a; /* aligned */
char b[N-1]; /* unaligned (offset 1B) */
char c[N]; /* aligned (offset NB) */
struct t d; /* aligned (offset 2NB) */
struct t e; /* unaligned (offset 2N+4N+4 B) */
};
/* unaligned */
for (i = 0; i < N/2; i++)
{
tmp.e.k[i] = 8;
}
The vector assembly generated for the loop is:
movdqa 463(%rip),%xmm0
mov %rsp,%rdx
movl $0x8,0xc4(%rsp)
movl $0x8,0xc8(%rsp)
movaps %xmm0,0xd0(%rsp)
movl $0x8,0xcc(%rsp)
movl $0x8,0x100(%rsp)
movaps %xmm0,0xe0(%rsp)
movaps %xmm0,0xf0(%rsp)
I just thought, I should also mention this run time case in linpk.
In daxpy:
m = MOD(N,4)
IF ( m.NE.0 ) THEN
DO i = 1 , m
Dy(i) = Dy(i) + Da*Dx(i)
ENDDO
IF ( N.LT.4 ) RETURN
ENDIF
In this case, Dx and Dy are both double precision vectors. Dx is aligned
and Dy is misaligned to an unknown amount at compile-time.
(FWIW, it should be possible with bound analysis to know that this loop
can iterate <= 3 times and use that for the cost model analysis, but gcc
does not do that)
I can post the disassembly if needed, but basically this loop looks
like:
if (prologue_iters == 0)
go to before_vector_loop
else
go to prologue
prologue:
execute prologue
if prologue=num_iters
go to exit
before_vector_loop:
if (num_iters-prologue-epilogue = 0)
go to epilogue
else
go to vector loop
vector loop:
execute vector loop
if (num_iters-prologue-epilogue = num_iters-prologue)
go to exit
else
go to epilogue
epilogue:
execute epilogue
exit:
AFAIU here are the different cases that can occur:
----------------------------------------------------------------------
Prolog guard1 Prolog guard2 Epilogue guard1 Epilogue guard2
----------------------------------------------------------------------
Misaligned, 1 iteration:
not-taken, taken
Misaligned, 2 iterations:
not-taken, not-taken, taken
Misaligned, 3 iterations:
not-taken, not-taken, not-taken, taken
Aligned, 1 iteration:
taken not-taken taken
Aligned, 2 iterations
taken not-taken not taken taken
Aligned, 3 iterations
taken not-taken not-taken not-taken
Conservatively, I think it amounts to 1 taken and 1 not-taken branch in
the epilogue and prologue each. Please correct me if you think if I am
generalizing this incorrectly.
If you think its ok, then perhaps the guards should be qualified as a
taken and not-taken. In that case, we can count 2 guards for the
run-time case for the prologue and epilogue each, but 1 guard will be
counted as taken with a higher cost and the other guard will be counted
as not-taken with a lower cost. The not-taken guard still involves a
compare of some sort and that should be counted.
Thanks,
Harsha
More information about the Gcc-patches
mailing list