This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
stepanov on sparc; sparc backend scheduling issues
- To: dje at watson dot ibm dot com (David Edelsohn)
- Subject: stepanov on sparc; sparc backend scheduling issues
- From: Joe Buck <jbuck at synopsys dot COM>
- Date: Fri, 25 May 2001 17:55:29 -0700 (PDT)
- Cc: jbuck at synopsys dot COM (Joe Buck), mark at codesourcery dot com (Mark Mitchell), jfm2 at club-internet dot fr, gcc at gcc dot gnu dot org
David Edelsohn wrote:
> I'd like to know why we're worse as well. Can you help me
> investigate this?
I've investigated the problem on the Sparc, and it's very interesting.
The issue seems to be in scheduling. For those who don't want to
plow through the whole thing, -mtune=ultrasparc or -mcpu=ultrasparc
makes the abstraction penalty disappear.
Here's what we get on the Sparc for 3.0-pre:
test absolute additions ratio with
number time per second test0
0 0.89sec 56.18M 1.00
1 0.46sec 108.70M 0.52
2 0.46sec 108.70M 0.52
3 0.60sec 83.33M 0.67
4 0.61sec 81.97M 0.69
5 0.61sec 81.97M 0.69
6 0.61sec 81.97M 0.69
7 0.61sec 81.97M 0.69
8 0.60sec 83.33M 0.67
9 0.62sec 80.65M 0.70
10 0.61sec 81.97M 0.69
11 0.60sec 83.33M 0.67
12 0.61sec 81.97M 0.69
mean: 0.60sec 83.37M 0.67
Total absolute time: 7.89 sec
Abstraction Penalty: 0.67
The first oddity to notice is that we get an "impossible" result: an
abstraction penalty below 1. This is because loop 0, a simple
Fortran-style loop, is misoptimized. This is filed as GNATS bug
target/859, and this bug is also in 2.95.2.
If this bug were fixed, we would get a report like
test absolute additions ratio with
number time per second test0
0 0.46sec 108.70M 1.00
1 0.46sec 108.70M 1.00
2 0.46sec 108.70M 1.00
3 0.60sec 83.33M 1.30
4 0.61sec 81.97M 1.32
5 0.61sec 81.97M 1.32
6 0.61sec 81.97M 1.32
7 0.61sec 81.97M 1.32
8 0.60sec 83.33M 1.30
9 0.62sec 80.65M 1.34
10 0.61sec 81.97M 1.32
11 0.60sec 83.33M 1.30
12 0.61sec 81.97M 1.32
mean: 0.60sec 83.37M 1.30
But on 2.95.2, all the loops from #1 to #12 take the same time.
So the regression shows up for loops #3 through #12.
All of tests #1 through #12 are expansions of
template <class Iterator, class T>
void test(Iterator first, Iterator last, T zero) {
int i;
start_timer();
for(i = 0; i < iterations; ++i)
check(double(accumulate(first, last, zero)));
result_times[current_test++] = timer();
}
What really matters is the "accumulate" template, which is the inner loop:
struct {
double operator()(const double& x, const double& y) {return x + y; }
Double operator()(const Double& x, const Double& y) {return x + y; }
} plus;
template <class Iterator, class Number>
Number accumulate(Iterator first, Iterator last, Number result) {
while (first != last) result = plus(result, *first++);
return result;
}
Doing well on the Stepanov test means that a C++ compiler can crunch away
all this template abstraction and make as tight a loop as a C programmer
could hand-write.
For test #1, Iterator is a pointer to double and T is double.
For test #2, Iterator is a pointer to Double and T is Double,
where Double is a struct with a double member. This works just
as well as test #1 (for the original gcc-2.96-RH it did not).
For test #3, instead of using a pointer to double, the pointer is
replaced by an iterator class: an object with one data member,
a pointer to double. gcc 2.95.2 had no trouble with this, but
3.0 makes worse code.
Here's the inner loop of
double accumulate<double*, double>(double*, double*, double) :
.LL264:
mov %o4, %o0
ldd [%o0], %f2
add %o4, 8, %o4
cmp %o4, %o1
bne .LL264
faddd %f0, %f2, %f0
For
Double accumulate<Double*, Double>(Double*, Double*, Double):
it's the same except that o3 is used instead of o4.
I don't understand why we don't emit only five instructions, but if
I replace the first two with
ldd [%o4], %f2
the time stays the same, so the loop speed seems limited by
architecture constraints.
For
double accumulate<double_pointer, double>(double_pointer, double_pointer, double)
we get
.LL282:
ldd [%o5], %f2
add %o5, 8, %o0
faddd %f0, %f2, %f0
cmp %o0, %o1
bne .LL282
mov %o0, %o5
This is interesting. We still have six instructions, but there is only
one instruction between the ldd and the use of %f2, and there's no gap
between setting %o5 and using it. Is this just a pipeline bubble issue?
I swapped the positions of the faddd and the mov, to get
ldd [%o5], %f2
add %o5, 8, %o0
mov %o0, %o5
cmp %o0, %o1
bne .LL282
faddd %f0, %f2, %f0
and immediately loop 3 runs as fast as loops 1 and 2.
So, at least in the case of the Sparc, it seems that we have an
instruction scheduling issue.
But then a thought hit me: what about the -m options?
Sure enough, if I turn on -mcpu=ultrasparc or -mtune=ultrasparc, the
problem disappears! We still have the bug in loop 0, but loops 1 through
12 all take the same time.
So, a question: since I configured the compiler on an Ultrasparc, why
doesn't it set -mtune=ultrasparc by default? What are we defaulting to?