This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

stepanov on sparc; sparc backend scheduling issues


David Edelsohn wrote:
> 	I'd like to know why we're worse as well.  Can you help me
> investigate this?

I've investigated the problem on the Sparc, and it's very interesting.
The issue seems to be in scheduling.  For those who don't want to
plow through the whole thing, -mtune=ultrasparc or -mcpu=ultrasparc
makes the abstraction penalty disappear.

Here's what we get on the Sparc for 3.0-pre:

test      absolute   additions      ratio with
number    time       per second     test0

 0        0.89sec    56.18M         1.00
 1        0.46sec    108.70M         0.52
 2        0.46sec    108.70M         0.52
 3        0.60sec    83.33M         0.67
 4        0.61sec    81.97M         0.69
 5        0.61sec    81.97M         0.69
 6        0.61sec    81.97M         0.69
 7        0.61sec    81.97M         0.69
 8        0.60sec    83.33M         0.67
 9        0.62sec    80.65M         0.70
10        0.61sec    81.97M         0.69
11        0.60sec    83.33M         0.67
12        0.61sec    81.97M         0.69
mean:     0.60sec    83.37M         0.67

Total absolute time: 7.89 sec

Abstraction Penalty: 0.67

The first oddity to notice is that we get an "impossible" result: an
abstraction penalty below 1.  This is because loop 0, a simple
Fortran-style loop, is misoptimized.  This is filed as GNATS bug
target/859, and this bug is also in 2.95.2.

If this bug were fixed, we would get a report like

test      absolute   additions      ratio with
number    time       per second     test0

 0        0.46sec    108.70M        1.00
 1        0.46sec    108.70M        1.00
 2        0.46sec    108.70M        1.00
 3        0.60sec    83.33M         1.30
 4        0.61sec    81.97M         1.32
 5        0.61sec    81.97M         1.32
 6        0.61sec    81.97M         1.32
 7        0.61sec    81.97M         1.32
 8        0.60sec    83.33M         1.30
 9        0.62sec    80.65M         1.34
10        0.61sec    81.97M         1.32
11        0.60sec    83.33M         1.30
12        0.61sec    81.97M         1.32
mean:     0.60sec    83.37M         1.30

But on 2.95.2, all the loops from #1 to #12 take the same time.
So the regression shows up for loops #3 through #12.

All of tests #1 through #12 are expansions of

template <class Iterator, class T>
void test(Iterator first, Iterator last, T zero) {
  int i;
  start_timer();
  for(i = 0; i < iterations; ++i)
    check(double(accumulate(first, last, zero)));
  result_times[current_test++] = timer();
}

What really matters is the "accumulate" template, which is the inner loop:

struct {
  double operator()(const double& x, const double& y) {return x + y; }
  Double operator()(const Double& x, const Double& y) {return x + y; }
} plus;

template <class Iterator, class Number>
Number accumulate(Iterator first, Iterator last, Number result) {
  while (first != last) result =  plus(result, *first++);
  return result;
}

Doing well on the Stepanov test means that a C++ compiler can crunch away
all this template abstraction and make as tight a loop as a C programmer
could hand-write.

For test #1, Iterator is a pointer to double and T is double.

For test #2, Iterator is a pointer to Double and T is Double,
where Double is a struct with a double member.  This works just
as well as test #1 (for the original gcc-2.96-RH it did not).

For test #3, instead of using a pointer to double, the pointer is
replaced by an iterator class: an object with one data member,
a pointer to double.  gcc 2.95.2 had no trouble with this, but
3.0 makes worse code.

Here's the inner loop of

double accumulate<double*, double>(double*, double*, double) :

.LL264:
	mov	%o4, %o0
	ldd	[%o0], %f2
	add	%o4, 8, %o4
	cmp	%o4, %o1
	bne	.LL264
	faddd	%f0, %f2, %f0

For
Double accumulate<Double*, Double>(Double*, Double*, Double):

it's the same except that o3 is used instead of o4.

I don't understand why we don't emit only five instructions, but if
I replace the first two with

	ldd	[%o4], %f2

the time stays the same, so the loop speed seems limited by 
architecture constraints.

For
double accumulate<double_pointer, double>(double_pointer, double_pointer, double)

we get

.LL282:
	ldd	[%o5], %f2
	add	%o5, 8, %o0
	faddd	%f0, %f2, %f0
	cmp	%o0, %o1
	bne	.LL282
	mov	%o0, %o5

This is interesting.  We still have six instructions, but there is only
one instruction between the ldd and the use of %f2, and there's no gap
between setting %o5 and using it.  Is this just a pipeline bubble issue?

I swapped the positions of the faddd and the mov, to get

	ldd	[%o5], %f2
	add	%o5, 8, %o0
	mov	%o0, %o5
	cmp	%o0, %o1
	bne	.LL282
	faddd	%f0, %f2, %f0

and immediately loop 3 runs as fast as loops 1 and 2.

So, at least in the case of the Sparc, it seems that we have an
instruction scheduling issue.

But then a thought hit me: what about the -m options?

Sure enough, if I turn on -mcpu=ultrasparc or -mtune=ultrasparc, the
problem disappears!  We still have the bug in loop 0, but loops 1 through
12 all take the same time.

So, a question: since I configured the compiler on an Ultrasparc, why
doesn't it set -mtune=ultrasparc by default?  What are we defaulting to?


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]