Often in inner loops (no child functions called, big loop count), there will be several (many!) Altivec registers left unused. The attached file demonstrates the problem, when built with tree-ssa with: %PREFIX/bin/g++ -Winline -mdynamic-no-pic -fno-exceptions -O3 -maltivec -fstrict-aliasing -finline-functions -finline-limit=1000000000 -falign-loops=16 --param large-function- growth=1000000 --param inline-unit-growth=1000 iterator_10.cpp -S -o /tmp/iterator_10.s In my real world code, this is a big problem since I'm have code that does: - long computation for some answer - compute some address to ADD to the answer to (and the address is almost never in cache) - load from answer address - start on next loop - add answer and old answer and store The problem is that the compiler totally blows the approach since it immediately stores the loaded old answer to the stack, causing a stall waiting for the load to complete (and thus preventing any asynchrony with the load and the computation in the next loop).
Created attachment 5874 [details] Test input
The problem I see is that there is no store/load motion in the loop for the C.* and state.* (note that they really are C->* and state->* as they are references), this is caused by aliasing anylasis not knowing that they can not be the same object.
here is a work around: void foo(const Constants &C1, State &state1) { Constants C = C1; State state = state1; for (int i = 0; i < 100; i++) { state.step(C); } state1 = state; }
This workaround doesn't entirely solve the problem, I think. The issue is that passing 'C' here invokes an implicit copy constructor. If the constructor isn't defined, memcpy is used (bad for inlining). If the constructor is defined, is must be defined to take 'const Constants &C' as the argument and we're back in a similar boat as before. Attaching an updated example. This code is definitely better, but there should be zero load/stores in the inner loop but there still are a few.
Created attachment 6048 [details] Updated test case for the (partial) workaround
The other issue is that the altivec builtins are not marked so we think they can clobber the what the pointers point to.
The SRA rewrite for GCC 4.6 probably fixes the SRA part of this bug report (at last!). Can someone with a powerpc box have a look?