This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
| Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
|---|---|---|
| Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
| Other format: | [Raw text] | |
I've noticed while tinkering with 3.4 and 4.1 that some code sequences turn out much better in 4.1. However, other code sequences turn out substantially worse in 4.1.
The most frustrating is the reduction in use of postmodify addressing modes. It looks like tree-ssa-loop-ivopts converts a loop like:
for (i = 0; i < MAX; i++) {
sum += a[i];
} for (ivtmp = 0; ivtmp < MAX*4; ivtmp += 4)
{
sum += *(a+ivtmp)
}which is fine, except by the time we get to RTL, the load in the first loop form is converted in GCC 3.4 into a load with postincrement, and the RTL optimization turns the second form address into an add of ivtmp and 4, an add of ivtmp and a, and a load.
Similarly, I can do mulsidi3 fast but muldi3 not so fast. If I have a code sequence like:
#define SEQ(X) { \
c1 = *coef; coef++; \
c2 = *coef; coef++; \
vLo = *(vb1+(X)); \
vHi = *(vb1+(23-(X))); \
sum1L = MAC(sum1L,vLo,c1); \
sum2L = MAC(sum2L,vLo,c2); \
sum1L = MAC(sum1L,vHi,-c2); \
sum2L = MAC(sum2L,vHi,c1); \
vLo = *(vb1+32+(X)); \
vHi = *(vb1+32+(23-(X))); \
sum1R = MAC(sum1R,vLo,c1); \
sum2R = MAC(sum2R,vLo,c2); \
sum1R = MAC(sum1R,vHi,-c2); \
sum2R = MAC(sum2R,vHi,c1); \
}foo(const int *coef, int *vb1, short *out) {
int vLo, vHi, c1, c2;
Word64 sum1L = 0, sum2L = 0;
Word64 sum1R = 0, sum2R = 0;SEQ(0); SEQ(1); SEQ(2); SEQ(3); SEQ(4); SEQ(5); SEQ(6); SEQ(7); out[0] = sat64_16(sum1L+sum2L); out[1] = sat64_16(sum1R+sum2R); }
In GCC 3.4, the optimizer has no problem knowing that every multiply is a mulsidi3. In GCC 4.1, the tree optimizer decides that the sign extend to DI for c1, c2, vLo, and VHi should be done into a DImode temporary that is fed to the MAC patterns, and combine dosn't convert them. Indeed, if I don't have a define_insn_and_split for DImode it doesn't even have a chance, because the RTL expander has already converted the DImode multiply into various SImode instructions.
So... how do I coax GCC 4.1 into liking postmodify and mulsidi again? I've tried fiddling with rtx_costs for postmodify and multiply, and they should be accurate, but I get no love. What other things can I try to play with? Or is this sort of thing a known deficiency in 4.1 that I should try to work around?
I've attached a test for the latter case and the 3.4(.2) and 4.1(.1) assembly outputs for ARM, which exhibits this behavior. Note particularly the smull's and smlal's.
Attachment:
simple.c
Description: Binary data
Attachment:
simple-3.4.s
Description: Binary data
Attachment:
simple-4.1.s
Description: Binary data
| Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
|---|---|---|
| Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |