This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug middle-end/41082] [4.5 Regression] FAIL: gfortran.fortran-torture/execute/where_2.f90 execution, -O3 -g with -m64

From: "irar at il dot ibm dot com" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: 10 Jan 2010 13:43:29 -0000
Subject: [Bug middle-end/41082] [4.5 Regression] FAIL: gfortran.fortran-torture/execute/where_2.f90 execution, -O3 -g with -m64
References: <bug-41082-12313@http.gcc.gnu.org/bugzilla/>
Reply-to: gcc-bugzilla at gcc dot gnu dot org


------- Comment #43 from irar at il dot ibm dot com  2010-01-10 13:43 -------
Since -O2 -ftree-vectorize doesn't cause bad code, it has to be some other
optimization on top of vectorized code that causes the problem.

Bad code is generated when the alignment of 'reduce' is forced and the
reduction 'sum(reduce)' is vectorized. However, the result of the reduction is
correct, and the vector store element does not do any damage (as far as I can
see in debugger). So, the vector stores don't corrupt anything.

The part that goes wrong is in the scalar code that implements the decision on
whether to add the (correctly computed) reduction value to temp[9] and
temp[10]. The code that sets the condition, (which, by the way, is not using
any vectorized code) is not using the values of  reduce[9] and reduce[10], even
though the value of the condition depends on them:

reduce(1:3) = -1
reduce(4:6) = 0
reduce(7:8) = 5
reduce(9:10) = 10
...
WHERE (reduce > 6) temp = temp + sum(reduce) 


Here is the code for adding the result of the "sum(reduce)" to temp[9]:

L29:
        lbz r11,152(r1)  # **
        cmpwi cr7,r11,0  # reduce > 6 ?
        beq cr7,L30
        lwz r11,240(r1)  # load temp[9]
        add r11,r11,r9   # temp[9] + sum(reduce) 
        stw r11,240(r1)  # store temp[9]

** - The calculation of 152(r1) is based only on the value of reduce[8]! The
values of reduce[9] and reduce[10] are only used in the reduction calculation
and not compared to 6 at all.

In case we don't vectorize (but force the alignment), there is cmpwi cr7,r29,6
instruction, where r29 is reduce[9] (and the code is correct). The same happens
when the alignment of reduce is not forced and the reduction is vectorized
using peeling. 

I.e., as far as I can see, in the bad code, the comparison of reduce[9] and
reduce[10] with 6 do not exist. I wonder which optimization can be responsible
for that?


Also, some values of reduce are copied to a temporal array and are further
compared with 6. In  the version with peeling the values that are copied are
reduce[4:8]: there is no need to keep the first three and the last two are kept
in registers and compared to 6 (and also used in reduction epilogue). While in
the bad version the kept values are reduce[3:8] and reduce[8] is put before the
values of reduce[3:7] (reduce[3:7] are in 276(r1) to 292(r1), and reduce[8] is
in 272(r1)). (And in the bad code the last two values reduce[9] and reduce[10]
are only used in reduction epilogue).


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41082

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]