This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/83920] [nvptx] bad predicate reset
- From: "cesar at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 18 Jan 2018 19:10:25 +0000
- Subject: [Bug target/83920] [nvptx] bad predicate reset
- Auto-submitted: auto-generated
- References: <bug-83920-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83920
--- Comment #8 from cesar at gcc dot gnu.org ---
I tweaked your proposed fix as follows:
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 55c7e3cbf90..24625cd303f 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4104,8 +4104,11 @@ nvptx_single (unsigned mask, basic_block from,
basic_block to)
mov.u32 %x,%tid.x;
setp.ne.u32 %rnotvzero,%x,0;
}
+ reg.pred %rcond2; // Scratch copy of the original rcond.
+ mov.pred %rcond2, %rcond;
@%rnotvzero bra Lskip;
+ mov.pred %rcond, %rcond2
setp.<op>.<type> %rcond,op1,op2;
Lskip:
selp.u32 %rcondu32,1,0,%rcond;
@@ -4126,8 +4129,11 @@ nvptx_single (unsigned mask, basic_block from,
basic_block to)
There is nothing in the PTX spec to suggest that this is wrong, or
to explain why the extra initialization is needed. So, we
classify
it as a JIT bug, and the extra initialization as workaround. */
- emit_insn_before (gen_movbi (pvar, const0_rtx),
- bb_first_real_insn (from));
+ rtx_insn *from_insn = bb_first_real_insn (from);
+ rtx ptmp = gen_reg_rtx (GET_MODE (pvar));
+ emit_insn_before (gen_rtx_SET (ptmp, pvar), from_insn);
+ emit_insn_before (gen_movbi (pvar, const0_rtx), from_insn);
+ emit_insn_before (gen_rtx_SET (pvar, ptmp), tail);
#endif
emit_insn_before (nvptx_gen_vcast (pvar), tail);
}
This generates the following assembly code for gemm.f90:
$L34:
$L11:
mov.pred %r413, %r314;
setp.eq.u32 %r314, 1, 0;
@%r402 bra $L33;
$L33:
mov.pred %r314, %r413;
selp.u32 %r414, 1, 0, %r314;
shfl.idx.b32 %r414, %r414, 0, 31;
setp.ne.u32 %r314, %r414, 0;
@!%r314 bra.uni $L22;
bra $L3;
$L12:
I'm not sure what's going on here, because this patch causes illegal memory
access errors in lsdalton. Any thoughts?
Maybe a more involved workaround would be to leave r314 alone, and use the
scratch %r413 register as the predicate. But, then wouldn't the prevent the PRE
code hoisting optimization which moved the computation for %r314 outside of the
loop in the first place?
Is this original PTX JIT bug still present in the current Nvidia drivers? You
mentioned that this problem first appeared in 381.22. I wonder if it has been
resolved in 387.