[Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse

Tue Feb 8 11:57:27 GMT 2022

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #7 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
All your three patches combined still doesn't help resolve the problem.
And, what I realized: they don't even change the Nvidia/CUDA Driver reported
"used [...] registers".
Does that mean that the Driver "doesn't care" that we feed into it simple PTX
code, using less PTX registers -- it seems able to do these optimizations all
by itself?  :-O
(At least regarding number of registers used -- I didn't verify the SASS code
generated.)
(Emitting cleaner, "de-cluttered" code in GCC/nvptx is still very much
valuable, of course: if only for our own visual consumption!  ... at least as
long as it doesn't make 'nvptx.md' etc. much more complex...)

For "good" vs. "bad"/"not-so-good" (before vs. after "nvptx: Transition nvptx
backend to STORE_FLAG_VALUE = 1"), the only code generation difference is in
the '__muldc3' function ('nvptx-none/libgcc/_muldc3.o'), and that one is the
only '.extern' dependency aside from the '__reduction_lock' global variable
('nvptx-none/libgcc/reduction.o').
(In the following, working around <https://gcc.gnu.org/PR104416> "'lto-wrapper'
invoking 'mkoffload's with duplicated command-line options".)

This means, I can conveniently manually create a minimal nvptx 'libgcc.a':

    $ cp build-gcc-offload-nvptx-none/nvptx-none/libgcc/_muldc3.o ./
    $ rm -f libgcc.a && ar q libgcc.a _muldc3.o
build-gcc-offload-nvptx-none/nvptx-none/libgcc/reduction.o

..., and compile 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c' with
'-foffload=nvptx-none=-Wl,-L.'.  (Via 'GOMP_DEBUG=1' verified that identical
PTX code is loaded to GPU.)

Then, hand-modify '_muldc3.o', re-create 'libgcc.a', re-compile, re-execute.

Verified that before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1"
'__muldc3' (attached '_muldc3-good.o') works fine, and after "nvptx: Transition
nvptx backend to STORE_FLAG_VALUE = 1" '__muldc3' (attached '_muldc3-bad.o')
does show the problem originally reported here.

I then gradually morphed the former into the latter (beginning with eliding
simple changes like renumbered registers etc.), until only one last change was
necessary to turn "good" into "bad" (attached '_muldc3-WIP.o'); showing the
"still-good" state:

    [...]
    @@ -1716,8 +1718,16 @@
     cvt.rn.f64.s32 %r84,%r27;
     copysign.f64 %r61,%r61,%r84;
     .loc 2 1981 32
    +// Current/after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE =
1":
    +/*
     selp.u32 %r86,1,0,%r138;
     mov.u32 %r85,%r86;
    +*/
    +// Before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1":
    +set.u32.leu.f64 %r86,%r57,0d7fefffffffffffff;
    +neg.s32 %r87,%r86;
    +mov.u32 %r85,%r87;
    +//
     cvt.u16.u8 %r140,%r85;
     mov.u16 %r89,%r140;
     xor.b16 %r88,%r89,1;
    @@ -1770,8 +1780,16 @@
    [...]

That is, things go "bad" if we here use the '%r138' that was computed (and
used) earlier in the code, and things are "good" if we re-compute locally. 
Same for '%r139' in the second code block.

(Interestingly, it is tolerated if one of the long-lived registers are used,
but things go "bad" only if *both* are used.  But that's probably just a
distraction?  And again, I have not inspected the actual SASS code, but just
looked at the JIT-time "used [...] registers".)

Do we thus conclude that what happens is that the "nvptx: Transition nvptx
backend to STORE_FLAG_VALUE = 1" changes here enable an "optimization" in GCC
such that values that have previously been compute may be re-used later in the
code, without re-computing them.  But on the flip side, of course, this means
that the values have to kept live in (SASS) registers.  (That's just my theory;
I haven't verified the actual SASS.)

In other words: at least in this case here, it seems preferrable to re-compute
instead of keeping registers occupied.  (But I'm of course not claiming that to
be a simple yes/no decision...)

It seem we're now in territory of tuning CPU vs. GPU code generation?

Certainly, GCC has not seen much care for the latter (GPU code generation).
I mean: verify GCC pass pipeline generally, and parameterization of individual
passes for GPU code generation.
Impossible to get "right", of course, but maybe some heuristics for CPU vs. GPU
may be discovered and implemented?
I'm sure there must be some literature on that topic?

All that complicated by the fact the with the (several different versions of
the) Nvidia/CUDA Driver's PTX -> SASS translation/optimization we have another
moving part...