First, I do acknowledge that commit beed3f8f60492289ca6211d86c54a2254a642035 "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" generally does improve nvptx code generation -- thanks! I've however run into one case where it causes a regression: PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 (test for excess errors) [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 execution test PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O2 (test for excess errors) [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O2 execution test libgomp: The Nvidia accelerator has insufficient resources to launch 'worker$_omp_fn$0' with num_workers = 32 and vector_length = 32; recompile the program with 'num_workers = x and vector_length = y' on that offloaded region or '-fopenacc-dim=:x:y' where x * y <= 896. Same for C++. That's with a Nvidia Tesla K20c, Driver Version: 346.46 -- so, rather old. By the way: the subsequent commit 659f8161f61d3f75c3a47cf646147e8f7b4dcb34 "nvptx: Add support for PTX's cnot instruction" is not helpful or even relevant here; there are no 'cnot's appearing in the PTX code loaded to the GPU (per 'GOMP_DEBUG=1' execution). Per 'diff' of 'GOMP_DEBUG=1' execution we indeed see *more* registers used after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" than before. For '-O0': [...] Link log info : 4 bytes gmem info : Function properties for 'gang$_omp_fn$0': -info : used 51 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem +info : used 68 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 8 bytes cmem[2], 0 bytes lmem info : Function properties for 'worker$_omp_fn$0': -info : used 51 registers, 112 stack, 136 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem +info : used 68 registers, 112 stack, 136 bytes smem, 328 bytes cmem[0], 8 bytes cmem[2], 0 bytes lmem info : Function properties for 'vector$_omp_fn$0': -info : used 51 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem +info : used 68 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 8 bytes cmem[2], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings nvptx_exec: kernel vector$_omp_fn$0: launch gangs=1, workers=1, vectors=32 nvptx_exec: kernel vector$_omp_fn$0: finished -GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7ffc760394a0, size=0x60bb30, kinds=0x60bb48 +GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7fff99653530, size=0x60bad0, kinds=0x60bae8 GOMP_OFFLOAD_openacc_exec: prepare mappings + +libgomp: The Nvidia accelerator has insufficient resources to launch 'worker$_omp_fn$0' with num_workers = 32 and vector_length = 32; recompile the program with 'num_workers = x and vector_length = y' on that offloaded region or '-fopenacc-dim=:x:y' where x * y <= 896. - nvptx_exec: kernel worker$_omp_fn$0: launch gangs=1, workers=32, vectors=32 - nvptx_exec: kernel worker$_omp_fn$0: finished -GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7ffc760394a0, size=0x60bb50, kinds=0x60bb68 - GOMP_OFFLOAD_openacc_exec: prepare mappings - nvptx_exec: kernel gang$_omp_fn$0: launch gangs=32, workers=1, vectors=32 - nvptx_exec: kernel gang$_omp_fn$0: finished Similar for '-O2', just with less stack usage. Cross-checking with a more recent Driver Version: 450.119.03, I'm only seeing slightly increased register usage; 52 registers after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" compared to 51 registers before: [...] Link log info : 4 bytes gmem info : Function properties for 'vector$_omp_fn$0': info : used [-51-]{+52+} registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem info : Function properties for 'worker$_omp_fn$0': info : used [-51-]{+52+} registers, 112 stack, 136 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem info : Function properties for 'gang$_omp_fn$0': info : used [-51-]{+52+} registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem [...] This suggests that compared to before, after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" GCC is generating certain PTX code sequences that the Driver/JIT fails to understand/optimize? While not ideal, the code still executes fine (with newish Driver/JIT), and I'm thus OK if we classify that as not worth looking into -- but I at least wanted to report my findings: maybe there's a way to tune the GCC/nvptx code generation to the PTX -> SASS compiler's liking? Possibly (but that's just guessing!), the reason might be around the following PTX code change: [...] -setp.leu.f64 %r82,%r25,0d7fefffffffffffff; -@ ! %r82 bra $L3; +@ %r78 bra $L20; +setp.leu.f64 %r138,%r57,0d7fefffffffffffff; +bra $L3; +$L20: .loc 2 1976 21 -setp.leu.f64 %r83,%r57,0d7fefffffffffffff; -@ %r83 bra $L19; +setp.leu.f64 %r138,%r57,0d7fefffffffffffff; +@ %r138 bra $L19; $L3: [...] From a quick look, I read this to mean that the originally ("before") unconditional 'setp.leu.f64 %r82,%r25,0d7fefffffffffffff;' is now ("after") done conditionally. Maybe related, maybe not: when curiously 'diff'ing the before vs. after nvptx-none target libraries, I noticed amongst all the "noise" (improved code generation): 'nvptx-none/libatomic/gcas.o': [...] atom.cas.b32 %r137,[%r34],%r136,%r139; setp.eq.u32 %r140,%r137,%r136; selp.u32 %r138,1,0,%r140; -setp.ne.u32 %r141,%r138,0; -@ %r141 bra $L21; +@ %r140 bra $L18; st.u32 [%r201],%r137; -bra $L19; +$L18: +setp.eq.u32 %r142,%r138,0; +@ %r142 bra $L19; $L21: [...] ... which again looks like a pattern where an originally ("before") unconditional 'setp.ne.u32 %r141,%r138,0;' is now ("after") done conditionally. Similar in other files -- but I certainly didn't look in detail, and I'm certainly not claiming this to be/cause any actual problem. And, I've spotted a few cases where we're generating "maybe worse" code: 'nvptx-none/libgomp/openacc.o' (complete 'diff'): @@ -25,6 +25,7 @@ .reg .u64 %r28; .reg .u32 %r29; .reg .u32 %r30; +.reg .pred %r31; mov.u64 %r27,%ar0; st.u64 [%frame+16],%r27; ld.u64 %r28,[%frame+16]; @@ -38,8 +39,8 @@ ld.param.u32 %r30,[%value_in]; } mov.u32 %r23,%r30; -set.u32.ne.u32 %r24,%r23,0; -neg.s32 %r24,%r24; +setp.ne.u32 %r31,%r23,0; +selp.u32 %r24,1,0,%r31; st.u32 [%frame],%r24; ld.u32 %r25,[%frame]; mov.u32 %r26,%r25; 'nvptx-none/newlib/libc/reent/lib_a-renamer.o' (complete 'diff'): @@ -28,6 +28,8 @@ .reg .u32 %r32; .reg .pred %r33; .reg .u32 %r36; +.reg .u32 %r38; +.reg .pred %r39; mov.u64 %r26,%ar0; mov.u64 %r27,%ar1; mov.u64 %r28,%ar2; @@ -58,7 +60,9 @@ ld.param.u32 %r36,[%value_in]; } .loc 2 57 6 -set.u32.eq.u32 %r25,%r36,-1; +setp.eq.u32 %r39,%r36,-1; +selp.u32 %r38,1,0,%r39; +neg.s32 %r25,%r38; $L1: .loc 2 64 1 mov.u32 %value,%r25; 'nvptx-none/newlib/libc/stdio/lib_a-remove.o' (complete 'diff'): @@ -24,6 +24,8 @@ .reg .u64 %r26; .reg .u64 %r27; .reg .u32 %r30; +.reg .u32 %r33; +.reg .pred %r34; mov.u64 %r26,%ar0; mov.u64 %r27,%ar1; .loc 2 65 7 @@ -37,7 +39,9 @@ ld.param.u32 %r30,[%value_in]; } .loc 2 65 6 -set.u32.eq.u32 %value,%r30,-1; +setp.eq.u32 %r34,%r30,-1; +selp.u32 %r33,1,0,%r34; +neg.s32 %value,%r33; .loc 2 69 1 st.param.u32 [%value_out],%value; ret; @@ -51,6 +55,8 @@ .reg .u64 %r27; .reg .u64 %r30; .reg .u32 %r31; +.reg .u32 %r34; +.reg .pred %r35; mov.u64 %r27,%ar0; .loc 2 65 7 ld.global.u64 %r30,[_impure_ptr]; @@ -64,7 +70,9 @@ ld.param.u32 %r31,[%value_in]; } .loc 2 65 6 -set.u32.eq.u32 %value,%r31,-1; +setp.eq.u32 %r35,%r31,-1; +selp.u32 %r34,1,0,%r35; +neg.s32 %value,%r34; .loc 2 77 1 st.param.u32 [%value_out],%value; ret; 'nvptx-none/newlib/libm/common/lib_a-s_rint.o' (complete 'diff'): @@ -80,6 +80,7 @@ .reg .u32 %r119; .reg .pred %r120; .reg .u32 %r122; +.reg .pred %r123; .reg .u32 %r124; .reg .u32 %r125; .reg .u32 %r126; @@ -197,7 +198,8 @@ setp.eq.u32 %r120,%r41,0; @ %r120 bra $L5; .loc 2 114 9 -set.u32.eq.u32 %r122,%r58,19; +setp.eq.u32 %r123,%r58,19; +selp.u32 %r122,1,0,%r123; shl.b32 %r41,%r122,31; .loc 2 115 17 not.b32 %r124,%r64; I haven't looked if that's "actually worse" in SASS, or just "maybe worse" in the intermediate PTX representation. (... and is most certainly not related to the regression mentioned before.) It seems unlikely, but I'll report in case anything here changes due to Tom's several GCC/nvptx back end commits yesterday. So, please close this PR as "won't fix" unless you see something here that you'd like to look into.
Hi Torsten, Thanks for the bug report. The STORE_FLAG_VALUE=1 patch was one of a series to dramatically improve the quality of nvptx code. Alas not all of them have yet been reviewed/approved, and it's likely these later improvements address the quality regression you're seeing. The other patches in the "nvptx Boolean" series are: patchq3: nvptx: Expand QI mode operations using SI mode instructions. https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587999.html patchq4: nvptx: Fix and use BI mode logic instructions (e.g. and.pred). https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588555.html [and purely for reference, my other outstanding nvptx patches are] patchn: nvptx: Improved support for HFMode including neghf2 and abshf2. https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587949.html patchw: nvptx: Add support for 64-bit mul.hi (and other) instructions. https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588453.html And one other related patch is that there's also a middle-end SUBREG patch intended to improve code generation on nvptx is also pending at: patchs: Simplify paradoxical subreg extensions of TRUNCATE https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578848.html My guess is that patchq3+patchq4 above should (hopefully) resolve this particular regression. If you could give them a spin on your system to see if they reduce register pressure sufficiently for this case, that would be greatly appreciated. As you can read in the above postings, the total number of instructions/registers (after all of these changes) should be dramatically reduced. I'll see what I can do from my end.
(In reply to Roger Sayle from comment #1) > The other patches in the "nvptx Boolean" series are: > patchq3: nvptx: Expand QI mode operations using SI mode instructions. > https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587999.html > > patchq4: nvptx: Fix and use BI mode logic instructions (e.g. and.pred). > https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588555.html > > [and purely for reference, my other outstanding nvptx patches are] > patchn: nvptx: Improved support for HFMode including neghf2 and abshf2. > https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587949.html > > patchw: nvptx: Add support for 64-bit mul.hi (and other) instructions. > https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588453.html FYI, I'm currently testing these.
Additional patch proposed: https://gcc.gnu.org/pipermail/gcc-patches/2022-February/589802.html I need to figure out how to (re)produce Thomas' "used N registers" reports. If someone could summarize the effect of this patch (and previous patches) on register usage, that would be much appreciated (and help reviewers).
Created attachment 52370 [details] '_muldc3-good.o'
Created attachment 52371 [details] '_muldc3-bad.o'
Created attachment 52372 [details] '_muldc3-WIP.o'
All your three patches combined still doesn't help resolve the problem. And, what I realized: they don't even change the Nvidia/CUDA Driver reported "used [...] registers". Does that mean that the Driver "doesn't care" that we feed into it simple PTX code, using less PTX registers -- it seems able to do these optimizations all by itself? :-O (At least regarding number of registers used -- I didn't verify the SASS code generated.) (Emitting cleaner, "de-cluttered" code in GCC/nvptx is still very much valuable, of course: if only for our own visual consumption! ... at least as long as it doesn't make 'nvptx.md' etc. much more complex...) For "good" vs. "bad"/"not-so-good" (before vs. after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1"), the only code generation difference is in the '__muldc3' function ('nvptx-none/libgcc/_muldc3.o'), and that one is the only '.extern' dependency aside from the '__reduction_lock' global variable ('nvptx-none/libgcc/reduction.o'). (In the following, working around <https://gcc.gnu.org/PR104416> "'lto-wrapper' invoking 'mkoffload's with duplicated command-line options".) This means, I can conveniently manually create a minimal nvptx 'libgcc.a': $ cp build-gcc-offload-nvptx-none/nvptx-none/libgcc/_muldc3.o ./ $ rm -f libgcc.a && ar q libgcc.a _muldc3.o build-gcc-offload-nvptx-none/nvptx-none/libgcc/reduction.o ..., and compile 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c' with '-foffload=nvptx-none=-Wl,-L.'. (Via 'GOMP_DEBUG=1' verified that identical PTX code is loaded to GPU.) Then, hand-modify '_muldc3.o', re-create 'libgcc.a', re-compile, re-execute. Verified that before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" '__muldc3' (attached '_muldc3-good.o') works fine, and after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" '__muldc3' (attached '_muldc3-bad.o') does show the problem originally reported here. I then gradually morphed the former into the latter (beginning with eliding simple changes like renumbered registers etc.), until only one last change was necessary to turn "good" into "bad" (attached '_muldc3-WIP.o'); showing the "still-good" state: [...] @@ -1716,8 +1718,16 @@ cvt.rn.f64.s32 %r84,%r27; copysign.f64 %r61,%r61,%r84; .loc 2 1981 32 +// Current/after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1": +/* selp.u32 %r86,1,0,%r138; mov.u32 %r85,%r86; +*/ +// Before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1": +set.u32.leu.f64 %r86,%r57,0d7fefffffffffffff; +neg.s32 %r87,%r86; +mov.u32 %r85,%r87; +// cvt.u16.u8 %r140,%r85; mov.u16 %r89,%r140; xor.b16 %r88,%r89,1; @@ -1770,8 +1780,16 @@ [...] That is, things go "bad" if we here use the '%r138' that was computed (and used) earlier in the code, and things are "good" if we re-compute locally. Same for '%r139' in the second code block. (Interestingly, it is tolerated if one of the long-lived registers are used, but things go "bad" only if *both* are used. But that's probably just a distraction? And again, I have not inspected the actual SASS code, but just looked at the JIT-time "used [...] registers".) Do we thus conclude that what happens is that the "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" changes here enable an "optimization" in GCC such that values that have previously been compute may be re-used later in the code, without re-computing them. But on the flip side, of course, this means that the values have to kept live in (SASS) registers. (That's just my theory; I haven't verified the actual SASS.) In other words: at least in this case here, it seems preferrable to re-compute instead of keeping registers occupied. (But I'm of course not claiming that to be a simple yes/no decision...) It seem we're now in territory of tuning CPU vs. GPU code generation? Certainly, GCC has not seen much care for the latter (GPU code generation). I mean: verify GCC pass pipeline generally, and parameterization of individual passes for GPU code generation. Impossible to get "right", of course, but maybe some heuristics for CPU vs. GPU may be discovered and implemented? I'm sure there must be some literature on that topic? All that complicated by the fact the with the (several different versions of the) Nvidia/CUDA Driver's PTX -> SASS translation/optimization we have another moving part...
Well. Here's another problem. Re-testing things using a "bad" '__muldc3' from a build with your three patches applied, I notice that my '_muldc3-WIP.o' "old"/replacement code uses a 'set.u32.leu.f64', 'neg.s32', 'mov.u32' sequence. That works. However, if I replace that with a localized (its own '.reg .pred %r138_;') 'setp.leu.f64', 'selp.u32' sequence (as per your code generation changes), then it doesn't in fact work! Huh. Will later continue looking into this.
OK! Putting your "nvptx: Add support for 64-bit mul.hi (and other) instructions" on top, that considerably changes (simplifies!) the generated '__muldc3' PTX code; the regression disappears. :-) (I have, so far, only manually tested 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c'. I'll report later in the unlikely case that any other/new issues should appear.) (And, will later test your "nvptx: Tweak constraints on copysign instructions" on top, too.)
The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>: https://gcc.gnu.org/g:9bacd7af2e3bba9ddad17e7de4e2d299419d819d commit r12-7167-g9bacd7af2e3bba9ddad17e7de4e2d299419d819d Author: Roger Sayle <roger@nextmovesoftware.com> Date: Fri Feb 4 04:13:53 2022 +0100 PR target/104345: Use nvptx "set" instruction for cond ? -1 : 0 This patch addresses the "increased register pressure" regression on nvptx-none caused by my change to transition the backend to a STORE_FLAG_VALUE = 1 target. This improved code generation for the more common case of producing 0/1 Boolean values, but unfortunately made things marginally worse when a 0/-1 mask value is desired. Unfortunately, nvptx kernels are extremely sensitive to changes in register usage, which was observable in the reported PR. This patch provides optimizations for -(cond ? 1 : 0), effectively simplify this into cond ? -1 : 0, where these ternary operators are provided by nvptx's selp instruction, and for the specific case of SImode, using (restoring) nvptx's "set" instruction (which avoids the need for a predicate register). This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu with a "make" and "make -k check" with no new failures. Unfortunately, the exact register usage of a nvptx kernel depends upon the version of the Cuda drivers being used (and the hardware), but I believe this change should resolve the PR (for Thomas) by improving code generation for the cases that regressed. gcc/ChangeLog: PR target/104345 * config/nvptx/nvptx.md (sel_true<mode>): Fix indentation. (sel_false<mode>): Likewise. (define_code_iterator eqne): New code iterator for EQ and NE. (*selp<mode>_neg_<code>): New define_insn_and_split to optimize the negation of a selp instruction. (*selp<mode>_not_<code>): New define_insn_and_split to optimize the bitwise not of a selp instruction. (*setcc_int<mode>): Use set instruction for neg:SI of a selp. gcc/testsuite/ChangeLog: PR target/104345 * gcc.target/nvptx/neg-selp.c: New test case.
The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>: https://gcc.gnu.org/g:6d98e83b2c919bd9fba2c61333d613bafc37357f commit r12-7168-g6d98e83b2c919bd9fba2c61333d613bafc37357f Author: Roger Sayle <roger@nextmovesoftware.com> Date: Tue Feb 8 20:56:55 2022 +0100 nvptx: Tweak constraints on copysign instructions Many thanks to Thomas Schwinge for confirming my hypothesis that the register usage regression, PR target/104345, is solely due to libgcc's _muldc3 function. In addition to the isinf functionality in the previously proposed nvptx patch at https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588453.html which significantly reduces the number of instructions in _muldc3, the patch below further reduces both the number of instructions and the number of explicitly declared registers, by permitting floating point constant immediate operands in nvptx's copysign instruction. Fingers-crossed, the combination with all of the previous proposed nvptx patches improves things. Ultimately, increasing register usage from 50 to 51 registers, reducing the number of concurrent threads by ~2%, can easily be countered if we're now executing significantly fewer instructions in each kernel, for a net performance win. This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu with a "make" and "make -k check" with no new failures. gcc/ChangeLog: * config/nvptx/nvptx.md (copysign<mode>3): Allow immediate floating point constants as operands 1 and/or 2.
This should now be fixed on mainline.