104345 – [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse

Bug 104345 - [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse

Summary: [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" pat...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	12.0

Importance:	P3 minor
Target Milestone:	12.0
Assignee:	Roger Sayle

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2022-02-02 14:10 UTC by Thomas Schwinge
Modified:	2022-03-02 12:10 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	nvptx
Build:
Known to work:
Known to fail:
Last reconfirmed:	2022-02-02 00:00:00

Attachments
'_muldc3-good.o' (1.67 KB, text/plain) 2022-02-08 11:53 UTC, Thomas Schwinge	Details
'_muldc3-bad.o' (1.67 KB, text/plain) 2022-02-08 11:54 UTC, Thomas Schwinge	Details
'_muldc3-WIP.o' (1.80 KB, text/plain) 2022-02-08 11:55 UTC, Thomas Schwinge	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Thomas Schwinge 2022-02-02 14:10:08 UTC

First, I do acknowledge that commit beed3f8f60492289ca6211d86c54a2254a642035 "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" generally does improve nvptx code generation -- thanks!

I've however run into one case where it causes a regression:

    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  execution test
    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2  (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2  execution test

    libgomp: The Nvidia accelerator has insufficient resources to launch 'worker$_omp_fn$0' with num_workers = 32 and vector_length = 32; recompile the program with 'num_workers = x and vector_length = y' on that offloaded region or '-fopenacc-dim=:x:y' where x * y <= 896.

Same for C++.

That's with a Nvidia Tesla K20c, Driver Version: 346.46 -- so, rather old.

By the way: the subsequent commit 659f8161f61d3f75c3a47cf646147e8f7b4dcb34 "nvptx: Add support for PTX's cnot instruction" is not helpful or even relevant here; there are no 'cnot's appearing in the PTX code loaded to the GPU (per 'GOMP_DEBUG=1' execution).

Per 'diff' of  'GOMP_DEBUG=1' execution we indeed see *more* registers used after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" than before.  For '-O0':

    [...]
     Link log info    : 4 bytes gmem
     info    : Function properties for 'gang$_omp_fn$0':
    -info    : used 51 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem
    +info    : used 68 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 8 bytes cmem[2], 0 bytes lmem
     info    : Function properties for 'worker$_omp_fn$0':
    -info    : used 51 registers, 112 stack, 136 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem
    +info    : used 68 registers, 112 stack, 136 bytes smem, 328 bytes cmem[0], 8 bytes cmem[2], 0 bytes lmem
     info    : Function properties for 'vector$_omp_fn$0':
    -info    : used 51 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem
    +info    : used 68 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 8 bytes cmem[2], 0 bytes lmem
       GOMP_OFFLOAD_openacc_exec: prepare mappings
       nvptx_exec: kernel vector$_omp_fn$0: launch gangs=1, workers=1, vectors=32
       nvptx_exec: kernel vector$_omp_fn$0: finished
    -GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7ffc760394a0, size=0x60bb30, kinds=0x60bb48
    +GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7fff99653530, size=0x60bad0, kinds=0x60bae8
       GOMP_OFFLOAD_openacc_exec: prepare mappings
    +
    +libgomp: The Nvidia accelerator has insufficient resources to launch 'worker$_omp_fn$0' with num_workers = 32 and vector_length = 32; recompile the program with 'num_workers = x and vector_length = y' on that offloaded region or '-fopenacc-dim=:x:y' where x * y <= 896.
    -  nvptx_exec: kernel worker$_omp_fn$0: launch gangs=1, workers=32, vectors=32
    -  nvptx_exec: kernel worker$_omp_fn$0: finished
    -GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7ffc760394a0, size=0x60bb50, kinds=0x60bb68
    -  GOMP_OFFLOAD_openacc_exec: prepare mappings
    -  nvptx_exec: kernel gang$_omp_fn$0: launch gangs=32, workers=1, vectors=32
    -  nvptx_exec: kernel gang$_omp_fn$0: finished

Similar for '-O2', just with less stack usage.

Cross-checking with a more recent Driver Version: 450.119.03, I'm only seeing slightly increased register usage; 52 registers after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" compared to 51 registers before:

    [...]
    Link log info    : 4 bytes gmem
    info    : Function properties for 'vector$_omp_fn$0':
    info    : used [-51-]{+52+} registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem
    info    : Function properties for 'worker$_omp_fn$0':
    info    : used [-51-]{+52+} registers, 112 stack, 136 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem
    info    : Function properties for 'gang$_omp_fn$0':
    info    : used [-51-]{+52+} registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 16 bytes cmem[2], 0 bytes lmem
    [...]

This suggests that compared to before, after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" GCC is generating certain PTX code sequences that the Driver/JIT fails to understand/optimize?  While not ideal, the code still executes fine (with newish Driver/JIT), and I'm thus OK if we classify that as not worth looking into -- but I at least wanted to report my findings: maybe there's a way to tune the GCC/nvptx code generation to the PTX -> SASS compiler's liking?

Possibly (but that's just guessing!), the reason might be around the following PTX code change:

    [...]
    -setp.leu.f64 %r82,%r25,0d7fefffffffffffff;
    -@ ! %r82 bra $L3;
    +@ %r78 bra $L20;
    +setp.leu.f64 %r138,%r57,0d7fefffffffffffff;
    +bra $L3;
    +$L20:
     .loc 2 1976 21
    -setp.leu.f64 %r83,%r57,0d7fefffffffffffff;
    -@ %r83 bra $L19;
    +setp.leu.f64 %r138,%r57,0d7fefffffffffffff;
    +@ %r138 bra $L19;
     $L3:
    [...]

From a quick look, I read this to mean that the originally ("before") unconditional 'setp.leu.f64 %r82,%r25,0d7fefffffffffffff;' is now ("after") done conditionally.

Maybe related, maybe not: when curiously 'diff'ing the before vs. after nvptx-none target libraries, I noticed amongst all the "noise" (improved code generation):

'nvptx-none/libatomic/gcas.o':

    [...]
     atom.cas.b32 %r137,[%r34],%r136,%r139;
     setp.eq.u32 %r140,%r137,%r136;
     selp.u32 %r138,1,0,%r140;
    -setp.ne.u32 %r141,%r138,0;
    -@ %r141 bra $L21;
    +@ %r140 bra $L18;
     st.u32 [%r201],%r137;
    -bra $L19;
    +$L18:
    +setp.eq.u32 %r142,%r138,0;
    +@ %r142 bra $L19;
     $L21:
    [...]

... which again looks like a pattern where an originally ("before") unconditional 'setp.ne.u32 %r141,%r138,0;' is now ("after") done conditionally.

Similar in other files -- but I certainly didn't look in detail, and I'm certainly not claiming this to be/cause any actual problem.


And, I've spotted a few cases where we're generating "maybe worse" code:

'nvptx-none/libgomp/openacc.o' (complete 'diff'):

    @@ -25,6 +25,7 @@
     .reg .u64 %r28;
     .reg .u32 %r29;
     .reg .u32 %r30;
    +.reg .pred %r31;
     mov.u64 %r27,%ar0;
     st.u64 [%frame+16],%r27;
     ld.u64 %r28,[%frame+16];
    @@ -38,8 +39,8 @@
     ld.param.u32 %r30,[%value_in];
     }
     mov.u32 %r23,%r30;
    -set.u32.ne.u32 %r24,%r23,0;
    -neg.s32 %r24,%r24;
    +setp.ne.u32 %r31,%r23,0;
    +selp.u32 %r24,1,0,%r31;
     st.u32 [%frame],%r24;
     ld.u32 %r25,[%frame];
     mov.u32 %r26,%r25;

'nvptx-none/newlib/libc/reent/lib_a-renamer.o' (complete 'diff'):

    @@ -28,6 +28,8 @@
     .reg .u32 %r32;
     .reg .pred %r33;
     .reg .u32 %r36;
    +.reg .u32 %r38;
    +.reg .pred %r39;
     mov.u64 %r26,%ar0;
     mov.u64 %r27,%ar1;
     mov.u64 %r28,%ar2;
    @@ -58,7 +60,9 @@
     ld.param.u32 %r36,[%value_in];
     }
     .loc 2 57 6
    -set.u32.eq.u32 %r25,%r36,-1;
    +setp.eq.u32 %r39,%r36,-1;
    +selp.u32 %r38,1,0,%r39;
    +neg.s32 %r25,%r38;
     $L1:
     .loc 2 64 1
     mov.u32 %value,%r25;

'nvptx-none/newlib/libc/stdio/lib_a-remove.o' (complete 'diff'):

    @@ -24,6 +24,8 @@
     .reg .u64 %r26;
     .reg .u64 %r27;
     .reg .u32 %r30;
    +.reg .u32 %r33;
    +.reg .pred %r34;
     mov.u64 %r26,%ar0;
     mov.u64 %r27,%ar1;
     .loc 2 65 7
    @@ -37,7 +39,9 @@
     ld.param.u32 %r30,[%value_in];
     }
     .loc 2 65 6
    -set.u32.eq.u32 %value,%r30,-1;
    +setp.eq.u32 %r34,%r30,-1;
    +selp.u32 %r33,1,0,%r34;
    +neg.s32 %value,%r33;
     .loc 2 69 1
     st.param.u32 [%value_out],%value;
     ret;
    @@ -51,6 +55,8 @@
     .reg .u64 %r27;
     .reg .u64 %r30;
     .reg .u32 %r31;
    +.reg .u32 %r34;
    +.reg .pred %r35;
     mov.u64 %r27,%ar0;
     .loc 2 65 7
     ld.global.u64 %r30,[_impure_ptr];
    @@ -64,7 +70,9 @@
     ld.param.u32 %r31,[%value_in];
     }
     .loc 2 65 6
    -set.u32.eq.u32 %value,%r31,-1;
    +setp.eq.u32 %r35,%r31,-1;
    +selp.u32 %r34,1,0,%r35;
    +neg.s32 %value,%r34;
     .loc 2 77 1
     st.param.u32 [%value_out],%value;
     ret;

'nvptx-none/newlib/libm/common/lib_a-s_rint.o' (complete 'diff'):

    @@ -80,6 +80,7 @@
     .reg .u32 %r119;
     .reg .pred %r120;
     .reg .u32 %r122;
    +.reg .pred %r123;
     .reg .u32 %r124;
     .reg .u32 %r125;
     .reg .u32 %r126;
    @@ -197,7 +198,8 @@
     setp.eq.u32 %r120,%r41,0;
     @ %r120 bra $L5;
     .loc 2 114 9
    -set.u32.eq.u32 %r122,%r58,19;
    +setp.eq.u32 %r123,%r58,19;
    +selp.u32 %r122,1,0,%r123;
     shl.b32 %r41,%r122,31;
     .loc 2 115 17
     not.b32 %r124,%r64;

I haven't looked if that's "actually worse" in SASS, or just "maybe worse" in the intermediate PTX representation.  (... and is most certainly not related to the regression mentioned before.)


It seems unlikely, but I'll report in case anything here changes due to Tom's several GCC/nvptx back end commits yesterday.


So, please close this PR as "won't fix" unless you see something here that you'd like to look into.

Comment 1 Roger Sayle 2022-02-02 14:26:59 UTC

Hi Torsten,
Thanks for the bug report.  The STORE_FLAG_VALUE=1 patch was one of a series to dramatically improve the quality of nvptx code.  Alas not all of them have yet
been reviewed/approved, and it's likely these later improvements address the quality regression you're seeing.

The other patches in the "nvptx Boolean" series are:
patchq3: nvptx: Expand QI mode operations using SI mode instructions.
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587999.html

patchq4: nvptx: Fix and use BI mode logic instructions (e.g. and.pred).
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588555.html

[and purely for reference, my other outstanding nvptx patches are]
patchn: nvptx: Improved support for HFMode including neghf2 and abshf2.
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587949.html

patchw: nvptx: Add support for 64-bit mul.hi (and other) instructions.
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588453.html

And one other related patch is that there's also a middle-end SUBREG
patch intended to improve code generation on nvptx is also pending at:

patchs: Simplify paradoxical subreg extensions of TRUNCATE
https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578848.html



My guess is that patchq3+patchq4 above should (hopefully) resolve this
particular regression.  If you could give them a spin on your system
to see if they reduce register pressure sufficiently for this case,
that would be greatly appreciated.  As you can read in the above postings,
the total number of instructions/registers (after all of these changes)
should be dramatically reduced.

I'll see what I can do from my end.

Comment 2 Tom de Vries 2022-02-03 14:17:38 UTC

(In reply to Roger Sayle from comment #1)
> The other patches in the "nvptx Boolean" series are:
> patchq3: nvptx: Expand QI mode operations using SI mode instructions.
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587999.html
> 
> patchq4: nvptx: Fix and use BI mode logic instructions (e.g. and.pred).
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588555.html
> 
> [and purely for reference, my other outstanding nvptx patches are]
> patchn: nvptx: Improved support for HFMode including neghf2 and abshf2.
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587949.html
> 
> patchw: nvptx: Add support for 64-bit mul.hi (and other) instructions.
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588453.html

FYI, I'm currently testing these.

Comment 3 Roger Sayle 2022-02-03 21:09:23 UTC

Additional patch proposed:
https://gcc.gnu.org/pipermail/gcc-patches/2022-February/589802.html
I need to figure out how to (re)produce Thomas' "used N registers" reports.
If someone could summarize the effect of this patch (and previous patches) on register usage, that would be much appreciated (and help reviewers).

Comment 4 Thomas Schwinge 2022-02-08 11:53:33 UTC

Created attachment 52370 [details]
'_muldc3-good.o'

Comment 5 Thomas Schwinge 2022-02-08 11:54:19 UTC

Created attachment 52371 [details]
'_muldc3-bad.o'

Comment 6 Thomas Schwinge 2022-02-08 11:55:05 UTC

Created attachment 52372 [details]
'_muldc3-WIP.o'

Comment 7 Thomas Schwinge 2022-02-08 11:57:27 UTC

All your three patches combined still doesn't help resolve the problem.
And, what I realized: they don't even change the Nvidia/CUDA Driver reported "used [...] registers".
Does that mean that the Driver "doesn't care" that we feed into it simple PTX code, using less PTX registers -- it seems able to do these optimizations all by itself?  :-O
(At least regarding number of registers used -- I didn't verify the SASS code generated.)
(Emitting cleaner, "de-cluttered" code in GCC/nvptx is still very much valuable, of course: if only for our own visual consumption!  ... at least as long as it doesn't make 'nvptx.md' etc. much more complex...)

For "good" vs. "bad"/"not-so-good" (before vs. after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1"), the only code generation difference is in the '__muldc3' function ('nvptx-none/libgcc/_muldc3.o'), and that one is the only '.extern' dependency aside from the '__reduction_lock' global variable ('nvptx-none/libgcc/reduction.o').
(In the following, working around <https://gcc.gnu.org/PR104416> "'lto-wrapper' invoking 'mkoffload's with duplicated command-line options".)

This means, I can conveniently manually create a minimal nvptx 'libgcc.a':

    $ cp build-gcc-offload-nvptx-none/nvptx-none/libgcc/_muldc3.o ./
    $ rm -f libgcc.a && ar q libgcc.a _muldc3.o build-gcc-offload-nvptx-none/nvptx-none/libgcc/reduction.o

..., and compile 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c' with '-foffload=nvptx-none=-Wl,-L.'.  (Via 'GOMP_DEBUG=1' verified that identical PTX code is loaded to GPU.)

Then, hand-modify '_muldc3.o', re-create 'libgcc.a', re-compile, re-execute.

Verified that before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" '__muldc3' (attached '_muldc3-good.o') works fine, and after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" '__muldc3' (attached '_muldc3-bad.o') does show the problem originally reported here.

I then gradually morphed the former into the latter (beginning with eliding simple changes like renumbered registers etc.), until only one last change was necessary to turn "good" into "bad" (attached '_muldc3-WIP.o'); showing the "still-good" state:

    [...]
    @@ -1716,8 +1718,16 @@
     cvt.rn.f64.s32 %r84,%r27;
     copysign.f64 %r61,%r61,%r84;
     .loc 2 1981 32
    +// Current/after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1":
    +/*
     selp.u32 %r86,1,0,%r138;
     mov.u32 %r85,%r86;
    +*/
    +// Before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1":
    +set.u32.leu.f64 %r86,%r57,0d7fefffffffffffff;
    +neg.s32 %r87,%r86;
    +mov.u32 %r85,%r87;
    +//
     cvt.u16.u8 %r140,%r85;
     mov.u16 %r89,%r140;
     xor.b16 %r88,%r89,1;
    @@ -1770,8 +1780,16 @@
    [...]

That is, things go "bad" if we here use the '%r138' that was computed (and used) earlier in the code, and things are "good" if we re-compute locally.  Same for '%r139' in the second code block.

(Interestingly, it is tolerated if one of the long-lived registers are used, but things go "bad" only if *both* are used.  But that's probably just a distraction?  And again, I have not inspected the actual SASS code, but just looked at the JIT-time "used [...] registers".)

Do we thus conclude that what happens is that the "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" changes here enable an "optimization" in GCC such that values that have previously been compute may be re-used later in the code, without re-computing them.  But on the flip side, of course, this means that the values have to kept live in (SASS) registers.  (That's just my theory; I haven't verified the actual SASS.)

In other words: at least in this case here, it seems preferrable to re-compute instead of keeping registers occupied.  (But I'm of course not claiming that to be a simple yes/no decision...)

It seem we're now in territory of tuning CPU vs. GPU code generation?

Certainly, GCC has not seen much care for the latter (GPU code generation).
I mean: verify GCC pass pipeline generally, and parameterization of individual passes for GPU code generation.
Impossible to get "right", of course, but maybe some heuristics for CPU vs. GPU may be discovered and implemented?
I'm sure there must be some literature on that topic?

All that complicated by the fact the with the (several different versions of the) Nvidia/CUDA Driver's PTX -> SASS translation/optimization we have another moving part...

Comment 8 Thomas Schwinge 2022-02-08 12:33:03 UTC

Well.  Here's another problem.

Re-testing things using a "bad" '__muldc3' from a build with your three patches applied, I notice that my '_muldc3-WIP.o' "old"/replacement code uses a 'set.u32.leu.f64', 'neg.s32', 'mov.u32' sequence.  That works.
However, if I replace that with a localized (its own '.reg .pred %r138_;') 'setp.leu.f64', 'selp.u32' sequence (as per your code generation changes), then it doesn't in fact work!  Huh.

Will later continue looking into this.

Comment 9 Thomas Schwinge 2022-02-08 15:36:35 UTC

OK!  Putting your "nvptx: Add support for 64-bit mul.hi (and other) instructions" on top, that considerably changes (simplifies!) the generated '__muldc3' PTX code; the regression disappears.  :-)

(I have, so far, only manually tested 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c'.  I'll report later in the unlikely case that any other/new issues should appear.)


(And, will later test your "nvptx: Tweak constraints on copysign instructions" on top, too.)

Comment 10 GCC Commits 2022-02-10 08:02:44 UTC

The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>:

https://gcc.gnu.org/g:9bacd7af2e3bba9ddad17e7de4e2d299419d819d

commit r12-7167-g9bacd7af2e3bba9ddad17e7de4e2d299419d819d
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Fri Feb 4 04:13:53 2022 +0100

    PR target/104345: Use nvptx "set" instruction for cond ? -1 : 0
    
    This patch addresses the "increased register pressure" regression on
    nvptx-none caused by my change to transition the backend to a
    STORE_FLAG_VALUE = 1 target.  This improved code generation for the
    more common case of producing 0/1 Boolean values, but unfortunately
    made things marginally worse when a 0/-1 mask value is desired.
    Unfortunately, nvptx kernels are extremely sensitive to changes in
    register usage, which was observable in the reported PR.
    
    This patch provides optimizations for -(cond ? 1 : 0), effectively
    simplify this into cond ? -1 : 0, where these ternary operators are
    provided by nvptx's selp instruction, and for the specific case of
    SImode, using (restoring) nvptx's "set" instruction (which avoids
    the need for a predicate register).
    
    This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
    with a "make" and "make -k check" with no new failures.  Unfortunately,
    the exact register usage of a nvptx kernel depends upon the version of
    the Cuda drivers being used (and the hardware), but I believe this
    change should resolve the PR (for Thomas) by improving code generation
    for the cases that regressed.
    
    gcc/ChangeLog:
    
            PR target/104345
            * config/nvptx/nvptx.md (sel_true<mode>): Fix indentation.
            (sel_false<mode>): Likewise.
            (define_code_iterator eqne): New code iterator for EQ and NE.
            (*selp<mode>_neg_<code>): New define_insn_and_split to optimize
            the negation of a selp instruction.
            (*selp<mode>_not_<code>): New define_insn_and_split to optimize
            the bitwise not of a selp instruction.
            (*setcc_int<mode>): Use set instruction for neg:SI of a selp.
    
    gcc/testsuite/ChangeLog:
    
            PR target/104345
            * gcc.target/nvptx/neg-selp.c: New test case.

Comment 11 GCC Commits 2022-02-10 08:02:49 UTC

The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>:

https://gcc.gnu.org/g:6d98e83b2c919bd9fba2c61333d613bafc37357f

commit r12-7168-g6d98e83b2c919bd9fba2c61333d613bafc37357f
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Tue Feb 8 20:56:55 2022 +0100

    nvptx: Tweak constraints on copysign instructions
    
    Many thanks to Thomas Schwinge for confirming my hypothesis that the register
    usage regression, PR target/104345, is solely due to libgcc's _muldc3 function.
    In addition to the isinf functionality in the previously proposed nvptx patch at
    https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588453.html which
    significantly reduces the number of instructions in _muldc3, the patch below
    further reduces both the number of instructions and the number of explicitly
    declared registers, by permitting floating point constant immediate operands
    in nvptx's copysign instruction.
    
    Fingers-crossed, the combination with all of the previous proposed nvptx
    patches improves things.  Ultimately, increasing register usage from 50 to
    51 registers, reducing the number of concurrent threads by ~2%, can easily
    be countered if we're now executing significantly fewer instructions in each
    kernel, for a net performance win.
    
    This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
    with a "make" and "make -k check" with no new failures.
    
    gcc/ChangeLog:
    
            * config/nvptx/nvptx.md (copysign<mode>3): Allow immediate
            floating point constants as operands 1 and/or 2.

Comment 12 Roger Sayle 2022-03-02 12:10:10 UTC

This should now be fixed on mainline.