Bug 115682 - nvptx vs. "fwprop: invoke change_is_worthwhile to judge if a replacement is worthwhile"
Summary: nvptx vs. "fwprop: invoke change_is_worthwhile to judge if a replacement is w...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 15.0
: P3 minor
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-27 12:44 UTC by Thomas Schwinge
Modified: 2024-06-27 12:44 UTC (History)
2 users (show)

See Also:
Host:
Target: nvptx
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas Schwinge 2024-06-27 12:44:33 UTC
I've observed/bisected that "fwprop: invoke change_is_worthwhile to judge if a replacement is worthwhile" (a) slightly improves and in a some cases also (b) slightly regresses nvptx target code generation (supposedly; not actually benchmarked).  That is, in a few cases, it introduces additional register usage together with (a) different choice of PTX instructions (presumably beneficial), or (b) redundant computations (presumably bad).  It's possible that the PTX JIT later is able to re-optimize this, but we can't be sure -- and would like good PTX code emitted by GCC if only for our own reading pleasure.

I've only looked at PTX code generated for GCC target libraries, and only looked at those where the 'diff' was less than one screenful, roughly.

A few examples of category (a):

'nvptx-none/libgfortran/generated/_dim_i16.o:_gfortran_specific__dim_i16':

    [...]
    +.reg .u64 %r42;
    [...]
    -mov.u64 %r54,%r32;
    -sub.u64 %r55,%r39,%r35;
    -setp.ge.s64 %r45,%r55,0;
    -@ %r45 bra $L2;
    -mov.u64 %r54,0;
    -mov.u64 %r55,%r54;
    -$L2:
    +sub.u64 %r42,%r39,%r35;
    +setp.ge.s64 %r45,%r42,0;
    +selp.u64 %r54,%r32,0,%r45;
    +selp.u64 %r55,%r42,0,%r45;
    [...]

That is, replace conditional 'bra' by two 'selp's.

Similar, 'nvptx-none/newlib/libc/stdio/libc_a-makebuf.o:__swhatbuf_r':

    [...]
     setp.ne.u16 %r41,%r39,0;
    -@ %r41 bra $L22;
     mov.u32 %r23,0;
    -mov.u64 %r31,1024;
    +selp.u64 %r31,64,1024,%r41;
     mov.u32 %r32,%r23;
     bra $L20;
     $L19:
    @@ -325,11 +324,6 @@
     .loc 2 123 10
     mov.u64 %r31,1024;
     mov.u32 %r32,2048;
    -bra $L20;
    -$L22:
    -mov.u32 %r23,0;
    -mov.u64 %r31,64;
    -mov.u32 %r32,%r23;
     $L20:
    [...]

Differently, 'nvptx-none/newlib/libc/search/libc_a-hash_page.o:__addel':

    [...]
    -add.u64 %r38,%r37,%r37;
    +shl.b64 %r38,%r37,1;
    [...]

A few examples of category (b):

'nvptx-none/newlib/libc/stdlib/libc_a-gdtoa-gdtoa.o:__gdtoa':

    [...]
    +.reg .u64 %r263;
    [...]
    -add.u64 %r337,%r391,24;
    +add.u64 %r263,%r391,24;
    +mov.u64 %r337,%r263;
    [...]

New unnecessary intermediate 'u64 %r263'.

Similarly, 'nvptx-none/newlib/libc/stdlib/libc_a-mprec.o:__d2b':

    [...]
    +.reg .u32 %r89;
    [...]
    -cvt.u64.u32 %r90,%r39;
    +mov.u32 %r89,%r39;
    +cvt.u64.u32 %r90,%r89;
    [...]

'nvptx-none/libgfortran/generated/cshift0_c4.o:_gfortrani_cshift0_c4' (full diff with a bit of unchanged context added):

    [...]
     .reg .u64 %r131;
    [...]
     .reg .u32 %r177;
    [...]
    +.reg .u32 %r266;
     .reg .u64 %r267;
    [...]
     mov.u32 %r177,%ar3;
    [...]
     cvt.s64.s32 %r131,%r177;
    [...]
    -cvt.u64.u32 %r267,%r177;
    +cvt.u32.u64 %r266,%r131;
    +cvt.u64.u32 %r267,%r266;
    [...]

In the new code, the 'u32' '%r177' is interpreted as 's32', converted into 's64', and stored into 'u64' '%r131' (which is present also in the old code for other reasons; I didn't try to understand that in more detail).  The 'u64' '%r131' is then converted into 'u32' and stored in the new 'u32 %r266', and then again converted into 'u64', and stored in '%r267'.

In the old code, the 'u32' '%r177' was converted into 'u64', and stored into the 'u64' '%r267' directly.

A lot more instances of this in other files.

'nvptx-none/libgfortran/generated/matmul_i1.o:_gfortran_matmul_i1' (full diff with a bit of unchanged context added):

    [...]
     .reg .u64 %r299;
    [...]
     .reg .u64 %r347;
     .reg .u64 %r349;
    [...]
     .reg .u64 %r432;
    [...]
     .reg .u64 %r1004;
    [...]
     ['%r299' initialized via different code paths]
    [...]
     ld.u64 %r347,[%r733];
    [...]
     add.u64 %r1004,%r299,1;
    [...]
     sub.u64 %r349,%r347,%r1004;
    [...]
    -mov.u64 %r432,%r347;
    +add.u64 %r432,%r349,%r1004;
    [...]

That is, instead of just using '%r347' ('mov') for '%r432', we now prefer re-computing it ('add').

Similarly in 'nvptx-none/libgfortran/io/transfer.o:bswap_array' -- even more concisely:

    [...]
     add.u64 %r68,%r92,%r65;
    -mov.u64 %r53,%r92;
    +sub.u64 %r53,%r68,%r65;
    [...]

..., and both these items in combination (?) in 'nvptx-none/newlib/libc/search/libc_a-hash.o:__expand_table', for example, where we now use two additional registers, plus additional 'sub' plus additional 'cvt'.

Now, I suppose, we'll quickly conclude this is due to instruction costing (..., and, in particular, its non-existence in GCC/nvptx...)?