96401 – [nvptx] Take advantage of subword ld/st/cvt

Bug 96401 - [nvptx] Take advantage of subword ld/st/cvt

Summary: [nvptx] Take advantage of subword ld/st/cvt

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	11.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-07-31 13:36 UTC by Tom de Vries
Modified:	2020-07-31 14:14 UTC (History)
CC List:	0 users

See Also:
Host:
Target:	nvptx
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tom de Vries 2020-07-31 13:36:03 UTC

Consider test-case test.c:
...
void
foo (void)
{
  volatile unsigned int v;
  volatile unsigned short v2;
  v2 = v;
}
...

With the current compiler, we have:
...
$ gcc test.c -S -o- -O2
  ...
        .reg.u32 %r22;
        .reg.u16 %r24;
                ld.u32  %r22, [%frame];
                cvt.u16.u32     %r24, %r22;
                st.u16  [%frame+4], %r24;
}
...

As it happens, the nvptx manual states at 5.2.2 "Restricted Use of Sub-Word Sizes":
...
For convenience, ld, st, and cvt instructions permit source and destination data operands to be wider than the instruction-type size, so that narrow values may be loaded, stored, and converted using regular-width registers. For example, 8-bit or 16-bit values may be held directly in 32-bit or 64-bit registers when being loaded, stored, or converted to other types and sizes.
...

In other words, we may emit instead:
...
        .reg.u32 %r22;
                ld.u32  %r22, [%frame];
                st.u16  [%frame+4], %r22;
...

Comment 1 Tom de Vries 2020-07-31 13:49:34 UTC

(In reply to Tom de Vries from comment #0)
> In other words, we may emit instead:
> ...
>         .reg.u32 %r22;
>                 ld.u32  %r22, [%frame];
>                 st.u16  [%frame+4], %r22;
> ...

So, why don't we?

Using -dP we see the respective insns:
...
//(insn 5 2 6 2
//    (set (reg:SI 22 [ v$0_1 ])
//         (mem/v/c:SI (reg/f:DI 2 %frame) [1 v+0 S4 A128]))
//     "test.c":7:6 6 {*movsi_insn}
//     (nil))
                ld.u32  %r22, [%frame]; // 5    [c=4]  *movsi_insn/1

//(insn 6 5 9 2
//    (set (reg:HI 24 [ v$0_1 ])
//         (subreg:HI (reg:SI 22 [ v$0_1 ]) 0))
//     "test.c":7:6 5 {*movhi_insn}
//     (expr_list:REG_DEAD (reg:SI 22 [ v$0_1 ])
//     (nil)))
                cvt.u16.u32     %r24, %r22;     // 6    [c=12]  *movhi_insn/0

//(insn 9 6 12 2
//    (set (mem/v/c:HI (plus:DI (reg/f:DI 2 %frame)
//                     (const_int 4 [0x4])) [2 v2+0 S2 A32])
//         (reg:HI 24 [ v$0_1 ]))
//     "test.c":7:6 5 {*movhi_insn}
//     (expr_list:REG_DEAD (reg:HI 24 [ v$0_1 ])
//     (nil)))
                st.u16  [%frame+4], %r24;       // 9    [c=4]  *movhi_insn/2
...

I went to investigate why combine doesn't combine insns 6 and 9, that is, why doesn't it generate:
...
//(insn 9 6 12 2
//    (set (mem/v/c:HI (plus:DI (reg/f:DI 2 %frame)
//                     (const_int 4 [0x4])) [2 v2+0 S2 A32])
//         (subreg:HI (reg:SI 22 [ v$0_1 ]) 0))
//     "test.c":7:6 5 {*movhi_insn}
//     (expr_list:REG_DEAD (reg:HI 22 [ v$0_1 ])
//     (nil)))
...

Part of the required changes is to make the movhi_insn store alternative work for subreg source operand:
...
@@ -229,8 +234,8 @@
 
 (define_insn "*mov<mode>_insn"
   [(set (match_operand:QHSDIM 0 "nonimmediate_operand" "=R,R,m")
-       (match_operand:QHSDIM 1 "general_operand" "Ri,m,R"))]
-  "!MEM_P (operands[0]) || REG_P (operands[1])"
+       (match_operand:QHSDIM 1 "general_operand" "Ri,m,Q"))]
+  "!MEM_P (operands[0]) || REG_P (operands[1]) || SUBREG_P (operands[1])"
 {
   if (which_alternative == 1)
     return "%.\\tld%A1%u1\\t%0, %1;";
...
which required me to define:
...
+(define_constraint "Q"
+  "A pseudo register or subreg."
+  (ior (match_code "reg")
+      (match_code "subreg")))
+
...
[ Note that this constraint is an oddity, like the R constraint: it's not a register constraint. ]

After debugging I found that I needed this as well:
...
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index d2f321fcbcc..2234edad53b 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -6444,7 +6444,7 @@ nvptx_data_alignment (const_tree type, unsigned int basic_align)
 static bool
 nvptx_modes_tieable_p (machine_mode, machine_mode)
 {
-  return false;
+  return true;
 }
 
 /* Implement TARGET_HARD_REGNO_NREGS.  */
...
due to this bit in combine.c:subst():
...
                  /* In general, don't install a subreg involving two                 
                     modes not tieable.  It can worsen register                       
                     allocation, and can even make invalid reload                     
                     insns, since the reg inside may need to be copied                
                     from in the outside mode, and that may be invalid                
                     if it is an fp reg copied in integer mode.                       
                                                                                      ...

Using these changes, I get the desired:
...
        .reg.u32 %r22;
                ld.u32  %r22, [%frame];
                st.u16  [%frame+4], %r22;
...

Comment 2 Tom de Vries 2020-07-31 14:03:56 UTC

(In reply to Tom de Vries from comment #1)
> Using these changes, I get the desired:
> ...
>         .reg.u32 %r22;
>                 ld.u32  %r22, [%frame];
>                 st.u16  [%frame+4], %r22;
> ...

And to be precise about it, that's starting at fwprop1 that we have two insns:
...
(insn 5 2 9 2
    (set (reg:SI 22 [ v$0_1 ])
         (mem/v/c:SI (reg/f:DI 2 %frame) [1 v+0 S4 A128]))
    "test.c":7:6 6 {*movsi_insn}
    (nil))
(insn 9 5 0 2
    (set (mem/v/c:HI (plus:DI (reg/f:DI 2 %frame)
                              (const_int 4 [0x4])) [2 v2+0 S2 A32])
         (subreg:HI (reg:SI 22 [ v$0_1 ]) 0))
    "test.c":7:6 5 {*movhi_insn}
    (expr_list:REG_DEAD (reg:SI 23 [ _2 ])
    (nil)))
...

Which is a bit earlier (at 247r) than combine (at 271r).

Comment 3 Tom de Vries 2020-07-31 14:14:02 UTC

Note that with the proposed TARGET_TRULY_NOOP_TRUNCATION -> false change ( https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549896.html ), we start out with the same ptx insns, but with the cvt.u16.u32 a truncate instead of a subreg move:
...
//(insn 5 2 6 2
//    (set (reg:SI 22 [ v$0_1 ])
//         (mem/v/c:SI (reg/f:DI 2 %frame) [1 v+0 S4 A128]))
//     "test.c":7:6 6 {*movsi_insn}
//     (nil))
                ld.u32  %r22, [%frame]; // 5    [c=4]  *movsi_insn/1

//(insn 6 5 9 2
//    (set (reg:HI 24 [ v$0_1 ])
//         (truncate:HI (reg:SI 22 [ v$0_1 ])))
       "test.c":7:6 30 {truncsihi2}
//     (expr_list:REG_DEAD (reg:SI 22 [ v$0_1 ])
//     (nil)))
                cvt.u16.u32     %r24, %r22;     // 6    [c=4]  truncsihi2/0

//(insn 9 6 12 2
//    (set (mem/v/c:HI (plus:DI (reg/f:DI 2 %frame)
//                              (const_int 4 [0x4])) [2 v2+0 S2 A32])
//         (reg:HI 24 [ v$0_1 ])) "test.c":7:6 5 {*movhi_insn}
//     (expr_list:REG_DEAD (reg:HI 24 [ v$0_1 ])
//        (nil)))
                st.u16  [%frame+4], %r24;       // 9    [c=4]  *movhi_insn/2
...

Still, with the changes in comment 1 enabled we end up with the desired two insns, though a bit later, at cse2 (265r), and not using movhi_insn:
...
(insn 9 5 0 2 (set (mem/v/c:HI (plus:DI (reg/f:DI 2 %frame)
                (const_int 4 [0x4])) [2 v2+0 S2 A32])
        (truncate:HI (reg:SI 22 [ v$0_1 ]))) "test.c":7:6 30 {truncsihi2}
     (expr_list:REG_DEAD (reg:HI 24 [ v$0_1 ])
        (nil)))
...
so we might get this just with the nvptx_modes_tieable_p change.