Bug 91753 - Bad register allocation of multi-register types
Summary: Bad register allocation of multi-register types
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks: 47562 95958 98877
  Show dependency treegraph
 
Reported: 2019-09-12 12:57 UTC by Wilco
Modified: 2021-08-12 08:09 UTC (History)
4 users (show)

See Also:
Host:
Target: aarch64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-01-20 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Wilco 2019-09-12 12:57:50 UTC
The following example shows that register allocation of types which require multiple registers is quite non-optimal:

#include <stdint.h>

#include <arm_neon.h>
void neon_transform_nada(const uint8x16x4_t table, uint8_t * values, int volume) {
  uint8x16_t x1 = vld1q_u8(values + 0);
  uint8x16_t x2 = vld1q_u8(values + 16);
  uint8x16_t x3 = vld1q_u8(values + 16*2);
  uint8x16_t x4 = vld1q_u8(values + 16*3);
  for(int i = 0; i  <  volume; i++) {
          x1 = vqtbx4q_u8(x1, table,x1);
          x2 = vqtbx4q_u8(x2, table,x2);
          x3 = vqtbx4q_u8(x3, table,x3);
          x4 = vqtbx4q_u8(x4, table,x4);
   }
  vst1q_u8(values + 0,    x1);
  vst1q_u8(values + 16,   x2);
  vst1q_u8(values + 16*2, x3);
  vst1q_u8(values + 16*3, x4);
}

With -O2/O3:

neon_transform_nada:
	cmp	w1, 0
	ldp	q31, q30, [x0]
	ldp	q29, q28, [x0, 32]
	ble	.L2
	mov	v27.16b, v1.16b
	mov	w2, 0
	mov	v26.16b, v3.16b
	mov	v25.16b, v0.16b
	mov	v24.16b, v2.16b
	.p2align 3,,7
.L3:
	mov	v0.16b, v25.16b
	add	w2, w2, 1
	mov	v20.16b, v25.16b
	cmp	w1, w2
	mov	v16.16b, v25.16b
	mov	v4.16b, v25.16b
	mov	v1.16b, v27.16b
	mov	v21.16b, v27.16b
	mov	v17.16b, v27.16b
	mov	v5.16b, v27.16b
	mov	v2.16b, v24.16b
	mov	v22.16b, v24.16b
	mov	v18.16b, v24.16b
	mov	v6.16b, v24.16b
	mov	v3.16b, v26.16b
	mov	v23.16b, v26.16b
	mov	v19.16b, v26.16b
	mov	v7.16b, v26.16b
	tbx	v31.16b, {v0.16b - v3.16b}, v31.16b
	tbx	v30.16b, {v20.16b - v23.16b}, v30.16b
	tbx	v29.16b, {v16.16b - v19.16b}, v29.16b
	tbx	v28.16b, {v4.16b - v7.16b}, v28.16b
	bne	.L3
.L2:
	stp	q31, q30, [x0]
	stp	q29, q28, [x0, 32]
	ret

With -O1 it looks a lot better but there are still 4 redundant moves:

neon_transform_nada:
	ldr	q19, [x0]
	ldr	q18, [x0, 16]
	ldr	q17, [x0, 32]
	ldr	q16, [x0, 48]
	cmp	w1, 0
	ble	.L2
	mov	w2, 0
.L3:
	mov	v4.16b, v0.16b
	mov	v5.16b, v1.16b
	mov	v6.16b, v2.16b
	mov	v7.16b, v3.16b
	tbx	v19.16b, {v4.16b - v7.16b}, v19.16b
	tbx	v18.16b, {v4.16b - v7.16b}, v18.16b
	tbx	v17.16b, {v4.16b - v7.16b}, v17.16b
	tbx	v16.16b, {v4.16b - v7.16b}, v16.16b
	add	w2, w2, 1
	cmp	w1, w2
	bne	.L3
.L2:
	str	q19, [x0]
	str	q18, [x0, 16]
	str	q17, [x0, 32]
	str	q16, [x0, 48]
	ret
Comment 1 Andrew Pinski 2019-09-12 19:36:57 UTC
lower-subreg should have be able to help here.  I wonder why it did not ...
Comment 2 Wilco 2019-09-12 21:49:14 UTC
(In reply to Andrew Pinski from comment #1)
> lower-subreg should have be able to help here.  I wonder why it did not ...

I'm not sure how it can help. When you write a part of a multi-register mode using subreg, you get incorrect liveness info. This is why splitting 64-bit types on 32-bit targets before register allocation gives such a huge gain.

The approach I used in other compilers was to generate multi-register virtual registers using a single create operation rather than via multiple subreg writes.
Comment 3 Andrew Pinski 2019-09-12 23:29:18 UTC
(In reply to Wilco from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > lower-subreg should have be able to help here.  I wonder why it did not ...
> 
> I'm not sure how it can help. 

I think you misunderstood what this pass does.
It does exactly what you think it should do:
/* Decompose multi-word pseudo-registers into individual
   pseudo-registers when possible and profitable.  This is possible
   when all the uses of a multi-word register are via SUBREG, or are
   copies of the register to another location.  Breaking apart the
   register permits more CSE and permits better register allocation.

The only difference is the creating part which missing.
Comment 4 Wilco 2019-09-13 12:04:57 UTC
(In reply to Andrew Pinski from comment #3)
> (In reply to Wilco from comment #2)
> > (In reply to Andrew Pinski from comment #1)
> > > lower-subreg should have be able to help here.  I wonder why it did not ...
> > 
> > I'm not sure how it can help. 
> 
> I think you misunderstood what this pass does.
> It does exactly what you think it should do:
> /* Decompose multi-word pseudo-registers into individual
>    pseudo-registers when possible and profitable.  This is possible
>    when all the uses of a multi-word register are via SUBREG, or are
>    copies of the register to another location.  Breaking apart the
>    register permits more CSE and permits better register allocation.
> 
> The only difference is the creating part which missing.

Yes but the issue is that you can't remove all the subregs since the TBX instructions really need a 512-bit register. The slim dump for x1 = vqtbx4q_u8(x1, table,x1):

   30: r94:XI#0=r105:V16QI
   31: r95:XI=r94:XI
      REG_DEAD r94:XI
   32: r95:XI#16=r101:V16QI
   33: r96:XI=r95:XI
      REG_DEAD r95:XI
   34: r96:XI#32=r102:V16QI
   35: r97:XI=r96:XI
      REG_DEAD r96:XI
   36: r97:XI#48=r106:V16QI
   38: r100:V16QI=unspec[r100:V16QI,r97:XI,r100:V16QI] 186
      REG_DEAD r97:XI

As you can see it creates the 512-bit XI register via a complex sequence of 4 subreg lvalues.
Comment 5 Andrew Pinski 2020-01-20 08:27:39 UTC
The biggest problem is set_qreg* is being used here.

This is the RTX that is produced:
(insn 38 37 39 4 (set (reg/v:XI 125 [ __o ])
        (reg/v:XI 124 [ __o ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25556:9 3404 {*aarch64_movxi}
     (expr_list:REG_DEAD (reg/v:XI 124 [ __o ])
        (nil)))
(insn 39 38 40 4 (set (subreg:V16QI (reg/v:XI 125 [ __o ]) 0)
        (reg:V16QI 134 [ _122 ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25556:9 1198 {*aarch64_simd_movv16qi}
     (nil))
(insn 40 39 41 4 (set (reg/v:XI 126 [ __o ])
        (reg/v:XI 125 [ __o ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25557:9 3404 {*aarch64_movxi}
     (expr_list:REG_DEAD (reg/v:XI 125 [ __o ])
        (nil)))
(insn 41 40 42 4 (set (subreg:V16QI (reg/v:XI 126 [ __o ]) 16)
        (reg:V16QI 96 [ _6 ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25557:9 1198 {*aarch64_simd_movv16qi}
     (nil))
(insn 42 41 43 4 (set (reg/v:XI 127 [ __o ])
        (reg/v:XI 126 [ __o ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25558:9 3404 {*aarch64_movxi}
     (expr_list:REG_DEAD (reg/v:XI 126 [ __o ])
        (nil)))
(insn 43 42 44 4 (set (subreg:V16QI (reg/v:XI 127 [ __o ]) 32)
        (reg:V16QI 135 [ _123 ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25558:9 1198 {*aarch64_simd_movv16qi}
     (nil))
(insn 44 43 45 4 (set (reg/v:XI 128 [ __o ])
        (reg/v:XI 127 [ __o ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25559:9 3404 {*aarch64_movxi}
     (expr_list:REG_DEAD (reg/v:XI 127 [ __o ])
        (nil)))
(insn 45 44 47 4 (set (subreg:V16QI (reg/v:XI 128 [ __o ]) 48)
        (reg:V16QI 95 [ _5 ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25559:9 1198 {*aarch64_simd_movv16qi}
     (nil))

Notice the move instruction inbetween.  That confuses everything here.

If we had a way to generate XImode directly from 4 V16QI, and only generate one move statement, then the register allocator would act better.
Comment 6 Andrew Pinski 2020-01-20 14:15:21 UTC
(In reply to Andrew Pinski from comment #5)
> If we had a way to generate XImode directly from 4 V16QI, and only generate
> one move statement, then the register allocator would act better.

That or split the XI register move to do 4 V16QI/V4SI and only the final move we generate the subreg.  I think this later one is the best option really, and that lower-subreg.c pass should be doing but is not for some reason ....
Comment 7 Andrew Pinski 2020-01-20 22:15:42 UTC
(In reply to Wilco from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > lower-subreg should have be able to help here.  I wonder why it did not ...
> 
> I'm not sure how it can help. When you write a part of a multi-register mode
> using subreg, you get incorrect liveness info. This is why splitting 64-bit
> types on 32-bit targets before register allocation gives such a huge gain.

Actually no, lower-subreg was not supposed to fix incorrect liveness info but rather improve other things.

Also lower-subreg does not handle OImode/CImode/XImode as an array of vectors but rather as integer modes.  You can see that effect by changing the define FORCE_LOWERING to 1 and see how it will decompose the registers.  Really we want to decompose the registers to TImode in this case rather than DImode.  I have not looked into how we could enhance lower-subreg to do that.  This will fix the issue even better.
Comment 8 ktkachov 2021-01-29 09:12:39 UTC
The issue with the many moves is still there, however for GCC 11 at least they're hoisted outside the loop
Comment 9 Tamar Christina 2021-08-12 08:09:49 UTC
Fixed in GCC 12 where we generate at -O1/2/3

neon_transform_nada(uint8x16x4_t, unsigned char*, int):
        ldp     q7, q6, [x0]
        ldp     q5, q4, [x0, 32]
        cmp     w1, 0
        ble     .L2
        mov     w2, 0
.L3:
        add     w2, w2, 1
        tbx     v7.16b, {v0.16b - v3.16b}, v7.16b
        tbx     v6.16b, {v0.16b - v3.16b}, v6.16b
        tbx     v5.16b, {v0.16b - v3.16b}, v5.16b
        tbx     v4.16b, {v0.16b - v3.16b}, v4.16b
        cmp     w1, w2
        bne     .L3
.L2:
        stp     q7, q6, [x0]
        stp     q5, q4, [x0, 32]
        ret