The following example shows that register allocation of types which require multiple registers is quite non-optimal: #include <stdint.h> #include <arm_neon.h> void neon_transform_nada(const uint8x16x4_t table, uint8_t * values, int volume) { uint8x16_t x1 = vld1q_u8(values + 0); uint8x16_t x2 = vld1q_u8(values + 16); uint8x16_t x3 = vld1q_u8(values + 16*2); uint8x16_t x4 = vld1q_u8(values + 16*3); for(int i = 0; i < volume; i++) { x1 = vqtbx4q_u8(x1, table,x1); x2 = vqtbx4q_u8(x2, table,x2); x3 = vqtbx4q_u8(x3, table,x3); x4 = vqtbx4q_u8(x4, table,x4); } vst1q_u8(values + 0, x1); vst1q_u8(values + 16, x2); vst1q_u8(values + 16*2, x3); vst1q_u8(values + 16*3, x4); } With -O2/O3: neon_transform_nada: cmp w1, 0 ldp q31, q30, [x0] ldp q29, q28, [x0, 32] ble .L2 mov v27.16b, v1.16b mov w2, 0 mov v26.16b, v3.16b mov v25.16b, v0.16b mov v24.16b, v2.16b .p2align 3,,7 .L3: mov v0.16b, v25.16b add w2, w2, 1 mov v20.16b, v25.16b cmp w1, w2 mov v16.16b, v25.16b mov v4.16b, v25.16b mov v1.16b, v27.16b mov v21.16b, v27.16b mov v17.16b, v27.16b mov v5.16b, v27.16b mov v2.16b, v24.16b mov v22.16b, v24.16b mov v18.16b, v24.16b mov v6.16b, v24.16b mov v3.16b, v26.16b mov v23.16b, v26.16b mov v19.16b, v26.16b mov v7.16b, v26.16b tbx v31.16b, {v0.16b - v3.16b}, v31.16b tbx v30.16b, {v20.16b - v23.16b}, v30.16b tbx v29.16b, {v16.16b - v19.16b}, v29.16b tbx v28.16b, {v4.16b - v7.16b}, v28.16b bne .L3 .L2: stp q31, q30, [x0] stp q29, q28, [x0, 32] ret With -O1 it looks a lot better but there are still 4 redundant moves: neon_transform_nada: ldr q19, [x0] ldr q18, [x0, 16] ldr q17, [x0, 32] ldr q16, [x0, 48] cmp w1, 0 ble .L2 mov w2, 0 .L3: mov v4.16b, v0.16b mov v5.16b, v1.16b mov v6.16b, v2.16b mov v7.16b, v3.16b tbx v19.16b, {v4.16b - v7.16b}, v19.16b tbx v18.16b, {v4.16b - v7.16b}, v18.16b tbx v17.16b, {v4.16b - v7.16b}, v17.16b tbx v16.16b, {v4.16b - v7.16b}, v16.16b add w2, w2, 1 cmp w1, w2 bne .L3 .L2: str q19, [x0] str q18, [x0, 16] str q17, [x0, 32] str q16, [x0, 48] ret
lower-subreg should have be able to help here. I wonder why it did not ...
(In reply to Andrew Pinski from comment #1) > lower-subreg should have be able to help here. I wonder why it did not ... I'm not sure how it can help. When you write a part of a multi-register mode using subreg, you get incorrect liveness info. This is why splitting 64-bit types on 32-bit targets before register allocation gives such a huge gain. The approach I used in other compilers was to generate multi-register virtual registers using a single create operation rather than via multiple subreg writes.
(In reply to Wilco from comment #2) > (In reply to Andrew Pinski from comment #1) > > lower-subreg should have be able to help here. I wonder why it did not ... > > I'm not sure how it can help. I think you misunderstood what this pass does. It does exactly what you think it should do: /* Decompose multi-word pseudo-registers into individual pseudo-registers when possible and profitable. This is possible when all the uses of a multi-word register are via SUBREG, or are copies of the register to another location. Breaking apart the register permits more CSE and permits better register allocation. The only difference is the creating part which missing.
(In reply to Andrew Pinski from comment #3) > (In reply to Wilco from comment #2) > > (In reply to Andrew Pinski from comment #1) > > > lower-subreg should have be able to help here. I wonder why it did not ... > > > > I'm not sure how it can help. > > I think you misunderstood what this pass does. > It does exactly what you think it should do: > /* Decompose multi-word pseudo-registers into individual > pseudo-registers when possible and profitable. This is possible > when all the uses of a multi-word register are via SUBREG, or are > copies of the register to another location. Breaking apart the > register permits more CSE and permits better register allocation. > > The only difference is the creating part which missing. Yes but the issue is that you can't remove all the subregs since the TBX instructions really need a 512-bit register. The slim dump for x1 = vqtbx4q_u8(x1, table,x1): 30: r94:XI#0=r105:V16QI 31: r95:XI=r94:XI REG_DEAD r94:XI 32: r95:XI#16=r101:V16QI 33: r96:XI=r95:XI REG_DEAD r95:XI 34: r96:XI#32=r102:V16QI 35: r97:XI=r96:XI REG_DEAD r96:XI 36: r97:XI#48=r106:V16QI 38: r100:V16QI=unspec[r100:V16QI,r97:XI,r100:V16QI] 186 REG_DEAD r97:XI As you can see it creates the 512-bit XI register via a complex sequence of 4 subreg lvalues.
The biggest problem is set_qreg* is being used here. This is the RTX that is produced: (insn 38 37 39 4 (set (reg/v:XI 125 [ __o ]) (reg/v:XI 124 [ __o ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25556:9 3404 {*aarch64_movxi} (expr_list:REG_DEAD (reg/v:XI 124 [ __o ]) (nil))) (insn 39 38 40 4 (set (subreg:V16QI (reg/v:XI 125 [ __o ]) 0) (reg:V16QI 134 [ _122 ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25556:9 1198 {*aarch64_simd_movv16qi} (nil)) (insn 40 39 41 4 (set (reg/v:XI 126 [ __o ]) (reg/v:XI 125 [ __o ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25557:9 3404 {*aarch64_movxi} (expr_list:REG_DEAD (reg/v:XI 125 [ __o ]) (nil))) (insn 41 40 42 4 (set (subreg:V16QI (reg/v:XI 126 [ __o ]) 16) (reg:V16QI 96 [ _6 ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25557:9 1198 {*aarch64_simd_movv16qi} (nil)) (insn 42 41 43 4 (set (reg/v:XI 127 [ __o ]) (reg/v:XI 126 [ __o ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25558:9 3404 {*aarch64_movxi} (expr_list:REG_DEAD (reg/v:XI 126 [ __o ]) (nil))) (insn 43 42 44 4 (set (subreg:V16QI (reg/v:XI 127 [ __o ]) 32) (reg:V16QI 135 [ _123 ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25558:9 1198 {*aarch64_simd_movv16qi} (nil)) (insn 44 43 45 4 (set (reg/v:XI 128 [ __o ]) (reg/v:XI 127 [ __o ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25559:9 3404 {*aarch64_movxi} (expr_list:REG_DEAD (reg/v:XI 127 [ __o ]) (nil))) (insn 45 44 47 4 (set (subreg:V16QI (reg/v:XI 128 [ __o ]) 48) (reg:V16QI 95 [ _5 ])) "/bajas/pinskia/src/toolchain-10/marvell-tools/lib/gcc/aarch64-marvell-linux-gnu/10.0.0/include/arm_neon.h":25559:9 1198 {*aarch64_simd_movv16qi} (nil)) Notice the move instruction inbetween. That confuses everything here. If we had a way to generate XImode directly from 4 V16QI, and only generate one move statement, then the register allocator would act better.
(In reply to Andrew Pinski from comment #5) > If we had a way to generate XImode directly from 4 V16QI, and only generate > one move statement, then the register allocator would act better. That or split the XI register move to do 4 V16QI/V4SI and only the final move we generate the subreg. I think this later one is the best option really, and that lower-subreg.c pass should be doing but is not for some reason ....
(In reply to Wilco from comment #2) > (In reply to Andrew Pinski from comment #1) > > lower-subreg should have be able to help here. I wonder why it did not ... > > I'm not sure how it can help. When you write a part of a multi-register mode > using subreg, you get incorrect liveness info. This is why splitting 64-bit > types on 32-bit targets before register allocation gives such a huge gain. Actually no, lower-subreg was not supposed to fix incorrect liveness info but rather improve other things. Also lower-subreg does not handle OImode/CImode/XImode as an array of vectors but rather as integer modes. You can see that effect by changing the define FORCE_LOWERING to 1 and see how it will decompose the registers. Really we want to decompose the registers to TImode in this case rather than DImode. I have not looked into how we could enhance lower-subreg to do that. This will fix the issue even better.
The issue with the many moves is still there, however for GCC 11 at least they're hoisted outside the loop
Fixed in GCC 12 where we generate at -O1/2/3 neon_transform_nada(uint8x16x4_t, unsigned char*, int): ldp q7, q6, [x0] ldp q5, q4, [x0, 32] cmp w1, 0 ble .L2 mov w2, 0 .L3: add w2, w2, 1 tbx v7.16b, {v0.16b - v3.16b}, v7.16b tbx v6.16b, {v0.16b - v3.16b}, v6.16b tbx v5.16b, {v0.16b - v3.16b}, v5.16b tbx v4.16b, {v0.16b - v3.16b}, v4.16b cmp w1, w2 bne .L3 .L2: stp q7, q6, [x0] stp q5, q4, [x0, 32] ret