Summary: | Inefficient neon intrinsic code sequence | ||
---|---|---|---|
Product: | gcc | Reporter: | Carrot <carrot> |
Component: | target | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | clyon, egallager, linux, mkuvyrkov, ramana, rsandifo |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 4.7.0 | ||
Target Milestone: | --- | ||
See Also: | https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65375 | ||
Host: | Target: | arm-linux-androideabi, arm-linux-gnueabi | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2011-12-12 00:00:00 | |
Bug Depends on: | |||
Bug Blocks: | 47562 |
Description
Carrot
2011-12-12 07:25:34 UTC
At least part of the problem here is the uninitialised variable in the vld4 call. GCC tries to create a zero initialisation of "x" before the vld4, so that the other lanes have defined values. Obviously we could be doing that much better than we are, and perhaps we should have some kind of special case so that uninitialised NEON vectors are never zero-initialised (e.g. use a plain clobber instead). But uninitialised variables aren't really ideal either way. Something like: x = vld4_dup_u8(src); y.val[0][0] = x.val[1][0]; y.val[1][0] = x.val[2][0]; vst2_lane_u8(dst, y, 0); would be better in principle. Unfortunately, we don't generate good code for that either. Part of the problem is introduced by lower-subreg, but it's not good even with -fno-split-wide-types. FWIW, uint8x8x4_t x; uint8x8x2_t y; x = vld4_dup_u8(src); y.val[0] = x.val[1]; y.val[1] = x.val[2]; vst2_lane_u8(dst, y, 0); does give the expected output. I.e. the remaining inefficiency from comment #1 is in the uninitialised parts of y. Richard With -fno-split-wide-types I can end up getting identical output to what is expected in this case with FSF trunk. I suspect this might be another of those costs with lower-subreg issues. Ramana Kugan, Would you please check if your patch for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65375 also affects this one? Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for armv7. Charles, would you please look at this? I have run into a similar problem with vld3 and vst4. uint8x16x3_t tmp = vld3q_u8(src); vst4q_u8((uint8_t *)dst, {tmp.val[2], tmp.val[1], tmp.val[0], fullVector}); produces: 70: 4cdf4061 ld3 {v1.16b-v3.16b}, [x3], #48 74: 4e083c04 mov x4, v0.d[0] 78: 4e183c05 mov x5, v0.d[1] 7c: 6f000400 mvni v0.4s, #0x0 80: 4e083c4a mov x10, v2.d[0] 84: 4e183c4b mov x11, v2.d[1] 88: aa0403e2 mov x2, x4 8c: aa0503e1 mov x1, x5 90: 4e083c24 mov x4, v1.d[0] 94: 4e183c25 mov x5, v1.d[1] 98: a90007e2 stp x2, x1, [sp] 9c: 3d800fe0 str q0, [sp,#48] a0: a9012fea stp x10, x11, [sp,#16] a4: aa0403e6 mov x6, x4 a8: a90217e6 stp x6, x5, [sp,#32] ac: 4c4023e0 ld1 {v0.16b-v3.16b}, [sp] b0: 4c9f0000 st4 {v0.16b-v3.16b}, [x0], #64 But if I add -fno-split-wide-types it compiles to: 68: 4cdf4064 ld3 {v4.16b-v6.16b}, [x3], #48 6c: 4f000400 movi v0.4s, #0x0 70: 6f000403 mvni v3.4s, #0x0 74: 4ea51ca1 mov v1.16b, v5.16b 78: 4ea41c82 mov v2.16b, v4.16b 7c: 4c9f0000 st4 {v0.16b-v3.16b}, [x0], #64 This happens with both 4.9 and 5.1 that I have tried. (In reply to Maxim Kuvyrkov from comment #5) > Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for > armv7. Charles, would you please look at this? Should Charles still remain the assignee for this? (In reply to Eric Gallager from comment #7) > (In reply to Maxim Kuvyrkov from comment #5) > > Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for > > armv7. Charles, would you please look at this? > > Should Charles still remain the assignee for this? I'm afraid not: Charles no longer works with us. |