make check-gcc RUNTESTFLAGS="complex.exp=fast-math-complex-mls-*.c" FAIL: gcc.dg/vect/complex/fast-math-complex-mls-double.c scan-tree-dump vect "Found COMPLEX_FMA" FAIL: gcc.dg/vect/complex/fast-math-complex-mls-float.c scan-tree-dump vect "Found COMPLEX_FMA"
Created attachment 58977 [details] Reduced testcase options: `-ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 -ffast-math -march=armv8.3-a`
Created attachment 58978 [details] Better testcase Before the patch fms180snd could be detected but fms180snd_1 could not. BUT both are the same function just changed when the multiply by i happens. fms180snd_1 represents what happens after the patch for fms180snd .
+Tamar since he wrote the original Complex vectorization support.
Created attachment 58979 [details] Full testcase Before the change fms180snd_2a and fms180snd_1 could not be detected even though they are all the same. Note I think fms180snd_2a is more representative of what is done after the patch for fms180snd rather than fms180snd_1.
Yeah, This is because they generate different gimple sequences and thus different SLP trees. The core of the problem is there's no canonical form here, and a missing gimple simplification rule: _33 = IMAGPART_EXPR <*_3> + ((REALPART_EXPR <*_5> * IMAGPART_EXPR <*_7>) + (IMAGPART_EXPR <*_5> * REALPART_EXPR <*_7>)); vs _37 = IMAGPART_EXPR <*_3> - ((REALPART_EXPR <*_5> * -IMAGPART_EXPR <*_7>) + (IMAGPART_EXPR <*_5> * -REALPART_EXPR <*_7>)); i.e. a - ((b * -c) + (d * -e)) == a + (b * c) + (d * e) So probably in match.pd we should fold _37 into _33 which is a simpler form of the same thing and it's better on scalar as well. It would be better to finally introduce a vectorizer canonical form, for instance the real part generates: _36 = (_31 - _30) + REALPART_EXPR <*_3>; vs _32 = REALPART_EXPR <*_3> + (_26 - _27); and this already is an additional thing to check, so it would be better if slp build always puts complex parts consistently on one side of commutative operations so we don't have to swap operands to check. In any case, I have some patches in this area and can take a look when I'm back, but think the new expression should be simplified back into the old one.
I think a - ((b * -c) + (d * -e)) -> a + (b * c) + (d * e) is a good simplification to be made, but it's difficult to do this with canonicalization only. Like a * -b -> -(a * b) as the negate might combine with both other negates down and upstream. But for a*-b + c * -d it might be more obvious to turn that into -a*b - c*d. Maybe reassoc can be of help here - IIRC it turns b * -c into b * c * -1, undistribute_ops_list might get that. Note one issue is that complex lowering leaves around dead stmts, confusing reassoc and forwprop, in particular - _10 = COMPLEX_EXPR <_18, _6>; stay around until reassoc. scheduling dce for testing shows reassoc does something. It's update_complex_assignment who replaces existing complex stmts with COMPLEX_EXPRs, we should possibly resort do simple_dce_from_worklist to clean those. Let me try to do that.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:de1923f9f4d5344694c22ca883aeb15caf635734 commit r15-3128-gde1923f9f4d5344694c22ca883aeb15caf635734 Author: Richard Biener <rguenther@suse.de> Date: Fri Aug 23 13:44:29 2024 +0200 tree-optimization/116463 - complex lowering leaves around dead stmts Complex lowering generally replaces existing complex defs with COMPLEX_EXPRs but those might be dead when it can always refer to components from the lattice. This in turn can pessimize followup transforms like forwprop and reassoc, the following makes sure to get rid of dead COMPLEX_EXPRs generated by using simple_dce_from_worklist. PR tree-optimization/116463 * tree-complex.cc: Include tree-ssa-dce.h. (dce_worklist): New global. (update_complex_assignment): Add SSA def to the DCE worklist. (tree_lower_complex): Perform DCE.
As of r15-3128-gde1923f9f4d534 now FAIL: gcc.target/i386/avx512fp16-vector-complex-float.c scan-assembler-not vfmadd[123]*ph[ \\\\t] FAIL: gcc.target/i386/avx512fp16-vector-complex-float.c scan-assembler-times vfmaddcph[ \\\\t] 1 FAIL: gcc.target/i386/part-vect-complexhf.c scan-assembler-times vfmaddcph[ \\\\t] 1 fail which look similar to the aarch64 fails (I have no idea if the patch helped for those). For the first test it's fma0 which is no longer vectorized as vmovdqu16 (%rdx), %zmm0 vmovdqu16 (%rsi), %zmm1 vfmaddcph (%rdi), %zmm1, %zmm0 vmovdqu16 %zmm0, (%rdx) but vmovdqu16 (%rsi), %zmm0 vmovdqu16 (%rdi), %zmm2 movl $1431655765, %eax kmovd %eax, %k1 vpshufb .LC1(%rip), %zmm0, %zmm1 vfmadd213ph (%rdx), %zmm2, %zmm1 vpshufb .LC2(%rip), %zmm0, %zmm0 vpshufb .LC0(%rip), %zmm2, %zmm3 vmovdqa64 %zmm0, %zmm2 vfmadd132ph %zmm3, %zmm1, %zmm2 vfnmadd132ph %zmm3, %zmm1, %zmm0 vpblendmw %zmm0, %zmm2, %zmm0{%k1} vmovdqu16 %zmm0, (%rdx) where instead of note: Found COMPLEX_FMA pattern in SLP tree we have note: Found VEC_ADDSUB pattern in SLP tree note: Target does not support VEC_ADDSUB for vector type vector(32) _Float16 with the IL difference being (- is good, + is bad) _12 = REALPART_EXPR <*_3>; _11 = IMAGPART_EXPR <*_3>; ... @@ -46,10 +46,10 @@ _27 = _19 * _25; _28 = _20 * _25; _29 = _19 * _24; - _30 = _26 - _27; - _31 = _28 + _29; - _32 = _12 + _30; - _33 = _11 + _31; + _9 = _12 + _26; + _10 = _11 + _28; + _32 = _9 - _27; + _33 = _10 + _29; REALPART_EXPR <*_3> = _32; IMAGPART_EXPR <*_3> = _33; i_18 = i_21 + 1; which is different association, enabled by deleting dead uses that confuse reassoc.
(In reply to Richard Biener from comment #8) > fail which look similar to the aarch64 fails (I have no idea if the patch > helped for those). The aarch64 ones still fail. And yes they look very similar.
The current failures/xpass for aarch64 is: XPASS: gcc.dg/vect/complex/fast-math-complex-add-half-float.c scan-tree-dump-times vect "stmt.*COMPLEX_ADD_ROT270" 1 XPASS: gcc.dg/vect/complex/fast-math-complex-add-half-float.c scan-tree-dump-times vect "stmt.*COMPLEX_ADD_ROT90" 1 FAIL: gcc.dg/vect/complex/fast-math-complex-mls-double.c scan-tree-dump vect "Found COMPLEX_ADD_ROT270" FAIL: gcc.dg/vect/complex/fast-math-complex-mls-float.c scan-tree-dump vect "Found COMPLEX_ADD_ROT270" FAIL: gcc.dg/vect/complex/fast-math-complex-mls-half-float.c scan-tree-dump vect "Found COMPLEX_ADD_ROT270"
(In reply to Richard Biener from comment #6) > I think > > a - ((b * -c) + (d * -e)) -> a + (b * c) + (d * e) > > is a good simplification to be made, but it's difficult to do this with > canonicalization only. Like a * -b -> -(a * b) as the negate might > combine with both other negates down and upstream. But for > a*-b + c * -d it might be more obvious to turn that into > -a*b - c*d. Yeah, my expectation was that this would be an easier transform to avoid the sharing problem we discussed before and that indeed the transform looks at the entire chain not just transforming a * -b. a*-b + c * -d -> -a*b - c*d has the property of still maintaining the FMS and FMNS chains and can get further simplified in the above case. > > Maybe reassoc can be of help here - IIRC it turns b * -c into > b * c * -1, undistribute_ops_list might get that. hmm I see, but don't we have a higher chance that folding will just fold it back into the multiply? For this to work we'd have to do (b * -c) + (d * -e) -> -(b * c + d * e) in one transformation no? since I'd imagine (b * c * -1) + (d * e * -1) would just be undone by match.pd? > > Note one issue is that complex lowering leaves around dead stmts, > confusing reassoc and forwprop, in particular > > - _10 = COMPLEX_EXPR <_18, _6>; > > stay around until reassoc. scheduling dce for testing shows reassoc > does something. > > It's update_complex_assignment who replaces existing complex > stmts with COMPLEX_EXPRs, we should possibly resort do > simple_dce_from_worklist > to clean those. Let me try to do that. Thanks!
(In reply to Tamar Christina from comment #11) > (In reply to Richard Biener from comment #6) > > I think > > > > a - ((b * -c) + (d * -e)) -> a + (b * c) + (d * e) > > > > is a good simplification to be made, but it's difficult to do this with > > canonicalization only. Like a * -b -> -(a * b) as the negate might > > combine with both other negates down and upstream. But for > > a*-b + c * -d it might be more obvious to turn that into > > -a*b - c*d. > > Yeah, my expectation was that this would be an easier transform to avoid > the sharing problem we discussed before and that indeed the transform > looks at the entire chain not just transforming a * -b. > > a*-b + c * -d -> -a*b - c*d > > has the property of still maintaining the FMS and FMNS chains and can > get further simplified in the above case. > > > > > Maybe reassoc can be of help here - IIRC it turns b * -c into > > b * c * -1, undistribute_ops_list might get that. > > hmm I see, but don't we have a higher chance that folding will just > fold it back into the multiply? > > For this to work we'd have to do > > (b * -c) + (d * -e) -> -(b * c + d * e) > > in one transformation no? since I'd imagine > > (b * c * -1) + (d * e * -1) > > would just be undone by match.pd? The * -1 is something reassoc does only internally, it then distributes that back to generate an outer plus or minus. Note for the x86 testcases there isn't any such simplification opportunity, but the reassoc heuristics correctly mangle the expression to no longer match the expected SLP complex patterns. There's also the re-association of chains done by SLP discovery itself which could be a problem. I'd say fixing this fallout is quite low priority at the moment, the simple cases could be re-associated by reassoc into a recognizable complex op order but even there it's a bit difficult as the operations span two "chains" (a multiplication and addition chain) where reassoc looks at them separately (apart from undistribution).
*** Bug 117071 has been marked as a duplicate of this bug. ***
*** Bug 117075 has been marked as a duplicate of this bug. ***
Those tests FAIL also on the 14 branch, starting with r14-10683-g12c00048d9f3598e57b98ec7723f7356bd255d04 Do we want to backport r15-3128 or just xfail the parts of the tests?
(In reply to Jakub Jelinek from comment #15) > Those tests FAIL also on the 14 branch, starting with > r14-10683-g12c00048d9f3598e57b98ec7723f7356bd255d04 > Do we want to backport r15-3128 or just xfail the parts of the tests? I don't think r15-3128 on its own fixes the tests? If it does, sure, we can backport it.
It doesn't. With r14-10682 there are 2 Found COMPLEX_FMS pattern in SLP tree and 2 Found COMPLEX_ADD_ROT270 pattern in SLP tree messages in fast-math-complex-mls-double.c and 4 Found COMPLEX_FMA pattern in SLP tree and 4 Found COMPLEX_FMS_CONJ pattern in SLP tree With r14-10683 the 4 Found COMPLEX_FMA pattern in SLP tree are gone, the rest remains the same. With r14+10683 + backported r15-3128 the 4 Found COMPLEX_FMA pattern in SLP tree are back, but the 2 COMPLEX_FMS and 2 COMPLEX_ADD_ROT270 messages are gone.
I can provide a GCC 14 fix during Stage 3 while I fix the GCC 15 one if that's easier.
The failing tests on the branch are: #include <complex.h> #define TYPE double #define N 200 void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= (a[i] * I * I) * b[i]; } The issue is just a small difference in commutative operations. we look for {R,R} * {R,I} but found {R,I} * {R,R} which is a bit fragile. since the DF analysis is cached, we should be able to swap operands and retry for multiply. for the addition we can't as it's part of the TWO_OPERANDs and we require them to be in a particular order to have the right values. Testing a patch that fixes the tests on the gcc-14 branch. Will work on the one for master next.
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>: https://gcc.gnu.org/g:a9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8 commit r15-5585-ga9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8 Author: Tamar Christina <tamar.christina@arm.com> Date: Thu Nov 21 15:10:24 2024 +0000 middle-end:For multiplication try swapping operands when matching complex multiply [PR116463] This commit fixes the failures of complex.exp=fast-math-complex-mls-*.c on the GCC 14 branch and some of the ones on the master. The current matching just looks for one order for multiplication and was relying on canonicalization to always give the right order because of the TWO_OPERANDS. However when it comes to the multiplication trying only one order is a bit fragile as they can be flipped. The failing tests on the branch are: void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= (a[i] * I * I) * b[i]; } The issue is just a small difference in commutative operations. we look for {R,R} * {R,I} but found {R,I} * {R,R}. Since the DF analysis is cached, we should be able to swap operands and retry for multiply cheaply. There is a constraint being checked by vect_validate_multiplication for the data flow of the operands feeding the multiplications. So e.g. between the nodes: note: node 0x4d1d210 (max_nunits=2, refcnt=3) vector(2) double note: op template: _27 = _10 * _25; note: stmt 0 _27 = _10 * _25; note: stmt 1 _29 = _11 * _25; note: node 0x4d1d060 (max_nunits=2, refcnt=2) vector(2) double note: op template: _26 = _11 * _24; note: stmt 0 _26 = _11 * _24; note: stmt 1 _28 = _10 * _24; we require the lanes to come from the same source which vect_validate_multiplication checks. As such it doesn't make sense to flip them individually because that would invalidate the earlier linear_loads_p checks which have validated that the arguments all come from the same datarefs. This patch thus flips the operands in unison to still maintain this invariant, but also honor the commutative nature of multiplication. gcc/ChangeLog: PR tree-optimization/116463 * tree-vect-slp-patterns.cc (complex_mul_pattern::matches, complex_fms_pattern::matches): Try swapping operands on multiply.
will wait a week or so for backporting to GCC 14, looking at the complex add failures on trunk now. Have an idea but need to work out on paper if sound.
Ok, so the problem with the ones on trunk isn't necessarily the canonicalization itself but that our externals handling is a bit shallow. On externals we determine that we have no information on the DF and return TOP. This is because DR analysis doesn't try to handle externals since they're not part of the loop. However all we need to know for complex numbers is whether the externals are loaded from the same place and the order of them. concretely the loop pre-header is: <bb 2> [local count: 10737416]: b$real_11 = REALPART_EXPR <b_15(D)>; b$imag_10 = IMAGPART_EXPR <b_15(D)>; _53 = -b$imag_10; and the loop body: <bb 3> [local count: 1063004408]: ... _23 = REALPART_EXPR <*_5>; _24 = IMAGPART_EXPR <*_5>; _27 = _24 * _53; _28 = _23 * _53; codegen before after: {_24, _23} * { _53, _53 } and after { _24, _24 } * { _53, b$real_11 } Before we were able to easily tell that the order for the multiply would be IMAG, REAL. In the after (GCC 15) case that information is there, but requires us to follow the externals. Richi what do you think about extending externals handling in linear_loads_p to follow all external ops and if they load from the same memref to figure out the "virtual lane permute"? We can store the info in a new externals cache (to avoid re-walking externals we already walked, as perm_cache stores slp nodes) and the permute for the node in the perm_cache as we do for any cached lookups today? This would also fix the other tests Andrew added in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463#c4
> Am 23.11.2024 um 13:20 schrieb tnfchris at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org>: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 > > --- Comment #22 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > Ok, so the problem with the ones on trunk isn't necessarily the > canonicalization itself but that our externals handling is a bit shallow. > > On externals we determine that we have no information on the DF and return TOP. > This is because DR analysis doesn't try to handle externals since they're not > part of the loop. > > However all we need to know for complex numbers is whether the externals are > loaded from the same place and the order of them. > > concretely the loop pre-header is: > > <bb 2> [local count: 10737416]: > b$real_11 = REALPART_EXPR <b_15(D)>; > b$imag_10 = IMAGPART_EXPR <b_15(D)>; > _53 = -b$imag_10; > > and the loop body: > > <bb 3> [local count: 1063004408]: > ... > > _23 = REALPART_EXPR <*_5>; > _24 = IMAGPART_EXPR <*_5>; > _27 = _24 * _53; > _28 = _23 * _53; > > codegen before after: > > {_24, _23} * { _53, _53 } > > and after > > { _24, _24 } * { _53, b$real_11 } > > Before we were able to easily tell that the order for the multiply would be > IMAG, REAL. > In the after (GCC 15) case that information is there, but requires us to follow > the externals. > > Richi what do you think about extending externals handling in linear_loads_p to > follow all external ops and if they load from the same memref to figure out the > "virtual lane permute"? Externs do not have a permute as we build them from scalars. So any permute can be trivially imposed on them - rather than TOP they should be BOTTOM. Of course there’s also no advantage of imposing a permute on them. > We can store the info in a new externals cache (to avoid re-walking externals > we already walked, as perm_cache stores slp nodes) and the permute for the > node in the perm_cache as we do for any cached lookups today? > > This would also fix the other tests Andrew added in > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463#c4 > > -- > You are receiving this mail because: > You are on the CC list for the bug.
(In reply to rguenther@suse.de from comment #23) > > Am 23.11.2024 um 13:20 schrieb tnfchris at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org>: > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 > > > > --- Comment #22 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > > Ok, so the problem with the ones on trunk isn't necessarily the > > canonicalization itself but that our externals handling is a bit shallow. > > > > On externals we determine that we have no information on the DF and return TOP. > > This is because DR analysis doesn't try to handle externals since they're not > > part of the loop. > > > > However all we need to know for complex numbers is whether the externals are > > loaded from the same place and the order of them. > > > > concretely the loop pre-header is: > > > > <bb 2> [local count: 10737416]: > > b$real_11 = REALPART_EXPR <b_15(D)>; > > b$imag_10 = IMAGPART_EXPR <b_15(D)>; > > _53 = -b$imag_10; > > > > and the loop body: > > > > <bb 3> [local count: 1063004408]: > > ... > > > > _23 = REALPART_EXPR <*_5>; > > _24 = IMAGPART_EXPR <*_5>; > > _27 = _24 * _53; > > _28 = _23 * _53; > > > > codegen before after: > > > > {_24, _23} * { _53, _53 } > > > > and after > > > > { _24, _24 } * { _53, b$real_11 } > > > > Before we were able to easily tell that the order for the multiply would be > > IMAG, REAL. > > In the after (GCC 15) case that information is there, but requires us to follow > > the externals. > > > > Richi what do you think about extending externals handling in linear_loads_p to > > follow all external ops and if they load from the same memref to figure out the > > "virtual lane permute"? > > Externs do not have a permute as we build them from scalars. So any permute > can be trivially imposed on them - rather than TOP they should be BOTTOM. > Of course there’s also no advantage of imposing a permute on them. > But the scalars can access memory that we can tell what they are. My point with the above was that it doesn't make sense to me that we know that {a[0],a[1]} reads a linearly but that with a1 = a[0] a2 = a[1] {a1,a2} we say "sorry we know nothing about you". Yes they're externals but they have a defined order of use in the SLP tree. This isn't about imposing a permute. I said virtual permute since linear_load_p uses the lane permutes on loads to determine the memory access order. We DO already impose any order on them, but the other operand is oddodd, so the overall order ends up being oddodd because any known permute overrides unknown ones. So the question is, can we not follow externals in a constructor to figure out if how they are used they all read from the same base and in which order?
(In reply to Tamar Christina from comment #24) > (In reply to rguenther@suse.de from comment #23) > > > Am 23.11.2024 um 13:20 schrieb tnfchris at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org>: > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463 > > > > > > --- Comment #22 from Tamar Christina <tnfchris at gcc dot gnu.org> --- > > > Ok, so the problem with the ones on trunk isn't necessarily the > > > canonicalization itself but that our externals handling is a bit shallow. > > > > > > On externals we determine that we have no information on the DF and return TOP. > > > This is because DR analysis doesn't try to handle externals since they're not > > > part of the loop. > > > > > > However all we need to know for complex numbers is whether the externals are > > > loaded from the same place and the order of them. > > > > > > concretely the loop pre-header is: > > > > > > <bb 2> [local count: 10737416]: > > > b$real_11 = REALPART_EXPR <b_15(D)>; > > > b$imag_10 = IMAGPART_EXPR <b_15(D)>; > > > _53 = -b$imag_10; > > > > > > and the loop body: > > > > > > <bb 3> [local count: 1063004408]: > > > ... > > > > > > _23 = REALPART_EXPR <*_5>; > > > _24 = IMAGPART_EXPR <*_5>; > > > _27 = _24 * _53; > > > _28 = _23 * _53; > > > > > > codegen before after: > > > > > > {_24, _23} * { _53, _53 } > > > > > > and after > > > > > > { _24, _24 } * { _53, b$real_11 } > > > > > > Before we were able to easily tell that the order for the multiply would be > > > IMAG, REAL. > > > In the after (GCC 15) case that information is there, but requires us to follow > > > the externals. > > > > > > Richi what do you think about extending externals handling in linear_loads_p to > > > follow all external ops and if they load from the same memref to figure out the > > > "virtual lane permute"? > > > > Externs do not have a permute as we build them from scalars. So any permute > > can be trivially imposed on them - rather than TOP they should be BOTTOM. > > Of course there’s also no advantage of imposing a permute on them. > > > > But the scalars can access memory that we can tell what they are. > > My point with the above was that it doesn't make sense to me that we know > that {a[0],a[1]} reads a linearly but that with > > a1 = a[0] > a2 = a[1] > > {a1,a2} we say "sorry we know nothing about you". > > Yes they're externals but they have a defined order of use in the SLP tree. > This isn't about imposing a permute. I said virtual permute since > linear_load_p uses the lane permutes on loads to determine the memory access > order. > > We DO already impose any order on them, but the other operand is oddodd, so > the overall order ends up being oddodd because any known permute overrides > unknown ones. So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's the "other operand" of an add? What's the (bad) effect of classifying it as ODDODD (optimistically)? > So the question is, can we not follow externals in a constructor to figure > out if how they are used they all read from the same base and in which order? I don't see how it makes sense to do this. For the above example, what's the testcase exhibiting this (and on which arch)?
The releases/gcc-14 branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>: https://gcc.gnu.org/g:f01f01f0ebf8f5207096cb9650354210d890fe0d commit r14-11053-gf01f01f0ebf8f5207096cb9650354210d890fe0d Author: Tamar Christina <tamar.christina@arm.com> Date: Thu Nov 21 15:10:24 2024 +0000 middle-end:For multiplication try swapping operands when matching complex multiply [PR116463] This commit fixes the failures of complex.exp=fast-math-complex-mls-*.c on the GCC 14 branch and some of the ones on the master. The current matching just looks for one order for multiplication and was relying on canonicalization to always give the right order because of the TWO_OPERANDS. However when it comes to the multiplication trying only one order is a bit fragile as they can be flipped. The failing tests on the branch are: void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= (a[i] * I * I) * b[i]; } The issue is just a small difference in commutative operations. we look for {R,R} * {R,I} but found {R,I} * {R,R}. Since the DF analysis is cached, we should be able to swap operands and retry for multiply cheaply. There is a constraint being checked by vect_validate_multiplication for the data flow of the operands feeding the multiplications. So e.g. between the nodes: note: node 0x4d1d210 (max_nunits=2, refcnt=3) vector(2) double note: op template: _27 = _10 * _25; note: stmt 0 _27 = _10 * _25; note: stmt 1 _29 = _11 * _25; note: node 0x4d1d060 (max_nunits=2, refcnt=2) vector(2) double note: op template: _26 = _11 * _24; note: stmt 0 _26 = _11 * _24; note: stmt 1 _28 = _10 * _24; we require the lanes to come from the same source which vect_validate_multiplication checks. As such it doesn't make sense to flip them individually because that would invalidate the earlier linear_loads_p checks which have validated that the arguments all come from the same datarefs. This patch thus flips the operands in unison to still maintain this invariant, but also honor the commutative nature of multiplication. gcc/ChangeLog: PR tree-optimization/116463 * tree-vect-slp-patterns.cc (complex_mul_pattern::matches, complex_fms_pattern::matches): Try swapping operands on multiply. (cherry picked from commit a9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8)
> > > > We DO already impose any order on them, but the other operand is oddodd, so > > the overall order ends up being oddodd because any known permute overrides > > unknown ones. > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's > the "other operand" of an add? What's the (bad) effect of classifying > it as ODDODD (optimistically)? > > > So the question is, can we not follow externals in a constructor to figure > > out if how they are used they all read from the same base and in which order? > > I don't see how it makes sense to do this. For the above example, what's > the testcase exhibiting this (and on which arch)? I've been working on a fix from a different angle for this, which also covers another GCC 14 regression that went unnoticed. I'll post after regressions finish.
(In reply to Tamar Christina from comment #27) > > > > > > We DO already impose any order on them, but the other operand is oddodd, so > > > the overall order ends up being oddodd because any known permute overrides > > > unknown ones. > > > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's > > the "other operand" of an add? What's the (bad) effect of classifying > > it as ODDODD (optimistically)? > > > > > So the question is, can we not follow externals in a constructor to figure > > > out if how they are used they all read from the same base and in which order? > > > > I don't see how it makes sense to do this. For the above example, what's > > the testcase exhibiting this (and on which arch)? > > I've been working on a fix from a different angle for this, which also > covers another GCC 14 regression that went unnoticed. I'll post after > regressions finish. So I've formalized the handling of TOP a bit better. Which gets it to recognize it again, however, it will be dropped as it's not profitable. The reason it's not profitable is the canonicalization issue mentioned above. This has split the imaginary and real nodes into different computations. So no matter what you do in the SLP tree, the attached digraph won't make the loads of _5 linear. Are you ok with me trying that Richi?
Created attachment 59779 [details] pattern.dot digraph of resulting pattern
(In reply to Tamar Christina from comment #29) > (In reply to Tamar Christina from comment #27) > > > > > > > > We DO already impose any order on them, but the other operand is oddodd, so > > > > the overall order ends up being oddodd because any known permute overrides > > > > unknown ones. > > > > > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's > > > the "other operand" of an add? What's the (bad) effect of classifying > > > it as ODDODD (optimistically)? > > > > > > > So the question is, can we not follow externals in a constructor to figure > > > > out if how they are used they all read from the same base and in which order? > > > > > > I don't see how it makes sense to do this. For the above example, what's > > > the testcase exhibiting this (and on which arch)? > > > > I've been working on a fix from a different angle for this, which also > > covers another GCC 14 regression that went unnoticed. I'll post after > > regressions finish. > > So I've formalized the handling of TOP a bit better. Which gets it to > recognize it again, however, it will be dropped as it's not profitable. > > The reason it's not profitable is the canonicalization issue mentioned > above. This has split the imaginary and real nodes into different > computations. > > So no matter what you do in the SLP tree, the attached digraph won't make > the loads of _5 linear. Are you ok with me trying that Richi? I can't make sense of that graph - the node feeding the store seems to have wrong scalar stmts? What's the testcase for this (and on what arch?). But yes, the loads of *5 won't get linear here, but at least the permute node feeding the complex-add-rot270 can be elided, eventually even making the external _53, b$real_11 match the other with different order (though we don't model that, cost-wise).
(In reply to Richard Biener from comment #31) > (In reply to Tamar Christina from comment #29) > > (In reply to Tamar Christina from comment #27) > > > > > > > > > > We DO already impose any order on them, but the other operand is oddodd, so > > > > > the overall order ends up being oddodd because any known permute overrides > > > > > unknown ones. > > > > > > > > So what's the desired outcome? I guess PERM_UNKNOWN? I guess it's > > > > the "other operand" of an add? What's the (bad) effect of classifying > > > > it as ODDODD (optimistically)? > > > > > > > > > So the question is, can we not follow externals in a constructor to figure > > > > > out if how they are used they all read from the same base and in which order? > > > > > > > > I don't see how it makes sense to do this. For the above example, what's > > > > the testcase exhibiting this (and on which arch)? > > > > > > I've been working on a fix from a different angle for this, which also > > > covers another GCC 14 regression that went unnoticed. I'll post after > > > regressions finish. > > > > So I've formalized the handling of TOP a bit better. Which gets it to > > recognize it again, however, it will be dropped as it's not profitable. > > > > The reason it's not profitable is the canonicalization issue mentioned > > above. This has split the imaginary and real nodes into different > > computations. > > > > So no matter what you do in the SLP tree, the attached digraph won't make > > the loads of _5 linear. Are you ok with me trying that Richi? > > I can't make sense of that graph - the node feeding the store seems to have > wrong scalar stmts? > > What's the testcase for this (and on what arch?). > void fms_elemconjsnd(_Complex TYPE a[restrict N], _Complex TYPE b, _Complex TYPE c[restrict N]) { for (int i = 0; i < N; i++) c[i] -= a[i] * ~b; } compiled with -Ofast -march=armv8.3-a #define TYPE double #define I 1.0i #define N 200 void fms180snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { for (int i=0; i < N; i++) c[i] -= a[i] * (b[i] * I * I); } void fms180snd_1 (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N], _Complex TYPE c[restrict N]) { _Complex TYPE t = I; for (int i=0; i < N; i++) c[i] -= a[i] * (b[i] * t * t); } is another one, where they are the same things, but 1st one is matched and second one doesn't. > But yes, the loads of *5 won't get linear here, but at least the > permute node feeding the complex-add-rot270 can be elided, eventually > even making the external _53, b$real_11 match the other with different > order (though we don't model that, cost-wise). But without the loads getting linearize the match will never work as multi-lane SLP will be immediately cancelled because it assumed load-lanes is cheaper (it's not, but load lanes doesn't get costed) and that's why there's a load permute optimization step after complex pattern matching. The point is however, that no permute is needed. *not even for the loads*. GCC 13 generated: fms_elemconjsnd: fneg d1, d1 mov x2, 0 dup v4.2d, v0.d[0] dup v3.2d, v1.d[0] .L2: ldr q1, [x0, x2] ldr q0, [x1, x2] fmul v2.2d, v3.2d, v1.2d fcadd v0.2d, v0.2d, v2.2d, #270 fmls v0.2d, v4.2d, v1.2d str q0, [x1, x2] add x2, x2, 16 cmp x2, 3200 bne .L2 ret which was the optimal sequence.
I see. Note it's SLP discoveries association code that figures out a SLP graph, disabling this ends up with single-lane (store-lanes) from the start. The association that "succeeds" first wins, and it's an unfortunate one (for SLP pattern detection). The thing is that the re-association greedily figures the best operand order as well. We start with t.c:3:21: note: pre-sorted chains of plus_expr plus_expr _19 plus_expr _27 minus_expr _26 plus_expr _18 minus_expr _29 minus_expr _28 and if we'd start with plus_expr _19 plus_expr _27 minus_expr _26 plus_expr _18 minus_expr _28 minus_expr _29 instead we get the desired SLP pattern match but still store-lanes is prefered it seems (not sure how we got away with no store-lanes in GCC 13). We could simply refuse to override the SLP graph with laod/store-lanes when patterns were found: diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 3892e1be3f2..4fb57a76f85 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -5064,7 +5065,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size, to cancel SLP when this applied to all instances in a loop but now we decide this per SLP instance. It's important to do this only after SLP pattern recognition. */ - if (is_a <loop_vec_info> (vinfo)) + if (!pattern_found && is_a <loop_vec_info> (vinfo)) FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance) if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_store && !SLP_INSTANCE_TREE (instance)->ldst_lanes) when starting with the swapped ops above we then get the desired code again. I've hacked that in with diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 3892e1be3f2..4fb57a76f85 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -2275,6 +2275,7 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node, /* 1. pre-sort according to def_type and operation. */ for (unsigned lane = 0; lane < group_size; ++lane) chains[lane].stablesort (dt_sort_cmp, vinfo); + std::swap (chains[1][2], chains[1][1]); if (dump_enabled_p ()) { dump_printf_loc (MSG_NOTE, vect_location, it happens that in this specific case the optimal operand order matches stmt order so the following produces that - but I'm not positively sure that's always good (though the 'stablesort' also tries to not disturb order - but in this case it's the DFS order collecting the scalar ops). In reality there's not enough info on the op or its definition to locally decide a better order for future pattern matching. diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 3892e1be3f2..f21e8b909ff 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -1684,7 +1684,12 @@ dt_sort_cmp (const void *op1_, const void *op2_, void *) auto *op2 = (const chain_op_t *) op2_; if (op1->dt != op2->dt) return (int)op1->dt - (int)op2->dt; - return (int)op1->code - (int)op2->code; + if ((int)op1->code != (int)op2->code) + return (int)op1->code - (int)op2->code; + if (TREE_CODE (op1->op) == SSA_NAME && TREE_CODE (op2->op) == SSA_NAME) + return (gimple_uid (SSA_NAME_DEF_STMT (op1->op)) + - gimple_uid (SSA_NAME_DEF_STMT (op2->op))); + return 0; } /* Linearize the associatable expression chain at START with the That said, I don't have a good idea on how to make this work better, not even after re-doing SLP discovery. Maybe SLP patterns need to work on the initial single-lane SLP graph? But then we'd have to find lane-matches on two unconnected SLP sub-graphs which complicates the pattern matching part. We basically form SLP nodes from two sets of (two lanes) plus/minus ops (three each) but we of course try to avoid SLP build of all 3! permutations possible and stop at the first one that succeeds.
Possibly first computing a lattice val for each SSA name whether its origin is a "real" or a "imag" component of a complex load could get us meta but even then the individual sorting which determines the initial association to SLP nodes would be only possible to adjust "cross-lane" (and to what? I guess combine real+imag parts? Hopefully of the same entity). Into vect we get with _19 = REALPART_EXPR <*_3>; _18 = IMAGPART_EXPR <*_3>; _5 = a_14(D) + _2; _23 = REALPART_EXPR <*_5>; // real _24 = IMAGPART_EXPR <*_5>; // imag _26 = b$real_11 * _23; // real? _27 = _24 * _53; // imag? _28 = _23 * _53; // mixed? but fed into imag _29 = b$real_11 * _24; // mixed? _7 = _18 - _28; // mixed? or imag? _22 = _27 - _26; // mixed? _32 = _19 + _22; // mixed? or real? _33 = _7 - _29; // mixed? but fed into real? REALPART_EXPR <*_3> = _32; IMAGPART_EXPR <*_3> = _33; so not sure if that will help. That we'd like to have full load groups is unfortunately only visible a node deeper. We could also fill a lattice with group IDs but we'd need to know parts to identify lane duplicates vs. full groups. It's also a lot of hassle for not much gain and very special cases?
(In reply to Richard Biener from comment #34) > Possibly first computing a lattice val for each SSA name whether its origin > is a "real" or a "imag" component of a complex load could get us meta but > even then the individual sorting which determines the initial association to > SLP nodes would be only possible to adjust "cross-lane" (and to what? I > guess combine real+imag parts? Hopefully of the same entity). Into vect we > get with > > _19 = REALPART_EXPR <*_3>; > _18 = IMAGPART_EXPR <*_3>; > _5 = a_14(D) + _2; > _23 = REALPART_EXPR <*_5>; // real > _24 = IMAGPART_EXPR <*_5>; // imag > _26 = b$real_11 * _23; // real? > _27 = _24 * _53; // imag? > _28 = _23 * _53; // mixed? but fed into imag > _29 = b$real_11 * _24; // mixed? > _7 = _18 - _28; // mixed? or imag? > _22 = _27 - _26; // mixed? > _32 = _19 + _22; // mixed? or real? > _33 = _7 - _29; // mixed? but fed into real? > REALPART_EXPR <*_3> = _32; > IMAGPART_EXPR <*_3> = _33; > > so not sure if that will help. That we'd like to have full load groups > is unfortunately only visible a node deeper. We could also fill a lattice > with group IDs but we'd need to know parts to identify lane duplicates vs. > full groups. It's also a lot of hassle for not much gain and very special > cases? That should help, because all it's after is that the final loads be permuted. The reason I'm keep to fix this is because it's not that niche. complex-add due to the operation being just +/- with a permute is by far the most common one. Not recognizing this is e.g. 10% on fotonik in SPECCPU2017 FP, which is also a regression I'm trying to fix. I can try to reduce a testcase for that to see if maybe that specific one is easier to fix. I'm just wondering if we can't do better in the future, e.g. LLVM recognizes both fms180snd cases for instance. If it's easier, I could see if we can just have another pattern to discover fmul + fcadd? Could maybe work and fix the SPEC regression... need to make a testcase
So when I was looking into PR 119393, I found any slight change in ssa names sometimes will cause g++.dg/vect/simd-complex-num-null-node.cc to fail. I wonder if that is the issue mentioned in comment #33 here. Plus I suspect that one and this one has a similar sensitivity to ssa name differences now.