Bug 116463 - [15/16 Regression] complex multiply vectorizer detection failures after r15-3087-gb07f8a301158e5
Summary: [15/16 Regression] complex multiply vectorizer detection failures after r15-3...
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 15.0
: P2 normal
Target Milestone: 15.0
Assignee: Tamar Christina
URL:
Keywords: missed-optimization, testsuite-fail
: 117071 117075 (view as bug list)
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-08-22 19:58 UTC by Andrew Pinski
Modified: 2025-04-17 12:33 UTC (History)
6 users (show)

See Also:
Host:
Target: aarch64-*-* x86_64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-08-22 00:00:00


Attachments
Reduced testcase (160 bytes, text/plain)
2024-08-22 20:09 UTC, Andrew Pinski
Details
Better testcase (182 bytes, text/plain)
2024-08-22 20:18 UTC, Andrew Pinski
Details
Full testcase (214 bytes, text/plain)
2024-08-22 20:43 UTC, Andrew Pinski
Details
pattern.dot (467 bytes, text/plain)
2024-12-03 23:26 UTC, Tamar Christina
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Pinski 2024-08-22 19:58:45 UTC
make check-gcc RUNTESTFLAGS="complex.exp=fast-math-complex-mls-*.c"
FAIL: gcc.dg/vect/complex/fast-math-complex-mls-double.c scan-tree-dump vect "Found COMPLEX_FMA"
FAIL: gcc.dg/vect/complex/fast-math-complex-mls-float.c scan-tree-dump vect "Found COMPLEX_FMA"
Comment 1 Andrew Pinski 2024-08-22 20:09:52 UTC Comment hidden (obsolete)
Comment 2 Andrew Pinski 2024-08-22 20:18:05 UTC
Created attachment 58978 [details]
Better testcase

Before the patch fms180snd could be detected but fms180snd_1 could not.
BUT both are the same function just changed when the multiply by i happens.

fms180snd_1 represents what happens after the patch for fms180snd .
Comment 3 Andrew Pinski 2024-08-22 20:20:28 UTC
+Tamar
since he wrote the original Complex vectorization support.
Comment 4 Andrew Pinski 2024-08-22 20:43:06 UTC
Created attachment 58979 [details]
Full testcase

Before the change fms180snd_2a and fms180snd_1 could not be detected even though they are all the same.

Note I think fms180snd_2a is more representative of what is done after the patch for fms180snd rather than fms180snd_1.
Comment 5 Tamar Christina 2024-08-22 23:03:51 UTC
Yeah, This is because they generate different gimple sequences and thus different SLP trees.
The core of the problem is there's no canonical form here, and a missing gimple simplification rule:

  _33 = IMAGPART_EXPR <*_3> + ((REALPART_EXPR <*_5> * IMAGPART_EXPR <*_7>) + (IMAGPART_EXPR <*_5> * REALPART_EXPR <*_7>));
vs
  _37 = IMAGPART_EXPR <*_3> - ((REALPART_EXPR <*_5> * -IMAGPART_EXPR <*_7>) + (IMAGPART_EXPR <*_5> * -REALPART_EXPR <*_7>));

i.e. a - ((b * -c) + (d * -e)) == a + (b * c) + (d * e)

So probably in match.pd we should fold _37 into _33 which is a simpler form of the same thing and it's better on scalar as well.

It would be better to finally introduce a vectorizer canonical form, for instance the real part generates:

  _36 = (_31 - _30) + REALPART_EXPR <*_3>;
vs
  _32 = REALPART_EXPR <*_3> + (_26 - _27);

and this already is an additional thing to check, so it would be better if slp build always puts complex parts consistently on one side of commutative operations so we don't have to swap operands to check.

In any case, I have some patches in this area and can take a look when I'm back, but think the new expression should be simplified back into the old one.
Comment 6 Richard Biener 2024-08-23 11:43:37 UTC
I think

  a - ((b * -c) + (d * -e))  ->  a + (b * c) + (d * e)

is a good simplification to be made, but it's difficult to do this with
canonicalization only.  Like a * -b -> -(a * b) as the negate might
combine with both other negates down and upstream.  But for
a*-b + c * -d it might be more obvious to turn that into
-a*b - c*d.

Maybe reassoc can be of help here - IIRC it turns b * -c into
b * c * -1, undistribute_ops_list might get that.

Note one issue is that complex lowering leaves around dead stmts,
confusing reassoc and forwprop, in particular

-  _10 = COMPLEX_EXPR <_18, _6>;

stay around until reassoc.  scheduling dce for testing shows reassoc
does something.

It's update_complex_assignment who replaces existing complex
stmts with COMPLEX_EXPRs, we should possibly resort do simple_dce_from_worklist
to clean those.  Let me try to do that.
Comment 7 GCC Commits 2024-08-23 12:37:55 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:de1923f9f4d5344694c22ca883aeb15caf635734

commit r15-3128-gde1923f9f4d5344694c22ca883aeb15caf635734
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Aug 23 13:44:29 2024 +0200

    tree-optimization/116463 - complex lowering leaves around dead stmts
    
    Complex lowering generally replaces existing complex defs with
    COMPLEX_EXPRs but those might be dead when it can always refer to
    components from the lattice.  This in turn can pessimize followup
    transforms like forwprop and reassoc, the following makes sure to
    get rid of dead COMPLEX_EXPRs generated by using
    simple_dce_from_worklist.
    
            PR tree-optimization/116463
            * tree-complex.cc: Include tree-ssa-dce.h.
            (dce_worklist): New global.
            (update_complex_assignment): Add SSA def to the DCE worklist.
            (tree_lower_complex): Perform DCE.
Comment 8 Richard Biener 2024-08-23 12:46:55 UTC
As of r15-3128-gde1923f9f4d534 now

FAIL: gcc.target/i386/avx512fp16-vector-complex-float.c scan-assembler-not vfmadd[123]*ph[ \\\\t]
FAIL: gcc.target/i386/avx512fp16-vector-complex-float.c scan-assembler-times vfmaddcph[ \\\\t] 1
FAIL: gcc.target/i386/part-vect-complexhf.c scan-assembler-times vfmaddcph[ \\\\t] 1

fail which look similar to the aarch64 fails (I have no idea if the patch
helped for those).

For the first test it's fma0 which is no longer vectorized as

        vmovdqu16       (%rdx), %zmm0
        vmovdqu16       (%rsi), %zmm1
        vfmaddcph       (%rdi), %zmm1, %zmm0
        vmovdqu16       %zmm0, (%rdx)

but

        vmovdqu16       (%rsi), %zmm0
        vmovdqu16       (%rdi), %zmm2
        movl    $1431655765, %eax
        kmovd   %eax, %k1
        vpshufb .LC1(%rip), %zmm0, %zmm1
        vfmadd213ph     (%rdx), %zmm2, %zmm1
        vpshufb .LC2(%rip), %zmm0, %zmm0
        vpshufb .LC0(%rip), %zmm2, %zmm3
        vmovdqa64       %zmm0, %zmm2
        vfmadd132ph     %zmm3, %zmm1, %zmm2
        vfnmadd132ph    %zmm3, %zmm1, %zmm0
        vpblendmw       %zmm0, %zmm2, %zmm0{%k1}
        vmovdqu16       %zmm0, (%rdx)

where instead of

note:    Found COMPLEX_FMA pattern in SLP tree

we have

note:    Found VEC_ADDSUB pattern in SLP tree
note:    Target does not support VEC_ADDSUB for vector type vector(32) _Float16 

with the IL difference being (- is good, + is bad)

  _12 = REALPART_EXPR <*_3>;
  _11 = IMAGPART_EXPR <*_3>;
...
@@ -46,10 +46,10 @@
   _27 = _19 * _25;
   _28 = _20 * _25;
   _29 = _19 * _24;
-  _30 = _26 - _27;
-  _31 = _28 + _29;
-  _32 = _12 + _30;
-  _33 = _11 + _31;
+  _9 = _12 + _26;
+  _10 = _11 + _28;
+  _32 = _9 - _27;
+  _33 = _10 + _29;
   REALPART_EXPR <*_3> = _32;
   IMAGPART_EXPR <*_3> = _33;
   i_18 = i_21 + 1;

which is different association, enabled by deleting dead uses that confuse
reassoc.
Comment 9 Andrew Pinski 2024-08-23 23:03:08 UTC
(In reply to Richard Biener from comment #8)
> fail which look similar to the aarch64 fails (I have no idea if the patch
> helped for those).

The aarch64 ones still fail. And yes they look very similar.
Comment 10 Andrew Pinski 2024-08-25 20:21:44 UTC
The current failures/xpass for aarch64 is:
XPASS: gcc.dg/vect/complex/fast-math-complex-add-half-float.c scan-tree-dump-times vect "stmt.*COMPLEX_ADD_ROT270" 1
XPASS: gcc.dg/vect/complex/fast-math-complex-add-half-float.c scan-tree-dump-times vect "stmt.*COMPLEX_ADD_ROT90" 1
FAIL: gcc.dg/vect/complex/fast-math-complex-mls-double.c scan-tree-dump vect "Found COMPLEX_ADD_ROT270"
FAIL: gcc.dg/vect/complex/fast-math-complex-mls-float.c scan-tree-dump vect "Found COMPLEX_ADD_ROT270"
FAIL: gcc.dg/vect/complex/fast-math-complex-mls-half-float.c scan-tree-dump vect "Found COMPLEX_ADD_ROT270"
Comment 11 Tamar Christina 2024-08-28 08:06:43 UTC
(In reply to Richard Biener from comment #6)
> I think
> 
>   a - ((b * -c) + (d * -e))  ->  a + (b * c) + (d * e)
> 
> is a good simplification to be made, but it's difficult to do this with
> canonicalization only.  Like a * -b -> -(a * b) as the negate might
> combine with both other negates down and upstream.  But for
> a*-b + c * -d it might be more obvious to turn that into
> -a*b - c*d.

Yeah, my expectation was that this would be an easier transform to avoid
the sharing problem we discussed before and that indeed the transform
looks at the entire chain not just transforming a * -b.

a*-b + c * -d -> -a*b - c*d

has the property of still maintaining the FMS and FMNS chains and can
get further simplified in the above case.

> 
> Maybe reassoc can be of help here - IIRC it turns b * -c into
> b * c * -1, undistribute_ops_list might get that.

hmm I see, but don't we have a higher chance that folding will just
fold it back into the multiply?

For this to work we'd have to do

  (b * -c) + (d * -e) -> -(b * c + d * e)

in one transformation no? since I'd imagine

  (b * c * -1) + (d * e * -1)

would just be undone by match.pd?

> 
> Note one issue is that complex lowering leaves around dead stmts,
> confusing reassoc and forwprop, in particular
> 
> -  _10 = COMPLEX_EXPR <_18, _6>;
> 
> stay around until reassoc.  scheduling dce for testing shows reassoc
> does something.
> 
> It's update_complex_assignment who replaces existing complex
> stmts with COMPLEX_EXPRs, we should possibly resort do
> simple_dce_from_worklist
> to clean those.  Let me try to do that.

Thanks!
Comment 12 Richard Biener 2024-08-28 08:13:00 UTC
(In reply to Tamar Christina from comment #11)
> (In reply to Richard Biener from comment #6)
> > I think
> > 
> >   a - ((b * -c) + (d * -e))  ->  a + (b * c) + (d * e)
> > 
> > is a good simplification to be made, but it's difficult to do this with
> > canonicalization only.  Like a * -b -> -(a * b) as the negate might
> > combine with both other negates down and upstream.  But for
> > a*-b + c * -d it might be more obvious to turn that into
> > -a*b - c*d.
> 
> Yeah, my expectation was that this would be an easier transform to avoid
> the sharing problem we discussed before and that indeed the transform
> looks at the entire chain not just transforming a * -b.
> 
> a*-b + c * -d -> -a*b - c*d
> 
> has the property of still maintaining the FMS and FMNS chains and can
> get further simplified in the above case.
> 
> > 
> > Maybe reassoc can be of help here - IIRC it turns b * -c into
> > b * c * -1, undistribute_ops_list might get that.
> 
> hmm I see, but don't we have a higher chance that folding will just
> fold it back into the multiply?
> 
> For this to work we'd have to do
> 
>   (b * -c) + (d * -e) -> -(b * c + d * e)
> 
> in one transformation no? since I'd imagine
> 
>   (b * c * -1) + (d * e * -1)
> 
> would just be undone by match.pd?

The * -1 is something reassoc does only internally, it then distributes
that back to generate an outer plus or minus.

Note for the x86 testcases there isn't any such simplification opportunity,
but the reassoc heuristics correctly mangle the expression to no longer
match the expected SLP complex patterns.  There's also the re-association
of chains done by SLP discovery itself which could be a problem.

I'd say fixing this fallout is quite low priority at the moment, the
simple cases could be re-associated by reassoc into a recognizable
complex op order but even there it's a bit difficult as the operations
span two "chains" (a multiplication and addition chain) where reassoc
looks at them separately (apart from undistribution).
Comment 13 Andrew Pinski 2024-10-11 00:02:24 UTC
*** Bug 117071 has been marked as a duplicate of this bug. ***
Comment 14 Andrew Pinski 2024-10-11 00:09:40 UTC
*** Bug 117075 has been marked as a duplicate of this bug. ***
Comment 15 Jakub Jelinek 2024-11-12 09:48:29 UTC
Those tests FAIL also on the 14 branch, starting with r14-10683-g12c00048d9f3598e57b98ec7723f7356bd255d04
Do we want to backport r15-3128 or just xfail the parts of the tests?
Comment 16 Richard Biener 2024-11-12 09:59:16 UTC
(In reply to Jakub Jelinek from comment #15)
> Those tests FAIL also on the 14 branch, starting with
> r14-10683-g12c00048d9f3598e57b98ec7723f7356bd255d04
> Do we want to backport r15-3128 or just xfail the parts of the tests?

I don't think r15-3128 on its own fixes the tests?  If it does, sure, we
can backport it.
Comment 17 Jakub Jelinek 2024-11-12 10:29:48 UTC
It doesn't.
With r14-10682 there are 2
Found COMPLEX_FMS pattern in SLP tree
and 2
Found COMPLEX_ADD_ROT270 pattern in SLP tree
messages in fast-math-complex-mls-double.c and 4
Found COMPLEX_FMA pattern in SLP tree
and 4
Found COMPLEX_FMS_CONJ pattern in SLP tree
With r14-10683 the 4
Found COMPLEX_FMA pattern in SLP tree
are gone, the rest remains the same.
With r14+10683 + backported r15-3128 the 4
Found COMPLEX_FMA pattern in SLP tree
are back, but the 2 COMPLEX_FMS and 2 COMPLEX_ADD_ROT270 messages are gone.
Comment 18 Tamar Christina 2024-11-12 10:31:14 UTC
I can provide a GCC 14 fix during Stage 3 while I fix the GCC 15 one if that's easier.
Comment 19 Tamar Christina 2024-11-20 22:47:50 UTC
The failing tests on the branch are:

#include <complex.h>

#define TYPE double
#define N 200

void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
               _Complex TYPE c[restrict N]) {
  for (int i = 0; i < N; i++)
    c[i] -= a[i] * (b[i] * I * I);
}

void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
               _Complex TYPE c[restrict N]) {
  for (int i = 0; i < N; i++)
    c[i] -= (a[i] * I * I) * b[i];
}

The issue is just a small difference in commutative operations.
we look for {R,R} * {R,I} but found {R,I} * {R,R} which is a bit fragile.

since the DF analysis is cached, we should be able to swap operands and retry
for multiply.  for the addition we can't as it's part of the TWO_OPERANDs and we require them to be in a particular order to have the right values.

Testing a patch that fixes the tests on the gcc-14 branch.

Will work on the one for master next.
Comment 20 GCC Commits 2024-11-22 08:07:17 UTC
The master branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:a9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8

commit r15-5585-ga9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Thu Nov 21 15:10:24 2024 +0000

    middle-end:For multiplication try swapping operands when matching complex multiply [PR116463]
    
    This commit fixes the failures of complex.exp=fast-math-complex-mls-*.c on the
    GCC 14 branch and some of the ones on the master.
    
    The current matching just looks for one order for multiplication and was relying
    on canonicalization to always give the right order because of the TWO_OPERANDS.
    
    However when it comes to the multiplication trying only one order is a bit
    fragile as they can be flipped.
    
    The failing tests on the branch are:
    
    void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
                   _Complex TYPE c[restrict N]) {
      for (int i = 0; i < N; i++)
        c[i] -= a[i] * (b[i] * I * I);
    }
    
    void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
                   _Complex TYPE c[restrict N]) {
      for (int i = 0; i < N; i++)
        c[i] -= (a[i] * I * I) * b[i];
    }
    
    The issue is just a small difference in commutative operations.
    we look for {R,R} * {R,I} but found {R,I} * {R,R}.
    
    Since the DF analysis is cached, we should be able to swap operands and retry
    for multiply cheaply.
    
    There is a constraint being checked by vect_validate_multiplication for the data
    flow of the operands feeding the multiplications.  So e.g.
    
    between the nodes:
    
    note:   node 0x4d1d210 (max_nunits=2, refcnt=3) vector(2) double
    note:   op template: _27 = _10 * _25;
    note:      stmt 0 _27 = _10 * _25;
    note:      stmt 1 _29 = _11 * _25;
    note:   node 0x4d1d060 (max_nunits=2, refcnt=2) vector(2) double
    note:   op template: _26 = _11 * _24;
    note:      stmt 0 _26 = _11 * _24;
    note:      stmt 1 _28 = _10 * _24;
    
    we require the lanes to come from the same source which
    vect_validate_multiplication checks.  As such it doesn't make sense to flip them
    individually because that would invalidate the earlier linear_loads_p checks
    which have validated that the arguments all come from the same datarefs.
    
    This patch thus flips the operands in unison to still maintain this invariant,
    but also honor the commutative nature of multiplication.
    
    gcc/ChangeLog:
    
            PR tree-optimization/116463
            * tree-vect-slp-patterns.cc (complex_mul_pattern::matches,
            complex_fms_pattern::matches): Try swapping operands on multiply.
Comment 21 Tamar Christina 2024-11-22 08:13:52 UTC
will wait a week or so for backporting to GCC 14, looking at the complex add failures on trunk now.  Have an idea but need to work out on paper if sound.
Comment 22 Tamar Christina 2024-11-23 12:20:02 UTC
Ok, so the problem with the ones on trunk isn't necessarily the canonicalization itself but that our externals handling is a bit shallow.

On externals we determine that we have no information on the DF and return TOP.
This is because DR analysis doesn't try to handle externals since they're not part of the loop.

However all we need to know for complex numbers is whether the externals are loaded from the same place and the order of them.

concretely the loop pre-header is:

  <bb 2> [local count: 10737416]:
  b$real_11 = REALPART_EXPR <b_15(D)>;
  b$imag_10 = IMAGPART_EXPR <b_15(D)>;
  _53 = -b$imag_10;

and the loop body:

  <bb 3> [local count: 1063004408]:
  ...

  _23 = REALPART_EXPR <*_5>;
  _24 = IMAGPART_EXPR <*_5>;
  _27 = _24 * _53;
  _28 = _23 * _53;

codegen before after:

{_24, _23} * { _53, _53 }

and after

{ _24, _24 } * { _53, b$real_11 }

Before we were able to easily tell that the order for the multiply would be IMAG, REAL.
In the after (GCC 15) case that information is there, but requires us to follow the externals.

Richi what do you think about extending externals handling in linear_loads_p to follow all external ops and if they load from the same memref to figure out the "virtual lane permute"?

We can store the info in a new externals cache (to avoid re-walking externals we already walked, as perm_cache stores slp nodes)  and the permute for the node in the perm_cache as we do for any cached lookups today?

This would also fix the other tests Andrew added in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463#c4
Comment 23 rguenther@suse.de 2024-11-24 12:08:07 UTC
> Am 23.11.2024 um 13:20 schrieb tnfchris at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org>:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463
> 
> --- Comment #22 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> Ok, so the problem with the ones on trunk isn't necessarily the
> canonicalization itself but that our externals handling is a bit shallow.
> 
> On externals we determine that we have no information on the DF and return TOP.
> This is because DR analysis doesn't try to handle externals since they're not
> part of the loop.
> 
> However all we need to know for complex numbers is whether the externals are
> loaded from the same place and the order of them.
> 
> concretely the loop pre-header is:
> 
>  <bb 2> [local count: 10737416]:
>  b$real_11 = REALPART_EXPR <b_15(D)>;
>  b$imag_10 = IMAGPART_EXPR <b_15(D)>;
>  _53 = -b$imag_10;
> 
> and the loop body:
> 
>  <bb 3> [local count: 1063004408]:
>  ...
> 
>  _23 = REALPART_EXPR <*_5>;
>  _24 = IMAGPART_EXPR <*_5>;
>  _27 = _24 * _53;
>  _28 = _23 * _53;
> 
> codegen before after:
> 
> {_24, _23} * { _53, _53 }
> 
> and after
> 
> { _24, _24 } * { _53, b$real_11 }
> 
> Before we were able to easily tell that the order for the multiply would be
> IMAG, REAL.
> In the after (GCC 15) case that information is there, but requires us to follow
> the externals.
> 
> Richi what do you think about extending externals handling in linear_loads_p to
> follow all external ops and if they load from the same memref to figure out the
> "virtual lane permute"?

Externs do not have a permute as we build them from scalars.  So any permute can be trivially imposed on them - rather than TOP they should be BOTTOM.  Of course there’s also no advantage of imposing a permute on them.

> We can store the info in a new externals cache (to avoid re-walking externals
> we already walked, as perm_cache stores slp nodes)  and the permute for the
> node in the perm_cache as we do for any cached lookups today?
> 
> This would also fix the other tests Andrew added in
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463#c4
> 
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 24 Tamar Christina 2024-11-24 15:07:03 UTC
(In reply to rguenther@suse.de from comment #23)
> > Am 23.11.2024 um 13:20 schrieb tnfchris at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org>:
> > 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463
> > 
> > --- Comment #22 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > Ok, so the problem with the ones on trunk isn't necessarily the
> > canonicalization itself but that our externals handling is a bit shallow.
> > 
> > On externals we determine that we have no information on the DF and return TOP.
> > This is because DR analysis doesn't try to handle externals since they're not
> > part of the loop.
> > 
> > However all we need to know for complex numbers is whether the externals are
> > loaded from the same place and the order of them.
> > 
> > concretely the loop pre-header is:
> > 
> >  <bb 2> [local count: 10737416]:
> >  b$real_11 = REALPART_EXPR <b_15(D)>;
> >  b$imag_10 = IMAGPART_EXPR <b_15(D)>;
> >  _53 = -b$imag_10;
> > 
> > and the loop body:
> > 
> >  <bb 3> [local count: 1063004408]:
> >  ...
> > 
> >  _23 = REALPART_EXPR <*_5>;
> >  _24 = IMAGPART_EXPR <*_5>;
> >  _27 = _24 * _53;
> >  _28 = _23 * _53;
> > 
> > codegen before after:
> > 
> > {_24, _23} * { _53, _53 }
> > 
> > and after
> > 
> > { _24, _24 } * { _53, b$real_11 }
> > 
> > Before we were able to easily tell that the order for the multiply would be
> > IMAG, REAL.
> > In the after (GCC 15) case that information is there, but requires us to follow
> > the externals.
> > 
> > Richi what do you think about extending externals handling in linear_loads_p to
> > follow all external ops and if they load from the same memref to figure out the
> > "virtual lane permute"?
> 
> Externs do not have a permute as we build them from scalars.  So any permute
> can be trivially imposed on them - rather than TOP they should be BOTTOM. 
> Of course there’s also no advantage of imposing a permute on them.
> 

But the scalars can access memory that we can tell what they are. 

My point with the above was that it doesn't make sense to me that we know that {a[0],a[1]} reads a linearly but that with 

a1 = a[0]
a2 = a[1]

{a1,a2} we say "sorry we know nothing about you". 

Yes they're externals but they have a defined order of use in the SLP tree. This isn't about imposing a permute. I said virtual permute since linear_load_p uses the lane permutes on loads to determine the memory access order.

We DO already impose any order on them, but the other operand is oddodd, so the overall order ends up being oddodd because any known permute overrides unknown ones.

So the question is, can we not follow externals in a constructor to figure out if how they are used they all read from the same base and in which order?
Comment 25 Richard Biener 2024-11-25 09:12:02 UTC
(In reply to Tamar Christina from comment #24)
> (In reply to rguenther@suse.de from comment #23)
> > > Am 23.11.2024 um 13:20 schrieb tnfchris at gcc dot gnu.org <gcc-bugzilla@gcc.gnu.org>:
> > > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463
> > > 
> > > --- Comment #22 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > > Ok, so the problem with the ones on trunk isn't necessarily the
> > > canonicalization itself but that our externals handling is a bit shallow.
> > > 
> > > On externals we determine that we have no information on the DF and return TOP.
> > > This is because DR analysis doesn't try to handle externals since they're not
> > > part of the loop.
> > > 
> > > However all we need to know for complex numbers is whether the externals are
> > > loaded from the same place and the order of them.
> > > 
> > > concretely the loop pre-header is:
> > > 
> > >  <bb 2> [local count: 10737416]:
> > >  b$real_11 = REALPART_EXPR <b_15(D)>;
> > >  b$imag_10 = IMAGPART_EXPR <b_15(D)>;
> > >  _53 = -b$imag_10;
> > > 
> > > and the loop body:
> > > 
> > >  <bb 3> [local count: 1063004408]:
> > >  ...
> > > 
> > >  _23 = REALPART_EXPR <*_5>;
> > >  _24 = IMAGPART_EXPR <*_5>;
> > >  _27 = _24 * _53;
> > >  _28 = _23 * _53;
> > > 
> > > codegen before after:
> > > 
> > > {_24, _23} * { _53, _53 }
> > > 
> > > and after
> > > 
> > > { _24, _24 } * { _53, b$real_11 }
> > > 
> > > Before we were able to easily tell that the order for the multiply would be
> > > IMAG, REAL.
> > > In the after (GCC 15) case that information is there, but requires us to follow
> > > the externals.
> > > 
> > > Richi what do you think about extending externals handling in linear_loads_p to
> > > follow all external ops and if they load from the same memref to figure out the
> > > "virtual lane permute"?
> > 
> > Externs do not have a permute as we build them from scalars.  So any permute
> > can be trivially imposed on them - rather than TOP they should be BOTTOM. 
> > Of course there’s also no advantage of imposing a permute on them.
> > 
> 
> But the scalars can access memory that we can tell what they are. 
> 
> My point with the above was that it doesn't make sense to me that we know
> that {a[0],a[1]} reads a linearly but that with 
> 
> a1 = a[0]
> a2 = a[1]
> 
> {a1,a2} we say "sorry we know nothing about you". 
> 
> Yes they're externals but they have a defined order of use in the SLP tree.
> This isn't about imposing a permute. I said virtual permute since
> linear_load_p uses the lane permutes on loads to determine the memory access
> order.
> 
> We DO already impose any order on them, but the other operand is oddodd, so
> the overall order ends up being oddodd because any known permute overrides
> unknown ones.

So what's the desired outcome?  I guess PERM_UNKNOWN?  I guess it's
the "other operand" of an add?  What's the (bad) effect of classifying
it as ODDODD (optimistically)?

> So the question is, can we not follow externals in a constructor to figure
> out if how they are used they all read from the same base and in which order?

I don't see how it makes sense to do this.  For the above example, what's
the testcase exhibiting this (and on which arch)?
Comment 26 GCC Commits 2024-12-03 16:20:53 UTC
The releases/gcc-14 branch has been updated by Tamar Christina <tnfchris@gcc.gnu.org>:

https://gcc.gnu.org/g:f01f01f0ebf8f5207096cb9650354210d890fe0d

commit r14-11053-gf01f01f0ebf8f5207096cb9650354210d890fe0d
Author: Tamar Christina <tamar.christina@arm.com>
Date:   Thu Nov 21 15:10:24 2024 +0000

    middle-end:For multiplication try swapping operands when matching complex multiply [PR116463]
    
    This commit fixes the failures of complex.exp=fast-math-complex-mls-*.c on the
    GCC 14 branch and some of the ones on the master.
    
    The current matching just looks for one order for multiplication and was relying
    on canonicalization to always give the right order because of the TWO_OPERANDS.
    
    However when it comes to the multiplication trying only one order is a bit
    fragile as they can be flipped.
    
    The failing tests on the branch are:
    
    void fms180snd(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
                   _Complex TYPE c[restrict N]) {
      for (int i = 0; i < N; i++)
        c[i] -= a[i] * (b[i] * I * I);
    }
    
    void fms180fst(_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
                   _Complex TYPE c[restrict N]) {
      for (int i = 0; i < N; i++)
        c[i] -= (a[i] * I * I) * b[i];
    }
    
    The issue is just a small difference in commutative operations.
    we look for {R,R} * {R,I} but found {R,I} * {R,R}.
    
    Since the DF analysis is cached, we should be able to swap operands and retry
    for multiply cheaply.
    
    There is a constraint being checked by vect_validate_multiplication for the data
    flow of the operands feeding the multiplications.  So e.g.
    
    between the nodes:
    
    note:   node 0x4d1d210 (max_nunits=2, refcnt=3) vector(2) double
    note:   op template: _27 = _10 * _25;
    note:      stmt 0 _27 = _10 * _25;
    note:      stmt 1 _29 = _11 * _25;
    note:   node 0x4d1d060 (max_nunits=2, refcnt=2) vector(2) double
    note:   op template: _26 = _11 * _24;
    note:      stmt 0 _26 = _11 * _24;
    note:      stmt 1 _28 = _10 * _24;
    
    we require the lanes to come from the same source which
    vect_validate_multiplication checks.  As such it doesn't make sense to flip them
    individually because that would invalidate the earlier linear_loads_p checks
    which have validated that the arguments all come from the same datarefs.
    
    This patch thus flips the operands in unison to still maintain this invariant,
    but also honor the commutative nature of multiplication.
    
    gcc/ChangeLog:
    
            PR tree-optimization/116463
            * tree-vect-slp-patterns.cc (complex_mul_pattern::matches,
            complex_fms_pattern::matches): Try swapping operands on multiply.
    
    (cherry picked from commit a9473f9c6f2d755d2eb79dbd30877e64b4bc6fc8)
Comment 27 Tamar Christina 2024-12-03 16:22:23 UTC
> > 
> > We DO already impose any order on them, but the other operand is oddodd, so
> > the overall order ends up being oddodd because any known permute overrides
> > unknown ones.
> 
> So what's the desired outcome?  I guess PERM_UNKNOWN?  I guess it's
> the "other operand" of an add?  What's the (bad) effect of classifying
> it as ODDODD (optimistically)?
> 
> > So the question is, can we not follow externals in a constructor to figure
> > out if how they are used they all read from the same base and in which order?
> 
> I don't see how it makes sense to do this.  For the above example, what's
> the testcase exhibiting this (and on which arch)?

I've been working on a fix from a different angle for this, which also covers another GCC 14 regression that went unnoticed. I'll post after regressions finish.
Comment 28 Tamar Christina 2024-12-03 23:24:54 UTC
(In reply to Tamar Christina from comment #27)
> > > 
> > > We DO already impose any order on them, but the other operand is oddodd, so
> > > the overall order ends up being oddodd because any known permute overrides
> > > unknown ones.
> > 
> > So what's the desired outcome?  I guess PERM_UNKNOWN?  I guess it's
> > the "other operand" of an add?  What's the (bad) effect of classifying
> > it as ODDODD (optimistically)?
> > 
> > > So the question is, can we not follow externals in a constructor to figure
> > > out if how they are used they all read from the same base and in which order?
> > 
> > I don't see how it makes sense to do this.  For the above example, what's
> > the testcase exhibiting this (and on which arch)?
> 
> I've been working on a fix from a different angle for this, which also
> covers another GCC 14 regression that went unnoticed. I'll post after
> regressions finish.

So I've formalized the handling of TOP a bit better.  Which gets it to recognize it again, however, it will be dropped as it's not profitable.

The reason it's not profitable is the canonicalization issue mentioned above.  This has split the imaginary and real nodes into different computations.

So no matter what you do in the SLP tree, the attached digraph won't make the loads of _5 linear.  Are you ok with me trying that Richi?
Comment 29 Tamar Christina 2024-12-03 23:25:09 UTC
(In reply to Tamar Christina from comment #27)
> > > 
> > > We DO already impose any order on them, but the other operand is oddodd, so
> > > the overall order ends up being oddodd because any known permute overrides
> > > unknown ones.
> > 
> > So what's the desired outcome?  I guess PERM_UNKNOWN?  I guess it's
> > the "other operand" of an add?  What's the (bad) effect of classifying
> > it as ODDODD (optimistically)?
> > 
> > > So the question is, can we not follow externals in a constructor to figure
> > > out if how they are used they all read from the same base and in which order?
> > 
> > I don't see how it makes sense to do this.  For the above example, what's
> > the testcase exhibiting this (and on which arch)?
> 
> I've been working on a fix from a different angle for this, which also
> covers another GCC 14 regression that went unnoticed. I'll post after
> regressions finish.

So I've formalized the handling of TOP a bit better.  Which gets it to recognize it again, however, it will be dropped as it's not profitable.

The reason it's not profitable is the canonicalization issue mentioned above.  This has split the imaginary and real nodes into different computations.

So no matter what you do in the SLP tree, the attached digraph won't make the loads of _5 linear.  Are you ok with me trying that Richi?
Comment 30 Tamar Christina 2024-12-03 23:26:08 UTC
Created attachment 59779 [details]
pattern.dot

digraph of resulting pattern
Comment 31 Richard Biener 2024-12-04 08:08:48 UTC
(In reply to Tamar Christina from comment #29)
> (In reply to Tamar Christina from comment #27)
> > > > 
> > > > We DO already impose any order on them, but the other operand is oddodd, so
> > > > the overall order ends up being oddodd because any known permute overrides
> > > > unknown ones.
> > > 
> > > So what's the desired outcome?  I guess PERM_UNKNOWN?  I guess it's
> > > the "other operand" of an add?  What's the (bad) effect of classifying
> > > it as ODDODD (optimistically)?
> > > 
> > > > So the question is, can we not follow externals in a constructor to figure
> > > > out if how they are used they all read from the same base and in which order?
> > > 
> > > I don't see how it makes sense to do this.  For the above example, what's
> > > the testcase exhibiting this (and on which arch)?
> > 
> > I've been working on a fix from a different angle for this, which also
> > covers another GCC 14 regression that went unnoticed. I'll post after
> > regressions finish.
> 
> So I've formalized the handling of TOP a bit better.  Which gets it to
> recognize it again, however, it will be dropped as it's not profitable.
> 
> The reason it's not profitable is the canonicalization issue mentioned
> above.  This has split the imaginary and real nodes into different
> computations.
>
> So no matter what you do in the SLP tree, the attached digraph won't make
> the loads of _5 linear.  Are you ok with me trying that Richi?

I can't make sense of that graph - the node feeding the store seems to have
wrong scalar stmts?

What's the testcase for this (and on what arch?).

But yes, the loads of *5 won't get linear here, but at least the
permute node feeding the complex-add-rot270 can be elided, eventually
even making the external _53, b$real_11 match the other with different
order (though we don't model that, cost-wise).
Comment 32 Tamar Christina 2024-12-04 09:56:43 UTC
(In reply to Richard Biener from comment #31)
> (In reply to Tamar Christina from comment #29)
> > (In reply to Tamar Christina from comment #27)
> > > > > 
> > > > > We DO already impose any order on them, but the other operand is oddodd, so
> > > > > the overall order ends up being oddodd because any known permute overrides
> > > > > unknown ones.
> > > > 
> > > > So what's the desired outcome?  I guess PERM_UNKNOWN?  I guess it's
> > > > the "other operand" of an add?  What's the (bad) effect of classifying
> > > > it as ODDODD (optimistically)?
> > > > 
> > > > > So the question is, can we not follow externals in a constructor to figure
> > > > > out if how they are used they all read from the same base and in which order?
> > > > 
> > > > I don't see how it makes sense to do this.  For the above example, what's
> > > > the testcase exhibiting this (and on which arch)?
> > > 
> > > I've been working on a fix from a different angle for this, which also
> > > covers another GCC 14 regression that went unnoticed. I'll post after
> > > regressions finish.
> > 
> > So I've formalized the handling of TOP a bit better.  Which gets it to
> > recognize it again, however, it will be dropped as it's not profitable.
> > 
> > The reason it's not profitable is the canonicalization issue mentioned
> > above.  This has split the imaginary and real nodes into different
> > computations.
> >
> > So no matter what you do in the SLP tree, the attached digraph won't make
> > the loads of _5 linear.  Are you ok with me trying that Richi?
> 
> I can't make sense of that graph - the node feeding the store seems to have
> wrong scalar stmts?
> 
> What's the testcase for this (and on what arch?).
> 

void fms_elemconjsnd(_Complex TYPE a[restrict N], _Complex TYPE b,
                     _Complex TYPE c[restrict N]) {
  for (int i = 0; i < N; i++)
    c[i] -= a[i] * ~b;
}

compiled with -Ofast -march=armv8.3-a

#define TYPE double
#define I 1.0i
#define N 200
void fms180snd (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
                _Complex TYPE c[restrict N])
{
  for (int i=0; i < N; i++)
    c[i] -= a[i] * (b[i] * I * I);
}

void fms180snd_1 (_Complex TYPE a[restrict N], _Complex TYPE b[restrict N],
                _Complex TYPE c[restrict N])
{
  _Complex TYPE t = I;
  for (int i=0; i < N; i++)
    c[i] -= a[i] * (b[i] * t * t);
}

is another one, where they are the same things, but 1st one is matched and second one doesn't.

> But yes, the loads of *5 won't get linear here, but at least the
> permute node feeding the complex-add-rot270 can be elided, eventually
> even making the external _53, b$real_11 match the other with different
> order (though we don't model that, cost-wise).

But without the loads getting linearize the match will never work as multi-lane SLP will be immediately cancelled because it assumed load-lanes is cheaper (it's not, but load lanes doesn't get costed) and that's why there's a load permute optimization step after complex pattern matching.

The point is however, that no permute is needed. *not even for the loads*.

GCC 13 generated:

fms_elemconjsnd:
        fneg    d1, d1
        mov     x2, 0
        dup     v4.2d, v0.d[0]
        dup     v3.2d, v1.d[0]
.L2:
        ldr     q1, [x0, x2]
        ldr     q0, [x1, x2]
        fmul    v2.2d, v3.2d, v1.2d
        fcadd   v0.2d, v0.2d, v2.2d, #270
        fmls    v0.2d, v4.2d, v1.2d
        str     q0, [x1, x2]
        add     x2, x2, 16
        cmp     x2, 3200
        bne     .L2
        ret

which was the optimal sequence.
Comment 33 Richard Biener 2024-12-04 12:54:25 UTC
I see.  Note it's SLP discoveries association code that figures out a SLP graph, disabling this ends up with single-lane (store-lanes) from the start.  The association that "succeeds" first wins, and it's an unfortunate one (for
SLP pattern detection).

The thing is that the re-association greedily figures the best operand
order as well.  We start with

t.c:3:21: note:   pre-sorted chains of plus_expr
plus_expr _19 plus_expr _27 minus_expr _26 
plus_expr _18 minus_expr _29 minus_expr _28

and if we'd start with

plus_expr _19 plus_expr _27 minus_expr _26 
plus_expr _18 minus_expr _28 minus_expr _29 

instead we get the desired SLP pattern match but still store-lanes is
prefered it seems (not sure how we got away with no store-lanes in GCC 13).
We could simply refuse to override the SLP graph with laod/store-lanes
when patterns were found:

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 3892e1be3f2..4fb57a76f85 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -5064,7 +5065,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size,
      to cancel SLP when this applied to all instances in a loop but now
      we decide this per SLP instance.  It's important to do this only
      after SLP pattern recognition.  */
-  if (is_a <loop_vec_info> (vinfo))
+  if (!pattern_found && is_a <loop_vec_info> (vinfo))
     FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance)
       if (SLP_INSTANCE_KIND (instance) == slp_inst_kind_store
          && !SLP_INSTANCE_TREE (instance)->ldst_lanes)

when starting with the swapped ops above we then get the desired code
again.  I've hacked that in with

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 3892e1be3f2..4fb57a76f85 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -2275,6 +2275,7 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
          /* 1. pre-sort according to def_type and operation.  */
          for (unsigned lane = 0; lane < group_size; ++lane)
            chains[lane].stablesort (dt_sort_cmp, vinfo);
+         std::swap (chains[1][2], chains[1][1]);
          if (dump_enabled_p ())
            {
              dump_printf_loc (MSG_NOTE, vect_location,

it happens that in this specific case the optimal operand order matches
stmt order so the following produces that - but I'm not positively sure
that's always good (though the 'stablesort' also tries to not disturb
order - but in this case it's the DFS order collecting the scalar ops).
In reality there's not enough info on the op or its definition to locally
decide a better order for future pattern matching.

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 3892e1be3f2..f21e8b909ff 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -1684,7 +1684,12 @@ dt_sort_cmp (const void *op1_, const void *op2_, void *)
   auto *op2 = (const chain_op_t *) op2_;
   if (op1->dt != op2->dt)
     return (int)op1->dt - (int)op2->dt;
-  return (int)op1->code - (int)op2->code;
+  if ((int)op1->code != (int)op2->code)
+    return (int)op1->code - (int)op2->code;
+  if (TREE_CODE (op1->op) == SSA_NAME && TREE_CODE (op2->op) == SSA_NAME)
+    return (gimple_uid (SSA_NAME_DEF_STMT (op1->op))
+           - gimple_uid (SSA_NAME_DEF_STMT (op2->op)));
+  return 0;
 }
 
 /* Linearize the associatable expression chain at START with the


That said, I don't have a good idea on how to make this work better, not
even after re-doing SLP discovery.  Maybe SLP patterns need to work on the
initial single-lane SLP graph?  But then we'd have to find lane-matches
on two unconnected SLP sub-graphs which complicates the pattern matching
part.

We basically form SLP nodes from two sets of (two lanes) plus/minus ops
(three each) but we of course try to avoid SLP build of all 3! permutations
possible and stop at the first one that succeeds.
Comment 34 Richard Biener 2024-12-04 13:06:07 UTC
Possibly first computing a lattice val for each SSA name whether its origin is a "real" or a "imag" component of a complex load could get us meta but even then the individual sorting which determines the initial association to SLP nodes would be only possible to adjust "cross-lane" (and to what?  I guess combine real+imag parts?  Hopefully of the same entity).  Into vect we get with

  _19 = REALPART_EXPR <*_3>;
  _18 = IMAGPART_EXPR <*_3>;
  _5 = a_14(D) + _2;
  _23 = REALPART_EXPR <*_5>; // real
  _24 = IMAGPART_EXPR <*_5>; // imag
  _26 = b$real_11 * _23; // real?
  _27 = _24 * _53; // imag?
  _28 = _23 * _53;  // mixed?  but fed into imag
  _29 = b$real_11 * _24; // mixed?
  _7 = _18 - _28;  // mixed? or imag?
  _22 = _27 - _26;  // mixed?
  _32 = _19 + _22;  // mixed?  or real?
  _33 = _7 - _29; // mixed?  but fed into real?
  REALPART_EXPR <*_3> = _32;
  IMAGPART_EXPR <*_3> = _33;

so not sure if that will help.  That we'd like to have full load groups
is unfortunately only visible a node deeper.  We could also fill a lattice
with group IDs but we'd need to know parts to identify lane duplicates vs.
full groups.  It's also a lot of hassle for not much gain and very special
cases?
Comment 35 Tamar Christina 2024-12-04 14:32:58 UTC
(In reply to Richard Biener from comment #34)
> Possibly first computing a lattice val for each SSA name whether its origin
> is a "real" or a "imag" component of a complex load could get us meta but
> even then the individual sorting which determines the initial association to
> SLP nodes would be only possible to adjust "cross-lane" (and to what?  I
> guess combine real+imag parts?  Hopefully of the same entity).  Into vect we
> get with
> 
>   _19 = REALPART_EXPR <*_3>;
>   _18 = IMAGPART_EXPR <*_3>;
>   _5 = a_14(D) + _2;
>   _23 = REALPART_EXPR <*_5>; // real
>   _24 = IMAGPART_EXPR <*_5>; // imag
>   _26 = b$real_11 * _23; // real?
>   _27 = _24 * _53; // imag?
>   _28 = _23 * _53;  // mixed?  but fed into imag
>   _29 = b$real_11 * _24; // mixed?
>   _7 = _18 - _28;  // mixed? or imag?
>   _22 = _27 - _26;  // mixed?
>   _32 = _19 + _22;  // mixed?  or real?
>   _33 = _7 - _29; // mixed?  but fed into real?
>   REALPART_EXPR <*_3> = _32;
>   IMAGPART_EXPR <*_3> = _33;
> 
> so not sure if that will help.  That we'd like to have full load groups
> is unfortunately only visible a node deeper.  We could also fill a lattice
> with group IDs but we'd need to know parts to identify lane duplicates vs.
> full groups.  It's also a lot of hassle for not much gain and very special
> cases?

That should help, because all it's after is that the final loads be permuted.
The reason I'm keep to fix this is because it's not that niche. complex-add due to the operation being just +/- with a permute is by far the most common one.

Not recognizing this is e.g. 10% on fotonik in SPECCPU2017 FP, which is also a regression I'm trying to fix.

I can try to reduce a testcase for that to see if maybe that specific one is easier to fix.  I'm just wondering if we can't do better in the future, e.g. LLVM recognizes both fms180snd cases for instance.

If it's easier, I could see if we can just have another pattern to discover fmul + fcadd?

Could maybe work and fix the SPEC regression... need to make a testcase
Comment 36 Andrew Pinski 2025-04-11 01:00:13 UTC
So when I was looking into PR 119393, I found any slight change in ssa names sometimes will cause g++.dg/vect/simd-complex-num-null-node.cc to fail. I wonder if that is the issue mentioned in comment #33 here. Plus I suspect that one and this one has a similar sensitivity to ssa name differences now.