Bug 114440 - Fail to recognize a chain of lane-reduced operations for loop reduction vect
Summary: Fail to recognize a chain of lane-reduced operations for loop reduction vect
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 14.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-03-23 10:45 UTC by Feng Xue
Modified: 2024-07-19 09:24 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-03-25 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Feng Xue 2024-03-23 10:45:35 UTC
In a loop reduction path containing a lane-reduced operation (DOT_PROD/SAD/WIDEN_SUM), current vectorizer could not handle the pattern if there are other operations, which might be a normal or another lane-reduced one. A pseudo example is represented as:

   char *d0, *d1;
   char *s0, *s1;
   char *w;
   int *n;

   ...
   int sum = 0;
 
   for (i) {
     ...
     sum += d0[i] * d1[i];       /* DOT_PROD */
     ...
     sum += abs(s0[i] - s1[i]);  /* SAD */
     ...
     sum += w[i];                /* WIDEN_SUM */
     ...
     sum += n[i];                /* Normal */
     ...
   }

   ... = sum;

For the case, reduction vectype would vary with operations, and this causes mismatch on count of vectorized defs and uses, a possible means might be fixing that by generating extra trivial pass-through copies. Given a concrete example as:

   sum = 0; 
   for (i) {
     sum += d0[i] * d1[i];       /* 16*char -> 4*int */
     sum += n[i];                /*   4*int -> 4*int */
   }

Final vetorized statements could be:

   sum_v0 = { 0, 0, 0, 0 };
   sum_v1 = { 0, 0, 0, 0 };
   sum_v2 = { 0, 0, 0, 0 };
   sum_v3 = { 0, 0, 0, 0 };

   for (i / 16) {
     sum_v0 += DOT_PROD (v_d0[i: 0 .. 15], v_d1[i: 0 .. 15]);
     sum_v1 += 0;  // copy
     sum_v2 += 0;  // copy
     sum_v3 += 0;  // copy

     sum_v0 += v_n[i:  0 .. 3];
     sum_v1 += v_n[i:  4 .. 7];
     sum_v2 += v_n[i:  8 .. 11];
     sum_v3 += v_n[i: 12 .. 15]; 
   }

   sum = REDUC_PLUS(sum_v0 + sum_v1 + sum_v2 + sum_v3);

In the above sequence, one summation statement simply forms one pattern. Though, we could easily compose a somewhat more complicated variant that gets into the similar situation. That is, a chain of lane-reduced operations comes from the non-reduction addend in one summation statement, like:

   sum += d0[i] * d1[i] + abs(s0[i] - s1[i]) + n[i];

Probably, this requires some extension in the vector pattern formation stage to split the patterns.
Comment 1 Richard Biener 2024-03-25 09:04:43 UTC
It looks like this should be possible, but of course this performs more "weird" re-association which might be a problem with floating-point (though I don't
know of any lane-reducing ops implemented for FP).

Representing the "scalar" side in the vectorizer IL is tricky though, this
is why the current handling is separated and not integrated with the rest
(so it doesn't compose as you noticed).

Note we lack SLP recognition for dot_prod and sad also because we do not
specify which lanes are combined, so the optabs are a black-box and only
useful when the resulting lanes are reduced.
Comment 2 GCC Commits 2024-07-17 13:58:11 UTC
The master branch has been updated by Feng Xue <fxue@gcc.gnu.org>:

https://gcc.gnu.org/g:178cc419512f7e358f88dfe2336625aa99cd7438

commit r15-2096-g178cc419512f7e358f88dfe2336625aa99cd7438
Author: Feng Xue <fxue@os.amperecomputing.com>
Date:   Wed May 29 17:22:36 2024 +0800

    vect: Support multiple lane-reducing operations for loop reduction [PR114440]
    
    For lane-reducing operation(dot-prod/widen-sum/sad) in loop reduction, current
    vectorizer could only handle the pattern if the reduction chain does not
    contain other operation, no matter the other is normal or lane-reducing.
    
    This patches removes some constraints in reduction analysis to allow multiple
    arbitrary lane-reducing operations with mixed input vectypes in a loop
    reduction chain. For example:
    
       int sum = 1;
       for (i)
         {
           sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
           sum += w[i];               // widen-sum <vector(16) char>
           sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
         }
    
    The vector size is 128-bit vectorization factor is 16. Reduction statements
    would be transformed as:
    
       vector<4> int sum_v0 = { 0, 0, 0, 1 };
       vector<4> int sum_v1 = { 0, 0, 0, 0 };
       vector<4> int sum_v2 = { 0, 0, 0, 0 };
       vector<4> int sum_v3 = { 0, 0, 0, 0 };
    
       for (i / 16)
         {
           sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
           sum_v1 = sum_v1;  // copy
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy
    
           sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
           sum_v1 = sum_v1;  // copy
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy
    
           sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
           sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy
         }
    
        sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0 + sum_v1
    
    2024-03-22 Feng Xue <fxue@os.amperecomputing.com>
    
    gcc/
            PR tree-optimization/114440
            * tree-vectorizer.h (vectorizable_lane_reducing): New function
            declaration.
            * tree-vect-stmts.cc (vect_analyze_stmt): Call new function
            vectorizable_lane_reducing to analyze lane-reducing operation.
            * tree-vect-loop.cc (vect_model_reduction_cost): Remove cost computation
            code related to emulated_mixed_dot_prod.
            (vectorizable_lane_reducing): New function.
            (vectorizable_reduction): Allow multiple lane-reducing operations in
            loop reduction. Move some original lane-reducing related code to
            vectorizable_lane_reducing.
            (vect_transform_reduction): Adjust comments with updated example.
    
    gcc/testsuite/
            PR tree-optimization/114440
            * gcc.dg/vect/vect-reduc-chain-1.c
            * gcc.dg/vect/vect-reduc-chain-2.c
            * gcc.dg/vect/vect-reduc-chain-3.c
            * gcc.dg/vect/vect-reduc-chain-dot-slp-1.c
            * gcc.dg/vect/vect-reduc-chain-dot-slp-2.c
            * gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
            * gcc.dg/vect/vect-reduc-chain-dot-slp-4.c
            * gcc.dg/vect/vect-reduc-dot-slp-1.c
Comment 3 GCC Commits 2024-07-17 13:58:15 UTC
The master branch has been updated by Feng Xue <fxue@gcc.gnu.org>:

https://gcc.gnu.org/g:db3c8c9726d0bafbb9f85b6d7027fe83602643e7

commit r15-2097-gdb3c8c9726d0bafbb9f85b6d7027fe83602643e7
Author: Feng Xue <fxue@os.amperecomputing.com>
Date:   Wed May 29 17:28:14 2024 +0800

    vect: Optimize order of lane-reducing operations in loop def-use cycles
    
    When transforming multiple lane-reducing operations in a loop reduction chain,
    originally, corresponding vectorized statements are generated into def-use
    cycles starting from 0. The def-use cycle with smaller index, would contain
    more statements, which means more instruction dependency. For example:
    
       int sum = 1;
       for (i)
         {
           sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
           sum += w[i];               // widen-sum <vector(16) char>
           sum += abs(s0[i] - s1[i]); // sad <vector(8) short>
           sum += n[i];               // normal <vector(4) int>
         }
    
    Original transformation result:
    
       for (i / 16)
         {
           sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
           sum_v1 = sum_v1;  // copy
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy
    
           sum_v0 = WIDEN_SUM (w_v0[i: 0 ~ 15], sum_v0);
           sum_v1 = sum_v1;  // copy
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy
    
           sum_v0 = SAD (s0_v0[i: 0 ~ 7 ], s1_v0[i: 0 ~ 7 ], sum_v0);
           sum_v1 = SAD (s0_v1[i: 8 ~ 15], s1_v1[i: 8 ~ 15], sum_v1);
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy
    
           ...
         }
    
    For a higher instruction parallelism in final vectorized loop, an optimal
    means is to make those effective vector lane-reducing ops be distributed
    evenly among all def-use cycles. Transformed as the below, DOT_PROD,
    WIDEN_SUM and SADs are generated into disparate cycles, instruction
    dependency among them could be eliminated.
    
       for (i / 16)
         {
           sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
           sum_v1 = sum_v1;  // copy
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy
    
           sum_v0 = sum_v0;  // copy
           sum_v1 = WIDEN_SUM (w_v1[i: 0 ~ 15], sum_v1);
           sum_v2 = sum_v2;  // copy
           sum_v3 = sum_v3;  // copy
    
           sum_v0 = sum_v0;  // copy
           sum_v1 = sum_v1;  // copy
           sum_v2 = SAD (s0_v2[i: 0 ~ 7 ], s1_v2[i: 0 ~ 7 ], sum_v2);
           sum_v3 = SAD (s0_v3[i: 8 ~ 15], s1_v3[i: 8 ~ 15], sum_v3);
    
           ...
         }
    
    2024-03-22 Feng Xue <fxue@os.amperecomputing.com>
    
    gcc/
            PR tree-optimization/114440
            * tree-vectorizer.h (struct _stmt_vec_info): Add a new field
            reduc_result_pos.
            * tree-vect-loop.cc (vect_transform_reduction): Generate lane-reducing
            statements in an optimized order.