Bug 56717 - Enhance Dot-product pattern recognition to avoid mult widening.
Summary: Enhance Dot-product pattern recognition to avoid mult widening.
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.9.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2013-03-25 10:01 UTC by Yuri Rumyantsev
Modified: 2021-07-21 03:19 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Yuri Rumyantsev 2013-03-25 10:01:05 UTC
Comparing performance of icc and gcc compilers we found out that for one important benchmark from eembc 1.1 suite gcc produces very poor code in comparison with icc. This deficiency can be illustrated by the following simple example:

typedef signed short s16;
typedef signed long  s32;
void bar (s16 *in1, s16 *in2, s16 *out, int n, s16 scale)
{
  int i;
  s32 acc = 0;
  for (i=0; i<n; i++)
    acc += ((s32) in1[i] * (s32) in2[i]) >> scale;
  *out = (s16) acc;
}
gcc performes mult widening conversion for it which does not look reasonable and leads to suboptiml code for x86 at least.

I assume that Dot-prodeuct pattern recognition can be simply enhanced to accept such case by allowing the following stmts:

     type x_t, y_t;
     TYPE1 prod;
     TYPE2 sum = init;
   loop:
     sum_0 = phi <init, sum_1>
     S1  x_t = ...
     S2  y_t = ...
     S3  x_T = (TYPE1) x_t;
     S4  y_T = (TYPE1) y_t;
     S5  prod = x_T * y_T;
     [S6  prod = (TYPE2) prod;  #optional]
     S6' prod1 = prod1 <bin-op> <opnd>
     S7  sum_1 = prod1 + sum_0;

where S6' is vectorizable.
Comment 1 Cong Hou 2013-11-08 19:44:07 UTC
The way ICC uses is not related to dot-product. It just finds out a smart way to implement widen-mult (s16 to s32) using PMADDWD.

I will try to make a patch on this issue.


thanks,
Cong
Comment 2 Cong Hou 2013-11-08 23:34:22 UTC
I examined the GCC generated code, and found the main problem is that the load of 'scale' (rhs operand of >>) to an xmm register is in the loop body, which could be moved outside.

This happened during rtl-reload pass. For the following code, the load to scale is still outside of the loop body.


void foo(short* a, short scale, int n) {
  int i;
  for (i=0; i<n; i++)
    a[i] = a[i] >> scale;
}


But for your code here, it is not. I suspect there may exist some issue in that pass.

By the way, from my test it turns out that using PMADDWD is no faster than the way used by GCC now.