100696 – mult_higpart is not vectorized

Bug 100696 - mult_higpart is not vectorized

Summary: mult_higpart is not vectorized

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	12.0

Importance:	P3 normal
Target Milestone:	12.0
Assignee:	Kewen Lin

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2021-05-20 09:26 UTC by Uroš Bizjak
Modified:	2021-08-25 03:01 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-05-20 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Uroš Bizjak 2021-05-20 09:26:18 UTC

Following testcases:

--cut here--
#define N 4

short r[N], a[N], b[N];
unsigned short ur[N], ua[N], ub[N];

void mul (void)
{
  int i;

  for (i = 0; i < N; i++)
    r[i] = a[i] * b[i];
}

/* { dg-final { scan-assembler "pmullw" } } */

void mulhi (void)
{
  int i;

  for (i = 0; i < N; i++)
    r[i] = ((int) a[i] * b[i]) >> 16;
}

/* { dg-final { scan-assembler "pmulhw" } } */

void umulhi (void)
{
  int i;

  for (i = 0; i < N; i++)
    ur[i] = ((unsigned int) ua[i] * ub[i]) >> 16;
}

/* { dg-final { scan-assembler "pmulhuw" } } */

void smulhrs (void)
{
  int i;

  for (i = 0; i < N; i++)
    r[i] = ((((int) a[i] * b[i]) >> 14) + 1) >> 1;
}

/* { dg-final { scan-assembler "pmulhrsw" } } */
--cut here--

should all vectorize for x86_64 with "-O3 -mssse3" to their vector instructions.

Currently the compiler vectorizes only pmullw and much more complex pmulhrsw, but not pmulhw and pmulhuw.

For N = 2 (SLP vectorization?), the compiler manages to vectorize mul and none of the other testcases.

Comment 1 Richard Biener 2021-05-20 11:14:26 UTC

We only have a widening multiplication pattern, not a separate high-part one.
When you increase N to 8 you'll see we need a VF of 8 to do

  <bb 2> [local count: 119292720]:
  vect__1.24_26 = MEM <vector(8) short int> [(short int *)&a];
  vect__3.27_23 = MEM <vector(8) short int> [(short int *)&b];
  vect_patt_29.28_22 = WIDEN_MULT_LO_EXPR <vect__3.27_23, vect__1.24_26>;
  vect_patt_29.28_21 = WIDEN_MULT_HI_EXPR <vect__3.27_23, vect__1.24_26>;
  vect__6.29_20 = vect_patt_29.28_22 >> 16;
  vect__6.29_19 = vect_patt_29.28_21 >> 16;
  vect__7.30_18 = VEC_PACK_TRUNC_EXPR <vect__6.29_20, vect__6.29_19>;
  MEM <vector(8) short int> [(short int *)&r] = vect__7.30_18;

resulting in

mulhi:
.LFB1:
        .cfi_startproc
        movdqa  b(%rip), %xmm0
        pmullw  a(%rip), %xmm0
        movdqa  %xmm0, %xmm1
        movdqa  b(%rip), %xmm2
        pmulhw  a(%rip), %xmm2
        punpcklwd       %xmm2, %xmm1
        punpckhwd       %xmm2, %xmm0
        psrad   $16, %xmm1
        psrad   $16, %xmm0
        pshufb  .LC0(%rip), %xmm1
        pshufb  .LC1(%rip), %xmm0
        por     %xmm1, %xmm0
        movaps  %xmm0, r(%rip)
        ret

for smulhrs there's a special pattern:

t.c:40:17: note:   widen_mult pattern recognized: patt_37 = _1 w* _3;
t.c:40:17: note:   vect_recog_mulhs_pattern: detected: _8 = _7 >> 1;
t.c:40:17: note:   created pattern stmt: patt_36 = .MULHRS (_1, _3);
t.c:40:17: note:   mult_high pattern recognized: patt_35 = (int) patt_36;
t.c:40:17: note:   extra pattern stmt: patt_36 = .MULHRS (_1, _3);
t.c:40:17: note:   vect_is_simple_use: operand _7 >> 1, type of def: internal
t.c:40:17: note:   vect_is_simple_use: operand .MULHRS (_1, _3), type of def: internal
t.c:40:17: note:   vect_recog_cast_forwprop_pattern: detected: _9 = (short int) _8;
t.c:40:17: note:   cast_forwprop pattern recognized: patt_34 = (short int) patt_36;

so we miss sth of that for the [u]mulhi cases.

Comment 2 Uroš Bizjak 2021-05-20 18:14:18 UTC

Related: PR89386.

CC added.

Comment 3 GCC Commits 2021-07-20 03:19:19 UTC

The master branch has been updated by Kewen Lin <linkw@gcc.gnu.org>:

https://gcc.gnu.org/g:a1d27560770818c514ad1ad6683f89e1e1bcd0ec

commit r12-2404-ga1d27560770818c514ad1ad6683f89e1e1bcd0ec
Author: Kewen Lin <linkw@linux.ibm.com>
Date:   Mon Jul 19 20:49:17 2021 -0500

    vect: Recog mul_highpart pattern [PR100696]
    
    This patch is to extend the existing pattern mulhs handlings
    to cover normal multiply highpart pattern recognization, it
    introduces one new internal function IFN_MULH for 1:1 map to
    [su]mul_highpart optab.  Since it covers MULT_HIGHPART_EXPR
    with optab support, i386 part change is to ensure it follows
    the consistent costing path.
    
    Bootstrapped & regtested on powerpc64le-linux-gnu P9,
    x86_64-redhat-linux and aarch64-linux-gnu.
    
    gcc/ChangeLog:
    
            PR tree-optimization/100696
            * internal-fn.c (first_commutative_argument): Add info for IFN_MULH.
            * internal-fn.def (IFN_MULH): New internal function.
            * tree-vect-patterns.c (vect_recog_mulhs_pattern): Add support to
            recog normal multiply highpart as IFN_MULH.
            * config/i386/i386.c (ix86_add_stmt_cost): Adjust for combined
            function CFN_MULH.
    
    gcc/testsuite/ChangeLog:
    
            PR tree-optimization/100696
            * gcc.target/i386/pr100637-3w.c: Adjust for mul_highpart recog.

Comment 4 Kewen Lin 2021-07-20 03:21:25 UTC

Should be fixed on trunk.