94865 – Failure to combine unpckhpd+unpcklpd into blendps

Bug 94865 - Failure to combine unpckhpd+unpcklpd into blendps

Summary: Failure to combine unpckhpd+unpcklpd into blendps

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	10.0

Importance:	P3 normal
Target Milestone:	14.0
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2020-04-29 22:02 UTC by Gabriel Ravier
Modified:	2023-08-22 09:35 UTC (History)
CC List:	2 users (show)

See Also:	94864
Host:
Target:	x86_64-* i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-04-30 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gabriel Ravier 2020-04-29 22:02:34 UTC

typedef double v2df __attribute__((vector_size(16)));

v2df move_sd(v2df a, v2df b)
{
    v2df result = a;
    result[1] = b[1];
    return result;
}

With `-O3 -msse4.1`, LLVM gives : 

move_sd(double __vector(2), double __vector(2)): # @move_sd(double __vector(2), double __vector(2))
  blendps xmm0, xmm1, 12 # xmm0 = xmm0[0,1],xmm1[2,3]
  ret

GCC gives : 

move_sd(double __vector(2), double __vector(2)):
  unpckhpd xmm1, xmm1
  unpcklpd xmm0, xmm1
  ret

Comment 1 Richard Biener 2020-04-30 07:09:05 UTC

Similar as PR94864.  I'll note that x86 might fare better if on GIMPLE instead
of

  _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
  result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
  return result_4;

we had a VEC_PERM but IIRC for two-element vectors this regressed some cases.
Note for this case the IL looks like above from the start so pattern-matching
a insert of an element from another vector to a permute might be a possibility
as well.

Comment 2 Richard Biener 2020-05-06 12:59:22 UTC

Missing match.pd patterns also include a no-op comb of insertion of an
extracted element at the same position

(simplify
  (bit_insert @0 (BIT_FIELD_REF @0 @size @pos) @pos)
  (if (size matches)
   @0)

in addition to the requested

(simplify
  (bit_insert @0 (BIT_FIELD_REF @1 @rsize @rpos) @ipos)
  (if (@0 and @1 are vectors compatible for a vec_perm)
   (vec_perm @0 @1 { shuffle-mask }))

Comment 3 GCC Commits 2023-08-22 09:34:22 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:27de9aa152141e7f3ee66372647d0f2cd94c4b90

commit r14-3381-g27de9aa152141e7f3ee66372647d0f2cd94c4b90
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Jul 12 15:01:47 2023 +0200

    tree-optimization/94864 - vector insert of vector extract simplification
    
    The PRs ask for optimizing of
    
      _1 = BIT_FIELD_REF <b_3(D), 64, 64>;
      result_4 = BIT_INSERT_EXPR <a_2(D), _1, 64>;
    
    to a vector permutation.  The following implements this as
    match.pd pattern, improving code generation on x86_64.
    
    On the RTL level we face the issue that backend patterns inconsistently
    use vec_merge and vec_select of vec_concat to represent permutes.
    
    I think using a (supported) permute is almost always better
    than an extract plus insert, maybe excluding the case we extract
    element zero and that's aliased to a register that can be used
    directly for insertion (not sure how to query that).
    
    The patch FAILs one case in gcc.target/i386/avx512fp16-vmovsh-1a.c
    where we now expand from
    
     __A_28 = VEC_PERM_EXPR <x2.8_9, x1.9_10, { 0, 9, 10, 11, 12, 13, 14, 15 }>;
    
    instead of
    
     _28 = BIT_FIELD_REF <x2.8_9, 16, 0>;
     __A_29 = BIT_INSERT_EXPR <x1.9_10, _28, 0>;
    
    producing a vpblendw instruction instead of the expected vmovsh.  That's
    either a missed vec_perm_const expansion optimization or even better,
    an improvement - Zen4 for example has 4 ports to execute vpblendw
    but only 3 for executing vmovsh and both instructions have the same size.
    
    The patch XFAILs the sub-testcase.
    
            PR tree-optimization/94864
            PR tree-optimization/94865
            PR tree-optimization/93080
            * match.pd (bit_insert @0 (BIT_FIELD_REF @1 ..) ..): New pattern
            for vector insertion from vector extraction.
    
            * gcc.target/i386/pr94864.c: New testcase.
            * gcc.target/i386/pr94865.c: Likewise.
            * gcc.target/i386/avx512fp16-vmovsh-1a.c: XFAIL.
            * gcc.dg/tree-ssa/forwprop-40.c: Likewise.
            * gcc.dg/tree-ssa/forwprop-41.c: Likewise.

Comment 4 Richard Biener 2023-08-22 09:35:16 UTC

Fixed.