84101 – [8 Regression] -O3 and -ftree-vectorize trying too hard for function returning trivial pair-of-uint64_t-structure

Bug 84101 - [8 Regression] -O3 and -ftree-vectorize trying too hard for function returning trivial pair-of-uint64_t-structure

Summary: [8 Regression] -O3 and -ftree-vectorize trying too hard for function returnin...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	7.1.0

Importance:	P2 normal
Target Milestone:	9.0
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Duplicates (3):	87062 89739 89849 (view as bug list)
Depends on:
Blocks:	vectorizer 89582
	Show dependency tree / graph

Reported:	2018-01-29 13:06 UTC by Chris Hall
Modified:	2021-05-14 10:30 UTC (History)
CC List:	9 users (show)

See Also:
Host:
Target:	x86_64--, i?86--
Build:
Known to work:	6.3.0, 9.0
Known to fail:	7.1.0
Last reconfirmed:	2018-01-30 00:00:00

Attachments
prototype patch (1.05 KB, patch) 2018-01-30 09:03 UTC, Richard Biener	Details \| Diff
updated vectorizer patch (1.76 KB, patch) 2019-03-28 12:14 UTC, Richard Biener	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Chris Hall 2018-01-29 13:06:40 UTC

The following:

  typedef struct uint64_pair uint64_pair_t ;
  struct uint64_pair
  {
    uint64_t  w0 ;
    uint64_t  w1 ;
  } ;

  uint64_pair_t pair(int num)
  {
    uint64_pair_t p ;

    p.w0 = num << 1 ;
    p.w1 = num >> 1 ;

    return p ;
  }

for recent x86_64, under v7.1.0, using "-O3", compiles to:

  pair:
   lea    (%rdi,%rdi,1),%eax
   sar    %edi
   movslq %edi,%rdi
   cltq   
   mov    %rax,-0x18(%rsp)
   movq   -0x18(%rsp),%xmm0
   mov    %rdi,-0x18(%rsp)
   movhps -0x18(%rsp),%xmm0
   movaps %xmm0,-0x18(%rsp)
   mov    -0x18(%rsp),%rax
   mov    -0x10(%rsp),%rdx
   retq

using "-O3 -fno-tree-vectorize", compiles to:

  pair:
   lea    (%rdi,%rdi,1),%eax
   sar    %edi
   movslq %edi,%rdx
   cltq   
   retq

I note that v6.3 produces the shorter code without the "-fno-tree-vectorize".

Comment 1 Richard Biener 2018-01-30 08:50:14 UTC

.optimized:

pair (int num)
{
  struct uint64_pair_t D.1958;
  int _1;
  long unsigned int _2;
  int _3;
  long unsigned int _4;
  vector(2) long unsigned int _9;

  <bb 2> [local count: 1073741825]:
  _1 = num_5(D) << 1;
  _2 = (long unsigned int) _1;
  _3 = num_5(D) >> 1;
  _4 = (long unsigned int) _3;
  _9 = {_2, _4};
  MEM[(struct uint64_pair *)&D.1958] = _9;
  return D.1958;
}

there's (plenty?) of duplicates with the vectorizer making mistakes with respect to ABI details which are not exposed at vectorization time.  Note we don't spill
at expansion time either:

(insn 10 9 11 2 (set (reg:V2DI 95)
        (vec_concat:V2DI (reg:DI 97)
            (reg:DI 99))) "t.c":15 -1
     (nil))
(insn 11 10 12 2 (set (reg:TI 92 [ D.1958 ])
        (subreg:TI (reg:V2DI 95) 0)) "t.c":15 -1
     (nil))
(insn 12 11 16 2 (set (reg:TI 93 [ <retval> ])
        (reg:TI 92 [ D.1958 ])) "t.c":15 -1
     (nil))
(insn 16 12 17 2 (set (reg/i:TI 0 ax)
        (reg:TI 93 [ <retval> ])) "t.c":16 -1
     (nil))
(insn 17 16 0 2 (use (reg/i:TI 0 ax)) "t.c":16 -1
     (nil))

but it's at LRA time the 'ax' TImode reg (register pair!) gets exposed.
From

(insn 10 9 16 2 (set (reg:V2DI 95)
        (vec_concat:V2DI (reg:DI 97)
            (reg:DI 99))) "t.c":15 3744 {vec_concatv2di}
     (expr_list:REG_DEAD (reg:DI 99)
        (expr_list:REG_DEAD (reg:DI 97)
            (nil))))
(insn 16 10 17 2 (set (reg/i:TI 0 ax)
        (subreg:TI (reg:V2DI 95) 0)) "t.c":16 84 {*movti_internal}
     (expr_list:REG_DEAD (reg:V2DI 95)
        (nil)))

we go to (after first spilling the DImode components):

(insn 22 9 24 2 (set (reg:DI 21 xmm0 [95])
        (mem/c:DI (plus:DI (reg/f:DI 7 sp)
                (const_int -24 [0xffffffffffffffe8])) [3 %sfp+-16 S8 A128])) "t.c":15 85 {*movdi_internal}
     (nil))
(insn 24 22 10 2 (set (mem/c:DI (plus:DI (reg/f:DI 7 sp)
                (const_int -24 [0xffffffffffffffe8])) [3 %sfp+-16 S8 A128])
        (reg:DI 5 di [99])) "t.c":15 85 {*movdi_internal}
     (nil))
(insn 10 24 23 2 (set (reg:V2DI 21 xmm0 [95])
        (vec_concat:V2DI (reg:DI 21 xmm0 [95])
            (mem/c:DI (plus:DI (reg/f:DI 7 sp)
                    (const_int -24 [0xffffffffffffffe8])) [3 %sfp+-16 S8 A128]))) "t.c":15 3744 {vec_concatv2di}
     (nil))
(insn 23 10 16 2 (set (mem/c:V2DI (plus:DI (reg/f:DI 7 sp)
                (const_int -24 [0xffffffffffffffe8])) [3 %sfp+-16 S16 A128])
        (reg:V2DI 21 xmm0 [95])) "t.c":15 1255 {movv2di_internal}
     (nil))
(insn 16 23 17 2 (set (reg/i:TI 0 ax)
        (mem/c:TI (plus:DI (reg/f:DI 7 sp)
                (const_int -24 [0xffffffffffffffe8])) [3 %sfp+-16 S16 A128])) "t.c":16 84 {*movti_internal}
     (nil))


This is really hard to avoid in the vectorizer given the decl we return
isn't a RESULT_DECL but a regular VAR_DECL so we have no idea it is
literally returned.

Note the RTL when not vectorizing isn't too different:

(insn 10 9 11 2 (set (reg:DI 97)
        (sign_extend:DI (reg:SI 96))) "t.c":13 -1
     (nil))
(insn 11 10 12 2 (set (subreg:DI (reg:TI 91 [ D.1958 ]) 8)
        (reg:DI 97)) "t.c":15 -1
     (nil))
(insn 12 11 16 2 (set (reg:TI 92 [ <retval> ])
        (reg:TI 91 [ D.1958 ])) "t.c":15 -1
     (nil))
(insn 16 12 17 2 (set (reg/i:TI 0 ax)
        (reg:TI 92 [ <retval> ])) "t.c":16 -1
     (nil))
(insn 17 16 0 2 (use (reg/i:TI 0 ax)) "t.c":16 -1
     (nil))

here it is the subreg1 pass that exposes the register pair and lowers the
subreg:

(insn 10 9 11 2 (set (reg:DI 97)
        (sign_extend:DI (reg:SI 96))) "t.c":13 149 {*extendsidi2_rex64}
     (nil))
(insn 11 10 19 2 (set (reg:DI 100 [ D.1958+8 ])
        (reg:DI 97)) "t.c":15 85 {*movdi_internal}
     (nil))
(insn 19 11 20 2 (set (reg:DI 101 [ <retval> ])
        (reg:DI 99 [ D.1958 ])) "t.c":15 85 {*movdi_internal}
     (nil))
(insn 20 19 21 2 (set (reg:DI 102 [ <retval>+8 ])
        (reg:DI 100 [ D.1958+8 ])) "t.c":15 85 {*movdi_internal}
     (nil))
(insn 21 20 22 2 (set (reg:DI 0 ax)
        (reg:DI 101 [ <retval> ])) "t.c":16 85 {*movdi_internal}
     (nil))
(insn 22 21 17 2 (set (reg:DI 1 dx [+8 ])
        (reg:DI 102 [ <retval>+8 ])) "t.c":16 85 {*movdi_internal}
     (nil))
(insn 17 22 0 2 (use (reg/i:TI 0 ax)) "t.c":16 -1
     (nil))

I imagine it could be made recognizing the (subreg (vec_concat ..)) case
as well...  but would that be a hack?

Comment 2 Richard Biener 2018-01-30 09:03:22 UTC

Created attachment 43287 [details]
prototype patch

So the following would be a hack to avoid vectorizing this.  Hack because it
would be confused by intermediate aggregate copies before returning (that
will eventually be elided).  Hack because for CONCATs we _might_ be able to
undo this at expansion, avoiding the spill dance.  Likewise for PARALLELs.
Likewise properly sized integer regs could be handled with a reg-reg move or be
aliased to vector regs.

Thus some RTL foo is needed here to sanitize the hard_function_value usage.
Anybody?

Comment 3 Richard Biener 2018-08-23 09:23:15 UTC

*** Bug 87062 has been marked as a duplicate of this bug. ***

Comment 4 Trass3r 2018-10-16 10:40:26 UTC

Also happens with pairs of floats:
https://godbolt.org/z/QrP0VD

Comment 5 Jakub Jelinek 2018-12-18 17:16:12 UTC

I actually don't think the right way to fix this is not to SLP vectorize it, but rather be able to undo it during RTL optimizations.
The user could have written it that way already, e.g. as:
#include <string.h>

struct S { long long a, b; };
typedef long long v2di __attribute__((vector_size (16)));

struct S
foo (int x)
{
  struct S p;
  v2di q = { x << 1, x >> 1 };
  memcpy (&p, &q, sizeof (p));
  return p;
}

or similar.

We have:
(insn 10 9 11 2 (set (reg:V2DI 92)
        (vec_concat:V2DI (reg:DI 94)
            (reg:DI 96))) "pr84101-3.c":8:8 4110 {vec_concatv2di}
     (nil))
(insn 11 10 12 2 (set (reg/v:TI 89 [ p ])
        (subreg:TI (reg:V2DI 92) 0)) "pr84101-3.c":9:3 65 {*movti_internal}
     (nil))
(insn 12 11 13 2 (set (reg:TI 88 [ D.1915 ])
        (reg/v:TI 89 [ p ])) "pr84101-3.c":10:10 65 {*movti_internal}
     (nil))
(insn 13 12 14 2 (clobber (reg/v:TI 89 [ p ])) -1
     (nil))
(insn 14 13 18 2 (set (reg:TI 90 [ <retval> ])
        (reg:TI 88 [ D.1915 ])) "pr84101-3.c":10:10 65 {*movti_internal}
     (nil))
(insn 18 14 19 2 (set (reg/i:TI 0 ax)
        (reg:TI 90 [ <retval> ])) "pr84101-3.c":11:1 65 {*movti_internal}
     (nil))
(insn 19 18 0 2 (use (reg/i:TI 0 ax)) "pr84101-3.c":11:1 -1
     (nil))
and because there aren't any half of TImode subregs, the subreg1 pass does nothing.  Combiner already sees
(insn 10 9 18 2 (set (reg:V2DI 92)
        (vec_concat:V2DI (reg:DI 94)
            (reg:DI 96))) "pr84101-3.c":8:8 4110 {vec_concatv2di}
     (expr_list:REG_DEAD (reg:DI 96)
        (expr_list:REG_DEAD (reg:DI 94)
            (nil))))
(insn 18 10 19 2 (set (reg/i:TI 0 ax)
        (subreg:TI (reg:V2DI 92) 0)) "pr84101-3.c":11:1 65 {*movti_internal}
     (expr_list:REG_DEAD (reg:V2DI 92)
        (nil)))
but (because of the hard register destination?) decides not to combine anything into insn 18.  The RA isn't able to cope with this because V2DImode is not HARD_REGNO_MODE_OK in GPRs (but TImode is).
So, where do you think we should deal with it?

Comment 6 Segher Boessenkool 2018-12-18 17:30:44 UTC

I think subreg should deal with this.  What do you mean "there aren't
any half of TImode subregs"?  insn 10 has it split at the start, and
insn 18 at the end wants it split, too.

Comment 7 Jakub Jelinek 2018-12-18 17:34:05 UTC

#c4 has another thing:
struct S { float a, b; };
struct S
foo (float x)
{
  return (struct S) { x, x };
}
where we generate bad code mainly because we have DImode as DECL_MODE of the RESULT_DECL, even when it is a structure containing two floats.  Not sure what we'd get e.g. if we treated it like V2SFmode instead, but that is problematic because it is MMX-ish mode, or if it would be possible to combine it into a vec_concatv2sf somehow.  Though, that should be really tracked in a different PR.

Comment 8 Jakub Jelinek 2018-12-18 17:43:12 UTC

(In reply to Segher Boessenkool from comment #6)
> I think subreg should deal with this.  What do you mean "there aren't
> any half of TImode subregs"?  insn 10 has it split at the start, and
> insn 18 at the end wants it split, too.

By that I mean that there aren't any (subreg:DI (reg:TI) [08]) destinations or similar.  Admittedly lower-subreg also handles shifts, and a few others, but not vec_concat etc.
Plus
V2DI move: original cost 4, split cost 4 * 2
so it decides not to do anything about V2DImode (generally the right thing).

So, we'd need to pattern recognize a vec_concat from scalar modes (integral only) and the result used (single use) in a subreg to twice that big integral mode and
handle it as if it was setting the two halves of the wider reg.

Comment 9 Segher Boessenkool 2018-12-18 17:49:21 UTC

But that is exactly the kind of thing lower-subreg is supposed to do?

Comment 10 Jakub Jelinek 2019-01-31 13:38:05 UTC

lower-subreg.c doesn't consider this for multiple reasons: 1) it doesn't have VEC_CONCAT handling, but that could be easily added 2) V2DImode isn't considered, because its move cost is the same as scalar move cost 3) in
(insn 10 9 11 2 (set (reg:V2DI 90)
        (vec_concat:V2DI (reg:DI 92)
            (reg:DI 94))) "pr84101.c":9:10 4128 {vec_concatv2di}
     (nil))
(insn 11 10 12 2 (set (reg:TI 87 [ D.1913 ])
        (subreg:TI (reg:V2DI 90) 0)) "pr84101.c":9:10 65 {*movti_internal}
     (nil))
(insn 12 11 16 2 (set (reg:TI 88 [ <retval> ])
        (reg:TI 87 [ D.1913 ])) "pr84101.c":9:10 65 {*movti_internal}
     (nil))
(insn 16 12 17 2 (set (reg/i:TI 0 ax)
        (reg:TI 88 [ <retval> ])) "pr84101.c":10:1 65 {*movti_internal}
     (nil))
there aren't any reasons to make the pseudos 87 or 88 decomposable, the result is only used as TImode, not in DImode subregs thereof.
So, right now pseudo 90 ends up in non_decomposable_context (something could be done about that), but as nothing ends up being in decomposable_context, nothing is done anyway.

Now, I guess the reason why we should split somewhere this V2DI appart is mainly the high cost of moving the (2!) integer registers from GPRs to SSE registers and move the result back, maybe lower-subreg.c would need to treat it differently based on the costs of VEC_CONCAT with integral operands (though, x86 rtx_cost claims it is very cheap).

Unfortunately, HARD_REGNO_MODE_OK doesn't allow V2DImode to live in a pair of GPRs, so the RA can't solve this say through having the vec_concat V2DI pattern have a =r,r,r alternative.

Comment 11 Richard Biener 2019-03-18 09:33:06 UTC

*** Bug 89739 has been marked as a duplicate of this bug. ***

Comment 12 Richard Biener 2019-03-18 09:33:31 UTC

Other testcase:

using u64 = unsigned long long;
struct u128 {u64 a, b;};

inline u64 load8(void* ptr) {
    u64 out;
    __builtin_memcpy(&out, ptr, 8);
    return out;
}

u128 load(char* basep, u64 n) {
    return {load8(basep), load8(basep+n-8)};
}

Comment 13 Richard Biener 2019-03-27 10:54:00 UTC

Not really working on this.

Comment 14 Richard Biener 2019-03-27 11:09:45 UTC

Just looking at what we feed combine:

(insn 9 8 15 2 (set (reg:V2DI 89)
        (vec_concat:V2DI (reg:DI 90 [ num ])
            (reg:DI 92))) "t.c":9:12 4182 {vec_concatv2di}
     (expr_list:REG_DEAD (reg:DI 92)
        (expr_list:REG_DEAD (reg:DI 90 [ num ])
            (nil))))
(insn 15 9 16 2 (set (reg/i:TI 0 ax)
        (subreg:TI (reg:V2DI 89) 0)) "t.c":10:1 65 {*movti_internal}
     (expr_list:REG_DEAD (reg:V2DI 89)
        (nil)))

I wonder why we can't "simplify" this into individual sets of the
hardreg pair?  fwprop sees the same thing so that's another possible
fixing point.  Not sure if the backend in the end would like to
see the above TImode set decomposed though...

Comment 15 Jakub Jelinek 2019-03-27 11:16:00 UTC

(In reply to Richard Biener from comment #14)
> Just looking at what we feed combine:
> 
> (insn 9 8 15 2 (set (reg:V2DI 89)
>         (vec_concat:V2DI (reg:DI 90 [ num ])
>             (reg:DI 92))) "t.c":9:12 4182 {vec_concatv2di}
>      (expr_list:REG_DEAD (reg:DI 92)
>         (expr_list:REG_DEAD (reg:DI 90 [ num ])
>             (nil))))
> (insn 15 9 16 2 (set (reg/i:TI 0 ax)
>         (subreg:TI (reg:V2DI 89) 0)) "t.c":10:1 65 {*movti_internal}
>      (expr_list:REG_DEAD (reg:V2DI 89)
>         (nil)))
> 
> I wonder why we can't "simplify" this into individual sets of the
> hardreg pair?  fwprop sees the same thing so that's another possible
> fixing point.  Not sure if the backend in the end would like to
> see the above TImode set decomposed though...

We surely want to decompose it in these testcases.  The big question is find out in which pass to do that (which has a reasonable infrastructure), what cost and what not to check etc.  The testcase show something that is clearly undesirable without any costs, vec_concating scalar regs into a vector only to subreg it into a scalar hard reg...  But now, if it wasn't into a GPR reg, but just TImode in some pseudo that it would be beneficial to reload into a vector reg and then operate in vector reg, it wouldn't be a win.  On the other side, if we don't get rid of those vector modes before reload, RA will choose vector registers for those.

Comment 16 Richard Biener 2019-03-27 11:36:42 UTC

(In reply to Jakub Jelinek from comment #15)
> (In reply to Richard Biener from comment #14)
> > Just looking at what we feed combine:
> > 
> > (insn 9 8 15 2 (set (reg:V2DI 89)
> >         (vec_concat:V2DI (reg:DI 90 [ num ])
> >             (reg:DI 92))) "t.c":9:12 4182 {vec_concatv2di}
> >      (expr_list:REG_DEAD (reg:DI 92)
> >         (expr_list:REG_DEAD (reg:DI 90 [ num ])
> >             (nil))))
> > (insn 15 9 16 2 (set (reg/i:TI 0 ax)
> >         (subreg:TI (reg:V2DI 89) 0)) "t.c":10:1 65 {*movti_internal}
> >      (expr_list:REG_DEAD (reg:V2DI 89)
> >         (nil)))
> > 
> > I wonder why we can't "simplify" this into individual sets of the
> > hardreg pair?  fwprop sees the same thing so that's another possible
> > fixing point.  Not sure if the backend in the end would like to
> > see the above TImode set decomposed though...
> 
> We surely want to decompose it in these testcases.  The big question is find
> out in which pass to do that (which has a reasonable infrastructure), what
> cost and what not to check etc.  The testcase show something that is clearly
> undesirable without any costs, vec_concating scalar regs into a vector only
> to subreg it into a scalar hard reg...  But now, if it wasn't into a GPR
> reg, but just TImode in some pseudo that it would be beneficial to reload
> into a vector reg and then operate in vector reg, it wouldn't be a win.  On
> the other side, if we don't get rid of those vector modes before reload, RA
> will choose vector registers for those.

I wonder if we should turn
  (subreg:TI (vec_concat:... ))
into
  (set (subreg:DI (reg:TI ... 0)))
  (set (subreg:DI (reg:TI ... 8)))
which is what we handle nicely it sems.  That means sth has to split
out the subreg into a separate instruction again or we need to make
fwprop1 not convert

(insn 9 8 10 2 (set (reg:V2DI 89)
        (vec_concat:V2DI (reg:DI 90)
            (reg:DI 92))) "t.c":9:12 4182 {vec_concatv2di}
     (nil))
(insn 10 9 11 2 (set (reg:TI 86 [ D.1921 ])
        (subreg:TI (reg:V2DI 89) 0)) "t.c":9:12 65 {*movti_internal}
     (nil))
(insn 11 10 15 2 (set (reg:TI 87 [ <retval> ])
        (reg:TI 86 [ D.1921 ])) "t.c":9:12 65 {*movti_internal}
     (nil))
(insn 15 11 16 2 (set (reg/i:TI 0 ax)
        (reg:TI 87 [ <retval> ])) "t.c":10:1 65 {*movti_internal}
     (nil))

into

(insn 9 8 15 2 (set (reg:V2DI 89)
        (vec_concat:V2DI (reg:DI 90 [ num ])
            (reg:DI 92))) "t.c":9:12 4182 {vec_concatv2di}
     (expr_list:REG_DEAD (reg:DI 92)
        (expr_list:REG_DEAD (reg:DI 90 [ num ])
            (nil))))
(insn 15 9 16 2 (set (reg/i:TI 0 ax)
        (subreg:TI (reg:V2DI 89) 0)) "t.c":10:1 65 {*movti_internal}
     (expr_list:REG_DEAD (reg:V2DI 89)
        (nil)))

but instead massage it into the above suggested form.

Comment 17 Richard Biener 2019-03-27 11:40:26 UTC

Oh, it's CSE forwarding the subreg already.

Comment 18 Richard Biener 2019-03-27 12:16:05 UTC

The following "fixes"

struct ciao { long a; long b; };
struct ciao square(int num) {
    struct ciao beta;
    beta.a = num;
    beta.b = num*num;
    return beta;
}

producing wrong code though, somehow forgetting the upper half.  The idea
was to give the constructor expansion an idea of the target (TImode reg)
so it can optimize for that.  Not sure what goes wrong - I seem to get
all things twice but still somewhat correct...

Index: gcc/expr.c
===================================================================
--- gcc/expr.c  (revision 269960)
+++ gcc/expr.c  (working copy)
@@ -7018,7 +7018,9 @@ store_field (rtx target, poly_int64 bits
            }
        }
 
-      temp = expand_normal (exp);
+      temp = expand_expr (exp, target, VOIDmode, EXPAND_NORMAL);//normal (exp);
+      if (temp == target)
+       return const0_rtx;
 
       /* We don't support variable-sized BLKmode bitfields, since our
         handling of BLKmode is bound up with the ability to break
@@ -7641,6 +7643,8 @@ safe_from_p (const_rtx x, tree exp, int
              return 0;
          return 1;
        }
+      else if (TREE_CODE (exp) == SSA_NAME)
+       return 1;
       else if (TREE_CODE (exp) == ERROR_MARK)
        return 1;       /* An already-visited SAVE_EXPR? */
       else

Comment 19 Andrew Pinski 2019-03-27 12:18:57 UTC

*** Bug 89849 has been marked as a duplicate of this bug. ***

Comment 20 Richard Biener 2019-03-27 12:27:47 UTC

Ah.

Index: gcc/expr.c
===================================================================
--- gcc/expr.c  (revision 269960)
+++ gcc/expr.c  (working copy)
@@ -7018,7 +7018,11 @@ store_field (rtx target, poly_int64 bits
            }
        }
 
-      temp = expand_normal (exp);
+      temp = expand_expr (exp,
+                         known_eq (bitpos, 0) ? target : NULL_RTX,
+                         VOIDmode, EXPAND_NORMAL);
+      if (temp == target)
+       return const0_rtx;
 
       /* We don't support variable-sized BLKmode bitfields, since our
         handling of BLKmode is bound up with the ability to break
@@ -7641,6 +7645,8 @@ safe_from_p (const_rtx x, tree exp, int
              return 0;
          return 1;
        }
+      else if (TREE_CODE (exp) == SSA_NAME)
+       return 1;
       else if (TREE_CODE (exp) == ERROR_MARK)
        return 1;       /* An already-visited SAVE_EXPR? */
       else

Comment 21 Richard Biener 2019-03-27 13:14:31 UTC

Needs more defensiveness.  Also the safe_from_p change might not be safe
in case we ever TER sth like

 _1 = BIT_FIELD_REF<vector, ...>;
 _2 = BIT_FIELD_REF<vector, ...>;
 vector = { _2, _1 };

which we do...

typedef double v2df __attribute__((vector_size(16)));
v2df v;
void foo()
{
  v = (v2df){v[1], v[0]};
}

and expand to

(insn 7 6 8 (set (reg:DF 87)
        (mem/j/c:DF (plus:DI (reg/f:DI 86)
                (const_int 8 [0x8])) [1 v+8 S8 A64])) "t4.c":5:5 -1
     (nil))

(insn 8 7 9 (set (mem/c:DF (reg/f:DI 85) [1 v+0 S8 A128])
        (reg:DF 87)) "t4.c":5:5 -1
     (nil))

...

(insn 11 10 12 (set (reg:DF 90)
        (mem/j/c:DF (reg/f:DI 89) [1 v+0 S8 A128])) "t4.c":5:5 -1
     (nil))

(insn 12 11 0 (set (mem/c:DF (reg/f:DI 88) [1 v+8 S8 A64])
        (reg:DF 90)) "t4.c":5:5 -1
     (nil))

after that change :/  So much for this nice trick.  The safe_from_p
change would be safe for SSA names we do not TER but for the testcases
we do TER.  There's no existing simple test in the C testsuite so I'll
add the above one.

Comment 22 Richard Biener 2019-03-27 13:25:17 UTC

But we can side-step this issue with changing the way expand_constructor
works:

Index: gcc/expr.c
===================================================================
--- gcc/expr.c  (revision 269963)
+++ gcc/expr.c  (working copy)
@@ -7018,7 +7018,16 @@ store_field (rtx target, poly_int64 bits
            }
        }
 
-      temp = expand_normal (exp);
+      if (REG_P (target)
+         && known_eq (bitpos, 0)
+         && known_eq (bitsize, GET_MODE_BITSIZE (GET_MODE (target))))
+       {
+         temp = expand_expr (exp, target, VOIDmode, EXPAND_NORMAL);
+         if (temp == target)
+           return const0_rtx;
+       }
+      else
+       temp = expand_normal (exp);
 
       /* We don't support variable-sized BLKmode bitfields, since our
         handling of BLKmode is bound up with the ability to break
@@ -8191,7 +8202,15 @@ expand_constructor (tree exp, rtx target
       if (avoid_temp_mem)
        return NULL_RTX;
 
-      target = assign_temp (type, TREE_ADDRESSABLE (exp), 1);
+      if (target && REG_P (target))
+       {
+         rtx tem = gen_reg_rtx (GET_MODE (target));
+         store_constructor (exp, tem, 0, int_expr_size (exp), false);
+         emit_move_insn (target, tem);
+         return target;
+       }
+      else
+       target = assign_temp (type, TREE_ADDRESSABLE (exp), 1);
     }
 
   store_constructor (exp, target, 0, int_expr_size (exp), false);

Comment 23 Richard Biener 2019-03-28 10:56:16 UTC

Fallout is:

FAIL: gcc.dg/pr85195.c (internal compiler error)

where we handle V1TI = {_2} with _2 = (__int128) int_1; this way and
end up calling convert_move from SImode to V1TImode (instead of TImode).
Looks like a pre-existing bug to me, not quickly sure where to fix.

FAIL: c-c++-common/torture/builtin-convertvector-1.c   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  (internal compiler error)

Similar issue, we end up in expand_fix to TImode from V4SFmode.  The generic
tree code expanders in expand_expr_real_2 simply pass down target (here
to expand_fix).  Not sure what constraints we have on target for expand_expr,
the docs just say "The value may be stored in TARGET if TARGET is nonzero.
TARGET is just a suggestion; callers must assume that the rtx returned may
not be the same as TARGET." and further down "Note that TARGET may have neither TMODE nor MODE.  In that case, it probably will not be used."  So it looks
to me these high-level expanders need to be more careful in what they
pass to functions like expand_fix or convert_move.

FAIL: gfortran.dg/allocatable_function_8.f90   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  (internal compiler error)
FAIL: gfortran.fortran-torture/execute/entry_4.f90,  -O1  (internal compiler error)

Same issue for expand_float.

That's all issues in the testsuite but with those it may look a little too
risky for GCC 9.  At least it would solve the DImode pair issue which
might happen quite often in practice :/

The following guards the above three cases:

@@ -8543,7 +8560,9 @@ expand_expr_real_2 (sepops ops, rtx targ
        op0 = gen_rtx_fmt_e (TYPE_UNSIGNED (TREE_TYPE (treeop0))
                             ? ZERO_EXTEND : SIGN_EXTEND, mode, op0);
 
-      else if (target == 0)
+      else if (target == 0
+              || (VECTOR_MODE_P (GET_MODE (target))
+                  != VECTOR_MODE_P (GET_MODE (op0))))
        op0 = convert_to_mode (mode, op0,
                               TYPE_UNSIGNED (TREE_TYPE
                                              (treeop0)));
@@ -9019,14 +9038,20 @@ expand_expr_real_2 (sepops ops, rtx targ
 
     case FIX_TRUNC_EXPR:
       op0 = expand_normal (treeop0);
-      if (target == 0 || modifier == EXPAND_STACK_PARM)
+      if (target == 0 || modifier == EXPAND_STACK_PARM
+         || (target
+             && (VECTOR_MODE_P (GET_MODE (target))
+                 != VECTOR_MODE_P (GET_MODE (op0)))))
        target = gen_reg_rtx (mode);
       expand_fix (target, op0, unsignedp);
       return target;
 
     case FLOAT_EXPR:
       op0 = expand_normal (treeop0);
-      if (target == 0 || modifier == EXPAND_STACK_PARM)
+      if (target == 0 || modifier == EXPAND_STACK_PARM
+         || (target
+             && (VECTOR_MODE_P (GET_MODE (target))
+                 != VECTOR_MODE_P (GET_MODE (op0)))))
        target = gen_reg_rtx (mode);
       /* expand_float can't figure out what to do if FROM has VOIDmode.
         So give it the correct mode.  With -O, cse will optimize this.  */

but then the testcases run into other similar issues all through RTL
expansion :/

Comment 24 Richard Biener 2019-03-28 12:14:59 UTC

Created attachment 46044 [details]
updated vectorizer patch

I've updated the vectorizer patch that was posted to the mailing list last year.
I changed it to also apply to the non-SLP case and instead of looking for a
non-vector-mode return RTX (which would pessimize word_mode vectorization)
look at hard_regno_nregs and only pessimize multi-reg return locations.

I've removed CONCAT and PARALLEL handling with PARALLEL left as todo
(not sure what to do there, I'd need to have a target / testcase combo
to look at - I suppose if any of the return locations is "compatible"
with the vector result we want to not pessimize).

Comment 25 Richard Biener 2019-04-03 12:31:15 UTC

Author: rguenth
Date: Wed Apr  3 12:30:16 2019
New Revision: 270123

URL: https://gcc.gnu.org/viewcvs?rev=270123&root=gcc&view=rev
Log:
2019-04-03  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/84101
	* tree-vect-stmts.c: Include explow.h for hard_function_value,
	regs.h for hard_regno_nregs.
	(cfun_returns): New helper.
	(vect_model_store_cost): When vectorizing a store to a decl
	we return and the function ABI returns in a multi-reg location
	account for the possible spilling that will happen.

	* gcc.target/i386/pr84101.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr84101.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vect-stmts.c

Comment 26 Richard Biener 2019-04-03 12:31:37 UTC

Mitigation installed on trunk (sofar).  Note it's likely not a full solution.

Comment 27 Richard Biener 2019-11-14 07:53:41 UTC

The GCC 7 branch is being closed, re-targeting to GCC 8.4.

Comment 28 Jakub Jelinek 2020-03-04 09:34:55 UTC

GCC 8.4.0 has been released, adjusting target milestone.

Comment 29 Jakub Jelinek 2021-05-14 10:30:00 UTC

The GCC 8 branch is being closed, mitigated for GCC 9.1.