The following: typedef struct uint64_pair uint64_pair_t ; struct uint64_pair { uint64_t w0 ; uint64_t w1 ; } ; uint64_pair_t pair(int num) { uint64_pair_t p ; p.w0 = num << 1 ; p.w1 = num >> 1 ; return p ; } for recent x86_64, under v7.1.0, using "-O3", compiles to: pair: lea (%rdi,%rdi,1),%eax sar %edi movslq %edi,%rdi cltq mov %rax,-0x18(%rsp) movq -0x18(%rsp),%xmm0 mov %rdi,-0x18(%rsp) movhps -0x18(%rsp),%xmm0 movaps %xmm0,-0x18(%rsp) mov -0x18(%rsp),%rax mov -0x10(%rsp),%rdx retq using "-O3 -fno-tree-vectorize", compiles to: pair: lea (%rdi,%rdi,1),%eax sar %edi movslq %edi,%rdx cltq retq I note that v6.3 produces the shorter code without the "-fno-tree-vectorize".
.optimized: pair (int num) { struct uint64_pair_t D.1958; int _1; long unsigned int _2; int _3; long unsigned int _4; vector(2) long unsigned int _9; <bb 2> [local count: 1073741825]: _1 = num_5(D) << 1; _2 = (long unsigned int) _1; _3 = num_5(D) >> 1; _4 = (long unsigned int) _3; _9 = {_2, _4}; MEM[(struct uint64_pair *)&D.1958] = _9; return D.1958; } there's (plenty?) of duplicates with the vectorizer making mistakes with respect to ABI details which are not exposed at vectorization time. Note we don't spill at expansion time either: (insn 10 9 11 2 (set (reg:V2DI 95) (vec_concat:V2DI (reg:DI 97) (reg:DI 99))) "t.c":15 -1 (nil)) (insn 11 10 12 2 (set (reg:TI 92 [ D.1958 ]) (subreg:TI (reg:V2DI 95) 0)) "t.c":15 -1 (nil)) (insn 12 11 16 2 (set (reg:TI 93 [ <retval> ]) (reg:TI 92 [ D.1958 ])) "t.c":15 -1 (nil)) (insn 16 12 17 2 (set (reg/i:TI 0 ax) (reg:TI 93 [ <retval> ])) "t.c":16 -1 (nil)) (insn 17 16 0 2 (use (reg/i:TI 0 ax)) "t.c":16 -1 (nil)) but it's at LRA time the 'ax' TImode reg (register pair!) gets exposed. From (insn 10 9 16 2 (set (reg:V2DI 95) (vec_concat:V2DI (reg:DI 97) (reg:DI 99))) "t.c":15 3744 {vec_concatv2di} (expr_list:REG_DEAD (reg:DI 99) (expr_list:REG_DEAD (reg:DI 97) (nil)))) (insn 16 10 17 2 (set (reg/i:TI 0 ax) (subreg:TI (reg:V2DI 95) 0)) "t.c":16 84 {*movti_internal} (expr_list:REG_DEAD (reg:V2DI 95) (nil))) we go to (after first spilling the DImode components): (insn 22 9 24 2 (set (reg:DI 21 xmm0 [95]) (mem/c:DI (plus:DI (reg/f:DI 7 sp) (const_int -24 [0xffffffffffffffe8])) [3 %sfp+-16 S8 A128])) "t.c":15 85 {*movdi_internal} (nil)) (insn 24 22 10 2 (set (mem/c:DI (plus:DI (reg/f:DI 7 sp) (const_int -24 [0xffffffffffffffe8])) [3 %sfp+-16 S8 A128]) (reg:DI 5 di [99])) "t.c":15 85 {*movdi_internal} (nil)) (insn 10 24 23 2 (set (reg:V2DI 21 xmm0 [95]) (vec_concat:V2DI (reg:DI 21 xmm0 [95]) (mem/c:DI (plus:DI (reg/f:DI 7 sp) (const_int -24 [0xffffffffffffffe8])) [3 %sfp+-16 S8 A128]))) "t.c":15 3744 {vec_concatv2di} (nil)) (insn 23 10 16 2 (set (mem/c:V2DI (plus:DI (reg/f:DI 7 sp) (const_int -24 [0xffffffffffffffe8])) [3 %sfp+-16 S16 A128]) (reg:V2DI 21 xmm0 [95])) "t.c":15 1255 {movv2di_internal} (nil)) (insn 16 23 17 2 (set (reg/i:TI 0 ax) (mem/c:TI (plus:DI (reg/f:DI 7 sp) (const_int -24 [0xffffffffffffffe8])) [3 %sfp+-16 S16 A128])) "t.c":16 84 {*movti_internal} (nil)) This is really hard to avoid in the vectorizer given the decl we return isn't a RESULT_DECL but a regular VAR_DECL so we have no idea it is literally returned. Note the RTL when not vectorizing isn't too different: (insn 10 9 11 2 (set (reg:DI 97) (sign_extend:DI (reg:SI 96))) "t.c":13 -1 (nil)) (insn 11 10 12 2 (set (subreg:DI (reg:TI 91 [ D.1958 ]) 8) (reg:DI 97)) "t.c":15 -1 (nil)) (insn 12 11 16 2 (set (reg:TI 92 [ <retval> ]) (reg:TI 91 [ D.1958 ])) "t.c":15 -1 (nil)) (insn 16 12 17 2 (set (reg/i:TI 0 ax) (reg:TI 92 [ <retval> ])) "t.c":16 -1 (nil)) (insn 17 16 0 2 (use (reg/i:TI 0 ax)) "t.c":16 -1 (nil)) here it is the subreg1 pass that exposes the register pair and lowers the subreg: (insn 10 9 11 2 (set (reg:DI 97) (sign_extend:DI (reg:SI 96))) "t.c":13 149 {*extendsidi2_rex64} (nil)) (insn 11 10 19 2 (set (reg:DI 100 [ D.1958+8 ]) (reg:DI 97)) "t.c":15 85 {*movdi_internal} (nil)) (insn 19 11 20 2 (set (reg:DI 101 [ <retval> ]) (reg:DI 99 [ D.1958 ])) "t.c":15 85 {*movdi_internal} (nil)) (insn 20 19 21 2 (set (reg:DI 102 [ <retval>+8 ]) (reg:DI 100 [ D.1958+8 ])) "t.c":15 85 {*movdi_internal} (nil)) (insn 21 20 22 2 (set (reg:DI 0 ax) (reg:DI 101 [ <retval> ])) "t.c":16 85 {*movdi_internal} (nil)) (insn 22 21 17 2 (set (reg:DI 1 dx [+8 ]) (reg:DI 102 [ <retval>+8 ])) "t.c":16 85 {*movdi_internal} (nil)) (insn 17 22 0 2 (use (reg/i:TI 0 ax)) "t.c":16 -1 (nil)) I imagine it could be made recognizing the (subreg (vec_concat ..)) case as well... but would that be a hack?
Created attachment 43287 [details] prototype patch So the following would be a hack to avoid vectorizing this. Hack because it would be confused by intermediate aggregate copies before returning (that will eventually be elided). Hack because for CONCATs we _might_ be able to undo this at expansion, avoiding the spill dance. Likewise for PARALLELs. Likewise properly sized integer regs could be handled with a reg-reg move or be aliased to vector regs. Thus some RTL foo is needed here to sanitize the hard_function_value usage. Anybody?
*** Bug 87062 has been marked as a duplicate of this bug. ***
Also happens with pairs of floats: https://godbolt.org/z/QrP0VD
I actually don't think the right way to fix this is not to SLP vectorize it, but rather be able to undo it during RTL optimizations. The user could have written it that way already, e.g. as: #include <string.h> struct S { long long a, b; }; typedef long long v2di __attribute__((vector_size (16))); struct S foo (int x) { struct S p; v2di q = { x << 1, x >> 1 }; memcpy (&p, &q, sizeof (p)); return p; } or similar. We have: (insn 10 9 11 2 (set (reg:V2DI 92) (vec_concat:V2DI (reg:DI 94) (reg:DI 96))) "pr84101-3.c":8:8 4110 {vec_concatv2di} (nil)) (insn 11 10 12 2 (set (reg/v:TI 89 [ p ]) (subreg:TI (reg:V2DI 92) 0)) "pr84101-3.c":9:3 65 {*movti_internal} (nil)) (insn 12 11 13 2 (set (reg:TI 88 [ D.1915 ]) (reg/v:TI 89 [ p ])) "pr84101-3.c":10:10 65 {*movti_internal} (nil)) (insn 13 12 14 2 (clobber (reg/v:TI 89 [ p ])) -1 (nil)) (insn 14 13 18 2 (set (reg:TI 90 [ <retval> ]) (reg:TI 88 [ D.1915 ])) "pr84101-3.c":10:10 65 {*movti_internal} (nil)) (insn 18 14 19 2 (set (reg/i:TI 0 ax) (reg:TI 90 [ <retval> ])) "pr84101-3.c":11:1 65 {*movti_internal} (nil)) (insn 19 18 0 2 (use (reg/i:TI 0 ax)) "pr84101-3.c":11:1 -1 (nil)) and because there aren't any half of TImode subregs, the subreg1 pass does nothing. Combiner already sees (insn 10 9 18 2 (set (reg:V2DI 92) (vec_concat:V2DI (reg:DI 94) (reg:DI 96))) "pr84101-3.c":8:8 4110 {vec_concatv2di} (expr_list:REG_DEAD (reg:DI 96) (expr_list:REG_DEAD (reg:DI 94) (nil)))) (insn 18 10 19 2 (set (reg/i:TI 0 ax) (subreg:TI (reg:V2DI 92) 0)) "pr84101-3.c":11:1 65 {*movti_internal} (expr_list:REG_DEAD (reg:V2DI 92) (nil))) but (because of the hard register destination?) decides not to combine anything into insn 18. The RA isn't able to cope with this because V2DImode is not HARD_REGNO_MODE_OK in GPRs (but TImode is). So, where do you think we should deal with it?
I think subreg should deal with this. What do you mean "there aren't any half of TImode subregs"? insn 10 has it split at the start, and insn 18 at the end wants it split, too.
#c4 has another thing: struct S { float a, b; }; struct S foo (float x) { return (struct S) { x, x }; } where we generate bad code mainly because we have DImode as DECL_MODE of the RESULT_DECL, even when it is a structure containing two floats. Not sure what we'd get e.g. if we treated it like V2SFmode instead, but that is problematic because it is MMX-ish mode, or if it would be possible to combine it into a vec_concatv2sf somehow. Though, that should be really tracked in a different PR.
(In reply to Segher Boessenkool from comment #6) > I think subreg should deal with this. What do you mean "there aren't > any half of TImode subregs"? insn 10 has it split at the start, and > insn 18 at the end wants it split, too. By that I mean that there aren't any (subreg:DI (reg:TI) [08]) destinations or similar. Admittedly lower-subreg also handles shifts, and a few others, but not vec_concat etc. Plus V2DI move: original cost 4, split cost 4 * 2 so it decides not to do anything about V2DImode (generally the right thing). So, we'd need to pattern recognize a vec_concat from scalar modes (integral only) and the result used (single use) in a subreg to twice that big integral mode and handle it as if it was setting the two halves of the wider reg.
But that is exactly the kind of thing lower-subreg is supposed to do?
lower-subreg.c doesn't consider this for multiple reasons: 1) it doesn't have VEC_CONCAT handling, but that could be easily added 2) V2DImode isn't considered, because its move cost is the same as scalar move cost 3) in (insn 10 9 11 2 (set (reg:V2DI 90) (vec_concat:V2DI (reg:DI 92) (reg:DI 94))) "pr84101.c":9:10 4128 {vec_concatv2di} (nil)) (insn 11 10 12 2 (set (reg:TI 87 [ D.1913 ]) (subreg:TI (reg:V2DI 90) 0)) "pr84101.c":9:10 65 {*movti_internal} (nil)) (insn 12 11 16 2 (set (reg:TI 88 [ <retval> ]) (reg:TI 87 [ D.1913 ])) "pr84101.c":9:10 65 {*movti_internal} (nil)) (insn 16 12 17 2 (set (reg/i:TI 0 ax) (reg:TI 88 [ <retval> ])) "pr84101.c":10:1 65 {*movti_internal} (nil)) there aren't any reasons to make the pseudos 87 or 88 decomposable, the result is only used as TImode, not in DImode subregs thereof. So, right now pseudo 90 ends up in non_decomposable_context (something could be done about that), but as nothing ends up being in decomposable_context, nothing is done anyway. Now, I guess the reason why we should split somewhere this V2DI appart is mainly the high cost of moving the (2!) integer registers from GPRs to SSE registers and move the result back, maybe lower-subreg.c would need to treat it differently based on the costs of VEC_CONCAT with integral operands (though, x86 rtx_cost claims it is very cheap). Unfortunately, HARD_REGNO_MODE_OK doesn't allow V2DImode to live in a pair of GPRs, so the RA can't solve this say through having the vec_concat V2DI pattern have a =r,r,r alternative.
*** Bug 89739 has been marked as a duplicate of this bug. ***
Other testcase: using u64 = unsigned long long; struct u128 {u64 a, b;}; inline u64 load8(void* ptr) { u64 out; __builtin_memcpy(&out, ptr, 8); return out; } u128 load(char* basep, u64 n) { return {load8(basep), load8(basep+n-8)}; }
Not really working on this.
Just looking at what we feed combine: (insn 9 8 15 2 (set (reg:V2DI 89) (vec_concat:V2DI (reg:DI 90 [ num ]) (reg:DI 92))) "t.c":9:12 4182 {vec_concatv2di} (expr_list:REG_DEAD (reg:DI 92) (expr_list:REG_DEAD (reg:DI 90 [ num ]) (nil)))) (insn 15 9 16 2 (set (reg/i:TI 0 ax) (subreg:TI (reg:V2DI 89) 0)) "t.c":10:1 65 {*movti_internal} (expr_list:REG_DEAD (reg:V2DI 89) (nil))) I wonder why we can't "simplify" this into individual sets of the hardreg pair? fwprop sees the same thing so that's another possible fixing point. Not sure if the backend in the end would like to see the above TImode set decomposed though...
(In reply to Richard Biener from comment #14) > Just looking at what we feed combine: > > (insn 9 8 15 2 (set (reg:V2DI 89) > (vec_concat:V2DI (reg:DI 90 [ num ]) > (reg:DI 92))) "t.c":9:12 4182 {vec_concatv2di} > (expr_list:REG_DEAD (reg:DI 92) > (expr_list:REG_DEAD (reg:DI 90 [ num ]) > (nil)))) > (insn 15 9 16 2 (set (reg/i:TI 0 ax) > (subreg:TI (reg:V2DI 89) 0)) "t.c":10:1 65 {*movti_internal} > (expr_list:REG_DEAD (reg:V2DI 89) > (nil))) > > I wonder why we can't "simplify" this into individual sets of the > hardreg pair? fwprop sees the same thing so that's another possible > fixing point. Not sure if the backend in the end would like to > see the above TImode set decomposed though... We surely want to decompose it in these testcases. The big question is find out in which pass to do that (which has a reasonable infrastructure), what cost and what not to check etc. The testcase show something that is clearly undesirable without any costs, vec_concating scalar regs into a vector only to subreg it into a scalar hard reg... But now, if it wasn't into a GPR reg, but just TImode in some pseudo that it would be beneficial to reload into a vector reg and then operate in vector reg, it wouldn't be a win. On the other side, if we don't get rid of those vector modes before reload, RA will choose vector registers for those.
(In reply to Jakub Jelinek from comment #15) > (In reply to Richard Biener from comment #14) > > Just looking at what we feed combine: > > > > (insn 9 8 15 2 (set (reg:V2DI 89) > > (vec_concat:V2DI (reg:DI 90 [ num ]) > > (reg:DI 92))) "t.c":9:12 4182 {vec_concatv2di} > > (expr_list:REG_DEAD (reg:DI 92) > > (expr_list:REG_DEAD (reg:DI 90 [ num ]) > > (nil)))) > > (insn 15 9 16 2 (set (reg/i:TI 0 ax) > > (subreg:TI (reg:V2DI 89) 0)) "t.c":10:1 65 {*movti_internal} > > (expr_list:REG_DEAD (reg:V2DI 89) > > (nil))) > > > > I wonder why we can't "simplify" this into individual sets of the > > hardreg pair? fwprop sees the same thing so that's another possible > > fixing point. Not sure if the backend in the end would like to > > see the above TImode set decomposed though... > > We surely want to decompose it in these testcases. The big question is find > out in which pass to do that (which has a reasonable infrastructure), what > cost and what not to check etc. The testcase show something that is clearly > undesirable without any costs, vec_concating scalar regs into a vector only > to subreg it into a scalar hard reg... But now, if it wasn't into a GPR > reg, but just TImode in some pseudo that it would be beneficial to reload > into a vector reg and then operate in vector reg, it wouldn't be a win. On > the other side, if we don't get rid of those vector modes before reload, RA > will choose vector registers for those. I wonder if we should turn (subreg:TI (vec_concat:... )) into (set (subreg:DI (reg:TI ... 0))) (set (subreg:DI (reg:TI ... 8))) which is what we handle nicely it sems. That means sth has to split out the subreg into a separate instruction again or we need to make fwprop1 not convert (insn 9 8 10 2 (set (reg:V2DI 89) (vec_concat:V2DI (reg:DI 90) (reg:DI 92))) "t.c":9:12 4182 {vec_concatv2di} (nil)) (insn 10 9 11 2 (set (reg:TI 86 [ D.1921 ]) (subreg:TI (reg:V2DI 89) 0)) "t.c":9:12 65 {*movti_internal} (nil)) (insn 11 10 15 2 (set (reg:TI 87 [ <retval> ]) (reg:TI 86 [ D.1921 ])) "t.c":9:12 65 {*movti_internal} (nil)) (insn 15 11 16 2 (set (reg/i:TI 0 ax) (reg:TI 87 [ <retval> ])) "t.c":10:1 65 {*movti_internal} (nil)) into (insn 9 8 15 2 (set (reg:V2DI 89) (vec_concat:V2DI (reg:DI 90 [ num ]) (reg:DI 92))) "t.c":9:12 4182 {vec_concatv2di} (expr_list:REG_DEAD (reg:DI 92) (expr_list:REG_DEAD (reg:DI 90 [ num ]) (nil)))) (insn 15 9 16 2 (set (reg/i:TI 0 ax) (subreg:TI (reg:V2DI 89) 0)) "t.c":10:1 65 {*movti_internal} (expr_list:REG_DEAD (reg:V2DI 89) (nil))) but instead massage it into the above suggested form.
Oh, it's CSE forwarding the subreg already.
The following "fixes" struct ciao { long a; long b; }; struct ciao square(int num) { struct ciao beta; beta.a = num; beta.b = num*num; return beta; } producing wrong code though, somehow forgetting the upper half. The idea was to give the constructor expansion an idea of the target (TImode reg) so it can optimize for that. Not sure what goes wrong - I seem to get all things twice but still somewhat correct... Index: gcc/expr.c =================================================================== --- gcc/expr.c (revision 269960) +++ gcc/expr.c (working copy) @@ -7018,7 +7018,9 @@ store_field (rtx target, poly_int64 bits } } - temp = expand_normal (exp); + temp = expand_expr (exp, target, VOIDmode, EXPAND_NORMAL);//normal (exp); + if (temp == target) + return const0_rtx; /* We don't support variable-sized BLKmode bitfields, since our handling of BLKmode is bound up with the ability to break @@ -7641,6 +7643,8 @@ safe_from_p (const_rtx x, tree exp, int return 0; return 1; } + else if (TREE_CODE (exp) == SSA_NAME) + return 1; else if (TREE_CODE (exp) == ERROR_MARK) return 1; /* An already-visited SAVE_EXPR? */ else
*** Bug 89849 has been marked as a duplicate of this bug. ***
Ah. Index: gcc/expr.c =================================================================== --- gcc/expr.c (revision 269960) +++ gcc/expr.c (working copy) @@ -7018,7 +7018,11 @@ store_field (rtx target, poly_int64 bits } } - temp = expand_normal (exp); + temp = expand_expr (exp, + known_eq (bitpos, 0) ? target : NULL_RTX, + VOIDmode, EXPAND_NORMAL); + if (temp == target) + return const0_rtx; /* We don't support variable-sized BLKmode bitfields, since our handling of BLKmode is bound up with the ability to break @@ -7641,6 +7645,8 @@ safe_from_p (const_rtx x, tree exp, int return 0; return 1; } + else if (TREE_CODE (exp) == SSA_NAME) + return 1; else if (TREE_CODE (exp) == ERROR_MARK) return 1; /* An already-visited SAVE_EXPR? */ else
Needs more defensiveness. Also the safe_from_p change might not be safe in case we ever TER sth like _1 = BIT_FIELD_REF<vector, ...>; _2 = BIT_FIELD_REF<vector, ...>; vector = { _2, _1 }; which we do... typedef double v2df __attribute__((vector_size(16))); v2df v; void foo() { v = (v2df){v[1], v[0]}; } and expand to (insn 7 6 8 (set (reg:DF 87) (mem/j/c:DF (plus:DI (reg/f:DI 86) (const_int 8 [0x8])) [1 v+8 S8 A64])) "t4.c":5:5 -1 (nil)) (insn 8 7 9 (set (mem/c:DF (reg/f:DI 85) [1 v+0 S8 A128]) (reg:DF 87)) "t4.c":5:5 -1 (nil)) ... (insn 11 10 12 (set (reg:DF 90) (mem/j/c:DF (reg/f:DI 89) [1 v+0 S8 A128])) "t4.c":5:5 -1 (nil)) (insn 12 11 0 (set (mem/c:DF (reg/f:DI 88) [1 v+8 S8 A64]) (reg:DF 90)) "t4.c":5:5 -1 (nil)) after that change :/ So much for this nice trick. The safe_from_p change would be safe for SSA names we do not TER but for the testcases we do TER. There's no existing simple test in the C testsuite so I'll add the above one.
But we can side-step this issue with changing the way expand_constructor works: Index: gcc/expr.c =================================================================== --- gcc/expr.c (revision 269963) +++ gcc/expr.c (working copy) @@ -7018,7 +7018,16 @@ store_field (rtx target, poly_int64 bits } } - temp = expand_normal (exp); + if (REG_P (target) + && known_eq (bitpos, 0) + && known_eq (bitsize, GET_MODE_BITSIZE (GET_MODE (target)))) + { + temp = expand_expr (exp, target, VOIDmode, EXPAND_NORMAL); + if (temp == target) + return const0_rtx; + } + else + temp = expand_normal (exp); /* We don't support variable-sized BLKmode bitfields, since our handling of BLKmode is bound up with the ability to break @@ -8191,7 +8202,15 @@ expand_constructor (tree exp, rtx target if (avoid_temp_mem) return NULL_RTX; - target = assign_temp (type, TREE_ADDRESSABLE (exp), 1); + if (target && REG_P (target)) + { + rtx tem = gen_reg_rtx (GET_MODE (target)); + store_constructor (exp, tem, 0, int_expr_size (exp), false); + emit_move_insn (target, tem); + return target; + } + else + target = assign_temp (type, TREE_ADDRESSABLE (exp), 1); } store_constructor (exp, target, 0, int_expr_size (exp), false);
Fallout is: FAIL: gcc.dg/pr85195.c (internal compiler error) where we handle V1TI = {_2} with _2 = (__int128) int_1; this way and end up calling convert_move from SImode to V1TImode (instead of TImode). Looks like a pre-existing bug to me, not quickly sure where to fix. FAIL: c-c++-common/torture/builtin-convertvector-1.c -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (internal compiler error) Similar issue, we end up in expand_fix to TImode from V4SFmode. The generic tree code expanders in expand_expr_real_2 simply pass down target (here to expand_fix). Not sure what constraints we have on target for expand_expr, the docs just say "The value may be stored in TARGET if TARGET is nonzero. TARGET is just a suggestion; callers must assume that the rtx returned may not be the same as TARGET." and further down "Note that TARGET may have neither TMODE nor MODE. In that case, it probably will not be used." So it looks to me these high-level expanders need to be more careful in what they pass to functions like expand_fix or convert_move. FAIL: gfortran.dg/allocatable_function_8.f90 -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (internal compiler error) FAIL: gfortran.fortran-torture/execute/entry_4.f90, -O1 (internal compiler error) Same issue for expand_float. That's all issues in the testsuite but with those it may look a little too risky for GCC 9. At least it would solve the DImode pair issue which might happen quite often in practice :/ The following guards the above three cases: @@ -8543,7 +8560,9 @@ expand_expr_real_2 (sepops ops, rtx targ op0 = gen_rtx_fmt_e (TYPE_UNSIGNED (TREE_TYPE (treeop0)) ? ZERO_EXTEND : SIGN_EXTEND, mode, op0); - else if (target == 0) + else if (target == 0 + || (VECTOR_MODE_P (GET_MODE (target)) + != VECTOR_MODE_P (GET_MODE (op0)))) op0 = convert_to_mode (mode, op0, TYPE_UNSIGNED (TREE_TYPE (treeop0))); @@ -9019,14 +9038,20 @@ expand_expr_real_2 (sepops ops, rtx targ case FIX_TRUNC_EXPR: op0 = expand_normal (treeop0); - if (target == 0 || modifier == EXPAND_STACK_PARM) + if (target == 0 || modifier == EXPAND_STACK_PARM + || (target + && (VECTOR_MODE_P (GET_MODE (target)) + != VECTOR_MODE_P (GET_MODE (op0))))) target = gen_reg_rtx (mode); expand_fix (target, op0, unsignedp); return target; case FLOAT_EXPR: op0 = expand_normal (treeop0); - if (target == 0 || modifier == EXPAND_STACK_PARM) + if (target == 0 || modifier == EXPAND_STACK_PARM + || (target + && (VECTOR_MODE_P (GET_MODE (target)) + != VECTOR_MODE_P (GET_MODE (op0))))) target = gen_reg_rtx (mode); /* expand_float can't figure out what to do if FROM has VOIDmode. So give it the correct mode. With -O, cse will optimize this. */ but then the testcases run into other similar issues all through RTL expansion :/
Created attachment 46044 [details] updated vectorizer patch I've updated the vectorizer patch that was posted to the mailing list last year. I changed it to also apply to the non-SLP case and instead of looking for a non-vector-mode return RTX (which would pessimize word_mode vectorization) look at hard_regno_nregs and only pessimize multi-reg return locations. I've removed CONCAT and PARALLEL handling with PARALLEL left as todo (not sure what to do there, I'd need to have a target / testcase combo to look at - I suppose if any of the return locations is "compatible" with the vector result we want to not pessimize).
Author: rguenth Date: Wed Apr 3 12:30:16 2019 New Revision: 270123 URL: https://gcc.gnu.org/viewcvs?rev=270123&root=gcc&view=rev Log: 2019-04-03 Richard Biener <rguenther@suse.de> PR tree-optimization/84101 * tree-vect-stmts.c: Include explow.h for hard_function_value, regs.h for hard_regno_nregs. (cfun_returns): New helper. (vect_model_store_cost): When vectorizing a store to a decl we return and the function ABI returns in a multi-reg location account for the possible spilling that will happen. * gcc.target/i386/pr84101.c: New testcase. Added: trunk/gcc/testsuite/gcc.target/i386/pr84101.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-stmts.c
Mitigation installed on trunk (sofar). Note it's likely not a full solution.
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
The GCC 8 branch is being closed, mitigated for GCC 9.1.