103941 – uavgv2qi3_ceil is not used (SLP costing and patterns vs live stmts)

Bug 103941 - uavgv2qi3_ceil is not used (SLP costing and patterns vs live stmts)

Summary: uavgv2qi3_ceil is not used (SLP costing and patterns vs live stmts)

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	12.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer 104240
	Show dependency tree / graph

Reported:	2022-01-07 15:31 UTC by Uroš Bizjak
Modified:	2022-04-19 14:44 UTC (History)
CC List:	0 users

See Also:
Host:
Target:	x86_64-- i?86--
Build:
Known to work:	12.0
Known to fail:
Last reconfirmed:	2022-01-10 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Uroš Bizjak 2022-01-07 15:31:03 UTC

Following testcase:

unsigned char ur[16], ua[16], ub[16];

void avgu_v2qi (void)
{
  int i;

  for (i = 0; i < 2; i++)
    ur[i] = (ua[i] + ub[i] + 1) >> 1;
}

does not vectorize on x86_64-linux-gnu with -O2 -ftree-vectorize.

Comment 1 Richard Biener 2022-01-10 10:53:41 UTC

t.c:8:11: note: Costing subgraph:
t.c:8:11: note: node 0x409a000 (max_nunits=2, refcnt=1)
t.c:8:11: note: op template: ur[0] = _23;
t.c:8:11: note:         stmt 0 ur[0] = _23;
t.c:8:11: note:         stmt 1 ur[1] = _35;
t.c:8:11: note:         children 0x409a088
t.c:8:11: note: node 0x409a088 (max_nunits=2, refcnt=1)
t.c:8:11: note: op template: patt_58 = (unsigned char) patt_56;
t.c:8:11: note:         stmt 0 patt_58 = (unsigned char) patt_56;
t.c:8:11: note:         stmt 1 patt_71 = (unsigned char) patt_69;
t.c:8:11: note:         children 0x409a110
t.c:8:11: note: node 0x409a110 (max_nunits=2, refcnt=1)
t.c:8:11: note: op template: patt_56 = .AVG_CEIL (_16, _18);
t.c:8:11: note:         stmt 0 patt_56 = .AVG_CEIL (_16, _18);
t.c:8:11: note:         stmt 1 patt_69 = .AVG_CEIL (_28, _30);
t.c:8:11: note:         children 0x409a220 0x409a198
t.c:8:11: note: node 0x409a220 (max_nunits=2, refcnt=1)
t.c:8:11: note: op template: _16 = ua[0];
t.c:8:11: note:         stmt 0 _16 = ua[0];
t.c:8:11: note:         stmt 1 _28 = ua[1];
t.c:8:11: note: node 0x409a198 (max_nunits=2, refcnt=1)
t.c:8:11: note: op template: _18 = ub[0];
t.c:8:11: note:         stmt 0 _18 = ub[0];
t.c:8:11: note:         stmt 1 _30 = ub[1];
t.c:8:11: note: Cost model analysis:
_23 1 times scalar_store costs 12 in body
_35 1 times scalar_store costs 12 in body
(unsigned char) _22 1 times scalar_stmt costs 4 in body
(unsigned char) _34 1 times scalar_stmt costs 4 in body
ua[0] 1 times vector_load costs 12 in body
ub[0] 1 times vector_load costs 12 in body
.AVG_CEIL (_16, _18) 1 times vector_stmt costs 4 in body
_23 1 times vector_store costs 12 in body
ua[0] 1 times vec_to_scalar costs 4 in epilogue
ua[1] 1 times vec_to_scalar costs 4 in epilogue
ub[0] 1 times vec_to_scalar costs 4 in epilogue
ub[1] 1 times vec_to_scalar costs 4 in epilogue
t.c:8:11: note: Cost model analysis for part in loop 0:
  Vector cost: 56
  Scalar cost: 32
t.c:8:11: missed: not vectorized: vectorization is not profitable.

it looks like somehow the scalar costing is off and the scalar loads from
ua and ub are considered live.  Possibly an artifact of patterns.

It's vectorized fine with -fno-vect-cost-model.

I will have a look, eventually not for GCC 12.

Comment 2 Richard Biener 2022-01-10 12:49:45 UTC

I think I've seen this before - the use in the conversion is elided in the vector path via recognizing a pattern of a pattern - that makes it not part of the SLP
tree and thus left as SLP_TYPE (..) = loop_vect, fooling the live computation.

vect_detect_hybrid_slp now does this in a more correct way but the original
worklist seeding has to be done differently for BB SLP.

Comment 3 GCC Commits 2022-01-12 19:57:49 UTC

The master branch has been updated by Uros Bizjak <uros@gcc.gnu.org>:

https://gcc.gnu.org/g:cb46559cea1d554cef1138db5bfbdd0647ffbc0d

commit r12-6535-gcb46559cea1d554cef1138db5bfbdd0647ffbc0d
Author: Uros Bizjak <ubizjak@gmail.com>
Date:   Wed Jan 12 20:57:12 2022 +0100

    testsuite: Compile gcc.target/i386/pr103861-3.c with -fno-vect-cost-model [PR103941]
    
    2022-01-12  UroÅ¡ Bizjak  <ubizjak@gmail.com>
    
    gcc/testsuite/ChangeLog:
    
            PR target/103941
            * gcc.target/i386/pr103861-3.c (dg-options): Add -fno-vect-cost-model.

Comment 4 Richard Biener 2022-01-26 12:42:42 UTC

Another testcase where this occurs:

void foo (int *c, float *x, float *y)
{
  c[0] = x[0] < y[0];
  c[1] = x[1] < y[1];
  c[2] = x[2] < y[2];
  c[3] = x[3] < y[3];
}

Comment 5 GCC Commits 2022-04-19 14:42:48 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:353434b65ef7972172597d232ae17022d9a57244

commit r12-8195-g353434b65ef7972172597d232ae17022d9a57244
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Apr 13 13:49:45 2022 +0200

    tree-optimization/104010 - fix SLP scalar costing with patterns
    
    When doing BB vectorization the scalar cost compute is derailed
    by patterns, causing lanes to be considered live and thus not
    costed on the scalar side.  For the testcase in PR104010 this
    prevents vectorization which was done by GCC 11.  PR103941
    shows similar cases of missed optimizations that are fixed by
    this patch.
    
    2022-04-13  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/104010
            PR tree-optimization/103941
            * tree-vect-slp.cc (vect_bb_slp_scalar_cost): When
            we run into stmts in patterns continue walking those
            for uses outside of the vectorized region instead of
            marking the lane live.
    
            * gcc.target/i386/pr103941-1.c: New testcase.
            * gcc.target/i386/pr103941-2.c: Likewise.

Comment 6 Richard Biener 2022-04-19 14:44:32 UTC

Fixed on trunk via the PR104010 regression fix.