Bug 92834 - misssed SLP vectorization in LightPixel
Summary: misssed SLP vectorization in LightPixel
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2019-12-05 20:18 UTC by Jan Hubicka
Modified: 2023-09-21 12:26 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-01-30 00:00:00


Attachments
simplified testcase (1.78 KB, text/x-csrc)
2019-12-05 20:18 UTC, Jan Hubicka
Details
Clang assembly from perf (1.17 KB, text/plain)
2019-12-06 12:31 UTC, Jan Hubicka
Details
gcc10-pr92834.patch (1.21 KB, patch)
2019-12-06 14:38 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Hubicka 2019-12-05 20:18:13 UTC
Created attachment 47431 [details]
simplified testcase

Clang is able to vectorize LightPixel which leads to about 10% improvements in rasterflood-svg Firefox benchmark.
Comment 1 Richard Biener 2019-12-06 11:24:33 UTC
t.C:260:16: missed:   Build SLP failed: incompatible vector types for: patt_288 = (unsigned char) patt_290;
t.C:260:16: note:       old vector type: vector(8) unsigned char
t.C:260:16: note:       new vector type: vector(4) unsigned char

we're also re-trying for all vector sizes even though the above issue
will prevail.

But one of the main issues is likely that we don't pattern match

static inline unsigned umax(unsigned a, unsigned b) {
  return a - ((a - b) & -(a < b));
}

static inline unsigned umin(unsigned a, unsigned b) {
  return a - ((a - b) & -(a > b));
}

I wonder if they desperately searched for sth not handled by GCC ;)

Btw, my copy of clang (OK, 7.0...) doesn't vectorize the color.components[]
compute but only the leading three-component compute involving reverse
sqrt, not vectorized by us because we don't handle BB vectorization of
reductions.  There's duplicate PRs about this.

Which part is important to vectorize?

Also - what flags did you use?  (tried -Ofast -mavx2)
Comment 2 Jan Hubicka 2019-12-06 12:31:40 UTC
Created attachment 47436 [details]
Clang assembly from perf

It is clang9 build https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d7ee02817ab1ea39a6415862ab7889f5e416598&selectedJob=278948829
it has full logs and binary, too

/builds/worker/fetches/sccache/sccache /builds/worker/fetches/clang/bin/clang++ -o Unified_cpp_gfx_2d2.o -c -flto=thin -I/builds/worker/workspace/build/src/obj-firefox/dist/system_wrappers -include /builds/worker/workspace/build/src/config/gcc_hidden.h -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -DMOZILLA_CLIENT -include /builds/worker/workspace/build/src/obj-firefox/mozilla-config.h -Qunused-arguments -Qunused-arguments -Wall -Wbitfield-enum-conversion -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wshadow-field-in-constructor-modified -Wsign-compare -Wtype-limits -Wunreachable-code -Wunreachable-code-return -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wloop-analysis -Wc++1z-compat -Wc++2a-compat -Wcomma -Wimplicit-fallthrough -Werror=non-literal-null-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-enum-zero-compare -Wtautological-unsigned-zero-compare -Wno-error=tautological-type-limit-compare -Wno-inline-new-delete -Wno-error=type-limits -Wno-error=pessimizing-move -Wno-error=nonnull -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=backend-plugin -Wno-error=return-std-move -Wno-error=atomic-alignment -Wformat -Wformat-security -Wno-gnu-zero-variadic-macro-arguments -Wno-unknown-warning-option -Wno-return-type-c-linkage -D_GLIBCXX_USE_CXX11_ABI=0 -fno-sized-deallocation -fno-aligned-new -fcrash-diagnostics-dir=/builds/worker/artifacts -fno-strict-aliasing -fno-strict-aliasing -fno-exceptions -fno-rtti -fno-exceptions -fno-math-errno -pthread -pipe -I/builds/worker/workspace/build/src/obj-firefox/dist/stl_wrappers -DNDEBUG=1 -DTRIMMED=1 -DUSE_SSE2 -DOS_POSIX=1 -DOS_LINUX=1 -DUSE_CAIRO -DMOZ2D_HAS_MOZ_CAIRO -DMOZ_ENABLE_FREETYPE -DSTATIC_EXPORTABLE_JS_API -DMOZ_HAS_MOZGLUE -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL -I/builds/worker/workspace/build/src/gfx/2d -I/builds/worker/workspace/build/src/obj-firefox/gfx/2d -I/builds/worker/workspace/build/src/obj-firefox/ipc/ipdl/_ipdlheaders -I/builds/worker/workspace/build/src/ipc/chromium/src -I/builds/worker/workspace/build/src/ipc/glue -I/builds/worker/workspace/build/src/gfx/skia -I/builds/worker/workspace/build/src/gfx/skia/skia -I/builds/worker/workspace/build/src/obj-firefox/dist/include -I/builds/worker/workspace/build/src/obj-firefox/dist/include/nspr -I/builds/worker/workspace/build/src/obj-firefox/dist/include/nss -fPIC -g -Xclang -load -Xclang /builds/worker/workspace/build/src/obj-firefox/build/clang-plugin/libclang-plugin.so -Xclang -add-plugin -Xclang moz-check -O2 -fno-omit-frame-pointer -funwind-tables -Werror -Wno-error=shadow -I/builds/worker/workspace/build/src/obj-firefox/dist/include/cairo -I/usr/include/freetype2  -MD -MP -MF .deps/Unified_cpp_gfx_2d2.o.pp   Unified_cpp_gfx_2d2.cpp
Comment 3 Jakub Jelinek 2019-12-06 14:38:04 UTC
Created attachment 47437 [details]
gcc10-pr92834.patch

This untested patch just undoes the fancy way of writing blend operation and turns it into ?: that we are able to optimize better.
Comment 4 Jakub Jelinek 2019-12-09 10:13:50 UTC
Author: jakub
Date: Mon Dec  9 10:13:18 2019
New Revision: 279113

URL: https://gcc.gnu.org/viewcvs?rev=279113&root=gcc&view=rev
Log:
	PR tree-optimization/92834
	* match.pd (A - ((A - B) & -(C cmp D)) -> (C cmp D) ? B : A,
	A + ((B - A) & -(C cmp D)) -> (C cmp D) ? B : A): New simplifications.

	* gcc.dg/tree-ssa/pr92834.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr92834.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/match.pd
    trunk/gcc/testsuite/ChangeLog
Comment 5 Andrew Pinski 2021-08-25 08:00:22 UTC
I think this is now the standard GCC SLP does not vectorize starting at loads but rather starts at stores issue.
Comment 6 Richard Biener 2021-08-25 09:30:58 UTC
So there is a store and we vectorize it:

  _188 = {_10, _14, _18, _22};
  MEM[(union ._anon_7 *)&color] = _188;
  _43 = color.bgra;
  color ={v} {CLOBBER};
  return _43;

but we fail to vectorize more because there's a mix of min/max operations
here.  In fact we even recognize the reduction on the 'A' component but
again we fail to be clever enough to exploit it (it's also 3 lanes and
thus difficult to handle).