Neither the command-line flag -ftree-loop-vectorize nor -fopenmp combined with "#pragma omp simd" works when -Os is active. It seems that it when specified manually vectorization should be work even in -Os mode. I can almost see why -ftree-loop-vectorize wouldn't work, which is why I tried the manual marking of loops to vectorize, but the latter didn't work either. I would suggest documenting this behavior and fix at least vectorizing manually marked loops.
IIRC the issue is that -Os inhibits most loop-header copying. Can you provide a testcase that shows the issue please? Can you check if the following patch fixes things for you? Index: gcc/tree-ssa-loop-ch.c =================================================================== --- gcc/tree-ssa-loop-ch.c (revision 258380) +++ gcc/tree-ssa-loop-ch.c (working copy) @@ -257,8 +257,7 @@ public: /* opt_pass methods: */ virtual bool gate (function *fun) { - return flag_tree_ch != 0 - && (flag_tree_loop_vectorize != 0 || fun->has_force_vectorize_loops); + return flag_tree_loop_vectorize != 0 || fun->has_force_vectorize_loops; } /* Just copy headers, no initialization/finalization of loop structures. */
Hmm, patch can't help. Instead try the following which should make the omp simd case work. Index: gcc/tree-ssa-loop-ch.c =================================================================== --- gcc/tree-ssa-loop-ch.c (revision 258380) +++ gcc/tree-ssa-loop-ch.c (working copy) @@ -57,7 +57,8 @@ should_duplicate_loop_header_p (basic_bl be true, since quite often it is possible to verify that the condition is satisfied in the first iteration and therefore to eliminate it. Jump threading handles these cases now. */ - if (optimize_loop_for_size_p (loop)) + if (optimize_loop_for_size_p (loop) + && !loop->force_vectorize) { if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file,
FDO might also help given important loops should show up as hot.
I will try the patch. I just tried -fopt-info-vec-missed and the message reported for every loop was: note: not vectorized: latch block not empty. note: bad loop form.
On Fri, 9 Mar 2018, linux at carewolf dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84777 > > --- Comment #4 from Allan Jensen <linux at carewolf dot com> --- > I will try the patch. I just tried -fopt-info-vec-missed and the message > reported for every loop was: > > note: not vectorized: latch block not empty. > note: bad loop form. Yeah, that's the effect for while () / for () style loops that haven't been transformed to do {} while () style by loop header copying.
Great. Your patch worked with 90% of the marked loops! The remaining report things like this with -fopt-info-vec-missed: note: not vectorized: relevant stmt not supported: idisty.872_437 = (unsigned int) idisty_386; note: bad operation or unsupported loop bound. But the result is already pretty good for -fopenmp with manually marked loops.
On Fri, 9 Mar 2018, linux at carewolf dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84777 > > --- Comment #6 from Allan Jensen <linux at carewolf dot com> --- > Great. Your patch worked with 90% of the marked loops! Good! > The remaining report things like this with -fopt-info-vec-missed: > > note: not vectorized: relevant stmt not supported: idisty.872_437 = (unsigned > int) idisty_386; > note: bad operation or unsupported loop bound. > > But the result is already pretty good for -fopenmp with manually marked loops. So is it any better if you use -O2 rather than -Os? Do you really need -Os? GCCs -O2 isn't as excessive code-size wise as competitors like ICC.
Yes, those I say are missing are compared to -O2. I was investigating this in relation to Qt. We either build these files with -O3, or with -Os for customer that are binary size sensitive. Since some of the image handling routines are quite heavy and have been written for auto-vectorization I was just checking if I could get it to work and the results with your patch are quite good: Normal sizes of qdrawhelper.o with -O3/-O2/-Os: 277704 / 198984 / 168440 With -O2 -ftree-vectorize: 242224 With -O2 -fopenmp: 219536 With -Os -ftree-loop-vectorize: 168440 (no change) With -Os -fopenmp: 177144 (with your patch) So most of the -Os benefit and still many of the central draw loops auto-vectorized. Haven't benchmarked it yet though.
On Fri, 9 Mar 2018, linux at carewolf dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84777 > > --- Comment #8 from Allan Jensen <linux at carewolf dot com> --- > Yes, those I say are missing are compared to -O2. I was investigating this in > relation to Qt. We either build these files with -O3, or with -Os for customer > that are binary size sensitive. Since some of the image handling routines are > quite heavy and have been written for auto-vectorization I was just checking if > I could get it to work and the results with your patch are quite good: > > Normal sizes of qdrawhelper.o with -O3/-O2/-Os: > 277704 / 198984 / 168440 > > With -O2 -ftree-vectorize: 242224 > With -O2 -fopenmp: 219536 > With -Os -ftree-loop-vectorize: 168440 (no change) > With -Os -fopenmp: 177144 (with your patch) > > So most of the -Os benefit and still many of the central draw loops > auto-vectorized. That looks indeed good. We have enough infrastructure already to support a #pragma GCC vectorize as well (it was added for Ada) just nobody bothered to add C/C++ support. > Haven't benchmarked it yet though. I will test and post the patch, I think it makes sense.
GCC 8 has the patch now.
Author: rguenth Date: Mon Mar 12 08:45:54 2018 New Revision: 258444 URL: https://gcc.gnu.org/viewcvs?rev=258444&root=gcc&view=rev Log: 2018-03-12 Richard Biener <rguenther@suse.de> PR tree-optimization/84777 * tree-ssa-loop-ch.c (should_duplicate_loop_header_p): For force-vectorize loops ignore whether we are optimizing for size. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-loop-ch.c
Richi: Planning to backport or should we close it?
I will backport.
Author: rguenth Date: Tue Nov 20 14:47:49 2018 New Revision: 266318 URL: https://gcc.gnu.org/viewcvs?rev=266318&root=gcc&view=rev Log: 2018-11-20 Richard Biener <rguenther@suse.de> Backport from mainline 2018-03-12 Richard Biener <rguenther@suse.de> PR tree-optimization/84777 * tree-ssa-loop-ch.c (should_duplicate_loop_header_p): For force-vectorize loops ignore whether we are optimizing for size. 2018-01-26 Richard Biener <rguenther@suse.de> PR rtl-optimization/84003 * dse.c (record_store): Only record redundant stores when the earlier store aliases at least all accesses the later one does. * g++.dg/torture/pr77745.C: Mark foo noinline to trigger latent bug in DSE if NOINLINE is appropriately defined. * g++.dg/torture/pr77745-2.C: New testcase including pr77745.C and defining NOINLINE. Added: branches/gcc-7-branch/gcc/testsuite/g++.dg/torture/pr77745-2.C Modified: branches/gcc-7-branch/gcc/ChangeLog branches/gcc-7-branch/gcc/dse.c branches/gcc-7-branch/gcc/testsuite/ChangeLog branches/gcc-7-branch/gcc/testsuite/g++.dg/torture/pr77745.C branches/gcc-7-branch/gcc/tree-ssa-loop-ch.c
Fixed.