Bug 84777 - -Os inhibits all vectorization
Summary: -Os inhibits all vectorization
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2018-03-09 10:41 UTC by Allan Jensen
Modified: 2018-11-20 14:48 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work: 7.3.1, 8.1.0
Known to fail:
Last reconfirmed: 2018-11-20 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Allan Jensen 2018-03-09 10:41:25 UTC
Neither the command-line flag -ftree-loop-vectorize nor -fopenmp combined with "#pragma omp simd" works when -Os is active.

It seems that it when specified manually vectorization should be work even in -Os mode. I can almost see why -ftree-loop-vectorize wouldn't work, which is why I tried the manual marking of loops to vectorize, but the latter didn't work either.

I would suggest documenting this behavior and fix at least vectorizing manually marked loops.
Comment 1 Richard Biener 2018-03-09 11:06:23 UTC
IIRC the issue is that -Os inhibits most loop-header copying.

Can you provide a testcase that shows the issue please?

Can you check if the following patch fixes things for you?

Index: gcc/tree-ssa-loop-ch.c
===================================================================
--- gcc/tree-ssa-loop-ch.c      (revision 258380)
+++ gcc/tree-ssa-loop-ch.c      (working copy)
@@ -257,8 +257,7 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *fun)
   {
-    return flag_tree_ch != 0
-          && (flag_tree_loop_vectorize != 0 || fun->has_force_vectorize_loops);
+    return flag_tree_loop_vectorize != 0 || fun->has_force_vectorize_loops;
   }
   
   /* Just copy headers, no initialization/finalization of loop structures.  */
Comment 2 Richard Biener 2018-03-09 11:16:36 UTC
Hmm, patch can't help.  Instead try the following which should make the omp simd
case work.

Index: gcc/tree-ssa-loop-ch.c
===================================================================
--- gcc/tree-ssa-loop-ch.c      (revision 258380)
+++ gcc/tree-ssa-loop-ch.c      (working copy)
@@ -57,7 +57,8 @@ should_duplicate_loop_header_p (basic_bl
      be true, since quite often it is possible to verify that the condition is
      satisfied in the first iteration and therefore to eliminate it.  Jump
      threading handles these cases now.  */
-  if (optimize_loop_for_size_p (loop))
+  if (optimize_loop_for_size_p (loop)
+      && !loop->force_vectorize)
     {
       if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file,
Comment 3 Richard Biener 2018-03-09 11:17:27 UTC
FDO might also help given important loops should show up as hot.
Comment 4 Allan Jensen 2018-03-09 11:59:43 UTC
I will try the patch. I just tried -fopt-info-vec-missed and the message reported for every loop was:

note: not vectorized: latch block not empty.
note: bad loop form.
Comment 5 rguenther@suse.de 2018-03-09 12:02:54 UTC
On Fri, 9 Mar 2018, linux at carewolf dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84777
> 
> --- Comment #4 from Allan Jensen <linux at carewolf dot com> ---
> I will try the patch. I just tried -fopt-info-vec-missed and the message
> reported for every loop was:
> 
> note: not vectorized: latch block not empty.
> note: bad loop form.

Yeah, that's the effect for while () / for () style loops that haven't
been transformed to do {} while () style by loop header copying.
Comment 6 Allan Jensen 2018-03-09 12:11:16 UTC
Great. Your patch worked with 90% of the marked loops!

The remaining report things like this with -fopt-info-vec-missed:

note: not vectorized: relevant stmt not supported: idisty.872_437 = (unsigned int) idisty_386;
note: bad operation or unsupported loop bound.

But the result is already pretty good for -fopenmp with manually marked loops.
Comment 7 rguenther@suse.de 2018-03-09 12:14:06 UTC
On Fri, 9 Mar 2018, linux at carewolf dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84777
> 
> --- Comment #6 from Allan Jensen <linux at carewolf dot com> ---
> Great. Your patch worked with 90% of the marked loops!

Good!

> The remaining report things like this with -fopt-info-vec-missed:
> 
> note: not vectorized: relevant stmt not supported: idisty.872_437 = (unsigned
> int) idisty_386;
> note: bad operation or unsupported loop bound.
> 
> But the result is already pretty good for -fopenmp with manually marked loops.

So is it any better if you use -O2 rather than -Os?  Do you really need
-Os?  GCCs -O2 isn't as excessive code-size wise as competitors like ICC.
Comment 8 Allan Jensen 2018-03-09 12:43:01 UTC
Yes, those I say are missing are compared to -O2. I was investigating this in relation to Qt. We either build these files with -O3, or with -Os for customer that are binary size sensitive. Since some of the image handling routines are quite heavy and have been written for auto-vectorization I was just checking if I could get it to work and the results with your patch are quite good:

Normal sizes of qdrawhelper.o with -O3/-O2/-Os: 
277704 / 198984 / 168440

With -O2 -ftree-vectorize: 242224
With -O2 -fopenmp: 219536
With -Os -ftree-loop-vectorize: 168440 (no change)
With -Os -fopenmp: 177144 (with your patch)

So most of the -Os benefit and still many of the central draw loops auto-vectorized.

Haven't benchmarked it yet though.
Comment 9 rguenther@suse.de 2018-03-09 13:16:15 UTC
On Fri, 9 Mar 2018, linux at carewolf dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84777
> 
> --- Comment #8 from Allan Jensen <linux at carewolf dot com> ---
> Yes, those I say are missing are compared to -O2. I was investigating this in
> relation to Qt. We either build these files with -O3, or with -Os for customer
> that are binary size sensitive. Since some of the image handling routines are
> quite heavy and have been written for auto-vectorization I was just checking if
> I could get it to work and the results with your patch are quite good:
> 
> Normal sizes of qdrawhelper.o with -O3/-O2/-Os: 
> 277704 / 198984 / 168440
> 
> With -O2 -ftree-vectorize: 242224
> With -O2 -fopenmp: 219536
> With -Os -ftree-loop-vectorize: 168440 (no change)
> With -Os -fopenmp: 177144 (with your patch)
> 
> So most of the -Os benefit and still many of the central draw loops
> auto-vectorized.

That looks indeed good.  We have enough infrastructure already to
support a #pragma GCC vectorize as well (it was added for Ada) just
nobody bothered to add C/C++ support.

> Haven't benchmarked it yet though.

I will test and post the patch, I think it makes sense.
Comment 10 Richard Biener 2018-03-12 08:46:17 UTC
GCC 8 has the patch now.
Comment 11 Richard Biener 2018-03-12 08:46:29 UTC
Author: rguenth
Date: Mon Mar 12 08:45:54 2018
New Revision: 258444

URL: https://gcc.gnu.org/viewcvs?rev=258444&root=gcc&view=rev
Log:
2018-03-12  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/84777
	* tree-ssa-loop-ch.c (should_duplicate_loop_header_p): For
	force-vectorize loops ignore whether we are optimizing for size.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-loop-ch.c
Comment 12 Martin Liška 2018-11-20 08:20:23 UTC
Richi: Planning to backport or should we close it?
Comment 13 Richard Biener 2018-11-20 08:39:34 UTC
I will backport.
Comment 14 Richard Biener 2018-11-20 14:48:25 UTC
Author: rguenth
Date: Tue Nov 20 14:47:49 2018
New Revision: 266318

URL: https://gcc.gnu.org/viewcvs?rev=266318&root=gcc&view=rev
Log:
2018-11-20  Richard Biener  <rguenther@suse.de>

	Backport from mainline
	2018-03-12  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/84777
	* tree-ssa-loop-ch.c (should_duplicate_loop_header_p): For
	force-vectorize loops ignore whether we are optimizing for size.

	2018-01-26  Richard Biener  <rguenther@suse.de>

	PR rtl-optimization/84003
	* dse.c (record_store): Only record redundant stores when
	the earlier store aliases at least all accesses the later one does.

	* g++.dg/torture/pr77745.C: Mark foo noinline to trigger
	latent bug in DSE if NOINLINE is appropriately defined.
	* g++.dg/torture/pr77745-2.C: New testcase including pr77745.C
	and defining NOINLINE.

Added:
    branches/gcc-7-branch/gcc/testsuite/g++.dg/torture/pr77745-2.C
Modified:
    branches/gcc-7-branch/gcc/ChangeLog
    branches/gcc-7-branch/gcc/dse.c
    branches/gcc-7-branch/gcc/testsuite/ChangeLog
    branches/gcc-7-branch/gcc/testsuite/g++.dg/torture/pr77745.C
    branches/gcc-7-branch/gcc/tree-ssa-loop-ch.c
Comment 15 Richard Biener 2018-11-20 14:48:36 UTC
Fixed.