53957 – Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler

Bug 53957 - Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler

Summary: Polyhedron 11 benchmark: MP_PROP_DESIGN twice as long as other compiler

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	fortran (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer 95946
	Show dependency tree / graph

Reported:	2012-07-13 19:59 UTC by Tobias Burnus
Modified:	2020-07-29 22:25 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2016-12-11 00:00:00

Attachments
Draft patch: Change comparison into bool assignment, decrement conditional jump (566 bytes, patch) 2012-07-18 14:04 UTC, Tobias Burnus	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tobias Burnus 2012-07-13 19:59:58 UTC

[Note that MP_PROP_DESIGN is also discussed at the gcc-graphite mailing list, albeit more with regards to automatic parallelization.]

The polyhedron benchmark (2011 version) is available at:
http://www.polyhedron.com/polyhedron_benchmark_suite0html, namely: http://www.polyhedron.com/web_images/documents/pb11.zip

(The original program, which also contains a ready-to-go benchmark is at http://propdesign.weebly.com/; Note that you may have to rename some input *.txt files to *TXT.)


The program takes twice as long with GCC as with ifort. The program is just 502 lines long (w/o comments) and contains no subroutines or functions. It mainly consists of loops and a some math functions (sin, cos, pow, tan, atan, acos, exp).


[Result on CentOS 5.7, x86-64-gnu-linux, Intel Xeon X3430 @2.40GHz]


Using GCC 4.8.0 20120622 (experimental) [trunk revision 188871], I get:

$ gfortran -Ofast -funroll-loops -fwhole-program -march=native mp_prop_design.f90
$ time ./a.out > /dev/null 

real    2m47.138s
user    2m46.808s
sys     0m0.236s


Using Intel's ifort on Intel(R) 64, Version 12.1 Build 20120212:

$ ifort -fast mp_prop_design.f90
$ time ./a.out > /dev/null 
real    1m25.906s
user    1m25.598s
sys     0m0.244s


With Intel's libimf preloaded (LD_PRELOAD=.../libimf.so), GCC has:

real    2m0.524s
user    1m59.809s
sys     0m0.689s



The code features expressions like a**2.0D0, but those are converted in GCC to a*a.

Using -mveclibabi=svml (and no preloading) gives the same timings as without (or slightly worse); it just calls vmldAtan2.


Vectorizer: I haven't profiled this part, but I want to note that ifort vectorizes more, namely:

GCC vectorizes:

662: LOOP VECTORIZED.
1032: LOOP VECTORIZED.
1060: LOOP VECTORIZED.


While ifort has:

mp_prop_design.f90(271): (col. 10) remark: LOOP WAS VECTORIZED.
  (Loop "m1 =2, 45" with conditional jump out of the loop)
mp_prop_design.f90(552): (col. 16) remark: LOOP WAS VECTORIZED.
  (Loop with condition)
mp_prop_design.f90(576): (col. 16) remark: PARTIAL LOOP WAS VECTORIZED.
  (Loop with two IF blocks)
mp_prop_design.f90(639): (col. 16) remark: LOOP WAS VECTORIZED.
  (Rather simple loop)
mp_prop_design.f90(662): (col.  2) remark: LOOP WAS VECTORIZED.
  (Vectorized by GCC)
mp_prop_design.f90(677): (col. 16) remark: PARTIAL LOOP WAS VECTORIZED.
   (Line number points to the outermost of the three loops; there are also
    conditional jumps)
mp_prop_design.f90(818): (col. 16) remark: LOOP WAS VECTORIZED.
   (Nested "if" blocks)
mp_prop_design.f90(1032): (col. 2) remark: LOOP WAS VECTORIZED.
mp_prop_design.f90(1060): (col. 2) remark: LOOP WAS VECTORIZED.
   (The last two are handled by GCC)

Comment 1 Richard Biener 2012-07-18 11:25:55 UTC

On trunk we do vectorize the loop at 552, but I'm not sure that unconditionally
calling vmldAtan2 is profitable.  That is, trunk for me has (-Ofast -mveclibabi=svml):

552: LOOP VECTORIZED.
576: LOOP VECTORIZED.
662: LOOP VECTORIZED.
1032: LOOP VECTORIZED.
1060: LOOP VECTORIZED.

The loop at 639 is converted to two memset calls.

mp_prop_design.f90(677): (col. 16) remark: PARTIAL LOOP WAS VECTORIZED.
   (Line number points to the outermost of the three loops; there are also
    conditional jumps)

seems to be the important one to tackle.

For the loop at 818 we fail to if-convert the nested if

                  IF ( j.EQ.1 ) THEN
                     tempa(j) = ZERO
                  ELSE
                     arg1 = -vefz(j)
                     arg2 = vefphi(j)
                     IF ( (arg2.LT.ZERO) .OR. (arg2.GT.ZERO) ) THEN
                        tempa(j) = ATAN(arg1/arg2) - theta(j)
                     ELSE
                        tempa(j) = -theta(j)
                     ENDIF
                  ENDIF

where we also fail to apply store-motion of tempa(j).  The if (j == 1)
conditional code makes the loop suitable for peeling, too.

That said, this loop is suitable for analysis as well.

Comment 2 Richard Biener 2012-07-18 12:02:32 UTC

All time is spent in the loop nest starting at line 677, 683, 694, 696 for
all of them we claim they are in bad loop form.

Comment 3 Richard Biener 2012-07-18 12:09:23 UTC

The issue seems to be that the frontend uses two induction variables, one signed
and one unsigned, for

                        DO i = 1 , 1 + NINT(2.0D0*PI*trns/dphit) ,      &
     &                     NINT(ainc/(dphit*(180.0D0/PI)))
...
                        END DO

<bb 78>:
  # i_5 = PHI <[mp_prop_design.f90 : 697:0] 1(77), [mp_prop_design.f90 : 696:0] i_621(79)>
  # countm1.38_32 = PHI <[mp_prop_design.f90 : 696:0] countm1.38_466(77), [mp_prop_design.f90 : 696:0] countm1.38_622(79)>
  # prephitmp.386_3285 = PHI <pretmp.385_3284(77), D.2618_614(79)>
  # prephitmp.386_3287 = PHI <pretmp.385_3286(77), D.2620_620(79)>
...
  [mp_prop_design.f90 : 696:0] i_621 = i_5 + pretmp.378_3242;
  [mp_prop_design.f90 : 696:0] # DEBUG i => i_621
  [mp_prop_design.f90 : 696:0] if (countm1.38_32 == 0)
    goto <bb 80>;
  else
    goto <bb 79>;

<bb 79>:
  [mp_prop_design.f90 : 696:0] countm1.38_622 = countm1.38_32 + 4294967295;
  [mp_prop_design.f90 : 696 : 0] goto <bb 78>;

and the "decrement" of countm1 happens in the loop latch block.  It would
be better to have this similar to other loops I see,

       bool flag = end-value == i;
       i = i + 1;
       if (flag) goto loop_exit;

Comment 4 Tobias Burnus 2012-07-18 12:46:29 UTC

(In reply to comment #3)
> It would be better to have this similar to other loops I see,
>
>        bool flag = end-value == i;
>        i = i + 1;
>        if (flag) goto loop_exit;

That's not that simple as one might not reach the end value due to the step. If "step" is (plus or minus) unity and if one has integers (and not reals, added in Fortran 77, deleted in Fortran 90), it is simple.

But if abs(step) != 1 or if the loop variable is not an integer, one either needs to calculate the number of trips beforehand, or has to use ">" or "<" rather than "==". The problem with "<" / ">" is that one has to do another comparison, unless the sign of "step" is known:

  if (step > 0 ? dovar > to : dovar < to)
    goto exit_label;

I don't see whether that version is better than the current version. Suggestions or comments?


The current code is (comment from trans-stmt.c's gfc_trans_do):

------------<cut>-----------------
   We translate a do loop from:

   DO dovar = from, to, step
      body
   END DO

   to:

   [evaluate loop bounds and step]
   empty = (step > 0 ? to < from : to > from);
   countm1 = (to - from) / step;
   dovar = from;
   if (empty) goto exit_label;
   for (;;)
     {
       body;
cycle_label:
       dovar += step
       if (countm1 ==0) goto exit_label;
       countm1--;
     }
exit_label:

   countm1 is an unsigned integer.  It is equal to the loop count minus one,
   because the loop count itself can overflow.  */
------------</cut>-----------------

Comment 5 rguenther@suse.de 2012-07-18 13:18:13 UTC

On Wed, 18 Jul 2012, burnus at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> 
> Tobias Burnus <burnus at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |burnus at gcc dot gnu.org
> 
> --- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> 2012-07-18 12:46:29 UTC ---
> (In reply to comment #3)
> > It would be better to have this similar to other loops I see,
> >
> >        bool flag = end-value == i;
> >        i = i + 1;
> >        if (flag) goto loop_exit;
> 
> That's not that simple as one might not reach the end value due to the step. If
> "step" is (plus or minus) unity and if one has integers (and not reals, added
> in Fortran 77, deleted in Fortran 90), it is simple.
> 
> But if abs(step) != 1 or if the loop variable is not an integer, one either
> needs to calculate the number of trips beforehand, or has to use ">" or "<"
> rather than "==". The problem with "<" / ">" is that one has to do another
> comparison, unless the sign of "step" is known:
> 
>   if (step > 0 ? dovar > to : dovar < to)
>     goto exit_label;
> 
> I don't see whether that version is better than the current version.
> Suggestions or comments?
> 
> 
> The current code is (comment from trans-stmt.c's gfc_trans_do):
> 
> ------------<cut>-----------------
>    We translate a do loop from:
> 
>    DO dovar = from, to, step
>       body
>    END DO
> 
>    to:
> 
>    [evaluate loop bounds and step]
>    empty = (step > 0 ? to < from : to > from);
>    countm1 = (to - from) / step;
>    dovar = from;
>    if (empty) goto exit_label;
>    for (;;)
>      {
>        body;
> cycle_label:
>        dovar += step
>        if (countm1 ==0) goto exit_label;
>        countm1--;
>      }
> exit_label:
> 
>    countm1 is an unsigned integer.  It is equal to the loop count minus one,
>    because the loop count itself can overflow.  */

If you do

>    [evaluate loop bounds and step]
>    empty = (step > 0 ? to < from : to > from);
>    countm1 = (to - from) / step;
>    dovar = from;
>    if (empty) goto exit_label;
>    for (;;)
>      { 
>        body;
> cycle_label:
>        dovar += step
         exit = countm1 == 0;
         countm1--;
>        if (exit) goto exit_label;
>      }
> exit_label:

it would work for this case.

Comment 6 Tobias Burnus 2012-07-18 14:04:45 UTC

Created attachment 27823 [details]
Draft patch: Change comparison into bool assignment, decrement conditional jump

(In reply to comment #5)
> If you do

>          exit = countm1 == 0;
>          countm1--;
> >        if (exit) goto exit_label;

> it would work for this case.


If I apply the attached patch, I do not see any performance difference on my AMD Athlon64 x2 4800+ with -Ofast -funroll-loops -march=native. 

real  3m45.711s  3m45.589s  3m44.308s  | 3m45.363s  3m45.328s  3m44.220s
user  3m45.710s  3m45.582s  3m44.274s  | 3m45.282s  3m45.286s  3m44.218s

Comment 7 Richard Biener 2012-07-18 14:47:49 UTC

It helps to make us even consider the loop.  We now run into

696: worklist: examine stmt: D.2574_254 = (real(kind=4)) i_5;

696: vect_is_simple_use: operand i_5
696: def_stmt: i_5 = PHI <1(77), i_324(80)>

696: Unsupported pattern.
696: not vectorized: unsupported use in stmt.
696: unexpected pattern.

that is, the following induction is not handled:

                           phit = phib + phie(k) + (REAL(i)-0.50D0)     &
     &                            *dphit

so it would be still worthwhile to pursue your patch if it does not have
negative effects elsewhere.  We should be able to fix the induction code
to handle this case.

If you can help isolating the innermost two loops into a smaller testcase
that would be great, too.

Comment 8 Tobias Burnus 2012-07-26 09:58:41 UTC

(In reply to comment #7)
> so it would be still worthwhile to pursue your patch if it does not have
> negative effects elsewhere.  We should be able to fix the induction code
> to handle this case.

Regarding negative (or positive) impact with regards to performance: That's difficult to test :-(

However, with the patch, f951 stops with the following ICE
  internal compiler error: in free_regset_pool, at sel-sched-ir.c:994
with gfortran.dg/pr42294.f and gfortran.dg/pr44691.f.

Comment 9 Richard Biener 2012-07-26 10:18:55 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > so it would be still worthwhile to pursue your patch if it does not have
> > negative effects elsewhere.  We should be able to fix the induction code
> > to handle this case.
> 
> Regarding negative (or positive) impact with regards to performance: That's
> difficult to test :-(
> 
> However, with the patch, f951 stops with the following ICE
>   internal compiler error: in free_regset_pool, at sel-sched-ir.c:994
> with gfortran.dg/pr42294.f and gfortran.dg/pr44691.f.

That's a pre-existing issue on current trunk, unrelated to the patch.

Comment 10 Richard Biener 2012-09-11 15:02:00 UTC

There are a lot more reasons why we do not vectorize this loop :(

Comment 11 Tobias Burnus 2013-01-16 22:53:35 UTC

(In reply to comment #6)
> Created attachment 27823 [details]
> Draft patch: Change comparison into bool assignment, decrement conditional jump

A similar but slightly different patch has been committed, cf. PR 52865 comment 13.

Comment 12 Anthony 2013-06-10 04:41:46 UTC

Hi Guys,

I'm the developer of PROP_DESIGN. I originally posted on the Google GCC Graphite Group. Thanks Tobias for creating this bug and realizing the root issue. I originally thought auto-parallelization would be of benefit. However, I recently starting experimenting with the Intel Fortran compiler and have found some things that may help you out. I have found Intel Fortran IPO, auto-vectorization, and/or auto-parallelization are of no benefit to PROP_DESIGN. I also found, as Tobias mentioned here, that gfortran creates significantly slower executable files than Intel Fortran. I have narrowed it down to just the basic optimizations. It does not have to do with anything else. If you compare gfortran -03 optimizations versus Intel Fortran /O3 optimizations, you see a big difference. One case I ran shows about 38.65% faster executable files, if you use Intel Fortran with /O3 optimizations compared to gfortran with -O3 optimizations.

I have a measle AMD C-60 processor and use Windows 7 64-bit. I have tried many other gfortran for Windows compilers in the past, but I'm currently using the latest version of TDM-GCC. I have also tried Linux but am not currently using it. I have not tried Intel Fortran on Linux.

I am not much of a programmer, so I can't say why gfortran -O3 is making slower executable files than Intel Fortran /O3. Perhaps you guys would know. I thought this information might help you out. If I can be of any help to you, let me know. My website has the latest version of PROP_DESIGN. Polyhedron refuses to update the version I sent them years ago. It would probably be better if you used the latest version for testing your software.

Sincerely,

Anthony Falzone
http://propdesign.weebly.com/

Comment 13 Anthony 2013-06-10 17:44:33 UTC

My previous post needs a correction.  Comparing gfortran O3 to Intel Fortran O3 I see a 60% speed improvement in favor of the Intel Fortran compiler.  There is a 40% improvement over past releases of PROP_DESIGN, which used gfortran Ofast.  There is not much difference between Intel Fortran O3 and Ofast, so I am using O3 to ensure accurate calculations.

Comment 14 Dominique d'Humieres 2013-06-22 13:04:36 UTC

Anthony, could you provide a reduced test showing the problem?

Comment 15 Anthony 2013-06-23 05:25:32 UTC

(In reply to Dominique d'Humieres from comment #14)
> Anthony, could you provide a reduced test showing the problem?

Hi Dominique,

About the most reduced I can think of is PROP_DESIGN_ANALYSIS.  It contains the core calculations that are required to determine aircraft propeller performance.  PROP_DESIGN_ANALYSIS_BENCHMARK just adds some looping options that in my mind could run in parallel.  However, I don't know anything about parallel programming and when I tried some Fortran compilers with auto-parallelization non of them can pick up on the loops that too me seem obviously parallel.  So what I think and what is currently feasible with auto-parallelization are not the same.

In any event, I have noticed that just using O3 optimizations there is a substantial difference between Intel Fortran and gfortran.  So I am just confirming what Tobias is saying here in this bug report.

If PROP_DESIGN_ANALYSIS_BENCHMARK is too complex of a test case for you the only thing I can think to do is take PROP_DESIGN_ANALYSIS and strip out most of the end parts where various outputs are created.  So the program would just take the inputs run the minimum calculations with the least looping possible and output pretty much nothing.  It wouldn't be hard for me to do something like that if it would be of any benefit to you, I'm not sure.

Also, if there is anything programming wise that you would like changed for some reason as far as Fortran syntax, I can try that too.  My knowledge of Fortran is fairly basic.  I tried to stick strictly to Fortran 77, since that is what I was trained in and have a lot of experience with.  I don't know any other programming languages or even any other versions of Fortran such as 90/95 etc...

Anthony

Comment 16 Thomas Koenig 2016-12-11 17:40:45 UTC

Still present on current trunk:

fortran -Ofast -funroll-loops -march=native -mtune=native -fopt-info-vec mp_prop_design.f90 
mp_prop_design.f90:1117:0: note: loop vectorized
mp_prop_design.f90:1117:0: note: loop vectorized
mp_prop_design.f90:1087:0: note: loop vectorized
mp_prop_design.f90:1060:0: note: loop vectorized
mp_prop_design.f90:1032:0: note: loop vectorized
mp_prop_design.f90:662:0: note: loop vectorized
mp_prop_design.f90:375:0: note: loop vectorized
mp_prop_design.f90:375:0: note: loop vectorized
mp_prop_design.f90:14:0: note: basic block vectorized

Comment 17 Anthony 2016-12-11 19:36:09 UTC

hi,

does anyone know how to contact someone about deleting my gcc buzilla account. i can't find any contact info except overseers@gcc.gnu.org and that one gets sent back.

Comment 18 Thomas Koenig 2016-12-11 22:36:25 UTC

Under Preferences/Email Preferences, you can select "Disable All Mail",
which should work and keep you from getting unwanted mail.

Comment 19 Anthony 2020-06-27 23:34:18 UTC

hi everyone,

I'm not sure if this is the right place to ask this or not, but it relates to the topic. I can't find the other thread about graphite auto-parallelization that I made a long time ago.

I tried gfortran 10.1.0 via MSYS2. It seems to work very well on the latest version of PROP_DESIGN. MP_PROP_DESIGN had some extra loops for benchmarking. I found it made it harder for the optimizer so I deleted that code and just use the 'real' version of the code it was based on called PROP_DESIGN_MAPS. So that's the actual propeller design code with no additional looping for benchmarking purposes.

I've found no Fortran compiler and do the auto-parallelization the way I would like. The only code that would implement any at run time actually slowed the code way down instead of sped it up.

I still have my original problem with gfortran. That is, at runtime no actual parallelization occurs. The code runs the exact same as if the commands are not present. Oddly though, the code does say it auto-parallelized many loops. Although, not the loops that would really help, but at least it shows it's doing something. That's an improvement from when I started these threads.

The problem is if I compile with the following:

gfortran PROP_DESIGN_MAPS.f -o PROP_DESIGN_MAPS.exe -O3 -ffixed-form -static -march=x86-64 -mtune=generic -mfpmath=sse -mieee-fp -pthread -ftree-parallelize-loops=2 -floop-parallelize-all -fopt-info-loop

It runs the exact same way as if I compile with:

gfortran PROP_DESIGN_MAPS.f -o PROP_DESIGN_MAPS.exe -O3 -ffixed-form -static -march=x86-64 -mtune=generic -mfpmath=sse -mieee-fp

Again, gfortran does say it auto-parallelize some loops. So it's very odd. I have searched the net and can't find anything that has helped.

I'm wondering if for Linux users, the code actually will work in parallel. That would at least narrow the problem down some. I'm using Windows 10 and the code will only run with one core. Compiling both ways it shows 2 threads used for awhile and then drops to 1 thread.

The good news from when this was posted is that gfortran ran the code at the same speed as the PGI Community Edition Compiler. Since they just stopped developing that, I switched back to gfortran. I no longer have Intel Fortran to test. That was the compiler that actually did run the code in parallel, but it ran twice as slow instead of twice as fast. That was a year or two ago. I don't know if it's any better now.

I'm wondering if there is some sort of issue with -pthread not being able to call anything more than one core on Windows 10.

You can download PROP_DESIGN at https://propdesign.jimdofree.com

Inside the download are all the *.f files. I also have c.bat files in there with the compiler options I used. The auto-parallelization commands are not present, since they don't seem to be working still. At least on Windows 10.

The code now runs much faster than it used to, due to many bug fixes and improvements I've made over the years. However, you can get it to run really slow for testing purposes. In the settings file for the program change the defaults like this:

1 ALLOW VORTEX INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM, DEFAULT = 2)
2 ALLOW BLADE-TO-BLADE INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM, DEFAULT = 2)

or like this

1 ALLOW VORTEX INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM, DEFAULT = 2)
1 ALLOW BLADE-TO-BLADE INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM, DEFAULT = 2)

The first runs very slow, the second incredibly slow. I just close the command window once I've seen if the code is running in parallel or not. With the defaults set at 2 for each of those values the code runs so fast you can't really get a sense of what's going on.

Thanks for any help,

Anthony

Comment 20 Thomas Koenig 2020-06-28 10:49:00 UTC

I looked at the output of 

gfortran -Ofast -funroll-loops -march=native -mtune=native -fopt-info-vec mp_prop_design.f90  2>&1 | wc -l

in the Polyhedron testsuite, and I can confirm it really shows
much more vectorization than before.  So, I'm glad that part seems
to be fixed.

Regarding auto-parallelization: I can confirm your observation under
Linux, it doesn't do any more than on Windows.

I think I will open up meta-PR and make this PR depend on it.

Comment 21 Thomas Koenig 2020-06-28 11:03:46 UTC

Another question: Is there anything left to be done with the
vectorizer, or could we remove that dependency?

Comment 22 Anthony 2020-06-28 15:40:59 UTC

(In reply to Thomas Koenig from comment #21)
> Another question: Is there anything left to be done with the
> vectorizer, or could we remove that dependency?

thanks for looking into this again for me. i'm surprised it worked the same on Linux, but knowing that, at least helps debug this issue some more. I'm not sure about the vectorizer question, maybe that question was intended for someone else. the runtimes seem good as is though. i doubt the auto-parallelization will add much speed. but it's an interesting feature that i've always hoped would work. i've never got it to work though. the only code that did actually implement something was Intel Fortran. it implemented one trivial loop, but it slowed the code down instead of speeding it up. the output from gfortran shows more loops it wants to run in parallel. they aren't important ones. but something would be better than nothing. if it slowed the code down, i would just not use it.

there is something different in gfortran where it mentions a lot of 16bit vectorization. i don't recall that from the past. but whatever it's doing, seems fine from a speed perspective.

some compliments to the developers. the code compiles very fast compared to other compilers. i'm really glad it doesn't rely on Microsoft Visual Studio. that's a huge time consuming install. I was very happy I could finally uninstall it. also, gfortran handles all my stop statements properly. PGI Community Edition was adding in a bunch of non-sense output anytime a stop command was issued. So it's nice to have the code work as intended again.

Comment 23 rguenther@suse.de 2020-06-29 09:36:58 UTC

On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> 
> --- Comment #22 from Anthony <prop_design at protonmail dot com> ---
> (In reply to Thomas Koenig from comment #21)
> > Another question: Is there anything left to be done with the
> > vectorizer, or could we remove that dependency?
> 
> thanks for looking into this again for me. i'm surprised it worked the same on
> Linux, but knowing that, at least helps debug this issue some more. I'm not
> sure about the vectorizer question, maybe that question was intended for
> someone else. the runtimes seem good as is though. i doubt the
> auto-parallelization will add much speed. but it's an interesting feature that
> i've always hoped would work. i've never got it to work though. the only code
> that did actually implement something was Intel Fortran. it implemented one
> trivial loop, but it slowed the code down instead of speeding it up. the output
> from gfortran shows more loops it wants to run in parallel. they aren't
> important ones. but something would be better than nothing. if it slowed the
> code down, i would just not use it.

GCC adds runtime checks for a minimal number of iterations before
dispatching to the parallelized code - I guess we simply never hit
the threshold.  This is configurable via --param parloops-min-per-thread,
the default is 100, the default number of threads is determined the same
as for OpenMP so you can probably tune that via OMP_NUM_THREADS.

Comment 24 Anthony 2020-06-29 14:09:21 UTC

(In reply to rguenther@suse.de from comment #23)
> On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > 
> > --- Comment #22 from Anthony <prop_design at protonmail dot com> ---
> > (In reply to Thomas Koenig from comment #21)
> > > Another question: Is there anything left to be done with the
> > > vectorizer, or could we remove that dependency?
> > 
> > thanks for looking into this again for me. i'm surprised it worked the same on
> > Linux, but knowing that, at least helps debug this issue some more. I'm not
> > sure about the vectorizer question, maybe that question was intended for
> > someone else. the runtimes seem good as is though. i doubt the
> > auto-parallelization will add much speed. but it's an interesting feature that
> > i've always hoped would work. i've never got it to work though. the only code
> > that did actually implement something was Intel Fortran. it implemented one
> > trivial loop, but it slowed the code down instead of speeding it up. the output
> > from gfortran shows more loops it wants to run in parallel. they aren't
> > important ones. but something would be better than nothing. if it slowed the
> > code down, i would just not use it.
> 
> GCC adds runtime checks for a minimal number of iterations before
> dispatching to the parallelized code - I guess we simply never hit
> the threshold.  This is configurable via --param parloops-min-per-thread,
> the default is 100, the default number of threads is determined the same
> as for OpenMP so you can probably tune that via OMP_NUM_THREADS.

thanks for that tip. i tried changing the parloops parameters but no luck. the only difference was the max thread use went from 2 to 3. core use was the same.

i added the following an some variations of these:

--param parloops-min-per-thread=2 (the default was 100 like you said) --param parloops-chunk-size=1 (the default was zero so i removed this parameter later) --param parloops-schedule=auto (tried all options except guided, the default is static)

i was able to check that they were set via:

--help=param -Q

some other things i tried was adding -mthreads and removing -static. but so far no luck. i also tried using -mthreads instead of -pthread.

i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN. MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with the ability of the optimizer to auto-parallelize (in the past at least).

Comment 25 Anthony 2020-06-30 01:21:22 UTC

(In reply to Anthony from comment #24)
> (In reply to rguenther@suse.de from comment #23)
> > On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > > 
> > > --- Comment #22 from Anthony <prop_design at protonmail dot com> ---
> > > (In reply to Thomas Koenig from comment #21)
> > > > Another question: Is there anything left to be done with the
> > > > vectorizer, or could we remove that dependency?
> > > 
> > > thanks for looking into this again for me. i'm surprised it worked the same on
> > > Linux, but knowing that, at least helps debug this issue some more. I'm not
> > > sure about the vectorizer question, maybe that question was intended for
> > > someone else. the runtimes seem good as is though. i doubt the
> > > auto-parallelization will add much speed. but it's an interesting feature that
> > > i've always hoped would work. i've never got it to work though. the only code
> > > that did actually implement something was Intel Fortran. it implemented one
> > > trivial loop, but it slowed the code down instead of speeding it up. the output
> > > from gfortran shows more loops it wants to run in parallel. they aren't
> > > important ones. but something would be better than nothing. if it slowed the
> > > code down, i would just not use it.
> > 
> > GCC adds runtime checks for a minimal number of iterations before
> > dispatching to the parallelized code - I guess we simply never hit
> > the threshold.  This is configurable via --param parloops-min-per-thread,
> > the default is 100, the default number of threads is determined the same
> > as for OpenMP so you can probably tune that via OMP_NUM_THREADS.
> 
> thanks for that tip. i tried changing the parloops parameters but no luck.
> the only difference was the max thread use went from 2 to 3. core use was
> the same.
> 
> i added the following an some variations of these:
> 
> --param parloops-min-per-thread=2 (the default was 100 like you said)
> --param parloops-chunk-size=1 (the default was zero so i removed this
> parameter later) --param parloops-schedule=auto (tried all options except
> guided, the default is static)
> 
> i was able to check that they were set via:
> 
> --help=param -Q
> 
> some other things i tried was adding -mthreads and removing -static. but so
> far no luck. i also tried using -mthreads instead of -pthread.
> 
> i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN.
> MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with
> the ability of the optimizer to auto-parallelize (in the past at least).

I did more testing and it the added options actually slow the code way down. however, it still is only using one core. from what i can tell if i set OMP_PLACES it doesn't seem like it's working. i saw a thread from years ago where someone had the same problem. i think OMP_PLACES might be working on linux but not on windows. that's what the thread i found was saying. don't really know. but i've exhausted all the possibilities at this point. the only thing i know for sure is i can't get it to use anything more than one core.

Comment 26 Anthony 2020-06-30 01:21:57 UTC

(In reply to Anthony from comment #24)
> (In reply to rguenther@suse.de from comment #23)
> > On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957
> > > 
> > > --- Comment #22 from Anthony <prop_design at protonmail dot com> ---
> > > (In reply to Thomas Koenig from comment #21)
> > > > Another question: Is there anything left to be done with the
> > > > vectorizer, or could we remove that dependency?
> > > 
> > > thanks for looking into this again for me. i'm surprised it worked the same on
> > > Linux, but knowing that, at least helps debug this issue some more. I'm not
> > > sure about the vectorizer question, maybe that question was intended for
> > > someone else. the runtimes seem good as is though. i doubt the
> > > auto-parallelization will add much speed. but it's an interesting feature that
> > > i've always hoped would work. i've never got it to work though. the only code
> > > that did actually implement something was Intel Fortran. it implemented one
> > > trivial loop, but it slowed the code down instead of speeding it up. the output
> > > from gfortran shows more loops it wants to run in parallel. they aren't
> > > important ones. but something would be better than nothing. if it slowed the
> > > code down, i would just not use it.
> > 
> > GCC adds runtime checks for a minimal number of iterations before
> > dispatching to the parallelized code - I guess we simply never hit
> > the threshold.  This is configurable via --param parloops-min-per-thread,
> > the default is 100, the default number of threads is determined the same
> > as for OpenMP so you can probably tune that via OMP_NUM_THREADS.
> 
> thanks for that tip. i tried changing the parloops parameters but no luck.
> the only difference was the max thread use went from 2 to 3. core use was
> the same.
> 
> i added the following an some variations of these:
> 
> --param parloops-min-per-thread=2 (the default was 100 like you said)
> --param parloops-chunk-size=1 (the default was zero so i removed this
> parameter later) --param parloops-schedule=auto (tried all options except
> guided, the default is static)
> 
> i was able to check that they were set via:
> 
> --help=param -Q
> 
> some other things i tried was adding -mthreads and removing -static. but so
> far no luck. i also tried using -mthreads instead of -pthread.
> 
> i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN.
> MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with
> the ability of the optimizer to auto-parallelize (in the past at least).

I did more testing and it the added options actually slow the code way down. however, it still is only using one core. from what i can tell if i set OMP_PLACES it doesn't seem like it's working. i saw a thread from years ago where someone had the same problem. i think OMP_PLACES might be working on linux but not on windows. that's what the thread i found was saying. don't really know. but i've exhausted all the possibilities at this point. the only thing i know for sure is i can't get it to use anything more than one core.

Comment 27 Anthony 2020-07-29 22:25:21 UTC

so after trying a bunch of things, i think the final problem may be this. i get the following result when i try to set thread affinity:

set GOMP_CPU_AFFINITY="0 1"

gives the following feedback at run time; libgomp: Affinity not supported on this configuration

i have to close the command prompt window to stop the program. the program doesn't run properly if i try to set thread affinity.

so this still makes me thing it might work on linux and not windows 10, but i have no way to test that.

the extra threads that auto-parallelization create will only go to one core, on my machine at least.