[Note that MP_PROP_DESIGN is also discussed at the gcc-graphite mailing list, albeit more with regards to automatic parallelization.] The polyhedron benchmark (2011 version) is available at: http://www.polyhedron.com/polyhedron_benchmark_suite0html, namely: http://www.polyhedron.com/web_images/documents/pb11.zip (The original program, which also contains a ready-to-go benchmark is at http://propdesign.weebly.com/; Note that you may have to rename some input *.txt files to *TXT.) The program takes twice as long with GCC as with ifort. The program is just 502 lines long (w/o comments) and contains no subroutines or functions. It mainly consists of loops and a some math functions (sin, cos, pow, tan, atan, acos, exp). [Result on CentOS 5.7, x86-64-gnu-linux, Intel Xeon X3430 @2.40GHz] Using GCC 4.8.0 20120622 (experimental) [trunk revision 188871], I get: $ gfortran -Ofast -funroll-loops -fwhole-program -march=native mp_prop_design.f90 $ time ./a.out > /dev/null real 2m47.138s user 2m46.808s sys 0m0.236s Using Intel's ifort on Intel(R) 64, Version 12.1 Build 20120212: $ ifort -fast mp_prop_design.f90 $ time ./a.out > /dev/null real 1m25.906s user 1m25.598s sys 0m0.244s With Intel's libimf preloaded (LD_PRELOAD=.../libimf.so), GCC has: real 2m0.524s user 1m59.809s sys 0m0.689s The code features expressions like a**2.0D0, but those are converted in GCC to a*a. Using -mveclibabi=svml (and no preloading) gives the same timings as without (or slightly worse); it just calls vmldAtan2. Vectorizer: I haven't profiled this part, but I want to note that ifort vectorizes more, namely: GCC vectorizes: 662: LOOP VECTORIZED. 1032: LOOP VECTORIZED. 1060: LOOP VECTORIZED. While ifort has: mp_prop_design.f90(271): (col. 10) remark: LOOP WAS VECTORIZED. (Loop "m1 =2, 45" with conditional jump out of the loop) mp_prop_design.f90(552): (col. 16) remark: LOOP WAS VECTORIZED. (Loop with condition) mp_prop_design.f90(576): (col. 16) remark: PARTIAL LOOP WAS VECTORIZED. (Loop with two IF blocks) mp_prop_design.f90(639): (col. 16) remark: LOOP WAS VECTORIZED. (Rather simple loop) mp_prop_design.f90(662): (col. 2) remark: LOOP WAS VECTORIZED. (Vectorized by GCC) mp_prop_design.f90(677): (col. 16) remark: PARTIAL LOOP WAS VECTORIZED. (Line number points to the outermost of the three loops; there are also conditional jumps) mp_prop_design.f90(818): (col. 16) remark: LOOP WAS VECTORIZED. (Nested "if" blocks) mp_prop_design.f90(1032): (col. 2) remark: LOOP WAS VECTORIZED. mp_prop_design.f90(1060): (col. 2) remark: LOOP WAS VECTORIZED. (The last two are handled by GCC)
On trunk we do vectorize the loop at 552, but I'm not sure that unconditionally calling vmldAtan2 is profitable. That is, trunk for me has (-Ofast -mveclibabi=svml): 552: LOOP VECTORIZED. 576: LOOP VECTORIZED. 662: LOOP VECTORIZED. 1032: LOOP VECTORIZED. 1060: LOOP VECTORIZED. The loop at 639 is converted to two memset calls. mp_prop_design.f90(677): (col. 16) remark: PARTIAL LOOP WAS VECTORIZED. (Line number points to the outermost of the three loops; there are also conditional jumps) seems to be the important one to tackle. For the loop at 818 we fail to if-convert the nested if IF ( j.EQ.1 ) THEN tempa(j) = ZERO ELSE arg1 = -vefz(j) arg2 = vefphi(j) IF ( (arg2.LT.ZERO) .OR. (arg2.GT.ZERO) ) THEN tempa(j) = ATAN(arg1/arg2) - theta(j) ELSE tempa(j) = -theta(j) ENDIF ENDIF where we also fail to apply store-motion of tempa(j). The if (j == 1) conditional code makes the loop suitable for peeling, too. That said, this loop is suitable for analysis as well.
All time is spent in the loop nest starting at line 677, 683, 694, 696 for all of them we claim they are in bad loop form.
The issue seems to be that the frontend uses two induction variables, one signed and one unsigned, for DO i = 1 , 1 + NINT(2.0D0*PI*trns/dphit) , & & NINT(ainc/(dphit*(180.0D0/PI))) ... END DO <bb 78>: # i_5 = PHI <[mp_prop_design.f90 : 697:0] 1(77), [mp_prop_design.f90 : 696:0] i_621(79)> # countm1.38_32 = PHI <[mp_prop_design.f90 : 696:0] countm1.38_466(77), [mp_prop_design.f90 : 696:0] countm1.38_622(79)> # prephitmp.386_3285 = PHI <pretmp.385_3284(77), D.2618_614(79)> # prephitmp.386_3287 = PHI <pretmp.385_3286(77), D.2620_620(79)> ... [mp_prop_design.f90 : 696:0] i_621 = i_5 + pretmp.378_3242; [mp_prop_design.f90 : 696:0] # DEBUG i => i_621 [mp_prop_design.f90 : 696:0] if (countm1.38_32 == 0) goto <bb 80>; else goto <bb 79>; <bb 79>: [mp_prop_design.f90 : 696:0] countm1.38_622 = countm1.38_32 + 4294967295; [mp_prop_design.f90 : 696 : 0] goto <bb 78>; and the "decrement" of countm1 happens in the loop latch block. It would be better to have this similar to other loops I see, bool flag = end-value == i; i = i + 1; if (flag) goto loop_exit;
(In reply to comment #3) > It would be better to have this similar to other loops I see, > > bool flag = end-value == i; > i = i + 1; > if (flag) goto loop_exit; That's not that simple as one might not reach the end value due to the step. If "step" is (plus or minus) unity and if one has integers (and not reals, added in Fortran 77, deleted in Fortran 90), it is simple. But if abs(step) != 1 or if the loop variable is not an integer, one either needs to calculate the number of trips beforehand, or has to use ">" or "<" rather than "==". The problem with "<" / ">" is that one has to do another comparison, unless the sign of "step" is known: if (step > 0 ? dovar > to : dovar < to) goto exit_label; I don't see whether that version is better than the current version. Suggestions or comments? The current code is (comment from trans-stmt.c's gfc_trans_do): ------------<cut>----------------- We translate a do loop from: DO dovar = from, to, step body END DO to: [evaluate loop bounds and step] empty = (step > 0 ? to < from : to > from); countm1 = (to - from) / step; dovar = from; if (empty) goto exit_label; for (;;) { body; cycle_label: dovar += step if (countm1 ==0) goto exit_label; countm1--; } exit_label: countm1 is an unsigned integer. It is equal to the loop count minus one, because the loop count itself can overflow. */ ------------</cut>-----------------
On Wed, 18 Jul 2012, burnus at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957 > > Tobias Burnus <burnus at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |burnus at gcc dot gnu.org > > --- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> 2012-07-18 12:46:29 UTC --- > (In reply to comment #3) > > It would be better to have this similar to other loops I see, > > > > bool flag = end-value == i; > > i = i + 1; > > if (flag) goto loop_exit; > > That's not that simple as one might not reach the end value due to the step. If > "step" is (plus or minus) unity and if one has integers (and not reals, added > in Fortran 77, deleted in Fortran 90), it is simple. > > But if abs(step) != 1 or if the loop variable is not an integer, one either > needs to calculate the number of trips beforehand, or has to use ">" or "<" > rather than "==". The problem with "<" / ">" is that one has to do another > comparison, unless the sign of "step" is known: > > if (step > 0 ? dovar > to : dovar < to) > goto exit_label; > > I don't see whether that version is better than the current version. > Suggestions or comments? > > > The current code is (comment from trans-stmt.c's gfc_trans_do): > > ------------<cut>----------------- > We translate a do loop from: > > DO dovar = from, to, step > body > END DO > > to: > > [evaluate loop bounds and step] > empty = (step > 0 ? to < from : to > from); > countm1 = (to - from) / step; > dovar = from; > if (empty) goto exit_label; > for (;;) > { > body; > cycle_label: > dovar += step > if (countm1 ==0) goto exit_label; > countm1--; > } > exit_label: > > countm1 is an unsigned integer. It is equal to the loop count minus one, > because the loop count itself can overflow. */ If you do > [evaluate loop bounds and step] > empty = (step > 0 ? to < from : to > from); > countm1 = (to - from) / step; > dovar = from; > if (empty) goto exit_label; > for (;;) > { > body; > cycle_label: > dovar += step exit = countm1 == 0; countm1--; > if (exit) goto exit_label; > } > exit_label: it would work for this case.
Created attachment 27823 [details] Draft patch: Change comparison into bool assignment, decrement conditional jump (In reply to comment #5) > If you do > exit = countm1 == 0; > countm1--; > > if (exit) goto exit_label; > it would work for this case. If I apply the attached patch, I do not see any performance difference on my AMD Athlon64 x2 4800+ with -Ofast -funroll-loops -march=native. real 3m45.711s 3m45.589s 3m44.308s | 3m45.363s 3m45.328s 3m44.220s user 3m45.710s 3m45.582s 3m44.274s | 3m45.282s 3m45.286s 3m44.218s
It helps to make us even consider the loop. We now run into 696: worklist: examine stmt: D.2574_254 = (real(kind=4)) i_5; 696: vect_is_simple_use: operand i_5 696: def_stmt: i_5 = PHI <1(77), i_324(80)> 696: Unsupported pattern. 696: not vectorized: unsupported use in stmt. 696: unexpected pattern. that is, the following induction is not handled: phit = phib + phie(k) + (REAL(i)-0.50D0) & & *dphit so it would be still worthwhile to pursue your patch if it does not have negative effects elsewhere. We should be able to fix the induction code to handle this case. If you can help isolating the innermost two loops into a smaller testcase that would be great, too.
(In reply to comment #7) > so it would be still worthwhile to pursue your patch if it does not have > negative effects elsewhere. We should be able to fix the induction code > to handle this case. Regarding negative (or positive) impact with regards to performance: That's difficult to test :-( However, with the patch, f951 stops with the following ICE internal compiler error: in free_regset_pool, at sel-sched-ir.c:994 with gfortran.dg/pr42294.f and gfortran.dg/pr44691.f.
(In reply to comment #8) > (In reply to comment #7) > > so it would be still worthwhile to pursue your patch if it does not have > > negative effects elsewhere. We should be able to fix the induction code > > to handle this case. > > Regarding negative (or positive) impact with regards to performance: That's > difficult to test :-( > > However, with the patch, f951 stops with the following ICE > internal compiler error: in free_regset_pool, at sel-sched-ir.c:994 > with gfortran.dg/pr42294.f and gfortran.dg/pr44691.f. That's a pre-existing issue on current trunk, unrelated to the patch.
There are a lot more reasons why we do not vectorize this loop :(
(In reply to comment #6) > Created attachment 27823 [details] > Draft patch: Change comparison into bool assignment, decrement conditional jump A similar but slightly different patch has been committed, cf. PR 52865 comment 13.
Hi Guys, I'm the developer of PROP_DESIGN. I originally posted on the Google GCC Graphite Group. Thanks Tobias for creating this bug and realizing the root issue. I originally thought auto-parallelization would be of benefit. However, I recently starting experimenting with the Intel Fortran compiler and have found some things that may help you out. I have found Intel Fortran IPO, auto-vectorization, and/or auto-parallelization are of no benefit to PROP_DESIGN. I also found, as Tobias mentioned here, that gfortran creates significantly slower executable files than Intel Fortran. I have narrowed it down to just the basic optimizations. It does not have to do with anything else. If you compare gfortran -03 optimizations versus Intel Fortran /O3 optimizations, you see a big difference. One case I ran shows about 38.65% faster executable files, if you use Intel Fortran with /O3 optimizations compared to gfortran with -O3 optimizations. I have a measle AMD C-60 processor and use Windows 7 64-bit. I have tried many other gfortran for Windows compilers in the past, but I'm currently using the latest version of TDM-GCC. I have also tried Linux but am not currently using it. I have not tried Intel Fortran on Linux. I am not much of a programmer, so I can't say why gfortran -O3 is making slower executable files than Intel Fortran /O3. Perhaps you guys would know. I thought this information might help you out. If I can be of any help to you, let me know. My website has the latest version of PROP_DESIGN. Polyhedron refuses to update the version I sent them years ago. It would probably be better if you used the latest version for testing your software. Sincerely, Anthony Falzone http://propdesign.weebly.com/
My previous post needs a correction. Comparing gfortran O3 to Intel Fortran O3 I see a 60% speed improvement in favor of the Intel Fortran compiler. There is a 40% improvement over past releases of PROP_DESIGN, which used gfortran Ofast. There is not much difference between Intel Fortran O3 and Ofast, so I am using O3 to ensure accurate calculations.
Anthony, could you provide a reduced test showing the problem?
(In reply to Dominique d'Humieres from comment #14) > Anthony, could you provide a reduced test showing the problem? Hi Dominique, About the most reduced I can think of is PROP_DESIGN_ANALYSIS. It contains the core calculations that are required to determine aircraft propeller performance. PROP_DESIGN_ANALYSIS_BENCHMARK just adds some looping options that in my mind could run in parallel. However, I don't know anything about parallel programming and when I tried some Fortran compilers with auto-parallelization non of them can pick up on the loops that too me seem obviously parallel. So what I think and what is currently feasible with auto-parallelization are not the same. In any event, I have noticed that just using O3 optimizations there is a substantial difference between Intel Fortran and gfortran. So I am just confirming what Tobias is saying here in this bug report. If PROP_DESIGN_ANALYSIS_BENCHMARK is too complex of a test case for you the only thing I can think to do is take PROP_DESIGN_ANALYSIS and strip out most of the end parts where various outputs are created. So the program would just take the inputs run the minimum calculations with the least looping possible and output pretty much nothing. It wouldn't be hard for me to do something like that if it would be of any benefit to you, I'm not sure. Also, if there is anything programming wise that you would like changed for some reason as far as Fortran syntax, I can try that too. My knowledge of Fortran is fairly basic. I tried to stick strictly to Fortran 77, since that is what I was trained in and have a lot of experience with. I don't know any other programming languages or even any other versions of Fortran such as 90/95 etc... Anthony
Still present on current trunk: fortran -Ofast -funroll-loops -march=native -mtune=native -fopt-info-vec mp_prop_design.f90 mp_prop_design.f90:1117:0: note: loop vectorized mp_prop_design.f90:1117:0: note: loop vectorized mp_prop_design.f90:1087:0: note: loop vectorized mp_prop_design.f90:1060:0: note: loop vectorized mp_prop_design.f90:1032:0: note: loop vectorized mp_prop_design.f90:662:0: note: loop vectorized mp_prop_design.f90:375:0: note: loop vectorized mp_prop_design.f90:375:0: note: loop vectorized mp_prop_design.f90:14:0: note: basic block vectorized
hi, does anyone know how to contact someone about deleting my gcc buzilla account. i can't find any contact info except overseers@gcc.gnu.org and that one gets sent back.
Under Preferences/Email Preferences, you can select "Disable All Mail", which should work and keep you from getting unwanted mail.
hi everyone, I'm not sure if this is the right place to ask this or not, but it relates to the topic. I can't find the other thread about graphite auto-parallelization that I made a long time ago. I tried gfortran 10.1.0 via MSYS2. It seems to work very well on the latest version of PROP_DESIGN. MP_PROP_DESIGN had some extra loops for benchmarking. I found it made it harder for the optimizer so I deleted that code and just use the 'real' version of the code it was based on called PROP_DESIGN_MAPS. So that's the actual propeller design code with no additional looping for benchmarking purposes. I've found no Fortran compiler and do the auto-parallelization the way I would like. The only code that would implement any at run time actually slowed the code way down instead of sped it up. I still have my original problem with gfortran. That is, at runtime no actual parallelization occurs. The code runs the exact same as if the commands are not present. Oddly though, the code does say it auto-parallelized many loops. Although, not the loops that would really help, but at least it shows it's doing something. That's an improvement from when I started these threads. The problem is if I compile with the following: gfortran PROP_DESIGN_MAPS.f -o PROP_DESIGN_MAPS.exe -O3 -ffixed-form -static -march=x86-64 -mtune=generic -mfpmath=sse -mieee-fp -pthread -ftree-parallelize-loops=2 -floop-parallelize-all -fopt-info-loop It runs the exact same way as if I compile with: gfortran PROP_DESIGN_MAPS.f -o PROP_DESIGN_MAPS.exe -O3 -ffixed-form -static -march=x86-64 -mtune=generic -mfpmath=sse -mieee-fp Again, gfortran does say it auto-parallelize some loops. So it's very odd. I have searched the net and can't find anything that has helped. I'm wondering if for Linux users, the code actually will work in parallel. That would at least narrow the problem down some. I'm using Windows 10 and the code will only run with one core. Compiling both ways it shows 2 threads used for awhile and then drops to 1 thread. The good news from when this was posted is that gfortran ran the code at the same speed as the PGI Community Edition Compiler. Since they just stopped developing that, I switched back to gfortran. I no longer have Intel Fortran to test. That was the compiler that actually did run the code in parallel, but it ran twice as slow instead of twice as fast. That was a year or two ago. I don't know if it's any better now. I'm wondering if there is some sort of issue with -pthread not being able to call anything more than one core on Windows 10. You can download PROP_DESIGN at https://propdesign.jimdofree.com Inside the download are all the *.f files. I also have c.bat files in there with the compiler options I used. The auto-parallelization commands are not present, since they don't seem to be working still. At least on Windows 10. The code now runs much faster than it used to, due to many bug fixes and improvements I've made over the years. However, you can get it to run really slow for testing purposes. In the settings file for the program change the defaults like this: 1 ALLOW VORTEX INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM, DEFAULT = 2) 2 ALLOW BLADE-TO-BLADE INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM, DEFAULT = 2) or like this 1 ALLOW VORTEX INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM, DEFAULT = 2) 1 ALLOW BLADE-TO-BLADE INTERACTIONS (1) FOR YES (2) FOR NO (INTEGER, NON-DIM, DEFAULT = 2) The first runs very slow, the second incredibly slow. I just close the command window once I've seen if the code is running in parallel or not. With the defaults set at 2 for each of those values the code runs so fast you can't really get a sense of what's going on. Thanks for any help, Anthony
I looked at the output of gfortran -Ofast -funroll-loops -march=native -mtune=native -fopt-info-vec mp_prop_design.f90 2>&1 | wc -l in the Polyhedron testsuite, and I can confirm it really shows much more vectorization than before. So, I'm glad that part seems to be fixed. Regarding auto-parallelization: I can confirm your observation under Linux, it doesn't do any more than on Windows. I think I will open up meta-PR and make this PR depend on it.
Another question: Is there anything left to be done with the vectorizer, or could we remove that dependency?
(In reply to Thomas Koenig from comment #21) > Another question: Is there anything left to be done with the > vectorizer, or could we remove that dependency? thanks for looking into this again for me. i'm surprised it worked the same on Linux, but knowing that, at least helps debug this issue some more. I'm not sure about the vectorizer question, maybe that question was intended for someone else. the runtimes seem good as is though. i doubt the auto-parallelization will add much speed. but it's an interesting feature that i've always hoped would work. i've never got it to work though. the only code that did actually implement something was Intel Fortran. it implemented one trivial loop, but it slowed the code down instead of speeding it up. the output from gfortran shows more loops it wants to run in parallel. they aren't important ones. but something would be better than nothing. if it slowed the code down, i would just not use it. there is something different in gfortran where it mentions a lot of 16bit vectorization. i don't recall that from the past. but whatever it's doing, seems fine from a speed perspective. some compliments to the developers. the code compiles very fast compared to other compilers. i'm really glad it doesn't rely on Microsoft Visual Studio. that's a huge time consuming install. I was very happy I could finally uninstall it. also, gfortran handles all my stop statements properly. PGI Community Edition was adding in a bunch of non-sense output anytime a stop command was issued. So it's nice to have the code work as intended again.
On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957 > > --- Comment #22 from Anthony <prop_design at protonmail dot com> --- > (In reply to Thomas Koenig from comment #21) > > Another question: Is there anything left to be done with the > > vectorizer, or could we remove that dependency? > > thanks for looking into this again for me. i'm surprised it worked the same on > Linux, but knowing that, at least helps debug this issue some more. I'm not > sure about the vectorizer question, maybe that question was intended for > someone else. the runtimes seem good as is though. i doubt the > auto-parallelization will add much speed. but it's an interesting feature that > i've always hoped would work. i've never got it to work though. the only code > that did actually implement something was Intel Fortran. it implemented one > trivial loop, but it slowed the code down instead of speeding it up. the output > from gfortran shows more loops it wants to run in parallel. they aren't > important ones. but something would be better than nothing. if it slowed the > code down, i would just not use it. GCC adds runtime checks for a minimal number of iterations before dispatching to the parallelized code - I guess we simply never hit the threshold. This is configurable via --param parloops-min-per-thread, the default is 100, the default number of threads is determined the same as for OpenMP so you can probably tune that via OMP_NUM_THREADS.
(In reply to rguenther@suse.de from comment #23) > On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957 > > > > --- Comment #22 from Anthony <prop_design at protonmail dot com> --- > > (In reply to Thomas Koenig from comment #21) > > > Another question: Is there anything left to be done with the > > > vectorizer, or could we remove that dependency? > > > > thanks for looking into this again for me. i'm surprised it worked the same on > > Linux, but knowing that, at least helps debug this issue some more. I'm not > > sure about the vectorizer question, maybe that question was intended for > > someone else. the runtimes seem good as is though. i doubt the > > auto-parallelization will add much speed. but it's an interesting feature that > > i've always hoped would work. i've never got it to work though. the only code > > that did actually implement something was Intel Fortran. it implemented one > > trivial loop, but it slowed the code down instead of speeding it up. the output > > from gfortran shows more loops it wants to run in parallel. they aren't > > important ones. but something would be better than nothing. if it slowed the > > code down, i would just not use it. > > GCC adds runtime checks for a minimal number of iterations before > dispatching to the parallelized code - I guess we simply never hit > the threshold. This is configurable via --param parloops-min-per-thread, > the default is 100, the default number of threads is determined the same > as for OpenMP so you can probably tune that via OMP_NUM_THREADS. thanks for that tip. i tried changing the parloops parameters but no luck. the only difference was the max thread use went from 2 to 3. core use was the same. i added the following an some variations of these: --param parloops-min-per-thread=2 (the default was 100 like you said) --param parloops-chunk-size=1 (the default was zero so i removed this parameter later) --param parloops-schedule=auto (tried all options except guided, the default is static) i was able to check that they were set via: --help=param -Q some other things i tried was adding -mthreads and removing -static. but so far no luck. i also tried using -mthreads instead of -pthread. i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN. MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with the ability of the optimizer to auto-parallelize (in the past at least).
(In reply to Anthony from comment #24) > (In reply to rguenther@suse.de from comment #23) > > On Sun, 28 Jun 2020, prop_design at protonmail dot com wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53957 > > > > > > --- Comment #22 from Anthony <prop_design at protonmail dot com> --- > > > (In reply to Thomas Koenig from comment #21) > > > > Another question: Is there anything left to be done with the > > > > vectorizer, or could we remove that dependency? > > > > > > thanks for looking into this again for me. i'm surprised it worked the same on > > > Linux, but knowing that, at least helps debug this issue some more. I'm not > > > sure about the vectorizer question, maybe that question was intended for > > > someone else. the runtimes seem good as is though. i doubt the > > > auto-parallelization will add much speed. but it's an interesting feature that > > > i've always hoped would work. i've never got it to work though. the only code > > > that did actually implement something was Intel Fortran. it implemented one > > > trivial loop, but it slowed the code down instead of speeding it up. the output > > > from gfortran shows more loops it wants to run in parallel. they aren't > > > important ones. but something would be better than nothing. if it slowed the > > > code down, i would just not use it. > > > > GCC adds runtime checks for a minimal number of iterations before > > dispatching to the parallelized code - I guess we simply never hit > > the threshold. This is configurable via --param parloops-min-per-thread, > > the default is 100, the default number of threads is determined the same > > as for OpenMP so you can probably tune that via OMP_NUM_THREADS. > > thanks for that tip. i tried changing the parloops parameters but no luck. > the only difference was the max thread use went from 2 to 3. core use was > the same. > > i added the following an some variations of these: > > --param parloops-min-per-thread=2 (the default was 100 like you said) > --param parloops-chunk-size=1 (the default was zero so i removed this > parameter later) --param parloops-schedule=auto (tried all options except > guided, the default is static) > > i was able to check that they were set via: > > --help=param -Q > > some other things i tried was adding -mthreads and removing -static. but so > far no luck. i also tried using -mthreads instead of -pthread. > > i should make clear i'm testing PROP_DESIGN_MAPS, not MP_PROP_DESIGN. > MP_PROP_DESIGN is ancient and the added benchmarking loops were messing with > the ability of the optimizer to auto-parallelize (in the past at least). I did more testing and it the added options actually slow the code way down. however, it still is only using one core. from what i can tell if i set OMP_PLACES it doesn't seem like it's working. i saw a thread from years ago where someone had the same problem. i think OMP_PLACES might be working on linux but not on windows. that's what the thread i found was saying. don't really know. but i've exhausted all the possibilities at this point. the only thing i know for sure is i can't get it to use anything more than one core.
so after trying a bunch of things, i think the final problem may be this. i get the following result when i try to set thread affinity: set GOMP_CPU_AFFINITY="0 1" gives the following feedback at run time; libgomp: Affinity not supported on this configuration i have to close the command prompt window to stop the program. the program doesn't run properly if i try to set thread affinity. so this still makes me thing it might work on linux and not windows 10, but i have no way to test that. the extra threads that auto-parallelization create will only go to one core, on my machine at least.