This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: -O3 and -ftree-vectorize



On 02/07/2014 10:22 AM, Jakub Jelinek wrote:
On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:
I'm seeing vectorization  but no output from
-ftree-vectorizer-verbose, and no dot product vectorization inside
omp parallel regions, with gcc g++ or gfortran 4.9.  Primary targets
are cygwin64 and linux x86_64.
I've been unable to use -O3 vectorization with gcc, although it
works with gfortran and g++, so use gcc -O2 -ftree-vectorize
together with additional optimization flags which don't break.
Can you file a GCC bugzilla PR with minimal testcases for this (or point us
at already filed bugreports)?
The question of problems with gcc -O3 (called from gfortran) have eluded me as to finding a minimal test case. When I run under debug, it appears that somewhere prior to the crash some gfortran code is over-written with data by the gcc code, overwhelming my debugging skill. I can get full performance with -O2 plus a bunch of intermediate flags. As to non-vectorization of dot product in omp parallel region, -fopt-info (which I didn't know about) is reporting vectorization, but there are no parallel simd instructions in the generated code for the omp_fn. I'll file a PR on that if it's still reproduced in a minimal case.


I've made source code changes to take advantage of the new
vectorization with merge() and ? operators; while it's useful for
-march=core-avx2, it's sometimes a loss for -msse4.1.
gcc vectorization with #pragma omp parallel for simd is reasonably
effective in my tests only on 12 or more cores.
Likewise.
Those are cases of 2 levels of loops from netlib "vector" benchmark where only one level is vectorizable and parallelizable. By putting the vectorizable loop on the outside the parallelization scales to a large number of cores. I don't expect it to out-perform single thread optimized avx vectorization until 8 or more cores are in use, but it needs more than expected number of threads even relative to SSE vectorization.

#pragma omp simd reduction(max: ) is giving correct results but poor
performance in my tests.
Likewise.
I'll file a PR on this, didn't know if there might be interest. I have an Intel compiler issue "closed, will not be fixed" so the simd reduction(max: ) isn't viable for icc in the near term.
Thanks,


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]