This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On 02/07/2014 10:22 AM, Jakub Jelinek wrote:
The question of problems with gcc -O3 (called from gfortran) have eluded me as to finding a minimal test case. When I run under debug, it appears that somewhere prior to the crash some gfortran code is over-written with data by the gcc code, overwhelming my debugging skill. I can get full performance with -O2 plus a bunch of intermediate flags. As to non-vectorization of dot product in omp parallel region, -fopt-info (which I didn't know about) is reporting vectorization, but there are no parallel simd instructions in the generated code for the omp_fn. I'll file a PR on that if it's still reproduced in a minimal case.On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:I'm seeing vectorization but no output from -ftree-vectorizer-verbose, and no dot product vectorization inside omp parallel regions, with gcc g++ or gfortran 4.9. Primary targets are cygwin64 and linux x86_64. I've been unable to use -O3 vectorization with gcc, although it works with gfortran and g++, so use gcc -O2 -ftree-vectorize together with additional optimization flags which don't break.Can you file a GCC bugzilla PR with minimal testcases for this (or point us at already filed bugreports)?
Those are cases of 2 levels of loops from netlib "vector" benchmark where only one level is vectorizable and parallelizable. By putting the vectorizable loop on the outside the parallelization scales to a large number of cores. I don't expect it to out-perform single thread optimized avx vectorization until 8 or more cores are in use, but it needs more than expected number of threads even relative to SSE vectorization.I've made source code changes to take advantage of the new vectorization with merge() and ? operators; while it's useful for -march=core-avx2, it's sometimes a loss for -msse4.1. gcc vectorization with #pragma omp parallel for simd is reasonably effective in my tests only on 12 or more cores.Likewise.
I'll file a PR on this, didn't know if there might be interest. I have an Intel compiler issue "closed, will not be fixed" so the simd reduction(max: ) isn't viable for icc in the near term.#pragma omp simd reduction(max: ) is giving correct results but poor performance in my tests.Likewise.
Thanks,
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |