This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/80844] OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads)
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Tue, 23 May 2017 07:08:50 +0000
- Subject: [Bug tree-optimization/80844] OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads)
- Auto-submitted: auto-generated
- References: <bug-80844-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2017-05-23
CC| |jakub at gcc dot gnu.org
Component|target |tree-optimization
Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
Ever confirmed|0 |1
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Uh. .optimized:
float sumfloat_omp(const float*) (const float * arr)
{
unsigned long ivtmp.22;
vector(8) float D__lsm0.19;
const vector(8) float vect__23.18;
const vector(8) float vect__4.16;
float stmp_sum_19.12;
vector(8) float vect__18.10;
float D.2841[8];
vector(8) float _10;
void * _77;
unsigned long _97;
<bb 2> [1.00%]:
arr_13 = arr_12(D);
__builtin_memset (&D.2841, 0, 32);
_10 = MEM[(float *)&D.2841];
ivtmp.22_78 = (unsigned long) arr_13;
_97 = ivtmp.22_78 + 4096;
...
<bb 4> [1.00%]:
MEM[(float *)&D.2841] = vect__23.18_58;
vect__18.10_79 = MEM[(float *)&D.2841];
stmp_sum_19.12_50 = [reduc_plus_expr] vect__18.10_79;
return stmp_sum_19.12_50;
well, that explains it ;) Coming from
<bb 7> [99.00%]:
# i_33 = PHI <i_25(8), 0(6)>
# ivtmp_35 = PHI <ivtmp_28(8), 1024(6)>
_21 = GOMP_SIMD_LANE (simduid.0_14(D));
_1 = (long unsigned int) i_33;
_2 = _1 * 4;
_3 = arr_13 + _2;
_4 = *_3;
_22 = D.2841[_21];
_23 = _4 + _22;
D.2841[_21] = _23;
i_25 = i_33 + 1;
ivtmp_28 = ivtmp_35 - 1;
if (ivtmp_28 != 0)
goto <bb 8>; [98.99%]
so we perform the reduction in memory, then LIM performs store-motion on it
but the memset isn't inlined early enough to rewrite the decl into SSA
(CCP from GOMP_SIMD_VF is missing). In DOM we have
__builtin_memset (&D.2841, 0, 32);
_10 = MEM[(float *)&D.2841];
so we do not fold that.
If OMP SIMD always zeros the vector then it could also emit the maybe easier
to optimize
WITH_SIZE_EXPR<_3, D.2841> = {};
of course gimple_fold_builtin_memset should simply be improved to optimize
now constant-size memset to = {}.
I'll have a look.