This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
- From: "sergey.shalnov at intel dot com" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Fri, 08 Dec 2017 12:10:29 +0000
- Subject: [Bug target/83008] [performance] Is it better to avoid extra instructions in data passing between loops?
- Auto-submitted: auto-generated
- References: <bug-83008-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83008
--- Comment #12 from sergey.shalnov at intel dot com ---
Richard,
Your last proposal changed the code generated a bit.
Currently is shows:
test_bugzilla1.c:6:5: note: Cost model analysis:.
Vector inside of loop cost: 62576
Vector prologue cost: 0
Vector epilogue cost: 0
Scalar iteration cost: 328
Scalar outside cost: 0
Vector outside cost: 0
prologue iterations: 0
epilogue iterations: 0
test_bugzilla1.c:6:5: note: cost model: the vector iteration cost = 62576
divided by the scalar iteration cost = 328 is greater or equal to the
vectorization factor = 4.
test_bugzilla1.c:6:5: note: not vectorized: vectorization not profitable.
test_bugzilla1.c:6:5: note: not vectorized: vector version will never be
profitable.
And it uses xmm+ vpbroadcastd to spill tmp[] to stack
...
1e7: 62 d2 7d 08 7c c9 vpbroadcastd %r9d,%xmm1
1ed: c4 c1 79 7e c9 vmovd %xmm1,%r9d
1f2: 62 f1 fd 08 7f 8c 24 vmovdqa64 %xmm1,-0x38(%rsp)
1f9: c8 ff ff ff
1fd: 62 f2 7d 08 7c d7 vpbroadcastd %edi,%xmm2
203: c5 f9 7e d7 vmovd %xmm2,%edi
207: 62 f1 fd 08 7f 94 24 vmovdqa64 %xmm2,-0x28(%rsp)
20e: d8 ff ff ff
212: 62 f2 7d 08 7c db vpbroadcastd %ebx,%xmm3
218: c5 f9 7e de vmovd %xmm3,%esi
21c: 62 f1 fd 08 7f 9c 24 vmovdqa64 %xmm3,-0x18(%rsp)
223: e8 ff ff ff
227: 01 fe add %edi,%esi
229: 45 01 c8 add %r9d,%r8d
22c: 41 01 f0 add %esi,%r8d
22f: 8b 5c 24 dc mov -0x24(%rsp),%ebx
233: 03 5c 24 ec add -0x14(%rsp),%ebx
237: 8b 6c 24 bc mov -0x44(%rsp),%ebp
23b: 03 6c 24 cc add -0x34(%rsp),%ebp
...
I think this is better in case of performance perspective but, as I said
before, not using vector registers here is the best option if no loops
vectorized.
In case of static loop increment (the first test case) - the first loop
vectorized as before.
Sergey