Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero?
Hongtao Liu
crazylht@gmail.com
Mon Mar 9 05:54:12 GMT 2020
On Sat, Mar 7, 2020 at 5:20 AM Hong X <hongx@ibm.com> wrote:
>
> Hi all,
>
> I tried to compile the following two code snippets with "--std=c++14 -mavx2 -O3" options:
>
> double tmp_values[4] = {0};
>
> and
>
> double tmp_values[4];
>
> for (auto i = 0; i < 4; ++i) {
> tmp_values[i] = 0.0;
> }
>
> The first code snippet leads to
>
> vmovaps XMMWORD PTR [rsp], xmm0
> vmovaps XMMWORD PTR [rsp+16], xmm0
>
> But the second leads to only
>
> vmovapd YMMWORD PTR [rsp], ymm0
>
> which is less efficient than the previous one. Am I missing something?
>
Assume you're working on Skylake. the latency and throuoput of
vmovaps/vmovpad is
| lat | throughput | uops | port |
VMOVAPS (XMM, M128)| [≤4;≤7] | 0.50 / 0.50 | 1 | 1*p23 |
VMOVAPS (YMM, M256)| [≤5;≤8]| 0.50 / 0.50| 1 | 1*p23 |
Refer to https://uops.info/table.html
So the later seems better.
> For the full code, see this godbolt link: https://godbolt.org/z/jonf72 , and I paste the full input and output below:
>
> Input code
>
> #include <cstring>
>
> double loadu1(const void* ptr, int count) {
>
> double tmp_values[4] = {0};
>
> std::memcpy(
> tmp_values,
> ptr,
> count * sizeof(double));
> return tmp_values[0] + tmp_values[1] + tmp_values[2] + tmp_values[3];
> }
>
>
> double loadu2(const void* ptr, int count) {
>
> double tmp_values[4];
>
> for (auto i = 0; i < 4; ++i) {
> tmp_values[i] = 0.0;
> }
>
> std::memcpy(
> tmp_values,
> ptr,
> count * sizeof(double));
> return tmp_values[0] + tmp_values[1] + tmp_values[2] + tmp_values[3];
> }
>
>
> Output assemblies:
>
> loadu1(void const*, int):
> sub rsp, 40
> movsx rdx, esi
> vpxor xmm0, xmm0, xmm0
> mov rsi, rdi
> sal rdx, 3
> mov rdi, rsp
> vmovaps XMMWORD PTR [rsp], xmm0
> vmovaps XMMWORD PTR [rsp+16], xmm0
> call memcpy
> vmovsd xmm0, QWORD PTR [rsp]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+8]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+16]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+24]
> add rsp, 40
> ret
> loadu2(void const*, int):
> push rbp
> movsx rdx, esi
> vxorpd xmm0, xmm0, xmm0
> mov rsi, rdi
> sal rdx, 3
> mov rbp, rsp
> and rsp, -32
> sub rsp, 32
> mov rdi, rsp
> vmovapd YMMWORD PTR [rsp], ymm0
> vzeroupper
> call memcpy
> vmovsd xmm0, QWORD PTR [rsp]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+8]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+16]
> vaddsd xmm0, xmm0, QWORD PTR [rsp+24]
> leave
> ret
>
> Thanks!
> Hong
>
--
BR,
Hongtao
More information about the Gcc-help
mailing list