Initializing a vector to zero leads to less efficient assemblies than manually assigning a vector to zero?

Mon Mar 9 05:54:12 GMT 2020

On Sat, Mar 7, 2020 at 5:20 AM Hong X <hongx@ibm.com> wrote:
>
> Hi all,
>
> I tried to compile the following two code snippets with "--std=c++14 -mavx2 -O3" options:
>
>     double tmp_values[4] = {0};
>
> and
>
>     double tmp_values[4];
>
>     for (auto i = 0; i < 4; ++i) {
>         tmp_values[i] = 0.0;
>     }
>
> The first code snippet leads to
>
>     vmovaps XMMWORD PTR [rsp], xmm0
>     vmovaps XMMWORD PTR [rsp+16], xmm0
>
> But the second leads to only
>
>     vmovapd YMMWORD PTR [rsp], ymm0
>
> which is less efficient than the previous one. Am I missing something?
>
Assume you're working on Skylake. the latency and throuoput of
vmovaps/vmovpad is
                                        | lat | throughput | uops |  port |
VMOVAPS (XMM, M128)| [≤4;≤7] | 0.50 / 0.50 | 1 | 1*p23 |
VMOVAPS (YMM, M256)| [≤5;≤8]|   0.50 / 0.50| 1 | 1*p23 |
Refer to https://uops.info/table.html
So the later seems better.
> For the full code, see this godbolt link: https://godbolt.org/z/jonf72 , and I paste the full input and output below:
>
> Input code
>
> #include <cstring>
>
> double loadu1(const void* ptr, int count) {
>
>     double tmp_values[4] = {0};
>
>     std::memcpy(
>         tmp_values,
>         ptr,
>         count * sizeof(double));
>     return tmp_values[0] + tmp_values[1] + tmp_values[2] + tmp_values[3];
> }
>
>
> double loadu2(const void* ptr, int count) {
>
>     double tmp_values[4];
>
>     for (auto i = 0; i < 4; ++i) {
>         tmp_values[i] = 0.0;
>     }
>
>     std::memcpy(
>         tmp_values,
>         ptr,
>         count * sizeof(double));
>     return tmp_values[0] + tmp_values[1] + tmp_values[2] + tmp_values[3];
> }
>
>
> Output assemblies:
>
> loadu1(void const*, int):
>         sub     rsp, 40
>         movsx   rdx, esi
>         vpxor   xmm0, xmm0, xmm0
>         mov     rsi, rdi
>         sal     rdx, 3
>         mov     rdi, rsp
>         vmovaps XMMWORD PTR [rsp], xmm0
>         vmovaps XMMWORD PTR [rsp+16], xmm0
>         call    memcpy
>         vmovsd  xmm0, QWORD PTR [rsp]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+8]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+16]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+24]
>         add     rsp, 40
>         ret
> loadu2(void const*, int):
>         push    rbp
>         movsx   rdx, esi
>         vxorpd  xmm0, xmm0, xmm0
>         mov     rsi, rdi
>         sal     rdx, 3
>         mov     rbp, rsp
>         and     rsp, -32
>         sub     rsp, 32
>         mov     rdi, rsp
>         vmovapd YMMWORD PTR [rsp], ymm0
>         vzeroupper
>         call    memcpy
>         vmovsd  xmm0, QWORD PTR [rsp]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+8]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+16]
>         vaddsd  xmm0, xmm0, QWORD PTR [rsp+24]
>         leave
>         ret
>
> Thanks!
> Hong
>


-- 
BR,
Hongtao