[Bug c++/62080] New: Suboptimal code generation with eigen library

Sun Aug 10 10:20:00 GMT 2014

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62080

            Bug ID: 62080
           Summary: Suboptimal code generation with eigen library
           Product: gcc
           Version: 4.8.3
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: beschindler at gmail dot com

Created attachment 33281
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33281&action=edit
Source code used to get the provided assembly

I'm currently optimizing some code using the eigen library and I'm stumbling
over an interesting problem. 
I have a function, which I wrote in two different ways (the attributes are
there to provide some optimization barriers, dimEigen is a member variable of
the containing class): 

void eigenClamp(Eigen::Vector4i& vec) __attribute__((noinline, noclone))
{
    vec = vec.array().min(dimEigen.array()).max(Eigen::Array4i::Zero());
}

void eigenClamp2(Eigen::Vector4i& vec) __attribute__((noinline, noclone))
{
    vec = vec.array().min(dimEigen.array());
    vec = vec.array().max(Eigen::Array4i::Zero());
}

I'm compiling this on a core i7 920 using -O2 -fno-exceptions -fno-rtti
-std=c++11 -march=native

The first function generates this assembly, which looks great: 

movdqu    (%rsi), %xmm1
movdqu    (%rdi), %xmm0
pminsd    %xmm1, %xmm0
pxor    %xmm1, %xmm1
pmaxsd    %xmm1, %xmm0
movdqa    %xmm0, (%rsi)

The second version does this: 

movdqa    (%rsi), %xmm0
pminsd    (%rdi), %xmm0
movdqa    %xmm0, (%rsi) <-- 
pxor    %xmm0, %xmm0
movdqu    (%rsi), %xmm1 <-- 
pmaxsd    %xmm1, %xmm0
movdqa    %xmm0, (%rsi)

It seems, because there are two lines in the original source code, the result
of the first expression is written to memory and then two instructions later,
read back from memory. This makes this function almost 50% slower in what I can
measure. As I find the latter code much easier to read as the former, it would
be great if the same assembly would be generated. 

Also, I note that in the second version, the pminsd is executed directly from
the memory source, while in the first version, it is read to a register and
then pminsd is called. Thus, I'd love to see this code: 

movdqu    (%rsi), %xmm1
pminsd    (%rdi), %xmm1
pxor    %xmm1, %xmm1
pmaxsd    %xmm1, %xmm0
movdqa    %xmm0, (%rsi)

As a reference, I'm attaching the complete source code and the generated
assembly