[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

Fri Feb 1 10:45:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494

--- Comment #6 from Peter Cordes <peter at cordes dot ca> ---
Oops, these were SD not SS.  Getting sleepy >.<.  Still, my optimization
suggestion for doing both compares in one masked SUB of +-PBCx applies equally.

And I think my testing with VBLENDVPS should apply equally to VBLENDVPD.

Since this is `double`, if we're going branchless we should definitely be
vectorizing for a pair of doubles, like doing 

xij = X0(1,i) - X0(1,j)   and 
yij = X0(2,i) - X0(2,j)

together with a vmovupd, and a vector of PBCx, PBCy.

Even if we later need both x and y separately (if those FMAs in the asm are
multiplying components of one vector), we might still come out ahead from doing
the expensive input processing with PD, then it's only one `vunpckhpd` to get
the Y element ready, and that can run in parallel with any x * z stuff

Or if we can unroll by 3 SIMD vectors over contiguous memory, we can get
{X0,Y0} {Z0,X1} {Y1,Z1}.  We get twice the work for a cost of only 3 extra
unpacks, doing 2 i and j values at once.

----

If this was 3 floats, using a SIMD load would be tricky (maybe vmaskmovps if we
need to avoid going off the end), unless we again unroll by 3 = LCM(vec_len,
width)