[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression
peter at cordes dot ca
gcc-bugzilla@gcc.gnu.org
Fri Feb 1 10:45:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494
--- Comment #6 from Peter Cordes <peter at cordes dot ca> ---
Oops, these were SD not SS. Getting sleepy >.<. Still, my optimization
suggestion for doing both compares in one masked SUB of +-PBCx applies equally.
And I think my testing with VBLENDVPS should apply equally to VBLENDVPD.
Since this is `double`, if we're going branchless we should definitely be
vectorizing for a pair of doubles, like doing
xij = X0(1,i) - X0(1,j) and
yij = X0(2,i) - X0(2,j)
together with a vmovupd, and a vector of PBCx, PBCy.
Even if we later need both x and y separately (if those FMAs in the asm are
multiplying components of one vector), we might still come out ahead from doing
the expensive input processing with PD, then it's only one `vunpckhpd` to get
the Y element ready, and that can run in parallel with any x * z stuff
Or if we can unroll by 3 SIMD vectors over contiguous memory, we can get
{X0,Y0} {Z0,X1} {Y1,Z1}. We get twice the work for a cost of only 3 extra
unpacks, doing 2 i and j values at once.
----
If this was 3 floats, using a SIMD load would be tricky (maybe vmaskmovps if we
need to avoid going off the end), unless we again unroll by 3 = LCM(vec_len,
width)
More information about the Gcc-bugs
mailing list