This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RFC 0/3] Stuff related to pr53533


... but not actually fixing it.

I was hoping that the first patch might give the vectorizer enough
info to solve the costing problem, but no such luck.  Nevertheless
it would seem that not having the info present at all would be a
bit of a hindrence when actually tweeking the vectorizer later...

FYI the "equivalence" of fabs/fmul in the integer simd costing is
based upon the AMD K8 document I had handy.  I don't know that I've
ever seen proper latencies and such for the Intel cpus?

The second patch implements something that I mentioned in the PR,
that we ought to be decomposing expensive vector multiplies just
like we do for scalar multiplies.  I'm really quite surprised that
we'd not noticed that v*4 wasn't being implemented via shift...

The third patch implements something that Richi mentioned in the
PR, that we ought not be trying to shuffle the elements of a
const_vector around; do that beforehand.

---

Some additional notes on the testcase in the PR:

The computation is 3 iterations of a hash function:
	(x + 12345) * 914237 - 13.

The -13 folds into the subsequent +12345 well enough, and so the
simple expansion of this results in 7 operations.  And that's 
exactly what we get when unrolling and vectorizing.

However, for the non-vectorized version, combine smooshes together
(2.5) iterations of the hash function, utilizing modulo arithmetic:
	h1 = (x + 12345) * 914237 - 13
	h2 = (h1 + 12345) * 914237 - 13
	h3 = (h2 + 12345) * 914237 - 13
	   = x*764146064584710053 + 10318335160567660
	   = x*0x101597a5 + 0x9deb476c (mod 2**32)

which is of course only 2 operations (combine actually misses one
and leaves an extra plus for 3 operations).  Which means that even
leaving aside everything above, the vectorized code is having
to work much harder than the scalar code in the end.

Manually adjusting complete_hash_func with the above substitution
and suddenly even the pre-sse4 vectorized version is faster than
the unvectorized version (with 100000 iterations and these patches):

scalar:	4.69 sec
sse2:	2.46 sec
sse4:	1.39 sec

So, it's not *really* the costing inside the vectorizer at all,
and begs the question of why we're not taking modulo arithmetic
into account earlier in the optimization path?

Bootstrapped and tested on x86_64, but I'll leave some time for
comment before committing any of this.


r~


Richard Henderson (3):
  Add rtx costs for sse integer ops
  Use synth_mult for vector multiplies vs scalar constant
  Handle const_vector in mulv4si3 for pre-sse4.1.

 gcc/config/i386/i386-protos.h |    1 +
 gcc/config/i386/i386.c        |  126 +++++++++++-
 gcc/config/i386/predicates.md |    7 +
 gcc/config/i386/sse.md        |   72 ++------
 gcc/expmed.c                  |  438 +++++++++++++++++++++++------------------
 gcc/machmode.h                |    8 +-
 6 files changed, 386 insertions(+), 266 deletions(-)

-- 
1.7.7.6


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]