This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][v3] GIMPLE store merging pass


On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote:
> The v3 of this patch addresses feedback I received on the version posted at [1].
> The merged store buffer is now represented as a char array that we splat values onto with
> native_encode_expr and native_interpret_expr. This allows us to merge anything that native_encode_expr
> accepts, including floating point values and short vectors. So this version extends the functionality
> of the previous one in that it handles floating point values as well.
> 
> The first phase of the algorithm that detects the contiguous stores is also slightly refactored according
> to feedback to read more fluently.
> 
> Richi, I experimented with merging up to MOVE_MAX bytes rather than word size but I got worse results on aarch64.
> MOVE_MAX there is 16 (because it has load/store register pair instructions) but the 128-bit immediates that we ended
> synthesising were too complex. Perhaps the TImode immediate store RTL expansions could be improved, but for now
> I've left the maximum merge size to be BITS_PER_WORD.

At least from playing with this kind of things in the RTL PR22141 patch,
I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit
constants usually isn't a win (not just for speed optimized blocks but also for
-Os).  For 64-bit store if the constant isn't signed 32-bit or unsigned
32-bit you need movabsq into some temporary register which has like 3 times worse
latency than normal store if I remember well, and then store it.  If it can
be CSEd and the same constant used multiple times in adjacent code perhaps.
Various other targets have different costs for different constants,
so it would be nice if the pass considered that (computed RTX costs of those
constants and used that in some heuristics).
What alias set is used for the accesses if there are different alias sets
involved in between the merged stores?
Also alignment can matter, even on non-strict alignment targets (speed vs.
-Os for that).
And, do you have some SPEC2k and/or SPEC2k6 numbers, for
 e.g. x86_64/i686/arm/aarch64/powerpc64le?
The RTL PR22141 changes weren't added mainly because it slowed down SPEC2k*
on powerpc.
Also, do you only handle constants or also the case where there is partial
or complete copying from some other memory, where it could be turned into
larger chunk loads + stores or __builtin_memcpy?

> I've disabled the pass for PDP-endian targets as the merging code proved to be quite fiddly to get right for different
> endiannesses and I didn't feel comfortable writing logic for BYTES_BIG_ENDIAN != WORDS_BIG_ENDIAN targets without serious
> testing capabilities. I hope that's ok (I note the bswap pass also doesn't try to do anything on such targets).

I think that is fine, it isn't the only pass that punts in this case.

	Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]