This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Use of vector instructions in memmov/memset expanding


> Hi, Jan!
> I was just preparing my version of the patch, but it seems a bit late
> now. Please see my comments to this and your previous letter below.
> 
> By the way, would it be possible to commit other part of the patch
> (middle-end part) - probably also by small parts -  and some other
> tuning after stage 1 closes?

I can't review the middle-end parts, so you need to ping some of the middle-end
maintainers. Usually patches submitted before stage1 end tends to be acceptable.
> 
> 
> > The patches disabling CSE and forwprop on constants are apparently papering around
> > problem that subregs of vector registers used in epilogue make IRA to think
> > that it can't put value into SSE register (resulting in NO_REGS class) and making
> > reloat to output load of 0 into the internal loop.
> 
> The problem here isn't about subregs - there is just no way to emit
> store when destination is 128-bit memory operand and source is 128-bit
> immediate. We should somehow find that previously we initialized a

Yes, I know.  But that is what instruction predicates are for.
When instruction pattern refuse the immediate operand, forwrprop/cse will
do nothing.
> 
> > I also plugged some code paths - the pain here is that the stringops have many
> > variants - different algorithms, different alignmnet, constant/variable
> > counts. These increase the testing matrix and some of code paths was worng with
> > the new SSE code.
> 
> Yep, I also saw such fails, thanks for the fixes. Though, I see
> another problem here: the main reason of these fails is that when size
> is small we could skip main loop and thus reach epilogue with
> uninitialized loop-iterator and/or promoted value. To make the
> algorithm absolutely correct, we should either perform needed
> initializations in the very beginning (before zero-testing) or use
> byte-loop in the epilogue. The second way could greatly hurt
> performance, so I think we should just initialize everything before
> the main loop in assumption that size is big enough and it'll be used
> in the main loop.
> Moreover, this algorithm wasn't initially intended for small sizes -
> memcpy/memset for small sizes should be expanded earlier, in
> move_by_pieces or set_by_pieces (it was in middle-end part of the
> patch). So the assumption about the size should be correct.

Problem here are memset/memcpy of small but variable size. This is very common
and it is critical to handle it right. I.e. even GCC have this in its ggc allocation
code for example.

Yes, there is still bug with the promotion.  The purpose of force_loopy_pologue
logic you removed was precisely to get this right.  I plugged the bug with
vector mode moves but not with integer modes.  I will look into it tonight.
Perhaps in first cut we could revert to the previous code for memset with
variable byte & keep your faster epilogues for memcpy & memsets of constants
where broadcasting is cheap.  I will have to think more.
> 
> 
> > We still may want to produce SSE moves for 64bit
> > operations in 32bit codegen but this is independent problem + the patch as it is seems
> > to produce slight regression on crafty.
> 
> Actually, such 8byte moves aren't critical for these part of the patch
> - here such moves only could be used in prologues/epilogues and
> doesn't affect performance much (assuming that size isn't very small
> and small performance loss in prologue/epilogue doesn't affect overall
> performance).
> But for memcpy/memset for small sizes, which are expanded in
> middle-end part, that could be quite crucial. For example, for copying
> of 24 bytes with unknown alignment on Atom three 8-byte SSE moves
> could be much faster than six 4-byte moves via GPR. So, it's
> definitely good to have an opportunity to generate such moves.

Yep, but we need to handle move_by_pieces/clear_by_pieces independently since it is middle
end change and it is not really related to the actual memset/memcpy expansion in backend.
So please prepare separate patch for that.
> > I think it would be better to use V8QI mode
> > for the promoted value (since it is what it really is) avoiding the need for changes in
> > expand_move and the loadq pattern.
> 
> Actually, we rarely operate in byte mode - usually we move/store at
> least with Pmode (when we use GPR). So V4SI or V2DI also looks
> reasonable to me here. Also, when promoting value from GPR to
> SSE-register, we surely need value in DI/SI-mode value, not in QImode.
> We could make everything in QI/V16QI/V8QI-modes but that could lead to
> generation of converts in many places (like in promotion to vector).

I meant V16QI, but anyway, it is not a byte, but a vector of bytes and that is what
we really do.  The code produced for V2DI or V16QI is actualy the same, but V16QI better
describe the fact that the value is a byte broadcast to 16 copies. Also that will unify
the 32bit/64bit codegen that currently use V4SI or V2DI for no obvious reason.

Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]