This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Add value range support into memcpy/memset expansion


> Hi Jan,
> 
> > I also think the misaligned move trick can/should be performed by
> > move_by_pieces and we ought to consider sane use of SSE - current vector_loop
> > with unrolling factor of 4 seems bit extreme.  At least buldozer is happy with
> > 2 and I would expect SSE moves to be especially useful for moving blocks with
> > known size where they are not used at all.
> > 
> > Currently I disabled misaligned move prologues/epilogues for Michael's vector
> > loop path since they ends up longer than the traditional code (that use loop
> > for epilogue)
> Prologues could use this techniques even with vector_loop, as they actually
> don't have a loop.

Were new prologues lose is the fact that we need to handle all sizes smaller than
SIZE_NEEDED.  This is 64bytes that leads to a variant for 32..64, 16..32, 8...16
4..8 and the tail.  It is quite a lot of code.

When block is known to be greater than 64, this is also a win but my current patch
does not fine tune this, yet.
Similarly misaligned moves are win when size is known, alignment is not performed
and normal fixed size epiogue needs more than one move or when alignment is known
but offset is non-zero.
It will need bit of tweaking to handle all the paths well - it is usual problem
with the stringops, they get way too complex as number of factors increase.

It is why I think we may consider vector loop with less than 4 unrollings.
AMD optimization manual recommends two for buldozer... Is there difference between
4 and two for Atom?

To be honest I am not quite sure from where constant of 4 comes.  I think I
introduced it long time ago for K8 where it apparently got some extra % of
performance.

It is used for larger blocks only for PPro. AMD chips preffer it for small
blocks apparently becuase they preffer loop-less sequence.

> As for epilogues - have you tried to use misaligned vector_moves (movdqu)?  It
> looks to me that we need approx. the same amount of instructions in vector-loop
> and in usual-loop epilogues, if we use vector-instructions in vector-loop
> epilogue.

Yes, code is in place for this.  You can just remove the check for size_needed
being smaller than 32 and it will produce the movdqu sequence for you (I tested
it on the vector loop testcase in the testusite).

The code will also use SSE for unrolled_loop prologue expansion at least for
memcpy (for memset it does not have broadcast value so it should skip it).

> 
> > Comments are welcome.
> BTW, maybe we could generalize expand_small_movmem a bit and make a separate
> expanding strategy out of it?  It will expand a memmov with no loop (and
> probably no alignment prologue) - just with the biggest available moves.  Also,
> a cost model could be added here to make decisions on when we actually want to
> align the moves.  Here is a couple of examples of that:
> 
> memcpy (a, b, 32); // alignment is unknown
> will expand to
>   movdqu a, %xmm0
>   movdqu a+16, %xmm1
>   movdqu %xmm0, b
>   movdqu %xmm1, b+16
> memcpy (a, b, 32); // alignment is known and equals 64bit
> will expand to
> a)
>   movdqu a, %xmm0
>   movdqu a+16, %xmm1
>   movdqu %xmm0, b
>   movdqu %xmm1, b+16
> or b)
>   movq	  a,   %xmm0
>   movdqa  a+8, %xmm1
>   movq	  a+24,%xmm2
>   movq	  %xmm0, b
>   movdqa  %xmm1, b+8
>   movq	  %xmm2, b+24
> 
> We would compute the total cost of both variant and choose the best - for
> computation we need just a cost of aligned and misaligned moves.
> 
> This strategy is actually pretty similar to move_by_pieces, but as it have much
> more target-specific information, it would be able to produce much more
> effective code.

I was actually thinking more along lines of teaching move_by_pieces to do the tricks.
It seems there is not that much of x86 specific knowledge in here and other architectures
will also benefit from it.  We can also enable it when value range is close enough.

I plan to look into it today or tomorrow - revisit your old patch to move_by_pieces and see
how much of extra API I need to get move_by_pieces to do what expand_small_movmem does.
> 
> And one more question - in a work on vector_loop for memset I tried to merge
> many of movmem and setmem routines (e.g. instead of expand_movmem_prologue and
> expand_setmem_prologue I made a single routine
> expand_movmem_or_setmem_prologue).  What do you think, is it a good idea?  It
> reduces code size in i386.c, but slightly complicates the code.  I'll send a
> patch shortly, as soon as the testing completes.

I would like to see it.  I am not too thrilled by the duplication. My original
motivation for that was to keep under control number of code paths thorugh the
expanders.  We already have many of them (and it is easy to get wrong code) as
different variants of prologues/epilgoues and main loops are not exactly the
same and thus the code is not as moduler as I would like.  I am not sure if
adding differences in between memset and memmove is not going to add too much
of extra cases to think of.  Maybe not, like for the misaligned prologues
the change is actually quite straighforward.

I however do not handle well the case where broadcasting of the constant value
should happen - currently I simply do it on the beggining that is quite cheap
in integer, but once we add SSE into a play we will need to push it down.

Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]