This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: [PATCH][GCC][mid-end] Allow larger copies when not slow_unaligned_access and no padding.


On Tue, 31 Jul 2018, Tamar Christina wrote:

> Ping 😊
> 
> > -----Original Message-----
> > From: gcc-patches-owner@gcc.gnu.org <gcc-patches-owner@gcc.gnu.org>
> > On Behalf Of Tamar Christina
> > Sent: Tuesday, July 24, 2018 17:34
> > To: Richard Biener <rguenther@suse.de>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; law@redhat.com;
> > ian@airs.com; amodra@gmail.com; bergner@vnet.ibm.com
> > Subject: Re: [PATCH][GCC][mid-end] Allow larger copies when not
> > slow_unaligned_access and no padding.
> > 
> > Hi Richard,
> > 
> > Thanks for the review!
> > 
> > The 07/23/2018 18:46, Richard Biener wrote:
> > > On July 23, 2018 7:01:23 PM GMT+02:00, Tamar Christina
> > <tamar.christina@arm.com> wrote:
> > > >Hi All,
> > > >
> > > >This allows copy_blkmode_to_reg to perform larger copies when it is
> > > >safe to do so by calculating the bitsize per iteration doing the
> > > >maximum copy allowed that does not read more than the amount of bits
> > > >left to copy.
> > > >
> > > >Strictly speaking, this copying is only done if:
> > > >
> > > >  1. the target supports fast unaligned access  2. no padding is
> > > > being used.
> > > >
> > > >This should avoid the issues of the first patch (PR85123) but still
> > > >work for targets that are safe to do so.
> > > >
> > > >Original patch
> > > >https://gcc.gnu.org/ml/gcc-patches/2017-11/msg01088.html
> > > >Previous respin
> > > >https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00239.html
> > > >
> > > >
> > > >This produces for the copying of a 3 byte structure:
> > > >
> > > >fun3:
> > > >	adrp	x1, .LANCHOR0
> > > >	add	x1, x1, :lo12:.LANCHOR0
> > > >	mov	x0, 0
> > > >	sub	sp, sp, #16
> > > >	ldrh	w2, [x1, 16]
> > > >	ldrb	w1, [x1, 18]
> > > >	add	sp, sp, 16
> > > >	bfi	x0, x2, 0, 16
> > > >	bfi	x0, x1, 16, 8
> > > >	ret
> > > >
> > > >whereas before it was producing
> > > >
> > > >fun3:
> > > >	adrp	x0, .LANCHOR0
> > > >	add	x2, x0, :lo12:.LANCHOR0
> > > >	sub	sp, sp, #16
> > > >	ldrh	w1, [x0, #:lo12:.LANCHOR0]
> > > >	ldrb	w0, [x2, 2]
> > > >	strh	w1, [sp, 8]
> > > >	strb	w0, [sp, 10]
> > > >	ldr	w0, [sp, 8]
> > > >	add	sp, sp, 16
> > > >	ret
> > > >
> > > >Cross compiled and regtested on
> > > >  aarch64_be-none-elf
> > > >  armeb-none-eabi
> > > >and no issues
> > > >
> > > >Boostrapped and regtested
> > > > aarch64-none-linux-gnu
> > > > x86_64-pc-linux-gnu
> > > > powerpc64-unknown-linux-gnu
> > > > arm-none-linux-gnueabihf
> > > >
> > > >and found no issues.
> > > >
> > > >OK for trunk?
> > >
> > > How does this affect store-to-load forwarding when the source is initialized
> > piecewise? IMHO we should avoid larger loads but generate larger stores
> > when possible.
> > >
> > > How do non-x86 architectures behave with respect to STLF?
> > >
> > 
> > I should have made it more explicit in my cover letter, but this only covers reg
> > to reg copies.
> > So the store-t-load forwarding shouldn't really come to play here, unless I'm
> > missing something
> > 
> > The example in my patch shows that the loads from mem are mostly
> > unaffected.
> > 
> > For x86 the change is also quite significant, e.g for a 5 byte struct load it used
> > to generate
> > 
> > fun5:
> > 	movl	foo5(%rip), %eax
> > 	movl	%eax, %edi
> > 	movzbl	%al, %edx
> > 	movzbl	%ah, %eax
> > 	movb	%al, %dh
> > 	movzbl	foo5+2(%rip), %eax
> > 	shrl	$24, %edi
> > 	salq	$16, %rax
> > 	movq	%rax, %rsi
> > 	movzbl	%dil, %eax
> > 	salq	$24, %rax
> > 	movq	%rax, %rcx
> > 	movq	%rdx, %rax
> > 	movzbl	foo5+4(%rip), %edx
> > 	orq	%rsi, %rax
> > 	salq	$32, %rdx
> > 	orq	%rcx, %rax
> > 	orq	%rdx, %rax
> > 	ret
> > 
> > instead of
> > 
> > fun5:
> >         movzbl  foo5+4(%rip), %eax
> >         salq    $32, %rax
> >         movq    %rax, %rdx
> >         movl    foo5(%rip), %eax
> >         orq     %rdx, %rax
> >         ret
> > 
> > so the loads themselves are unaffected.

I see.  Few things:

   dst_words = XALLOCAVEC (rtx, n_regs);
+
+  slow_unaligned_access
+    = targetm.slow_unaligned_access (word_mode, TYPE_ALIGN (TREE_TYPE 
(src)));
+
   bitsize = MIN (TYPE_ALIGN (TREE_TYPE (src)), BITS_PER_WORD);

please avoid the extra vertical space.


+
+      /* Find the largest integer mode that can be used to copy all or as
+        many bits as possible of the struct

likewise.

+      FOR_EACH_MODE_IN_CLASS (mode_iter, MODE_INT)
+       if (padding_correction == 0
+           && !slow_unaligned_access

These conditions are invariant so please hoist them out of the
loop.

+           && GET_MODE_BITSIZE (mode_iter.require ())
+                           <= ((bytes * BITS_PER_UNIT) - bitpos)
+           && GET_MODE_BITSIZE (mode_iter.require ()) <= BITS_PER_WORD)
+         bitsize = GET_MODE_BITSIZE (mode_iter.require ());
+       else
+         break;

The slow_unaligned_access target hook is passed a mode and an
alignment but you are assuming that a narrower mode behaves
the same with possibly different alignment.

I'm also not sure this loop will never compute to sth that is
smaller than the original conservative bitsize based on
TYPE_ALIGN of src, so maybe you could instead use sth like

  min_mode = get_mode_for_size (bitsize);
...

  if (padding_correction == 0
      && !STRICT_ALIGNMENT)
    FOR_EACH_MODE_FROM (mode_iter, min_mode)
      {
        unsigned msize = GET_MODE_BITSIZE (mode_iter.require ());
        if (msize <= ((bytes * BITS_PER_UNIT) - bitpos)
            && msize <= BITS_PER_WORD)
          bitsize = msize;
        else
          break;
      }
...

instead?

Thanks,
Richar.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]