User account creation filtered due to spam.

Bug 67366 - Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
Summary: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 6.0
: P3 enhancement
Target Milestone: 6.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-08-26 21:42 UTC by Yann Collet
Modified: 2015-10-13 09:16 UTC (History)
5 users (show)

See Also:
Host:
Target: Armv7-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2015-08-27 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Yann Collet 2015-08-26 21:42:26 UTC
Accessing unaligned memory positions used to be forbidden on ARM cpus. But since ARMv6 (quite many years by now), this operation is supported.

However, GCC 4.5 - 4.6 - 4.7 - 4.8 seem to generate sub-optimal code on these targets.

In theory, it's illegal to issue a direct statement such as :

u32 read32(const void* ptr) { return *(const u32*)ptr; }

if ptr is not properly aligned.

There are 2 work-around that I know.
The first is to use `packed` instruction, which is not portable (compiler specific).

The second and better one is to use memcpy() :

u32 read32(const void* ptr) { u32 v; memcpy(&u, ptr, sizeof(v)); return v; }

This version is portable and safe.
It also works very well on multiple platform, such as x86/x64 or PPC, or ARM64, being reduced to an optimal assembly sequence (single instruction).

Unfortunately, GCC 4.5 - 4.6 - 4.7 - 4.8 generate suboptimal assembly for this function on ARMv6 or ARMv7 :

read32(void const*):
	ldr	r0, [r0]	@ unaligned
	sub	sp, sp, #8
	str	r0, [sp, #4]	@ unaligned
	ldr	r0, [sp, #4]
	add	sp, sp, #8
	bx	lr

This in stark contrast with clang, which generates a much more efficient assembly :

read32(void const*):                           @ @read32(void const*)
	ldr	r0, [r0]
	bx	lr

(assembly can be generated and displayed using a simple tool : https://goo.gl/7FWDB8)

It's not that gcc is unaware of cpu's unaligned memory access capability,
since it does use it : `ldr r0, [r0]`
but then lose a lot of time on useless operations on a discardable temporary variable,
storing data into stack just to read it again.


Inlining does not save the day. -O3 help at reducing the impact, but it's still large.

On a recent exercise comparing efficient vs inefficient memory access on ARMv6 and ARMv7,
the measured difference was very large : up to 6x faster at -O2 settings.
See :
http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html

It's definitely a too large difference to be ignored.
As a consequence, to preserve performance, source code must try a bunch of possibilities depending on target and compiler, if not version.
In some circumstances (gcc with ARMv6, or gcc <= 4.5), it's even necessary to write illegal code (see !st version above) to reach optimal performance on targets.

This looks like a waste of energy, and a recipe for bugs, especially compared to clang, which generates clean code in all circumstances for all targets.


Considering the huge performance difference such an improvement could make, is that something the gcc team would like to look into ?


Regards
Comment 1 Richard Biener 2015-08-27 07:39:21 UTC
I think this boils down to the fact that memcpy expansion is done too late and
that (with more recent GCC) the "inlining" done on the GIMPLE level is restricted
to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1 unconditionally.

I guess arm goes through the movmisalign optab for unaligned accesses and the
GIMPLE inlining could be enabled as well if movmisalign is supported.

Note that GCC 4.8 is no longer supported and enhancements will go to GCC 6 at
most.

With current trunk and -O2 -march=armv6 I get for

typedef unsigned int u32;
u32 read32(const void* ptr) { u32 v; __builtin_memcpy(&v, ptr, sizeof(v)); return v; }

read32:
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr     r0, [r0]        @ unaligned
        sub     sp, sp, #8
        add     sp, sp, #8
        @ sp needed
        bx      lr

so apart from the stack slot not getting removed this has already improved,
but the issue I mention still happens as we expand from

read32 (const void * ptr)
{
  u32 v;
  u32 _4;

  <bb 2>:
  __builtin_memcpy (&v, ptr_2(D), 4);
  _4 = v;
  v ={v} {CLOBBER};
  return _4;

so partially confirmed, for the GIMPLE memory op "inlining" issue.
Comment 2 Richard Earnshaw 2015-08-27 09:36:11 UTC
(In reply to Richard Biener from comment #1)
> I think this boils down to the fact that memcpy expansion is done too late
> and
> that (with more recent GCC) the "inlining" done on the GIMPLE level is
> restricted
> to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> unconditionally.
> 

Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions work with misaligned addresses (ldm, for example).  The only way to handle misaligned copies is through the movmisalign API.
Comment 3 rguenther@suse.de 2015-08-27 10:21:07 UTC
On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> 
> --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #1)
> > I think this boils down to the fact that memcpy expansion is done too late
> > and
> > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > restricted
> > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > unconditionally.
> > 
> 
> Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> work with misaligned addresses (ldm, for example).  The only way to handle
> misaligned copies is through the movmisalign API.

Are the movmisalign handled ones reasonably efficient?  That is, more
efficient than memcpy/memmove?  Then we should experiment with

Index: gcc/gimple-fold.c
===================================================================
--- gcc/gimple-fold.c   (revision 227252)
+++ gcc/gimple-fold.c   (working copy)
@@ -708,7 +708,9 @@ gimple_fold_builtin_memory_op (gimple_st
                  /* If the destination pointer is not aligned we must be 
able
                     to emit an unaligned store.  */
                  && (dest_align >= GET_MODE_ALIGNMENT (TYPE_MODE (type))
-                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
dest_align)))
+                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
dest_align)
+                     || (optab_handler (movmisalign_optab, TYPE_MODE 
(type))
+                         != CODE_FOR_nothing)))
                {
                  tree srctype = type;
                  tree desttype = type;
@@ -720,7 +722,10 @@ gimple_fold_builtin_memory_op (gimple_st
                    srcmem = tem;
                  else if (src_align < GET_MODE_ALIGNMENT (TYPE_MODE 
(type))
                           && SLOW_UNALIGNED_ACCESS (TYPE_MODE (type),
-                                                    src_align))
+                                                    src_align)
+                          && (optab_handler (movmisalign_optab,
+                                             TYPE_MODE (type))
+                              == CODE_FOR_nothing))
                    srcmem = NULL_TREE;
                  if (srcmem)
                    {
Comment 4 Ramana Radhakrishnan 2015-08-27 10:42:44 UTC
(In reply to Richard Earnshaw from comment #2)
> (In reply to Richard Biener from comment #1)
> > I think this boils down to the fact that memcpy expansion is done too late
> > and
> > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > restricted
> > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > unconditionally.
> > 
> 
> Yep, we have to define STRICT_ALIGNMENT to 1 because not all load
> instructions work with misaligned addresses (ldm, for example).  The only
> way to handle misaligned copies is through the movmisalign API.

Hmmm, actually that's not completely true - for SImode and HImode objects there don't appear to be movmisalign interfaces. For things like memcpy this is done through the memcpy interface where we end up generating copies using unaligned_loadsi and unaliged_storesi.

We may need to experiment a bit with the movmisalign interface too.


Ramana
Comment 5 Ramana Radhakrishnan 2015-08-27 10:47:03 UTC
(In reply to rguenther@suse.de from comment #3)
> On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > 
> > --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #1)
> > > I think this boils down to the fact that memcpy expansion is done too late
> > > and
> > > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > > restricted
> > > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > > unconditionally.
> > > 
> > 
> > Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> > work with misaligned addresses (ldm, for example).  The only way to handle
> > misaligned copies is through the movmisalign API.
> 
> Are the movmisalign handled ones reasonably efficient?  That is, more
> efficient than memcpy/memmove?  Then we should experiment with

What do you mean by more efficient here ? they'll end up calling down to the same unspec block if we implemented them - memcpy / memmove go through the backend block_move interface.

Will try the patch with a hacked up movmisalign pattern in the backend.

> 
> Index: gcc/gimple-fold.c
> ===================================================================
> --- gcc/gimple-fold.c   (revision 227252)
> +++ gcc/gimple-fold.c   (working copy)
> @@ -708,7 +708,9 @@ gimple_fold_builtin_memory_op (gimple_st
>                   /* If the destination pointer is not aligned we must be 
> able
>                      to emit an unaligned store.  */
>                   && (dest_align >= GET_MODE_ALIGNMENT (TYPE_MODE (type))
> -                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)))
> +                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)
> +                     || (optab_handler (movmisalign_optab, TYPE_MODE 
> (type))
> +                         != CODE_FOR_nothing)))
>                 {
>                   tree srctype = type;
>                   tree desttype = type;
> @@ -720,7 +722,10 @@ gimple_fold_builtin_memory_op (gimple_st
>                     srcmem = tem;
>                   else if (src_align < GET_MODE_ALIGNMENT (TYPE_MODE 
> (type))
>                            && SLOW_UNALIGNED_ACCESS (TYPE_MODE (type),
> -                                                    src_align))
> +                                                    src_align)
> +                          && (optab_handler (movmisalign_optab,
> +                                             TYPE_MODE (type))
> +                              == CODE_FOR_nothing))
>                     srcmem = NULL_TREE;
>                   if (srcmem)
>                     {
Comment 6 Ramana Radhakrishnan 2015-08-27 11:08:46 UTC
(In reply to rguenther@suse.de from comment #3)
> On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > 
> > --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #1)
> > > I think this boils down to the fact that memcpy expansion is done too late
> > > and
> > > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > > restricted
> > > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > > unconditionally.
> > > 
> > 
> > Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> > work with misaligned addresses (ldm, for example).  The only way to handle
> > misaligned copies is through the movmisalign API.
> 
> Are the movmisalign handled ones reasonably efficient?  That is, more
> efficient than memcpy/memmove?  Then we should experiment with

minor nit - missing include of optabs.h - fixing that and adding a movmisalignsi pattern in the backend that just generates either an unaligned / storesi insn generates the following for me for the above mentioned testcase.


read32:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r0, [r0]	@ unaligned
	bx	lr




I'm on holiday from this evening so don't really want to push something today ... 





> 
> Index: gcc/gimple-fold.c
> ===================================================================
> --- gcc/gimple-fold.c   (revision 227252)
> +++ gcc/gimple-fold.c   (working copy)
> @@ -708,7 +708,9 @@ gimple_fold_builtin_memory_op (gimple_st
>                   /* If the destination pointer is not aligned we must be 
> able
>                      to emit an unaligned store.  */
>                   && (dest_align >= GET_MODE_ALIGNMENT (TYPE_MODE (type))
> -                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)))
> +                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)
> +                     || (optab_handler (movmisalign_optab, TYPE_MODE 
> (type))
> +                         != CODE_FOR_nothing)))
>                 {
>                   tree srctype = type;
>                   tree desttype = type;
> @@ -720,7 +722,10 @@ gimple_fold_builtin_memory_op (gimple_st
>                     srcmem = tem;
>                   else if (src_align < GET_MODE_ALIGNMENT (TYPE_MODE 
> (type))
>                            && SLOW_UNALIGNED_ACCESS (TYPE_MODE (type),
> -                                                    src_align))
> +                                                    src_align)
> +                          && (optab_handler (movmisalign_optab,
> +                                             TYPE_MODE (type))
> +                              == CODE_FOR_nothing))
>                     srcmem = NULL_TREE;
>                   if (srcmem)
>                     {
Comment 7 rguenther@suse.de 2015-08-27 11:12:57 UTC
On Thu, 27 Aug 2015, ramana at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> 
> --- Comment #6 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #3)
> > On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > > 
> > > --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > > (In reply to Richard Biener from comment #1)
> > > > I think this boils down to the fact that memcpy expansion is done too late
> > > > and
> > > > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > > > restricted
> > > > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > > > unconditionally.
> > > > 
> > > 
> > > Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> > > work with misaligned addresses (ldm, for example).  The only way to handle
> > > misaligned copies is through the movmisalign API.
> > 
> > Are the movmisalign handled ones reasonably efficient?  That is, more
> > efficient than memcpy/memmove?  Then we should experiment with
> 
> minor nit - missing include of optabs.h - fixing that and adding a
> movmisalignsi pattern in the backend that just generates either an unaligned /
> storesi insn generates the following for me for the above mentioned testcase.
> 
> 
> read32:
>         @ args = 0, pretend = 0, frame = 0
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         ldr     r0, [r0]        @ unaligned
>         bx      lr
> 
> 
> 
> 
> I'm on holiday from this evening so don't really want to push something today
> ... 

Sure ;)  When adding the GIMPLE folding I was just careful here as I
don't really have a STRICT_ALIGNMENT machine with movmisalign handling
available.  Thus full testing is appreciated (might turn up some
testcases that need adjustment).  There are more STRICT_ALIGN
guarded cases below in the function, eventually they can be modified
as well (at which point splitting out the alignment check to a separate
function makes sense).
Comment 8 Ramana Radhakrishnan 2015-08-27 11:16:52 UTC
(In reply to rguenther@suse.de from comment #7)
> On Thu, 27 Aug 2015, ramana at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > 
> > --- Comment #6 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> ---
> > (In reply to rguenther@suse.de from comment #3)
> > > On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > > > 
> > > > --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > > > (In reply to Richard Biener from comment #1)
> > > > > I think this boils down to the fact that memcpy expansion is done too late
> > > > > and
> > > > > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > > > > restricted
> > > > > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > > > > unconditionally.
> > > > > 
> > > > 
> > > > Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> > > > work with misaligned addresses (ldm, for example).  The only way to handle
> > > > misaligned copies is through the movmisalign API.
> > > 
> > > Are the movmisalign handled ones reasonably efficient?  That is, more
> > > efficient than memcpy/memmove?  Then we should experiment with
> > 
> > minor nit - missing include of optabs.h - fixing that and adding a
> > movmisalignsi pattern in the backend that just generates either an unaligned /
> > storesi insn generates the following for me for the above mentioned testcase.
> > 
> > 
> > read32:
> >         @ args = 0, pretend = 0, frame = 0
> >         @ frame_needed = 0, uses_anonymous_args = 0
> >         @ link register save eliminated.
> >         ldr     r0, [r0]        @ unaligned
> >         bx      lr
> > 
> > 
> > 
> > 
> > I'm on holiday from this evening so don't really want to push something today
> > ... 
> 
> Sure ;)  When adding the GIMPLE folding I was just careful here as I
> don't really have a STRICT_ALIGNMENT machine with movmisalign handling
> available.  Thus full testing is appreciated (might turn up some
> testcases that need adjustment).  There are more STRICT_ALIGN
> guarded cases below in the function, eventually they can be modified
> as well (at which point splitting out the alignment check to a separate
> function makes sense).

This was the backend hack I was playing with. It needs extension to HImode values, cleaning up and regression testing and some small amount of benchmarking to see we're doing the right thing.



diff --git a/gcc/config/arm/arm.md b/gcc/config/arm/arm.md
index 288bbb9..eaff494 100644
--- a/gcc/config/arm/arm.md
+++ b/gcc/config/arm/arm.md
@@ -11423,6 +11423,27 @@
   }"
 )
 
+(define_expand "movmisalignsi"
+  [(match_operand:SI 0 "general_operand")
+   (match_operand:SI 1 "general_operand")]
+  "unaligned_access"
+{
+  /* This pattern is not permitted to fail during expansion: if both arguments
+     are non-registers (e.g. memory := constant, which can be created by the
+     auto-vectorizer), force operand 1 into a register.  */
+  if (!s_register_operand (operands[0], SImode)
+      && !s_register_operand (operands[1], SImode))
+    operands[1] = force_reg (SImode, operands[1]);
+
+  if (MEM_P (operands[1]))
+    emit_insn (gen_unaligned_loadsi (operands[0], operands[1]));
+  else
+    emit_insn (gen_unaligned_storesi (operands[0], operands[1]));
+
+  DONE;
+})
+
+
 ;; Vector bits common to IWMMXT and Neon
 (include "vec-common.md")
 ;; Load the Intel Wireless Multimedia Extension patterns
Comment 9 Richard Earnshaw 2015-08-27 14:31:29 UTC
Does that really do the right thing?  That is, does force_reg understand a misaligned memory op?

Also, what if one memory operand is aligned, but the other not?  Don't we want to have the right combination of aligned/misaligned instructions?
Comment 10 rguenther@suse.de 2015-08-27 14:36:28 UTC
On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> 
> --- Comment #9 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> Does that really do the right thing?  That is, does force_reg understand a
> misaligned memory op?
> 
> Also, what if one memory operand is aligned, but the other not?  Don't we want
> to have the right combination of aligned/misaligned instructions?

I think you never get two MEMs, you at most get a constant on the
RHS of a store.
Comment 11 Ramana Radhakrishnan 2015-08-27 14:45:52 UTC
(In reply to rguenther@suse.de from comment #10)
> On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > 
> > --- Comment #9 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > Does that really do the right thing?  That is, does force_reg understand a
> > misaligned memory op?

In practice I think yes.

> > 
> > Also, what if one memory operand is aligned, but the other not?  Don't we want
> > to have the right combination of aligned/misaligned instructions?
> 
> I think you never get two MEMs, you at most get a constant on the
> RHS of a store.

Yep that's my understanding too. movmisalign<mode> has stricter rules compared to mov<mode> so we have to handle some of these carefully. The logic in here is essentially from neon.md : movmisalign<mode>, so some of it may not be relevant here in the scalar case, but it's probably better to be a bit defensive here.

I tried playing with the HImode case but the pain I had was with the fact that HImode movmisalign expects a load into a HImode register and I hadn't got my head around that given other things I needed to finish up before leaving.
Comment 12 Ramana Radhakrishnan 2015-09-09 15:28:48 UTC
(In reply to rguenther@suse.de from comment #3)
> On Thu, 27 Aug 2015, rearnsha at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
> > 
> > --- Comment #2 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #1)
> > > I think this boils down to the fact that memcpy expansion is done too late
> > > and
> > > that (with more recent GCC) the "inlining" done on the GIMPLE level is
> > > restricted
> > > to !SLOW_UNALIGNED_ACCESS but arm defines STRICT_ALIGNMENT to 1
> > > unconditionally.
> > > 
> > 
> > Yep, we have to define STRICT_ALIGNMENT to 1 because not all load instructions
> > work with misaligned addresses (ldm, for example).  The only way to handle
> > misaligned copies is through the movmisalign API.
> 
> Are the movmisalign handled ones reasonably efficient?  That is, more
> efficient than memcpy/memmove?  Then we should experiment with
> 
> Index: gcc/gimple-fold.c
> ===================================================================
> --- gcc/gimple-fold.c   (revision 227252)
> +++ gcc/gimple-fold.c   (working copy)
> @@ -708,7 +708,9 @@ gimple_fold_builtin_memory_op (gimple_st
>                   /* If the destination pointer is not aligned we must be 
> able
>                      to emit an unaligned store.  */
>                   && (dest_align >= GET_MODE_ALIGNMENT (TYPE_MODE (type))
> -                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)))
> +                     || !SLOW_UNALIGNED_ACCESS (TYPE_MODE (type), 
> dest_align)
> +                     || (optab_handler (movmisalign_optab, TYPE_MODE 
> (type))
> +                         != CODE_FOR_nothing)))
>                 {
>                   tree srctype = type;
>                   tree desttype = type;
> @@ -720,7 +722,10 @@ gimple_fold_builtin_memory_op (gimple_st
>                     srcmem = tem;
>                   else if (src_align < GET_MODE_ALIGNMENT (TYPE_MODE 
> (type))
>                            && SLOW_UNALIGNED_ACCESS (TYPE_MODE (type),
> -                                                    src_align))
> +                                                    src_align)
> +                          && (optab_handler (movmisalign_optab,
> +                                             TYPE_MODE (type))
> +                              == CODE_FOR_nothing))
>                     srcmem = NULL_TREE;
>                   if (srcmem)
>                     {

This plus the backend changes to deal with unaligned himode and simode values tested ok on armhf with only 2 extra failures in strlen-opt-8.c. Prima-facie they appear to be testisms, but it will be fun to handle this across all architecture levels for the arm target as unaligned access depends on architecture levels.
Comment 13 Ramana Radhakrishnan 2015-10-09 10:54:03 UTC
Author: ramana
Date: Fri Oct  9 10:53:31 2015
New Revision: 228643

URL: https://gcc.gnu.org/viewcvs?rev=228643&root=gcc&view=rev
Log:
[Patch PR target/67366 1/2] [ARM] - Add movmisalignhi / si patterns


This adds movmisalignhi and movmisalignsi expanders when unaligned
access is allowed by the architecture. This allows the mid-end
to expand to misaligned loads and stored.

Compared code generated for the Linux kernel and
it changes code generation for a handful of files all for the better
basically by reducing the stack usage.

Tested by :

1. armhf bootstrap and regression test - no regressions.
2.. arm-none-eabi cross build and regression test for

    {-marm/-march=armv7-a/-mfpu=vfpv3-d16/-mfloat-abi=softfp}
    {-mthumb/-march=armv8-a/-mfpu=crypto-neon-fp-armv8/-mfloat-abi=hard}
    {-marm/-mcpu=arm7tdmi/-mfloat-abi=soft}
    {-mthumb/-mcpu=arm7tdmi/-mfloat-abi=soft}

Will apply to trunk once 2/2 is approved.

regards
Ramana

2015-10-09  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>

        PR target/67366
        * config/arm/arm.md (movmisalign<mode>): New.
        * config/arm/iterators.md (HSI): New.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.md
    trunk/gcc/config/arm/iterators.md
Comment 14 Ramana Radhakrishnan 2015-10-09 10:59:06 UTC
Fixed on trunk - Yann please try and let us know.
Comment 15 Ramana Radhakrishnan 2015-10-09 11:08:36 UTC
Author: ramana
Revision: 228644
Modified property: svn:log

Modified: svn:log at Fri Oct  9 11:08:05 2015
------------------------------------------------------------------------------
--- svn:log (original)
+++ svn:log Fri Oct  9 11:08:05 2015
@@ -1,45 +1,43 @@
-[AArch64] Handle literal pools for functions > 1 MiB in size.
-    
+[PATCH PR target/67366 2/2] [gimple-fold.c] Support movmisalign optabs in gimple-fold.c
 
-This patch fixes the issue in PR63304 where we have
-functions that are > 1MiB. The idea is to use adrp / ldr or adrp / add
-instructions to address the literal pools under the use of a command line
-option. I would like to turn this on by default on trunk but keep this
-disabled by default for the release branches in order to get some
-serious testing for this feature while it bakes on trunk.
+This patch by Richard allows for movmisalign optabs to be supported
+in gimple-fold.c. This caused a bit of pain in the testsuite with strlenopt-8.c
+in conjunction with the ARM support for movmisalign_optabs as the test
+was coded up to do different things depending on whether the target
+supported misaligned access or not. However now with unaligned access
+being allowed for different levels of the architecture in the arm backend,
+the concept of the helper function non_strict_align mapping identically
+to the definition of STRICT_ALIGNMENT disappears.
 
-As a follow-up I would like to try and see if estimate_num_insns or
-something else can give us a heuristic to turn this on for "large" functions.
-After all the number of incidences of this are quite low in real life,
-so may be we should look to restrict this use as much as possible on the
-grounds that this code generation implies an extra integer register for
-addressing for every floating point and vector constant and I don't think
-that's great in code that already may have high register pressure.
+Adjusted thusly for ARM. The testsuite/lib changes were tested with an
+arm-none-eabi multilib that included architecture variants that did not
+support unaligned access and architecture variants that did.
 
-Tested on aarch64-none-elf with no regressions. A previous
-version was bootstrapped and regression tested.
+The testing matrix for this patch was:
 
-Applied to trunk.
+1. x86_64 bootstrap and regression test - no regressions.
+2. armhf bootstrap and regression test - no regressions.
+3. arm-none-eabi cross build and regression test for
 
-regards
-Ramana
+{-marm/-march=armv7-a/-mfpu=vfpv3-d16/-mfloat-abi=softfp}
+{-mthumb/-march=armv8-a/-mfpu=crypto-neon-fp-armv8/-mfloat-abi=hard}
+{-marm/-mcpu=arm7tdmi/-mfloat-abi=soft}
+{-mthumb/-mcpu=arm7tdmi/-mfloat-abi=soft}
 
-2015-09-14  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
+with no regressions.
 
-    	PR target/63304
-    	* config/aarch64/aarch64.c (aarch64_expand_mov_immediate): Handle
-    	nopcrelative_literal_loads.
-    	(aarch64_classify_address): Likewise.
-    	(aarch64_constant_pool_reload_icode): Define.
-    	(aarch64_secondary_reload): Handle secondary reloads for
-    	literal pools.
-    	(aarch64_override_options): Handle nopcrelative_literal_loads.
-    	(aarch64_classify_symbol): Handle nopcrelative_literal_loads.
-    	* config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>):
-    	Define.
-    	(aarch64_reload_movcp<VALL:mode><P:mode>): Likewise.
-    	* config/aarch64/aarch64.opt (mpc-relative-literal-loads): New option.
-    	* config/aarch64/predicates.md (aarch64_constant_pool_symref): New
-    	predicate.
-    	* doc/invoke.texi (mpc-relative-literal-loads): Document.
+Ok to apply ?
 
+2015-10-09  Richard Biener  <rguenth@suse.de>
+
+	PR target/67366
+	* gimple-fold.c (optabs-query.h): Include
+	(gimple_fold_builtin_memory_op): Allow unaligned stores
+	when movmisalign_optabs are available.
+
+2015-10-09  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
+
+	PR target/67366
+	* lib/target-supports.exp (check_effective_target_non_strict_align):
+	Adjust for arm*-*-*.
+	* gcc.target/arm/pr67366.c: New test.
Comment 16 Fredrik Hederstierna 2015-10-11 10:33:13 UTC
Could this fix also possibly improve:
Bug 67507 - "Code size increase with -Os from GCC 4.8.x to GCC 4.9.x for ARM thumb1", which also seems related to alignment on ARM targets?
Thanks, Fredrik
Comment 17 Ramana Radhakrishnan 2015-10-13 09:16:18 UTC
(In reply to Fredrik Hederstierna from comment #16)
> Could this fix also possibly improve:
> Bug 67507 - "Code size increase with -Os from GCC 4.8.x to GCC 4.9.x for ARM
> thumb1", which also seems related to alignment on ARM targets?
> Thanks, Fredrik

No I don't think so ...