Bug 48941 - [arm gcc] NEON: Stack pointer operations performed even tho stack is not accessed at all in function.
Summary: [arm gcc] NEON: Stack pointer operations performed even tho stack is not acce...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.5.0
: P3 normal
Target Milestone: 4.8.0
Assignee: Ramana Radhakrishnan
URL:
Keywords: missed-optimization
Depends on:
Blocks: 47562
  Show dependency treegraph
 
Reported: 2011-05-10 03:10 UTC by julien
Modified: 2012-07-05 17:01 UTC (History)
6 users (show)

See Also:
Host:
Target: arm-elf, arm-eabi
Build: x86_64-apple-darwin10
Known to work:
Known to fail:
Last reconfirmed: 2011-05-12 07:07:51


Attachments
Source showcasing the problem (496 bytes, text/plain)
2011-05-10 03:10 UTC, julien
Details
Proposed patch (5.15 KB, patch)
2011-05-12 08:46 UTC, rsandifo@gcc.gnu.org
Details | Diff
Rebased patch. (6.15 KB, patch)
2012-05-02 06:51 UTC, Ramana Radhakrishnan
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description julien 2011-05-10 03:10:27 UTC
Created attachment 24218 [details]
Source showcasing the problem

Using some float32x4x2_t temporaries, some stack space is allocated even though the temporaries are made registers and stack never gets accessed inside the function.

See attachment for C source and a corresponding assembly, produced with:

arm-elf-gcc-4.5 -O3 -march=armv7-a -mthumb -mfpu=neon -mfloat-abi=softfp -S -o - test.c | grep -vE '^[[:space:]]*[\.@].*$' (the grep is just there to remove lines of comments and directives)

The problem also happens in C++.

$ arm-elf-gcc-4.5 --version -v
Using built-in specs.
COLLECT_GCC=arm-elf-gcc-4.5
COLLECT_LTO_WRAPPER=/opt/local/libexec/gcc/arm-elf/4.5.0/lto-wrapper
arm-elf-gcc-4.5 (GCC) 4.5.0
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


Target: arm-elf
Configured with: /opt/local/var/macports/build/_opt_local_var_macports_sources_rsync.macports.org_release_ports_cross_arm-elf-gcc/work/gcc-4.5.0/configure --prefix=/opt/local --infodir=/opt/local/share/info --mandir=/opt/local/share/man --target=arm-elf --program-prefix=arm-elf- --program-suffix=-4.5 --without-included-gettext --enable-obsolete --with-newlib --disable-__cxa_atexit --enable-multilib --enable-biendian --disable-libgfortran --with-gxx-include-dir=/opt/local/arm-elf/include/c++/4.5.0/ --enable-languages=c,c++,objc --build=x86_64-apple-darwin10 --enable-fpu
Thread model: single
gcc version 4.5.0 (GCC) 
COLLECT_GCC_OPTIONS='-fversion' '-v'
 /opt/local/libexec/gcc/arm-elf/4.5.0/cc1 -quiet -v -D__USES_INITFINI__ help-dummy -quiet -dumpbase help-dummy -auxbase help-dummy -version -fversion -o /var/folders/Gn/GnNf6VbPEc4MTtfs4l39zU+++TI/-Tmp-//cc1shhZd.s
GNU C (GCC) version 4.5.0 (arm-elf)
	compiled by GNU C version 4.2.1 (Apple Inc. build 5666) (dot 3), GMP version 5.0.1, MPFR version 3.0.0-p8, MPC version 0.8.2
warning: MPFR header version 3.0.0-p8 differs from library version 3.0.1-p3.
warning: MPC header version 0.8.2 differs from library version 0.9.
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
COLLECT_GCC_OPTIONS='-fversion' '-v'
 /opt/local/lib/gcc/arm-elf/4.5.0/../../../../arm-elf/bin/as --version -o /var/folders/Gn/GnNf6VbPEc4MTtfs4l39zU+++TI/-Tmp-//ccuztVa3.o /var/folders/Gn/GnNf6VbPEc4MTtfs4l39zU+++TI/-Tmp-//cc1shhZd.s
GNU assembler (Linux/GNU Binutils) 2.20.51.0.9.20100526
Copyright 2010 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `arm-elf'.
COMPILER_PATH=/opt/local/libexec/gcc/arm-elf/4.5.0/:/opt/local/libexec/gcc/arm-elf/4.5.0/:/opt/local/libexec/gcc/arm-elf/:/opt/local/lib/gcc/arm-elf/4.5.0/:/opt/local/lib/gcc/arm-elf/:/opt/local/lib/gcc/arm-elf/4.5.0/../../../../arm-elf/bin/
LIBRARY_PATH=/opt/local/lib/gcc/arm-elf/4.5.0/:/opt/local/lib/gcc/arm-elf/4.5.0/../../../../arm-elf/lib/
COLLECT_GCC_OPTIONS='-fversion' '-v'
 /opt/local/libexec/gcc/arm-elf/4.5.0/collect2 -X --version /opt/local/lib/gcc/arm-elf/4.5.0/crti.o /opt/local/lib/gcc/arm-elf/4.5.0/crtbegin.o /opt/local/lib/gcc/arm-elf/4.5.0/../../../../arm-elf/lib/crt0.o -L/opt/local/lib/gcc/arm-elf/4.5.0 -L/opt/local/lib/gcc/arm-elf/4.5.0/../../../../arm-elf/lib /var/folders/Gn/GnNf6VbPEc4MTtfs4l39zU+++TI/-Tmp-//ccuztVa3.o --start-group -lgcc -lc --end-group /opt/local/lib/gcc/arm-elf/4.5.0/crtend.o /opt/local/lib/gcc/arm-elf/4.5.0/crtn.o
GNU ld (Linux/GNU Binutils) 2.20.51.0.9.20100526
Copyright 2010 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or (at your option) a later version.
This program has absolutely no warranty.
Comment 1 Ramana Radhakrishnan 2011-05-12 07:07:51 UTC
Occurs with trunk as well. 

This is what I see with "GCC: (GNU) 4.7.0 20110413 (experimental) [trunk revision 172363

	str	fp, [sp, #-4]!
	add	fp, sp, #0
	sub	sp, sp, #20
	vldmia	r0, {d20-d21}
	vmov	q11, q10  @ v4sf
	sub	sp, sp, #48
	add	r3, sp, #15
	bic	r3, r3, #15
	vzip.32	q10, q11
	vstr	d22, [r3, #16]
	vstr	d23, [r3, #24]
	vstmia	r3, {d20-d21}
	vldmia	r1, {d18-d19}
	vmov	q8, q9  @ v4sf
	vmov	d24, d21
	vzip.32	q9, q8
	vstr	d16, [r3, #16]
	vstr	d17, [r3, #24]
	vstmia	r3, {d18-d19}
	vmov	d26, d22
	vmov	d25, d16
	vmov	d23, d18
	vmul.f32	d16, d24, d25
	vmul.f32	d22, d26, d23
	vmov	d18, d19
	vmul.f32	d17, d20, d18
	vmls.f32	d16, d26, d18
	vmls.f32	d22, d20, d25
	vmls.f32	d17, d24, d23
	vuzp.32	d16, d22
	vmov	d18, d16
	vmov	d19, d17
	vmov	r0, r1, d18  @ v4sf
	vmov	r2, r3, d19
	add	sp, fp, #0
	ldmfd	sp!, {fp}
	bx	lr
Comment 2 rsandifo@gcc.gnu.org 2011-05-12 07:18:13 UTC
The problem seems to be that arm_neon.h implementation
of vzip* is returning the result by reference rather
than by value.
Comment 3 rsandifo@gcc.gnu.org 2011-05-12 08:46:54 UTC
Created attachment 24234 [details]
Proposed patch

The attached patch seems to fix the testcase and doesn't
regress neon.exp.  I'll test it fully next week.
We still generate more moves than necessary, but that's
a separate problem.
Comment 4 Richard Earnshaw 2011-05-16 08:13:04 UTC
(In reply to comment #3)
> Created attachment 24234 [details]
> Proposed patch
> 
> The attached patch seems to fix the testcase and doesn't
> regress neon.exp.  I'll test it fully next week.
> We still generate more moves than necessary, but that's
> a separate problem.

I think we should try to preserve the existing internal API, so that if someone manages to pick up an older version of arm_neon.h they won't get bizarre errors out of the compiler.
Comment 5 rsandifo@gcc.gnu.org 2011-06-02 13:40:28 UTC
(In reply to comment #4)
> (In reply to comment #3)
> > Created attachment 24234 [details]
> > Proposed patch
> > 
> > The attached patch seems to fix the testcase and doesn't
> > regress neon.exp.  I'll test it fully next week.
> > We still generate more moves than necessary, but that's
> > a separate problem.
> 
> I think we should try to preserve the existing internal API, so that if someone
> manages to pick up an older version of arm_neon.h they won't get bizarre errors
> out of the compiler.

That shouldn't be such a big issue though.  It's relatively
common for changes in GCC behaviour (such as extra front-end
strictness) to need fixincludes to be used on some older
headers.  If you manage to pull in the unfixed versions,
you'll get strange errors.  And this certainly wouldn't
be the only case in which GCC needs the right version of
its own headers to be used.

How strongly do you object?  I think the benefits are
worth any compatibility drawbacks in this case.
Comment 6 Richard Earnshaw 2011-11-24 11:00:46 UTC
(In reply to comment #5)
> How strongly do you object?  I think the benefits are
> worth any compatibility drawbacks in this case.

It would be a nice to have, but I'm not going to lose any sleep over it.
Comment 7 rsandifo@gcc.gnu.org 2012-01-18 11:54:27 UTC
No longer working on this.
Comment 8 Eric Batut 2012-01-27 14:11:34 UTC
Any chance of seeing the work on this restart ?

I found this bug while looking for something that would help (I raised bug 51980 for the same kind of issue, still seen on trunk), but the patch attached to this bug does not solve the issue for code that is rich with zip/uzp/trn intrinsics.

This is a major limitation of arm-gcc with respect to performance-critical Neon code in my opinion.
Comment 9 Ramana Radhakrishnan 2012-01-27 16:20:07 UTC
(In reply to comment #8)
> Any chance of seeing the work on this restart ?
> 
> I found this bug while looking for something that would help (I raised bug
> 51980 for the same kind of issue, still seen on trunk), but the patch attached
> to this bug does not solve the issue for code that is rich with zip/uzp/trn
> intrinsics.

I took a look at this for sometime when I was reviewing the patch submitted on trunk. The problem in this case appears to go away with -fno-split-wide-types but that in general is not a good idea. IIRC when RichardS and I talked about it we did talk about maybe getting lower-subreg to pay some attention to it. 

Neon intrinsics have been improving ( I'd like to think) over time but they are still not perfect unfortunately. I don't have time to look at this in the near term myself. 

> 
> This is a major limitation of arm-gcc with respect to performance-critical Neon
> code in my opinion.

As they say, patches are welcome :) 

Ramana
Comment 10 Ramana Radhakrishnan 2012-05-02 06:51:12 UTC
Created attachment 27282 [details]
Rebased patch.


This patch originally by Richard Sandiford fixed PR target/48941 but the problem was that lower-subreg was being a bit too aggressive with lowering subreg moves in this case causing unnecessary spills to the stack. As shown in my mail in the thread on lower-subreg.c the code generated is far better now with this patch in play . I've rebased to trunk, double checked that the generated files from the ml description match up and added a simple test that makes sure that the original testcase from the PR doesn't emit a vector store instruction ! That should be enough to catch all the cases that we worry about. There are still a few vmov's between Vector registers but I suspect that is because of the vcombine at the end for which RichardE might have something in flight. For the record, this patch is not directly applicable to earlier release branches as it depends on the new behaviour of lower-subreg.c which is why this bug is only targeted at 4.8.0
Comment 11 Ramana Radhakrishnan 2012-05-22 14:16:34 UTC
> There are still a few vmov's between Vector registers but I suspect that is
> because of the vcombine at the end for which RichardE might have something in
> flight. 

This is probably not true and needs some more investigation.
Comment 12 Ramana Radhakrishnan 2012-07-05 16:56:23 UTC
Author: ramana
Date: Thu Jul  5 16:56:15 2012
New Revision: 189295

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189295
Log:

Correct bug number to PR target/48941 

First part of the fix .

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
Comment 13 Ramana Radhakrishnan 2012-07-05 17:01:39 UTC
(In reply to comment #12)
fixed by 

http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189294