Bug 55073 - Wrong Neon code generation at -O2 caused by -fschedule-insns
Summary: Wrong Neon code generation at -O2 caused by -fschedule-insns
Status: REOPENED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2012-10-25 12:53 UTC by Eric Batut
Modified: 2012-11-30 18:50 UTC (History)
3 users (show)

See Also:
Host:
Target: arm*-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-10-25 00:00:00


Attachments
Zipfile with repro case, build script, disassembly listings and register flow analysis (4.03 KB, application/x-zip-compressed)
2012-10-25 12:53 UTC, Eric Batut
Details
Second repro case with source code, build script, assembly files and binary files (38.65 KB, application/x-gzip)
2012-11-30 16:20 UTC, Eric Batut
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Batut 2012-10-25 12:53:55 UTC
Created attachment 28528 [details]
Zipfile with repro case, build script, disassembly listings and register flow analysis

Using gcc trunk at rev 192800, compiled with the Android NDK's build-gcc.sh script (arm-linux-androideabi target).

Compiling the attached repro case at -O2 yields incorrect results. Correct results are generated for -O2 -fno-schedule-insns.

The command line to build an incorrect program is :
arm-linux-androideabi-g++ -mandroid -march=armv7-a -mfloat-abi=softfp -mfpu=vfp -mfpu=neon -fpic -marm -O2 -fno-strict-aliasing -Wall -o repro_ko repro.cpp

The command line to build a correct program is :
arm-linux-androideabi-g++ -mandroid -march=armv7-a -mfloat-abi=softfp -mfpu=vfp -mfpu=neon -fpic -marm -O2 -fno-schedule-insns -fno-strict-aliasing -Wall -o repro_ok repro.cpp

I am aware that the test case is quite convoluted but this is because we use some kind of "universal" 128b vector type that autoconverts to and from other Neon types (not all ARM compilers have -flax-vector-conversions). Still, both program should output the same results.

The body of the failing function is pasted below (prolog and epilog omitted):
Correct code (-O2 -fno-schedule-insns):
	vmov	d19, d20  @ v8qi
	vmov	d21, d18  @ v8qi
	vmov	d20, d19  @ v8qi
	vzip.8	d19, d18
	vzip.8	d21, d20
	vswp	d18, d19
	vswp	d20, d21
	vmov	d21, d19  @ v8qi
	vmov	d19, d20  @ v8qi
	vzip.8	d21, d20
	vzip.8	d19, d18
	vswp	d20, d21
	vswp	d18, d19
	vmovl.s8	q10, d21
	vmovl.s8	q9, d19
	vsub.i16	q9, q9, q8
	vsub.i16	q8, q10, q8
	vadd.i16	q8, q9, q8
	vst1.64	{d16-d17}, [r0:128]

Incorrect code (-O2):
	vmov	d19, d20  @ v8qi
	vmov	d22, d18  @ v8qi
	vmov	d21, d20  @ v8qi
	vzip.8	d19, d18
	vzip.8	d22, d21
	vswp	d18, d19
	vmov	d20, d22  @ v8qi
	vmov	d21, d18  @ v8qi
	vzip.8	d22, d19
	vzip.8	d21, d20
	vmovl.s8	q9, d22
	vswp	d20, d21
	vsub.i16	q9, q9, q8
	vmovl.s8	q10, d21
	vsub.i16	q8, q10, q8
	vadd.i16	q8, q9, q8
	vst1.64	{d16-d17}, [r0:128]

I have attached a build.sh script that builds the two versions (OK and KO) of the output programs. These programs need to be run on any Android ARMV7 target. This probably happens with linux builds of gcc as well.

I did some register flow tracing to give formal expressions of what ends up in the return value (well, just before the vsub/vsub/vadd actually). This is in the attached bug_gcc.txt file (which should be read with hard tabs, tab length set to 30 or something in order for the formatting to work).

I don't know if this is related to bug 54300 (which by the way is still "unconfirmed" although I confirmed it occurring even with -fno-strict-aliasing, do I need to provide more info on this one?)
Comment 1 Richard Earnshaw 2012-11-29 17:51:49 UTC
Author: rearnsha
Date: Thu Nov 29 17:51:40 2012
New Revision: 193943

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193943
Log:
	PR target/55073
	* arm/neon.md (neon_vtrn<mode>_internal): Split into expand
	and insn patterns.  Re-order insn arguments to tie inputs to
	outputs.
	(neon_vzip<mode>_internal): Likewise.
	(neon_vuzp<mode>_internal): Likewise.

	* gcc.target/arm/pr55073.C: New test.

Added:
    trunk/gcc/testsuite/gcc.target/arm/pr55073.C
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/neon.md
    trunk/gcc/testsuite/ChangeLog
Comment 2 Richard Earnshaw 2012-11-29 18:04:52 UTC
Fixed.
Comment 3 Eric Batut 2012-11-30 09:52:19 UTC
Hello Richard

I updated my working copy of gcc to rev 193943, rebuilt the compiler, rebuilt the testcase I originally attached to this bug report, and I am still getting different results depending on whether the -fno-schedule-insns option is used or not. Furthermore, neither of the two sets of return values I get match the ones you use in your test case for the failure detection. On what HW and with which compile options did you test this and come to these values?

I'd be glad to run more tests if you need me to.

Shall I reopen this bug?

Best Regards,
Eric
Comment 4 Richard Earnshaw 2012-11-30 09:58:38 UTC
(In reply to comment #3)
> Hello Richard
> 
> I updated my working copy of gcc to rev 193943, rebuilt the compiler, rebuilt
> the testcase I originally attached to this bug report, and I am still getting
> different results depending on whether the -fno-schedule-insns option is used
> or not. Furthermore, neither of the two sets of return values I get match the
> ones you use in your test case for the failure detection. On what HW and with
> which compile options did you test this and come to these values?
> 
> I'd be glad to run more tests if you need me to.
> 
> Shall I reopen this bug?
> 
> Best Regards,
> Eric

Well it would help if you told me what output you were expecting, rather than leaving me to try and figure it out.  I ran the testcase supplied with various optimization levels on a cortex-A15 and got identical results in all cases once the patch had been applied.
Comment 5 Eric Batut 2012-11-30 10:14:00 UTC
Since this comes from several hours of stripping down a texture generation engine to the single function that provided different results, I must admit I have no idea what the correct return values are.

What worries me more is that I still get two different set of values on a Tegra3 (Cortex-A9) after rebuilding pr55073.C with the build.sh script in the attached zipfile (and replacing the if-abort by printfs) :

root@android:/data # ./repro_ko
./repro_ko
[0] = 00000002
[1] = 00020000
[2] = FFFBFFFB
[3] = FFFBFFFB
root@android:/data # ./repro_ok
./repro_ok
[0] = 00030003
[1] = 00030003
[2] = FFFAFFFA
[3] = FFFAFFFA

Were you directly targeting A15 when building the testcase? Can this enable/disable some optimization codepaths that would explain why we have different results ?
Comment 6 Eric Batut 2012-11-30 11:05:18 UTC
Building the test case at O1 (which I tend to trust slightly more than O2 in the present case) gives the same set of values than the previous "OK" case :

root@android:/data # ./repro_O1
./repro_O1
[0] = 00030003
[1] = 00030003
[2] = FFFAFFFA
[3] = FFFAFFFA

I hereby declare these values to be the reference values.
Comment 7 Eric Batut 2012-11-30 13:21:13 UTC
Richard,

I apologize, building at -O0 (and handrolling an assembly routine to do the same computation) proves me wrong : your values are the correct ones, and -O1 is also broken.

The reference values are indeed
[0] = FFFFFFFF
[1] = FFFFFFFF
[2] = FFFCFFFC
[3] = FFFCFFFC

And I still have no idea why  my build of your patch does not produce these results on my HW. Could you please attach a binary build of the repro case so that I can test it on my HW? In the meantime I'll keep looking.

Best Regards,
Eric
Comment 8 Richard Earnshaw 2012-11-30 14:00:21 UTC
(In reply to comment #7)
> Richard,
> 
> I apologize, building at -O0 (and handrolling an assembly routine to do the
> same computation) proves me wrong : your values are the correct ones, and -O1
> is also broken.
> 
> The reference values are indeed
> [0] = FFFFFFFF
> [1] = FFFFFFFF
> [2] = FFFCFFFC
> [3] = FFFCFFFC
> 
> And I still have no idea why  my build of your patch does not produce these
> results on my HW. Could you please attach a binary build of the repro case so
> that I can test it on my HW? In the meantime I'll keep looking.
> 
> Best Regards,
> Eric

It sounds pretty obvious, but can you double check that you are picking up the new compiler and not somehow getting an old one.

Try building to assembly and checking that the GCC version information is correct.
Comment 9 Eric Batut 2012-11-30 14:29:11 UTC
Richard, 

I double-checked (update + rebuild), the end of my assembly files correctly states :
.ident	"GCC: (GNU) 4.8.0 20121130 (experimental)"

Since -O1 is also broken on my end, I tried to isolate the option that would fix -O1. It turns out that "-O1" and "-O1 -fno-dse" give identical function bodies, only the epilog differs:
 - "-O1" gives
	vmovl.s8	q9, d19 <= d19 (wrong)
	vsub.i16	q9, q9, q8
	vmovl.s8	q10, d21 <= d21 (wrong)
	vsub.i16	q8, q10, q8
	vadd.i16	q8, q9, q8
	vst1.64	{d16-d17}, [r0:128]
 - "-O1 -fno-dse" gives
	vmovl.s8	q9, d18 <= d18 (correct) instead of d19 (wrong)
	vsub.i16	q9, q9, q8
	vmovl.s8	q10, d20 <= d20 (correct) instead of d21 (wrong)
	vsub.i16	q8, q10, q8
	vadd.i16	q8, q9, q8
	vst1.64	{d16-d17}, [r0:128]

The function body above the previous code snippets is the same for both builds. The only difference is the widening of d19 and d21 in the wrong case, and of d18 and d20 in the correct case.

The compiler I am using to build arm-linux-androideabi-gcc is an Apple build of gcc 4.2.1 :

~/android-ndk-r8b: gcc -v
Using built-in specs.
Target: i686-apple-darwin11
Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.1~22/src/configure --disable-checking --enable-werror --prefix=/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.1~22/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1
Thread model: posix
gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)

Do you think rebuilding arm-linux-androideabi-gcc on Linux to check if the generated code is the same is worth the time or is there no chance whatsoever that it can make a difference ?
Comment 10 Richard Earnshaw 2012-11-30 14:40:07 UTC
(In reply to comment #9)
> Do you think rebuilding arm-linux-androideabi-gcc on Linux to check if the
> generated code is the same is worth the time or is there no chance whatsoever
> that it can make a difference ?

Well if you're building a cross compiler, there is always the possibility that your host compiler has mis-compiled GCC.  It would be worth double-checking by building your compiler on a different host.
Comment 11 Richard Earnshaw 2012-11-30 14:55:25 UTC
Something else to check is that you are using the version of arm_neon.h that comes with gcc-4.8.  This file has to match the version of GCC it was designed for.
Comment 12 Eric Batut 2012-11-30 15:16:47 UTC
(In reply to comment #11)
> Something else to check is that you are using the version of arm_neon.h that
> comes with gcc-4.8.  This file has to match the version of GCC it was designed
> for.


The arm_neon.h file is properly copied to the right place bu the build script, and inserting a #error in there did cause my build to fail, so I think I have the right one.

I am setting up my Linux VM to rebuild arm-linux-androideabi-gcc to check if it behaves the same as the Mac-built version does.

Thanks a lot for your help in sorting this out.

Best Regards,
Eric
Comment 13 Eric Batut 2012-11-30 16:16:36 UTC
Richard,

After a clean checkout of gcc's trunk and of the Android NDK r8b package and tools, I rebuilt arm-linux-androideabi-gcc on a Ubuntu VM using gcc 4.5.1. I then rebuilt my testcase with "-O1" and "-O1 -fno-dse", and the same difference is there: d19 and d21 are used as sources for the two vmovl.s8 instead of d18 and d20.

I attach a new tarball with the (very slightly) modified source I am using, the two assembly files that are generated, and the two binary files (they should run on any Android device, no fancy stuff here). Could you please use your local build of gcc to generate the same assembly files so that we can compare the function bodies?

Best Regards,
Eric
Comment 14 Eric Batut 2012-11-30 16:20:10 UTC
Created attachment 28840 [details]
Second repro case with source code, build script, assembly files and binary files
Comment 15 Richard Earnshaw 2012-11-30 17:29:01 UTC
OK, there's definitely something fishy going on here.  I'm re-opening this for now until I can look into it properly.
Comment 16 Richard Earnshaw 2012-11-30 18:50:11 UTC
I need to do some more digging, but it looks like it is the cprop_hardreg pass that is really going wrong.  My suspicion is that the pass is seeing

  vswp  D18, D19 // D19 is unused afterwards
  use   D18

and trying to remove the vswp operation entirely, by substituting D19 in for D18, to give:

 use D19

When doing this it really needs to delete the insn, but it fails to do so.  The result is that the code is now incorrect.

Bernd, I think you were the last to touch this, can you shed any further light on the issue?