Bug 65932

Summary: [5 Regression] Linux-3.10.75 on arm926ej-s does not boot due to wrong code generation
Product: gcc Reporter: Christian Eggers <christian.eggers>
Component: rtl-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: RESOLVED FIXED    
Severity: critical CC: ceggers, jakub, pageexec, ramana, wilson
Priority: P2 Keywords: wrong-code
Version: 5.1.1   
Target Milestone: 5.4   
Host: Target: arm-arm926ejs-linux-gnueabi
Build: Known to work:
Known to fail: Last reconfirmed: 2015-04-30 00:00:00
Attachments: Preprocessed source
Code compiled with r212399
Code compiled with r212400
Experimental patch to force out-of-ssa to use promoted subregs for phi nodes.
A possibly better patch, to modify ARM port to stop changing signed HI/QI to unsigned.

Description Christian Eggers 2015-04-29 15:59:49 UTC
Created attachment 35424 [details]
Preprocessed source

gcc version: 5.1.0 / 5.1.1 (current git snapshot)

host : x86_64_suse_linux (13.1)

target: arm-arm926ejs-linux-gnueabi

Configured with: /home/eggers/Projekte/cds/repos/software/tools/toolchain/src/gcc-gcc-5-branch-1e71966/configure --prefix=/home/eggers/Projekte/cds/repos/software/tools/toolchain/build/arm-arm926ejs-linux-gnueabi/arm-arm926ejs-linux-gnueabi --with-pkgversion='Kathrein CDS 15278' --target=arm-arm926ejs-linux-gnueabi --disable-multilib --mandir=/home/eggers/Projekte/cds/repos/software/tools/toolchain/build/arm-arm926ejs-linux-gnueabi/arm-arm926ejs-linux-gnueabi/share/man --infodir=/home/eggers/Projekte/cds/repos/software/tools/toolchain/build/arm-arm926ejs-linux-gnueabi/arm-arm926ejs-linux-gnueabi/share/info --with-gmp=/home/eggers/Projekte/cds/repos/software/tools/toolchain/build/install --with-mpfr=/home/eggers/Projekte/cds/repos/software/tools/toolchain/build/install --with-mpc=/home/eggers/Projekte/cds/repos/software/tools/toolchain/build/install --with-isl=/home/eggers/Projekte/cds/repos/software/tools/toolchain/build/install --with-sysroot=/home/eggers/Projekte/cds/repos/software/tools/toolchain/build/arm-arm926ejs-linux-gnueabi/arm-arm926ejs-linux-gnueabi/sys-root --disable-nls --with-slibdir=/home/eggers/Projekte/cds/repos/software/tools/toolchain/build/arm-arm926ejs-linux-gnueabi/arm-arm926ejs-linux-gnueabi/sys-root/lib --with-gxx-include-dir=/home/eggers/Projekte/cds/repos/software/tools/toolchain/build/arm-arm926ejs-linux-gnueabi/arm-arm926ejs-linux-gnueabi/sys-root/usr/include/c++/gcc-5-branch-1e71966 --enable-shared --enable-__cxa_atexit --enable-c99 --enable-threads=posix --disable-install-libiberty --with-system-zlib --with-cpu=arm926ej-s --with-float=soft --with-mode=arm --enable-languages=c,c++ --enable-long-long

command line: arm-arm926ejs-linux-gnueabi-gcc -Wp,-MD,lib/.vsprintf.o.d  -nostdinc -isystem /home/eggers/Projekte/cds/repos/software/tools/toolchain/build/arm-arm926ejs-linux-gnueabi/arm-arm926ejs-linux-gnueabi/lib64/gcc/arm-arm926ejs-linux-gnueabi/5.1.0/include -I/home/eggers/Projekte/cds/repos/software/cts/common/platform_linux/external/kernel/linux/linux-3.10/arch/arm/include -Iarch/arm/include/generated  -I/home/eggers/Projekte/cds/repos/software/cts/common/platform_linux/external/kernel/linux/linux-3.10/include -Iinclude -I/home/eggers/Projekte/cds/repos/software/cts/common/platform_linux/external/kernel/linux/linux-3.10/arch/arm/include/uapi -Iarch/arm/include/generated/uapi -I/home/eggers/Projekte/cds/repos/software/cts/common/platform_linux/external/kernel/linux/linux-3.10/include/uapi -Iinclude/generated/uapi -include /home/eggers/Projekte/cds/repos/software/cts/common/platform_linux/external/kernel/linux/linux-3.10/include/linux/kconfig.h  -I/home/eggers/Projekte/cds/repos/software/cts/common/platform_linux/external/kernel/linux/linux-3.10/lib -Ilib -D__KERNEL__ -mlittle-endian   -I/home/eggers/Projekte/cds/repos/software/cts/common/platform_linux/external/kernel/linux/linux-3.10/arch/arm/mach-at91/include -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -Os -Wno-maybe-uninitialized -fno-dwarf2-cfi-asm -mabi=aapcs-linux -mno-thumb-interwork -funwind-tables -marm -D__LINUX_ARM_ARCH__=5 -march=armv5te -mtune=arm9tdmi -msoft-float -Uarm -Wframe-larger-than=1024 -fno-stack-protector -Wno-unused-but-set-variable -fomit-frame-pointer -fno-var-tracking-assignments -g -fno-inline-functions-called-once -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack -DCC_HAVE_ASM_GOTO    -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(vsprintf)"  -D"KBUILD_MODNAME=KBUILD_STR(vsprintf)" -c -o lib/vsprintf.o /home/eggers/Projekte/cds/repos/software/cts/common/platform_linux/external/kernel/linux/linux-3.10/lib/vsprintf.c

linux-3.10.75 compiled with gcc-4.9.2  --> running
same kernel compiled with gcc-5.1.0 / gcc-5.1.1 (git snapshot) --> NOT running

Observations

Function string() (part of vsnprintf()) in lib/vsprintf.c is not compiled correctly:

     f5c:       e1dd51f8        ldrsh   r5, [sp, #24]

[...] 
# r5 = 0xFFFF FFFF
     f70:       e2453001        sub     r3, r5, #1
# r3 = 0xFFFF FFFE
     f74:       e1a03803        lsl     r3, r3, #16
# r3 = 0xFFFE 0000
                while (len < spec.field_width--) {
     f78:       e1500005        cmp     r0, r5
# r0 = 8
# r5 = 0xFFFF FFFF  --> correct
     f7c:       e1a03823        lsr     r3, r3, #16
# r3 = 0x0000 FFFE
     f80:       aa000005        bge     f9c <string.isra.5+0x64>
# branch taken --> correct

[...]
     f9c:       e1a05003        mov     r5, r3
# r5 = r3 = 0x0000 FFFE

[...]
        while (len < spec.field_width--) {
     fdc:       e1500005        cmp     r0, r5
# r0 (len) = 8
# r5 (spec.field_width) is still 0x0000 FFFE  --> not ok
# spec.field_width should be 0xFFFF FFFE at this point

     fe0:       e1a03823        lsr     r3, r3, #16
     fe4:       aa000005        bge     1000 <string.isra.5+0xc8>
# branch NOT taken --> wrong
Comment 1 Richard Earnshaw 2015-04-30 10:10:46 UTC
The type of field_width is a signed short.  On that basis, the use of LSR rather than ASR looks highly suspicious.
Comment 2 Richard Earnshaw 2015-04-30 15:21:54 UTC
The problem seems to be creeping in during the initial expand pass.  I can't be sure, but my suspicions are leaning towards this being some artifact of the ipa-sra decomposition of the 'spec' argument.  code built with -fno-ipa-sra appears to be ok on a visual inspection.
Comment 3 Christian Eggers 2015-04-30 17:28:19 UTC
(In reply to Richard Earnshaw from comment #2)
> code built with -fno-ipa-sra appears to be ok on a visual inspection.

Unfortunately I'll be on vacation until Tuesday, 7th. Then I can try to build to whole kernel with this flag.
Comment 4 Jakub Jelinek 2015-04-30 19:34:13 UTC
If -fno-ipa-sra cures this, I wonder if it hasn't started with r221348.
Some issues related to that were fixed with r221795, but some clearly haven't - e.g. I believe profiledbootstrap is still broken on arm and today I'm reducing another runtime failure that regressed with r221348 on armv7hl and r221795 doesn't fix that.  -fno-ipa-sra cures that as well.
Comment 5 Christian Eggers 2015-05-05 08:51:14 UTC
(In reply to Richard Earnshaw from comment #2)
> code built with
> -fno-ipa-sra appears to be ok on a visual inspection.

linux-3.10.25 boots fine if globally compiled with additional -fno-ipa-sra

The affected linux system is a arm926ej-s. Is this bug also relevant for my Cortex-M boards?
Comment 6 Dr. Uwe Meyer-Gruhl 2015-06-05 11:32:00 UTC
Just a feedback for another target platform (arm-none-eabi): I also had problems with a kernel 4.0.4 on an iConnect (Kirkwood variant) when I used gcc 5.1.0.

Using KCFLAGS='-fno-ipa-sra' helped immediately. I did not look into the generated code, though, but I assume this is the same problem.
Comment 7 Aaro Koskinen 2015-06-05 18:07:04 UTC
The issue is also present when compiling kernels for OMAP1/2 platforms. Also looks like -Os is needed to trigger this. Kernels compiled with -O2 seem to work.
Comment 8 gccbugs 2015-06-08 02:41:14 UTC
Can confirm this on the 5-20150519 snapshot. Here's a self-contained snippet demonstrates this problem (from vsprintf.c):

https://gcc.gnu.org/bugzilla/attachment.cgi?id=35613
Comment 9 Mikhail Maltsev 2015-06-08 09:31:51 UTC
Created attachment 35714 [details]
Code compiled with r212399
Comment 10 Mikhail Maltsev 2015-06-08 09:35:09 UTC
Created attachment 35715 [details]
Code compiled with r212400

There is a change from arithmetic shift to logical shift.
Comment 11 Jim Wilson 2015-06-12 22:35:00 UTC
There is another testcase in 66271.

The LSR is emitted because the arm port has PROMOTE_MODE defined to make chars and shorts unsigned when extended to int.  So the problem seems to be why it wasn't re-sign-extended before the compare.  Maybe an optimization pass incorrectly deleted that sign-extend.
Comment 12 Jim Wilson 2015-06-13 00:58:20 UTC
Part of the problem here is that the arm port defines PROMOTE_MODE to force short local variables to unsigned, but arm_promote_function_mode doesn't change signedness.  So a signed short parameter and a signed short local have different representations in a register.

Another part of the problem is that the eipa_sra pass is creating phi nodes that mix parameters and local variables.  Copying a signed short local into a signed short parameter requires a conversion, even though they have the same type, because they get different in register promotion.

Another part of the problem is that the out-of-ssa pass when evaluating phi nodes doesn't go through expand_expr, and hence doesn't handle mode promotion.  The code is copying DECL_RTL directly into SA.partition_to_pseudo, and then using SA.partition_to_pseudo when emitting the copy instructions.  Thus these phi nodes won't emit any conversion instructions even though one is apparently required.

I can get correct code if I pass DECL_RTL through expand_expr before storing it into SA.partition_to_pseudo for VAR_DECLs.  However, if I then do the same thing for PARM_DECLS, then I get a movhi insn instead of a conversion, because both the source and dest are promoted subregs.  I think the problem here is that emit_partition_copy doesn't handle promoted subregs.  I think we need something like what store_expr_with_bounds does, when the dest is a promoted subreg, to do the copy to the smaller mode, and then convert to the wider mode.  But I have a few too many suppositions at this point and am not sure if this makes sense.

It might be simpler to change the arm port so that promote_mode and function_promote_mode match.
Comment 13 Jim Wilson 2015-06-13 01:07:51 UTC
Created attachment 35774 [details]
Experimental patch to force out-of-ssa to use promoted subregs for phi nodes.
Comment 14 Jim Wilson 2015-06-14 00:34:20 UTC
Created attachment 35775 [details]
A possibly better patch, to modify ARM port to stop changing signed HI/QI to unsigned.

This would require performance testing to see what the effect is.  This is a simpler patch than trying to change the out-of-ssa pass.
Comment 15 Christian Eggers 2015-06-16 15:03:55 UTC
(In reply to Jim Wilson from comment #14)
> Created attachment 35775 [details]
> A possibly better patch, to modify ARM port to stop changing signed HI/QI to
> unsigned.
> 
> This would require performance testing to see what the effect is.  This is a
> simpler patch than trying to change the out-of-ssa pass.

linux-3.10.75 boots successful when compiled by a gcc built with this patch. Shall I also test the other patch?
Comment 16 jim.wilson 2015-06-16 15:35:55 UTC
On Tue, Jun 16, 2015 at 8:03 AM, christian.eggers at kathrein dot de
<gcc-bugzilla@gcc.gnu.org> wrote:
> Shall I also test the other patch?

The out-of-ssa patch is unfinished, and won't work without more changes.

Jim
Comment 17 Jim Wilson 2015-06-23 21:27:52 UTC
I've been researching how the ARM PROMOTE_MODE reached its current state.  There are a number of issues here.

1) There is a comment that says zero extension is faster for char than signed extension.  This is only true before the [su]xtb instructions were added, i.e. !arm_arch6.  Before these instructions, zero extend can be done with an and immediate, but sign extension requires two shifts.  So there is no benefit to zero extension for current architectures.

2) The comment doesn't apply to shorts, which require two shifts for both zero and sign extension when !arm_arch6.  The signedness was changed for shorts originally because some processors did not support unaligned accesses, and had to loads shorts with two load byte instructions.  This required 3 insns for zero extend and 4 for sign extend, so there was an advantage to zero extension, but only if the processor did not support unaligned accesses.  Unfortunately, the support was flawed, and did
        UNSIGNEDP = TARGET_MMU_TRAPS != 0;      \
which forces it to unsigned for MMU_TRAPS and signed for the default case.  It should have been something like
        if (TARGET_MMU_TRAPS) UNSIGNEDP = 1; \
When the TARGET_MMU_TRAPS feature was removed, this code was changed to set UNSIGNEDP to 0 unconditionally, which compounded the error.  This code should have been removed when the TARGET_MMU_TRAPS feature was removed.

3) Originally PROMOTE_FUNCTION_ARGS was a boolean that indicated whether PROMOTE_MODE applied to function arguments or not.  When it was changed to a hook, it was made into a function.  So now we have two sets of similar code, one in PROMOTE_MODE and one in arm_promote_function_mode (TARGET_PROMOTE_FUNCTION_MODE).  The two sets of code should be identical, but the unsigned-ness changing was left out of the arm_promote_function_mode code.  The result is that we have two different in-register representations for signed char/short symbols, one form for parameters and one form for local variables.  This inconsistency is confusing the out-of-ssa pass.  Since changing arm_promote_function_mode is effectively an ABI change, it is safer to change PROMOTE_MODE to match.

4) There is a general principle that you should not change signedness in PROMOTE_MODE unless the hardware forces you to do so.  For instance, for the MIPS64 target, the hardware requires that all 32-bit values in 64-bit registers be sign-extended, and 32-bit instructions may trap if this is not true.  Thus the mips port forces all 32-bit values to be signed.  But the arm architecture has no similar constraint, and hence we should not be changing signedness here.  Changing signedness results in less efficient code, as can be seen in this testcase, where the compiler must emit both a zero-extend and a sign-extend in the loop.  The sign-extend is required to get correct results for a compare, and the zero-extend is required because PROMOTE_MODE requires it.
Comment 18 Jim Wilson 2015-06-23 22:00:42 UTC
Ultimately, I believe that this is an ARM backend bug.  PROMOTE_MODE and TARGET_PROMOTE_FUNCTION_MODE should not behave differently.  It would help if the PROMOTE_MODE macro was merged with the TARGET_PROMOTE_FUNCTION_MODE hook, to avoid accidents like this.

I've tested my patch to modify PROMOTE_MODE so that it no longer sets UNSIGNEDP for char and short.

For the SPEC CPU2000 benchmarks, individual benchmarks are within 1% which is within the noise range, and the full benchmark results have almost identical performance.

I get 3 additional failures in the gcc testsuite, for gcc.target/arm/wmul-[123].c.  These are testcases to verify generation of the smulbb and smlabb instruction.  However, they only work currently because of the extra sign-extends emitted by PROMOTE_MODE.  We currently emit an unsigned short load, a sign-extend, and a multiply.  The sign-extend gets merged into the multiply.  But with the patch, we emit a signed short load and a multiply, and hence can't form smulbb.  Unpatched, for wmul-1.c we get
	ldrh	r1, [r4, #2]!
	ldrh	r6, [r0, #2]!
	smlabb	r5, r1, r6, r5
	smlabb	r2, r1, r1, r2
and patched we get
	ldrsh	r1, [r4, #2]!
	ldrsh	r6, [r0, #2]!
	mla	r5, r1, r6, r5
	mla	r2, r1, r1, r2
wmul-2.c is similar.  There is a bigger difference with wmul-3.c.  Unpatched is
	ldrh	r1, [r5, #2]!
	ldrh	r4, [r0, #2]!
	smulbb	r4, r1, r4
	subs	r6, r6, r4
	smulbb	r1, r1, r1
	subs	r2, r2, r1
whereas patched is
	ldrsh	r1, [r4, #2]!
	ldrsh	r6, [r0, #2]!
	mls	r5, r1, r6, r5
	mls	r2, r1, r1, r2
The patched code is better or equivalent to the unpatched code in all cases, but these testcases no longer serve their purpose.  I can fix wmul-1.c by changing types to int and casting to signed short.  This doesn't work for wmul-2.c because the scalar sign-extend is moved out of the loop, and no longer available to merge with the multiple.  This also doesn't work for wmul-3.c, but only because it is cse'd differently.  I get unpatched
	smulbb	r4, r1, r4
	subs	r6, r6, r4
and patched
	sxth	r4, r4
	mls	r6, r1, r4, r6

At the moment the only option I have to make wmul-2.c and wmul-3.c work is to remove them
Comment 19 Richard Biener 2015-07-16 09:11:45 UTC
GCC 5.2 is being released, adjusting target milestone to 5.3.
Comment 20 Jim Wilson 2015-09-02 16:12:28 UTC
*** Bug 66271 has been marked as a duplicate of this bug. ***
Comment 21 Ramana Radhakrishnan 2015-09-09 09:06:16 UTC
(In reply to Jim Wilson from comment #18)
> Ultimately, I believe that this is an ARM backend bug.  PROMOTE_MODE and
> TARGET_PROMOTE_FUNCTION_MODE should not behave differently.  It would help
> if the PROMOTE_MODE macro was merged with the TARGET_PROMOTE_FUNCTION_MODE
> hook, to avoid accidents like this.
> 
> I've tested my patch to modify PROMOTE_MODE so that it no longer sets
> UNSIGNEDP for char and short.
> 
> For the SPEC CPU2000 benchmarks, individual benchmarks are within 1% which
> is within the noise range, and the full benchmark results have almost
> identical performance.

This is for armv7-a ? Thumb2 / ARM state ? 

> 
> I get 3 additional failures in the gcc testsuite, for
> gcc.target/arm/wmul-[123].c.  These are testcases to verify generation of
> the smulbb and smlabb instruction.  However, they only work currently
> because of the extra sign-extends emitted by PROMOTE_MODE.  We currently
> emit an unsigned short load, a sign-extend, and a multiply.  The sign-extend
> gets merged into the multiply.  But with the patch, we emit a signed short
> load and a multiply, and hence can't form smulbb.  Unpatched, for wmul-1.c
> we get
> 	ldrh	r1, [r4, #2]!
> 	ldrh	r6, [r0, #2]!
> 	smlabb	r5, r1, r6, r5
> 	smlabb	r2, r1, r1, r2
> and patched we get
> 	ldrsh	r1, [r4, #2]!
> 	ldrsh	r6, [r0, #2]!
> 	mla	r5, r1, r6, r5
> 	mla	r2, r1, r1, r2

There are 2 considerations here. smlabb, smulbb on some of the older cores are more efficient than the equivalent mla instructions there can be a 1 cycle issue delta and a 1 cycle result latency overhead. Further more in Thumb2 - ldrsh has a smaller immediate range than ldrh and thus might cause more inefficient register usage compared to the original case.

I think investigating performance on an older core across a range of benchmarks (not relying on just spec but looking at something like the popular standard embedded benchmark would also be interesting) as I don't think this exposes enough 16 bit multiplication operations would be interesting. 



> wmul-2.c is similar.  There is a bigger difference with wmul-3.c.  Unpatched
> is
> 	ldrh	r1, [r5, #2]!
> 	ldrh	r4, [r0, #2]!
> 	smulbb	r4, r1, r4
> 	subs	r6, r6, r4
> 	smulbb	r1, r1, r1
> 	subs	r2, r2, r1
> whereas patched is
> 	ldrsh	r1, [r4, #2]!
> 	ldrsh	r6, [r0, #2]!
> 	mls	r5, r1, r6, r5
> 	mls	r2, r1, r1, r2
> The patched code is better or equivalent to the unpatched code in all cases,
> but these testcases no longer serve their purpose.  I can fix wmul-1.c by
> changing types to int and casting to signed short.  This doesn't work for
> wmul-2.c because the scalar sign-extend is moved out of the loop, and no
> longer available to merge with the multiple.  This also doesn't work for
> wmul-3.c, but only because it is cse'd differently.  I get unpatched
> 	smulbb	r4, r1, r4
> 	subs	r6, r6, r4
> and patched
> 	sxth	r4, r4
> 	mls	r6, r1, r4, r6

For e.g. on the Cortex-A9, this is now turning a 4 cycle sequence into a 6 cycle sequence.

On older cores, this is worse again as there is no sxth instruction, we've just introduced a couple of additional shifts and that would cost in terms of code size.

I do think that before this is submitted back into mainline we need to check across a range of architecture levels the effects of this change.

1 It would be good to look at this for something like libavcodec or fft for some scalar arithmetic code especially to do with short integers.

2. Comparing code generation in terms of code size and / or examples to satisfy that the PROMOTE_MODE change is not sub-optimal would be good.

> 
> At the moment the only option I have to make wmul-2.c and wmul-3.c work is
> to remove them
Comment 22 ktkachov 2015-11-16 16:09:37 UTC
So in an attempt to make some progress on this, I've tried Jim's approach of changing PROMOTE_MODE to not convert the short modes to unsigned.

Building SPEC2006 INT and looking at the generated assembly actually doesn't look too bad, with the codesize being a tiny bit smaller overall with this patch.
Many load operations followed by extend are expressed as single ldrsh instructions.

Then there's the ldrh+smullbb vs ldrsh+mla issue. I'm looking at the wmul-1.c testcase. The cse1 pass is the one that eliminates the sign-extends from the arms of the plus-mult. Before cse1 we have:
(insn 41 40 43 3 (set (reg:SI 136 [ _12 ])
        (sign_extend:SI (mem:HI (reg:SI 135))))
(insn 44 43 47 3 (set (reg:SI 138 [ _16 ])
        (sign_extend:SI (mem:HI (reg:SI 144))))

(insn 47 44 49 3 (set (reg/v:SI 141 [ dotp ])
        (plus:SI (mult:SI (sign_extend:SI (subreg/s/u:HI (reg:SI 136 [ _12 ]) 0))
                (sign_extend:SI (subreg/s/u:HI (reg:SI 138 [ _16 ]) 0)))
            (reg/v:SI 141 [ dotp ]))))
(insn 49 47 51 3 (set (reg/v:SI 153 [ sqr ])
        (plus:SI (mult:SI (sign_extend:SI (subreg/s/u:HI (reg:SI 136 [ _12 ]) 0))
                (sign_extend:SI (subreg/s/u:HI (reg:SI 136 [ _12 ]) 0)))
            (reg/v:SI 153 [ sqr ]))))

and cse1 transforms this into:
(insn 41 40 43 3 (set (reg:SI 136 [ _12 ])
        (sign_extend:SI (mem:HI (reg:SI 135 [ ivtmp.10 ])))))
(insn 44 43 47 3 (set (reg:SI 138 [ _16 ])
        (sign_extend:SI (mem:HI (reg:SI 144 [ ivtmp.13 ])))))
(insn 47 44 49 3 (set (reg/v:SI 141 [ dotp ])
        (plus:SI (mult:SI (reg:SI 136 [ _12 ])
                (reg:SI 138 [ _16 ]))
            (reg/v:SI 141 [ dotp ]))))
(insn 49 47 51 3 (set (reg/v:SI 153 [ sqr ])
        (plus:SI (mult:SI (reg:SI 136 [ _12 ])
                (reg:SI 136 [ _12 ]))
            (reg/v:SI 153 [ sqr ]))))

Notice how the extends on the loads remain but the extends from the plus-mult have disappeared.
Before Jim's patch the same sequence is:
(insn 41 40 43 3 (set (reg:SI 136 [ _12 ])
        (zero_extend:SI (mem:HI (reg:SI 135 [ ivtmp.10 ])))))
(insn 44 43 47 3 (set (reg:SI 138 [ _16 ])
        (zero_extend:SI (mem:HI (reg:SI 144 [ ivtmp.13 ])))))
(insn 47 44 49 3 (set (reg/v:SI 141 [ dotp ])
        (plus:SI (mult:SI (sign_extend:SI (subreg/s/v:HI (reg:SI 136 [ _12 ]) 0))
                (sign_extend:SI (subreg/s/v:HI (reg:SI 138 [ _16 ]) 0)))
            (reg/v:SI 141 [ dotp ]))))
(insn 49 47 51 3 (set (reg/v:SI 153 [ sqr ])
        (plus:SI (mult:SI (sign_extend:SI (subreg/s/v:HI (reg:SI 136 [ _12 ]) 0))
                (sign_extend:SI (subreg/s/v:HI (reg:SI 136 [ _12 ]) 0)))
            (reg/v:SI 153 [ sqr ]))))

So the sign-extends inside the plus-mult don't match the zero_extends from the loads, so it can't cse the sign-extends away.
If the SImode mult-accumulate is indeed more expensive than extend-mult-accumulate on some cores, then cse and costs should have
gated that transformation? Worth looking into IMO. If we fix these widening-mult-accumulate issues will Jim's arm backend patch
 be a valid patch correctness-wise?
Comment 23 Richard Biener 2015-12-04 10:43:57 UTC
GCC 5.3 is being released, adjusting target milestone.
Comment 24 Ramana Radhakrishnan 2016-01-22 05:38:26 UTC
*** Bug 67295 has been marked as a duplicate of this bug. ***
Comment 25 ktkachov 2016-02-04 09:50:44 UTC
Author: ktkachov
Date: Thu Feb  4 09:50:12 2016
New Revision: 233130

URL: https://gcc.gnu.org/viewcvs?rev=233130&root=gcc&view=rev
Log:
[ARM] PR target/65932: stop changing signedness in PROMOTE_MODE

2016-02-04  Jim Wilson  <jim.wilson@linaro.org>

	PR target/65932
	PR target/67714
	* config/arm/arm.h (PROMOTE_MODE): Don't set UNSIGNEDP for QImode and
	HImode.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.h
Comment 26 ktkachov 2016-02-04 09:52:07 UTC
Author: ktkachov
Date: Thu Feb  4 09:51:35 2016
New Revision: 233131

URL: https://gcc.gnu.org/viewcvs?rev=233131&root=gcc&view=rev
Log:
[ARM][1/4] PR target/65932: Add testcase

	PR target/65932
	PR target/67714
	* gcc.c-torture/execute/pr67714.c: New test.

Added:
    trunk/gcc/testsuite/gcc.c-torture/execute/pr67714.c
Modified:
    trunk/gcc/testsuite/ChangeLog
Comment 27 ktkachov 2016-02-04 09:55:08 UTC
Author: ktkachov
Date: Thu Feb  4 09:54:37 2016
New Revision: 233132

URL: https://gcc.gnu.org/viewcvs?rev=233132&root=gcc&view=rev
Log:
[ARM][2/4] Fix operand costing logic for SMUL[TB][TB]

	PR target/65932
	PR target/67714
	* config/arm/arm.c (arm_new_rtx_costs, MULT case): Properly extract
	the operands of the SIGN_EXTENDs from a SMUL[TB][TB] rtx.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.c
Comment 28 ktkachov 2016-02-04 09:56:45 UTC
Author: ktkachov
Date: Thu Feb  4 09:56:13 2016
New Revision: 233133

URL: https://gcc.gnu.org/viewcvs?rev=233133&root=gcc&view=rev
Log:
[cse][3/4] Don't overwrite original rtx when folding source of set

	PR target/65932
	PR target/67714
	* cse.c (cse_insn): Pass NULL to fold_rtx when initially
	folding the source of a SET.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cse.c
Comment 29 ktkachov 2016-02-04 09:58:07 UTC
Author: ktkachov
Date: Thu Feb  4 09:57:36 2016
New Revision: 233134

URL: https://gcc.gnu.org/viewcvs?rev=233134&root=gcc&view=rev
Log:
[ARM][4/4] Adjust gcc.target/arm/wmul-[123].c tests

	PR target/65932
	PR target/67714
	* gcc.target/arm/wmul-3.c: Simplify test to generate just
	a single smulbb instruction.
	* gcc.target/amr/wmul-1.c: Add -mtune=cortex-a9 to dg-options.
	* gcc.target/amr/wmul-2.c: Likewise.

Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/arm/wmul-1.c
    trunk/gcc/testsuite/gcc.target/arm/wmul-2.c
    trunk/gcc/testsuite/gcc.target/arm/wmul-3.c
Comment 30 ktkachov 2016-02-04 10:21:52 UTC
Fixed for GCC 6.
I've asked for a backport to GCC 5 and if approved I'll do it after it had a bit of time to bake on trunk.
Comment 31 ktkachov 2016-02-16 15:11:58 UTC
Author: ktkachov
Date: Tue Feb 16 15:11:26 2016
New Revision: 233454

URL: https://gcc.gnu.org/viewcvs?rev=233454&root=gcc&view=rev
Log:
[ARM] PR target/65932: stop changing signedness in PROMOTE_MODE

	Backport from trunk:
	2016-02-04  Jim Wilson  <jim.wilson@linaro.org>

	PR target/65932
	PR target/67714
	* config/arm/arm.h (PROMOTE_MODE): Don't set UNSIGNEDP for QImode and
	HImode.

Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/config/arm/arm.h
Comment 32 ktkachov 2016-02-16 15:13:37 UTC
Author: ktkachov
Date: Tue Feb 16 15:13:05 2016
New Revision: 233455

URL: https://gcc.gnu.org/viewcvs?rev=233455&root=gcc&view=rev
Log:
[ARM] Adjust tests after fix for PR 65932

	PR target/65932
	* gcc.target/arm/wmul-1.c: Add -mtune=cortex-a9 to dg-options.
	xfail the scan-assembler test.
	* gcc.target/arm/wmul-2.c: Likewise.
	* gcc.target/arm/wmul-3.c: Simplify test to generate a single smulbb.

Modified:
    branches/gcc-5-branch/gcc/testsuite/ChangeLog
    branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/wmul-1.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/wmul-2.c
    branches/gcc-5-branch/gcc/testsuite/gcc.target/arm/wmul-3.c
Comment 33 ktkachov 2016-02-16 15:14:34 UTC
Fixed for GCC 5.4 and GCC 6. Phew.
Comment 34 Christophe Lyon 2016-03-07 08:18:58 UTC
Author: clyon
Date: Mon Mar  7 08:18:25 2016
New Revision: 234019

URL: https://gcc.gnu.org/viewcvs?rev=234019&root=gcc&view=rev
Log:
2016-03-07  Christophe Lyon  <christophe.lyon@linaro.org>

	Backport from mainline
	2016-02-04  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

	PR target/65932
	PR target/67714
	* config/arm/arm.c (arm_new_rtx_costs, MULT case): Properly extract
	the operands of the SIGN_EXTENDs from a SMUL[TB][TB] rtx.


Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/config/arm/arm.c