Bug 78041 - Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0
Summary: Wrong code on ARMv7 with -mthumb -mfpu=neon-fp16 -O0
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 7.0
: P3 normal
Target Milestone: 5.5
Assignee: Wilco
URL:
Keywords: ra, wrong-code
Depends on:
Blocks:
 
Reported: 2016-10-19 16:00 UTC by Bernd Edlinger
Modified: 2017-01-24 14:24 UTC (History)
0 users

See Also:
Host:
Target: arm
Build:
Known to work:
Known to fail: 4.9.4, 5.4.1, 6.2.1, 7.0
Last reconfirmed: 2016-10-21 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bernd Edlinger 2016-10-19 16:00:44 UTC
This is a slightly modified test case:
gcc.c-torture/execute/pr34971.c

struct foo
{
  unsigned long long b:40;
} x;

extern void abort (void);

void test1(unsigned long long res)
{
  /* Build a rotate expression on a 40 bit argument.  */
  if ((x.b<<9) + (x.b>>31) != res)
    abort ();
}

int main()
{
  x.b = 0x0100000001;
  test1(0x0000000202);
  x.b = 0x0100000000;
  test1(0x0000000002);
  return 0;
}


when compiled with these options it fails:

gcc -mcpu=cortex-a9 -mthumb -mfpu=neon-fp16 -O0
Comment 1 Bernd Edlinger 2016-10-19 16:09:26 UTC
some background about this bug can be found here:

https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html
Comment 2 Wilco 2016-10-19 16:52:45 UTC
(In reply to Bernd Edlinger from comment #1)
> some background about this bug can be found here:
> 
> https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html

The <shift>di3_neon pattern does not constrain the input not to overlap with the output for immediate shifts, and the arm_emit_coreregs_64bit_shift code does not handle partial overlaps. However it is feasible to fix that by swapping the order for the various cases.
Comment 3 Bernd Edlinger 2016-10-19 17:24:47 UTC
(In reply to Wilco from comment #2)
> (In reply to Bernd Edlinger from comment #1)
> > some background about this bug can be found here:
> > 
> > https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html
> 
> The <shift>di3_neon pattern does not constrain the input not to overlap with
> the output for immediate shifts, and the arm_emit_coreregs_64bit_shift code
> does not handle partial overlaps. However it is feasible to fix that by
> swapping the order for the various cases.

from <shift>di3_neon:

            if (INTVAL (operands[2]) < 1)
              {
                emit_insn (gen_movdi (operands[0], operands[1]));
                DONE;
              }

Will the movdi pattern work with partial overlaps?
It does basically this:

      emit_move_insn (gen_lowpart (SImode, operands[0]),
                      gen_lowpart (SImode, operands[1]));
      emit_move_insn (gen_highpart (SImode, operands[0]),
                      gen_highpart (SImode, operands[1]));
Comment 4 Wilco 2016-10-19 17:44:17 UTC
(In reply to Bernd Edlinger from comment #3)
> (In reply to Wilco from comment #2)
> > (In reply to Bernd Edlinger from comment #1)
> > > some background about this bug can be found here:
> > > 
> > > https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01561.html
> > 
> > The <shift>di3_neon pattern does not constrain the input not to overlap with
> > the output for immediate shifts, and the arm_emit_coreregs_64bit_shift code
> > does not handle partial overlaps. However it is feasible to fix that by
> > swapping the order for the various cases.
> 
> from <shift>di3_neon:
> 
>             if (INTVAL (operands[2]) < 1)
>               {
>                 emit_insn (gen_movdi (operands[0], operands[1]));
>                 DONE;
>               }
> 
> Will the movdi pattern work with partial overlaps?
> It does basically this:
> 
>       emit_move_insn (gen_lowpart (SImode, operands[0]),
>                       gen_lowpart (SImode, operands[1]));
>       emit_move_insn (gen_highpart (SImode, operands[0]),
>                       gen_highpart (SImode, operands[1]));

I think it's OK - that code only triggers if a movdi has a physical register that is not a valid DI register which is not the case we're dealing with. movdi has a split that does check for partial overlap around line 5900 in arm.md:

  /* Handle a partial overlap.  */
  if (rtx_equal_p (operands[0], operands[3]))
     ...

However dealing with partial overlaps is complex so maybe the best option would be to add alternatives to <shift>di3_neon to either allow full overlap "r 0 X X X" or no overlap "&r r X  X X". The shift code works with full overlap.
Comment 5 Bernd Edlinger 2016-10-20 05:50:31 UTC
(In reply to Wilco from comment #4)
> However dealing with partial overlaps is complex so maybe the best option
> would be to add alternatives to <shift>di3_neon to either allow full overlap
> "r 0 X X X" or no overlap "&r r X  X X". The shift code works with full
> overlap.

That sounds like a good idea.

Then this condition in <shift>di3_neon could go away too:

            && (!reg_overlap_mentioned_p (operands[0], operands[1])
                || REGNO (operands[0]) == REGNO (operands[1])))
Comment 6 Richard Earnshaw 2016-10-20 13:05:28 UTC
(In reply to Bernd Edlinger from comment #5)
> (In reply to Wilco from comment #4)
> > However dealing with partial overlaps is complex so maybe the best option
> > would be to add alternatives to <shift>di3_neon to either allow full overlap
> > "r 0 X X X" or no overlap "&r r X  X X". The shift code works with full
> > overlap.
> 
> That sounds like a good idea.
> 
> Then this condition in <shift>di3_neon could go away too:
> 
>             && (!reg_overlap_mentioned_p (operands[0], operands[1])
>                 || REGNO (operands[0]) == REGNO (operands[1])))

Note that we don't want to restrict complete overlaps, only partial overlaps.  Restricting complete overlaps leads to significant increase in register pressure and a lot of redundant copying.
Comment 7 Bernd Edlinger 2016-10-20 14:16:27 UTC
(In reply to Richard Earnshaw from comment #6)
> (In reply to Bernd Edlinger from comment #5)
> > (In reply to Wilco from comment #4)
> > > However dealing with partial overlaps is complex so maybe the best option
> > > would be to add alternatives to <shift>di3_neon to either allow full overlap
> > > "r 0 X X X" or no overlap "&r r X  X X". The shift code works with full
> > > overlap.
> > 
> > That sounds like a good idea.
> > 
> > Then this condition in <shift>di3_neon could go away too:
> > 
> >             && (!reg_overlap_mentioned_p (operands[0], operands[1])
> >                 || REGNO (operands[0]) == REGNO (operands[1])))
> 
> Note that we don't want to restrict complete overlaps, only partial
> overlaps.  Restricting complete overlaps leads to significant increase in
> register pressure and a lot of redundant copying.

Yes.

That is Wilco's idea: instead of =r 0r X X X
use =r 0 X X X and =&r r X X X, that should ensure that
no partial overlap happens, just full overlap or nothing.

That's what arm_emit_coreregs_64bit_shift
and arm_ashldi3_1bit can handle.

Who will do it?
Comment 8 Wilco 2016-10-20 15:31:07 UTC
(In reply to Bernd Edlinger from comment #7)
> (In reply to Richard Earnshaw from comment #6)
> > (In reply to Bernd Edlinger from comment #5)
> > > (In reply to Wilco from comment #4)
> > > > However dealing with partial overlaps is complex so maybe the best option
> > > > would be to add alternatives to <shift>di3_neon to either allow full overlap
> > > > "r 0 X X X" or no overlap "&r r X  X X". The shift code works with full
> > > > overlap.
> > > 
> > > That sounds like a good idea.
> > > 
> > > Then this condition in <shift>di3_neon could go away too:
> > > 
> > >             && (!reg_overlap_mentioned_p (operands[0], operands[1])
> > >                 || REGNO (operands[0]) == REGNO (operands[1])))
> > 
> > Note that we don't want to restrict complete overlaps, only partial
> > overlaps.  Restricting complete overlaps leads to significant increase in
> > register pressure and a lot of redundant copying.
> 
> Yes.
> 
> That is Wilco's idea: instead of =r 0r X X X
> use =r 0 X X X and =&r r X X X, that should ensure that
> no partial overlap happens, just full overlap or nothing.
> 
> That's what arm_emit_coreregs_64bit_shift
> and arm_ashldi3_1bit can handle.
> 
> Who will do it?

I've got a patch that fixes it, it's being tested.

While looking at how DI mode operations get expanded, I noticed there is a CQ issue with your shift change. Shifts that are expanded early now use extra registers due to the DI mode write of zero. Given all other DI mode operations are expanded after reload, it may be better to do the same for shifts too.
Comment 9 Bernd Edlinger 2016-10-20 15:58:41 UTC
(In reply to Wilco from comment #8)
> 
> I've got a patch that fixes it, it's being tested.
> 
> While looking at how DI mode operations get expanded, I noticed there is a
> CQ issue with your shift change. Shifts that are expanded early now use
> extra registers due to the DI mode write of zero. Given all other DI mode
> operations are expanded after reload, it may be better to do the same for
> shifts too.

Interesting idea.  After reload there is no need anymore to zero the
DI mode register at all, so that could become completely unnecessary.
Comment 10 ktkachov 2016-10-21 08:09:52 UTC
Confirmed then. Wilco, if you're working on this can you please assign it to yourself?
Comment 11 Wilco 2016-10-21 08:31:00 UTC
(In reply to ktkachov from comment #10)
> Confirmed then. Wilco, if you're working on this can you please assign it to
> yourself?

Unfortunately the form doesn't allow me to do anything with the headers...
Comment 12 Bernd Edlinger 2016-10-21 08:35:34 UTC
(In reply to Wilco from comment #11)
> (In reply to ktkachov from comment #10)
> > Confirmed then. Wilco, if you're working on this can you please assign it to
> > yourself?
> 
> Unfortunately the form doesn't allow me to do anything with the headers...

I know that happens.
Comment 13 ktkachov 2016-10-21 08:37:41 UTC
.
Comment 15 Bernd Edlinger 2016-10-25 09:57:53 UTC
FYI: You could merge the two alternatives into one.

=?r,?&r
  0,  r
  i,  i

is equivalent to

=?&r
  0r
  i
Comment 16 Wilco 2016-10-25 10:26:00 UTC
Author: wilco
Date: Tue Oct 25 10:25:28 2016
New Revision: 241508

URL: https://gcc.gnu.org/viewcvs?rev=241508&root=gcc&view=rev
Log:
With -fpu=neon DI mode shifts are expanded after reload.  DI mode registers can 
either fully or partially overlap on both ARM and Thumb-2.  However the shift
expansion code can only deal with the full overlap case, and generates incorrect
code for partial overlaps.  The fix is to add new variants that support either
full overlap or no overlap.

    gcc/
	PR target/78041
	* config/arm/neon.md (ashldi3_neon): Add "r 0 i" and "&r r i" variants.
	Remove partial overlap check for shift by 1.
	(ashldi3_neon): Likewise.
    testsuite/
	* gcc.target/arm/pr78041.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/arm/pr78041.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/neon.md
    trunk/gcc/testsuite/ChangeLog
Comment 17 Wilco 2016-10-25 10:34:18 UTC
(In reply to Bernd Edlinger from comment #15)
> FYI: You could merge the two alternatives into one.
> 
> =?r,?&r
>   0,  r
>   i,  i
> 
> is equivalent to
> 
> =?&r
>   0r
>   i

Yes, that seems possible. These patterns will need to be changed significantly to fix PR 77308, so this can be done as part of that work.
Comment 18 Wilco 2017-01-06 14:26:38 UTC
Author: wilco
Date: Fri Jan  6 14:26:06 2017
New Revision: 244161

URL: https://gcc.gnu.org/viewcvs?rev=244161&root=gcc&view=rev
Log:
With -fpu=neon DI mode shifts are expanded after reload.  DI mode registers can 
either fully or partially overlap on both ARM and Thumb-2.  However the shift
expansion code can only deal with the full overlap case, and generates incorrect
code for partial overlaps.  The fix is to add new variants that support either
full overlap or no overlap.

	Backport from mainline
	2016-10-25  Wilco Dijkstra  <wdijkstr@arm.com>

    gcc/
	PR target/78041
	* config/arm/neon.md (ashldi3_neon): Add "r 0 i" and "&r r i" variants.
	Remove partial overlap check for shift by 1.
	(ashldi3_neon): Likewise.
    testsuite/
	* gcc.target/arm/pr78041.c: New test.

Added:
    branches/gcc-6-branch/gcc/testsuite/gcc.target/arm/pr78041.c
Modified:
    branches/gcc-6-branch/gcc/ChangeLog
    branches/gcc-6-branch/gcc/config/arm/neon.md
    branches/gcc-6-branch/gcc/testsuite/ChangeLog
Comment 19 Wilco 2017-01-24 14:14:44 UTC
Author: wilco
Date: Tue Jan 24 14:14:12 2017
New Revision: 244872

URL: https://gcc.gnu.org/viewcvs?rev=244872&root=gcc&view=rev
Log:
With -fpu=neon DI mode shifts are expanded after reload.  DI mode registers can 
either fully or partially overlap on both ARM and Thumb-2.  However the shift
expansion code can only deal with the full overlap case, and generates incorrect
code for partial overlaps.  The fix is to add new variants that support either
full overlap or no overlap.

	Backport from mainline
    gcc/
	PR target/78041
	* config/arm/neon.md (ashldi3_neon): Add "r 0 i" and "&r r i" variants.
	Remove partial overlap check for shift by 1.
	(ashldi3_neon): Likewise.
    testsuite/
	* gcc.target/arm/pr78041.c: New test.

Modified:
    branches/gcc-5-branch/gcc/ChangeLog
    branches/gcc-5-branch/gcc/config/arm/neon.md
    branches/gcc-5-branch/gcc/testsuite/ChangeLog
Comment 20 Wilco 2017-01-24 14:20:49 UTC
Fixed in GCC5, GCC6 and GCC7.