Bug 82518 - gfortran.fortran-torture/execute/in-pack.f90 fails on armeb
Summary: gfortran.fortran-torture/execute/in-pack.f90 fails on armeb
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 6.4.1
: P2 normal
Target Milestone: 8.0
Assignee: ktkachov
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2017-10-11 12:47 UTC by Christophe Lyon
Modified: 2018-05-04 16:36 UTC (History)
5 users (show)

See Also:
Host:
Target: armeb
Build:
Known to work: 6.4.1, 7.3.1, 8.0
Known to fail:
Last reconfirmed: 2018-02-09 00:00:00


Attachments
Reduced testcase (598 bytes, text/plain)
2018-02-07 10:06 UTC, Christophe Lyon
Details
assembly for arm (little-endian) (9.14 KB, text/plain)
2018-02-07 10:06 UTC, Christophe Lyon
Details
assembly for armeb (big-endian) (9.33 KB, text/plain)
2018-02-07 10:07 UTC, Christophe Lyon
Details
execution traces for arm (113.87 KB, text/plain)
2018-02-07 10:11 UTC, Christophe Lyon
Details
execution traces for armeb (112.65 KB, text/plain)
2018-02-07 10:11 UTC, Christophe Lyon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christophe Lyon 2017-10-11 12:47:22 UTC
I have noticed that since r252917
FAIL:    gfortran.fortran-torture/execute/in-pack.f90 execution,  -O3 -g 
on armeb-none-linux-gnueabihf
--with-mode arm
--with-cpu cortex-a9
--with-fpu neon-fp16

O0, O1, O2*, -Os still pass.
Comment 1 Christophe Lyon 2017-10-11 12:48:37 UTC
r252917 was a fix for PR tree-optimization/82220
Comment 2 Richard Biener 2017-10-11 12:58:07 UTC
So it was broken before the rev. with -fno-vect-cost-model added I guess.
Comment 3 Jakub Jelinek 2018-01-10 16:27:07 UTC
Ping, can you bisect with -O3 -g fno-vect-cost-model?  I don't have access to any armeb box, so no idea what to even look at.
Comment 4 Christophe Lyon 2018-01-11 14:44:44 UTC
Indeed it looks like the testcase has been failing with -fno-vect-cost-model for a very long time.

Trying the find the 'good' starting point for a bisect.

(I'm using qemu, I have no such board either)
Comment 5 Christophe Lyon 2018-01-17 12:18:08 UTC
So far, my bisect has been unsuccessful, because my bisect script thinks the guilty commit is one of the exit-code-125 ones, and there are too many.

I should probably re-try manually.
Comment 6 Christophe Lyon 2018-01-25 21:09:11 UTC
My bisect script cannot find the commit that introduced the problem with -fno-vect-cost-model, because the build was broken for quite some time.
The regression seems to have been introduced between r197671 and r197815.
Comment 7 Aldy Hernandez 2018-01-31 10:09:53 UTC
Hi Nick!  Hi all!

Do we have a way of testing armeb, either through a simulator or through some aarch64 with magic flags?

If anyone has a hint on how to reproduce this, I'll gladly take a stab at reproducing, bisecting, debugging, etc.  Whatever it takes to inch this forward :).
Comment 8 Nick Clifton 2018-01-31 10:24:48 UTC
Hi Aldy,

> Do we have a way of testing armeb, either through a simulator or through
> some aarch64 with magic flags?

GDB has an ARM simulator which is OK unless you need to test some of the newer features like scalable vector instructions.  Just compile your code as normal and then build an armeb targeted version of gdb.

Cheers
  Nick
Comment 9 Christophe Lyon 2018-01-31 13:18:20 UTC
(In reply to Aldy Hernandez from comment #7)
> Hi Nick!  Hi all!
> 
> Do we have a way of testing armeb, either through a simulator or through
> some aarch64 with magic flags?
> 

Please note that the bug appears on armeb (ie. AArch32 big-endian), and not on aarch64.
Comment 10 Aldy Hernandez 2018-02-04 15:28:43 UTC
I'm having some trouble reproducing this bug.  I'm a little rusty on cross builds, so perhaps someone can lend a hand.

I have a set of combined sources which I'm using to build a toolchain like this:

~/src/combined/configure --host=x86_64-linux --build=x86_64-linux --target=armeb-eabi --with-mode=arm --with-cpu=cortex-a9 --with-fpu=neon-fp16 --enable-languages=fortran  --disable-libgomp --disable-libsanitizer --disable-werror

I then do:

cd gcc && make check-fortran RUNTESTFLAGS="execute.exp=in-pack.f90 --target_board=arm-sim"

                === gfortran tests ===

Schedule of variations:
    arm-sim

Running target arm-sim
Using /usr/share/dejagnu/baseboards/arm-sim.exp as board description file for target.
Using /usr/share/dejagnu/config/sim.exp as generic interface file for target.
Using /usr/share/dejagnu/baseboards/basic-sim.exp as board description file for target.
Using /home/cygnus/aldyh/src/combined/gcc/testsuite/config/default.exp as tool-and-target-specific interface file.
Running /home/cygnus/aldyh/src/combined/gcc/testsuite/gfortran.fortran-torture/execute/execute.exp ...
FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution,  -O0 
FAIL: gfortran.fortran-torture/execute/in-pack.f90 execution,  -O1 
...

I get failures for everything, but it seems every execution test fails, even simple C tests.  Tests fail with no messages, just a simple execution failure, so I had to dig into the simulator.

I patched the gdb simulator to trace the instructions to see where we are dying:

diff --git a/sim/arm/wrapper.c b/sim/arm/wrapper.c
index bc1a043..4492b19 100644
--- a/sim/arm/wrapper.c
+++ b/sim/arm/wrapper.c
@@ -53,7 +53,7 @@ int stop_simulator;
 #include "dis-asm.h"

This presumably gives us some tracing in gcc/testsuite/*/*.log:

 /* TODO: Tracing should be converted to common tracing module.  */
-int trace = 0;
+int trace = 1;
 int disas = 0;
 int trace_funcs = 0;

The GCC testsuite log file now shows:

pc: 8c8c, instr: e3530014
pc: 8c90, instr: 1afffffb
pc: 8c94, instr: e5952000
pc: 8c98, instr: e3a03000
pc: 8c9c, instr: e5856014
pc: 8ca0, instr: e5853018
pc: 8ca4, instr: e1c520fc
pc: 4, instr: ea00089b

I took a peek at the executable being run with "/my-arm-build/objdudump -D the-executable.exe", and I see we are failing in initialise_monitor_handles().  This suggests we're failing during the start-up code:

    8c8c:       e3530014        cmp     r3, #20
    8c90:       1afffffb        bne     8c84 <initialise_monitor_handles+0x7c>
    8c94:       e5952000        ldr     r2, [r5]
    8c98:       e3a03000        mov     r3, #0
    8c9c:       e5856014        str     r6, [r5, #20]
    8ca0:       e5853018        str     r3, [r5, #24]
    8ca4:       e1c520fc        strd    r2, [r5, #12]

It seems that last store is corrupting things and making us jump to a PC of 4???

Before I bark up the wrong trees, I have some questions.

Am I running the simulator correctly?  Does it require a special flag for cortex-a9?  

Is the cortex-a9 CPU even handled by the simulator?

Should I run the dejagnu tests with -mcpu=xxxx or whatever, or is the --with-cpu=cortex-a9 configury flag enough?

Does the arm newlib/libgloss/whatever code have instructions that aren't handled by the GDB simulator?

I don't want to dig too deep into this, only to find out that our simulator, newlib, or whatever cannot handle cortex-a9 + neon-fp16.

For that matter, is this bug reproducible on a more generic Arm variant that *IS* supported by gdb?

Sorry for the barrage of questions, but this is a P1, and there doesn't seem to be an easy way to reproduce this.
Comment 11 Christophe Lyon 2018-02-05 09:06:24 UTC
My setup uses armeb-none-linux-gnueabihf (as opposed to armeb-eabi as you report). I have never tried armeb-eabi.

I am also using qemu as simulator (in user mode, not in system mode).

The failure in initialise_monitor_handles indicates a problem in the startup code, while initializing the semi-hosting interface.

Can you try qemu?
Comment 12 Aldy Hernandez 2018-02-05 11:56:47 UTC
(In reply to Christophe Lyon from comment #11)
> My setup uses armeb-none-linux-gnueabihf (as opposed to armeb-eabi as you
> report). I have never tried armeb-eabi.
> 
> I am also using qemu as simulator (in user mode, not in system mode).
> 
> The failure in initialise_monitor_handles indicates a problem in the startup
> code, while initializing the semi-hosting interface.
> 
> Can you try qemu?

Ughhh, off-list I've found that you build glibc/kernel and all that jazz.  This is going to take forever if we (ahem I) rebuild the world in order to reproduce.

In comment #6 you mention that the regression started between r197671 and r197815. 
 Could you reduce the fortran testcase to the absolute minimum, and then we can take a look at the assembly before r197671 and after r197815?

Perhaps start chopping off:

  i8 = (/(i,i=1,5)/)
  call isub8(i8(5:1:-1),5)
  ii8 = (/(5-i+1,i=1,5)/)
  if (any(ii8 /= i8)) call abort

along with the isub8 subroutine, and continue chopping things similarly upward until you get to the abort that fails.  Then see if you can chop non-dependent things from the top down until you get to the smallest block that has no problem before 197671 and a problem after 197815 (with -O3 -g -fno-vect-cost-model as suggested before).

Once you have a reduced testcase, perhaps we could discern something from the assembly or gimple dumps.

Again, please try to get the testcase as small as possible.  That will exponentially improve the chances of things getting fixed :).
Comment 13 Nick Clifton 2018-02-05 12:35:08 UTC
Hi Aldy,


> pc: 8ca4, instr: e1c520fc
> pc: 4, instr: ea00089b
> 
> I took a peek at the executable being run with "/my-arm-build/objdudump -D
> the-executable.exe", and I see we are failing in initialise_monitor_handles(). 
> This suggests we're failing during the start-up code:

>     8ca4:       e1c520fc        strd    r2, [r5, #12]
> 
> It seems that last store is corrupting things and making us jump to a PC of
> 4???

Address 4 is the "undefined instruction" vector.  If the simulator thinks
that the instruction is illegal/unknown then it will branch to address 4
and start executing from there.  (Or else it loads the value stored at 
address 4 and starts executing from that address.  I forget which).

So, this basically means that the simulator does not like that STRD 
instruction. :-(  Looking at the code in Handle_Store_Double() in 
sim/arm/armemu.c, I think that the reason is probably because the address
for the store is not double word aligned.  Which leads me to wonder,
what value is stored in r5 when the STRD instruction is being executed ?




> Am I running the simulator correctly?

Yes.

>  Does it require a special flag for
> cortex-a9?  

No.

> Is the cortex-a9 CPU even handled by the simulator?

Yes.

> Should I run the dejagnu tests with -mcpu=xxxx or whatever, or is the
> --with-cpu=cortex-a9 configury flag enough?

Be paranoid - add the option. :-)


> Does the arm newlib/libgloss/whatever code have instructions that aren't
> handled by the GDB simulator?

No.  Well not in the assembler parts of it.  The possible exception to this
are the memory manipulation functions (memcpy, strlen ,etc) in newlib/libc/
sys/arm/ which tend to be very tightly coded, and will often be updated to 
take advantage of new instructions as they are added to the ISA.

Of course the C parts of these libraries might use unsupported instructions
if gcc generates them.  But if you have configured gcc correctly (and I think
that you have) then this should not be an issue.

Cheers
  Nick
Comment 14 Aldy Hernandez 2018-02-05 13:29:48 UTC
(In reply to Nick Clifton from comment #13)
> Hi Aldy,
> 
> 
> > pc: 8ca4, instr: e1c520fc
> > pc: 4, instr: ea00089b
> > 
> > I took a peek at the executable being run with "/my-arm-build/objdudump -D
> > the-executable.exe", and I see we are failing in initialise_monitor_handles(). 
> > This suggests we're failing during the start-up code:
> 
> >     8ca4:       e1c520fc        strd    r2, [r5, #12]
> > 
> > It seems that last store is corrupting things and making us jump to a PC of
> > 4???
> 
> Address 4 is the "undefined instruction" vector.  If the simulator thinks
> that the instruction is illegal/unknown then it will branch to address 4
> and start executing from there.  (Or else it loads the value stored at 
> address 4 and starts executing from that address.  I forget which).
> 
> So, this basically means that the simulator does not like that STRD 
> instruction. :-(  Looking at the code in Handle_Store_Double() in 
> sim/arm/armemu.c, I think that the reason is probably because the address
> for the store is not double word aligned.  Which leads me to wonder,
> what value is stored in r5 when the STRD instruction is being executed ?

1: x/i $pc
=> 0x8c24 <initialise_monitor_handles+156>:     strd    r2, [r5, #12]
(gdb) info reg r5
r5             0x1b6e8  112360
(gdb) x/4x 0x1b6e8
0x1b6e8 <monitor_stdin>:        0x00000000      0x00000001      0x00000001
0xffffffff

...which is 64 bit aligned.

The above maps to the source: newlib/libc/sys/arm/syscalls.c

  for (i = 0; i < MAX_OPEN_FILES; i ++)
    openfiles[i].handle = -1;

  openfiles[0].handle = monitor_stdin;


> > Should I run the dejagnu tests with -mcpu=xxxx or whatever, or is the
> > --with-cpu=cortex-a9 configury flag enough?
> 
> Be paranoid - add the option. :-)

No difference :(.
Comment 15 Aldy Hernandez 2018-02-05 13:38:57 UTC
(In reply to Aldy Hernandez from comment #14)
> (In reply to Nick Clifton from comment #13)
> > Hi Aldy,
> > 
> > 
> > > pc: 8ca4, instr: e1c520fc
> > > pc: 4, instr: ea00089b
> > > 
> > > I took a peek at the executable being run with "/my-arm-build/objdudump -D
> > > the-executable.exe", and I see we are failing in initialise_monitor_handles(). 
> > > This suggests we're failing during the start-up code:
> > 
> > >     8ca4:       e1c520fc        strd    r2, [r5, #12]
> > > 
> > > It seems that last store is corrupting things and making us jump to a PC of
> > > 4???
> > 
> > Address 4 is the "undefined instruction" vector.  If the simulator thinks
> > that the instruction is illegal/unknown then it will branch to address 4
> > and start executing from there.  (Or else it loads the value stored at 
> > address 4 and starts executing from that address.  I forget which).
> > 
> > So, this basically means that the simulator does not like that STRD 
> > instruction. :-(  Looking at the code in Handle_Store_Double() in 
> > sim/arm/armemu.c, I think that the reason is probably because the address
> > for the store is not double word aligned.  Which leads me to wonder,
> > what value is stored in r5 when the STRD instruction is being executed ?
> 
> 1: x/i $pc
> => 0x8c24 <initialise_monitor_handles+156>:     strd    r2, [r5, #12]
> (gdb) info reg r5
> r5             0x1b6e8  112360
> (gdb) x/4x 0x1b6e8
> 0x1b6e8 <monitor_stdin>:        0x00000000      0x00000001      0x00000001
> 0xffffffff
> 
> ...which is 64 bit aligned.

And if you're curious what the 12 offset points to:

(gdb) x/4x $r5 + 12
0x1b6f4 <openfiles>:    0xffffffff      0x00000000      0x00000001      0x000000
00
(gdb) x/4x $r5 + 0x12
0x1b6fa <openfiles+6>:  0x00000000      0x00010000      0x0000ffff      0xffff00
00
Comment 16 Richard Earnshaw 2018-02-05 13:45:01 UTC
(In reply to Nick Clifton from comment #13)
> Hi Aldy,
> 
> 
> > pc: 8ca4, instr: e1c520fc
> > pc: 4, instr: ea00089b
> > 
> > I took a peek at the executable being run with "/my-arm-build/objdudump -D
> > the-executable.exe", and I see we are failing in initialise_monitor_handles(). 
> > This suggests we're failing during the start-up code:
> 
> >     8ca4:       e1c520fc        strd    r2, [r5, #12]
> > 
> > It seems that last store is corrupting things and making us jump to a PC of
> > 4???
> 
> Address 4 is the "undefined instruction" vector.  If the simulator thinks
> that the instruction is illegal/unknown then it will branch to address 4
> and start executing from there.  (Or else it loads the value stored at 
> address 4 and starts executing from that address.  I forget which).
> 
> So, this basically means that the simulator does not like that STRD 
> instruction. :-(  Looking at the code in Handle_Store_Double() in 
> sim/arm/armemu.c, I think that the reason is probably because the address
> for the store is not double word aligned.  Which leads me to wonder,
> what value is stored in r5 when the STRD instruction is being executed ?

You wouldn't take the undef vector for an alignment issue: that would take a data abort.

Sounds like your simulator is built for an older architecture, that doesn't have strd (ie it's pre-armv5te).
Comment 17 Christophe Lyon 2018-02-05 13:50:51 UTC
(In reply to Aldy Hernandez from comment #12)

> along with the isub8 subroutine, and continue chopping things similarly
> upward until you get to the abort that fails.  Then see if you can chop
> non-dependent things from the top down until you get to the smallest block
> that has no problem before 197671 and a problem after 197815 (with -O3 -g
> -fno-vect-cost-model as suggested before).
> 
Hmmm does -O3 overrides -fno-vect-cost-model?

When I running the testsuite with qemu/-fno-vect-cost-model (as target board), my logs show:
[...] /home/christophe.lyon/src/GCC/sources/gcc-fsf/r197671/gcc/testsuite/gfortran.fortran-torture/execute/in-pack.f90 -fno-vect-cost-model -fno-diagnostics-show-caret -w -O3 -g  [....]

so I may not have been testing what I thought :(
Comment 18 Aldy Hernandez 2018-02-05 14:47:26 UTC
(In reply to Christophe Lyon from comment #17)
> (In reply to Aldy Hernandez from comment #12)
> 
> > along with the isub8 subroutine, and continue chopping things similarly
> > upward until you get to the abort that fails.  Then see if you can chop
> > non-dependent things from the top down until you get to the smallest block
> > that has no problem before 197671 and a problem after 197815 (with -O3 -g
> > -fno-vect-cost-model as suggested before).
> > 
> Hmmm does -O3 overrides -fno-vect-cost-model?
> 
> When I running the testsuite with qemu/-fno-vect-cost-model (as target
> board), my logs show:
> [...]
> /home/christophe.lyon/src/GCC/sources/gcc-fsf/r197671/gcc/testsuite/gfortran.
> fortran-torture/execute/in-pack.f90 -fno-vect-cost-model
> -fno-diagnostics-show-caret -w -O3 -g  [....]
> 
> so I may not have been testing what I thought :(

-fno-vect-cost-model should take effect regardless of the -O3 specified.  That is, the order shouldn't matter.

BTW, I only mentioned -O3 -fno-vect-cost-model -g because that is what was suggested by Jakub in comment #3.
Comment 19 Aldy Hernandez 2018-02-05 17:17:22 UTC
(In reply to Richard Earnshaw from comment #16)
> (In reply to Nick Clifton from comment #13)
> > Hi Aldy,
> > 
> > 
> > > pc: 8ca4, instr: e1c520fc
> > > pc: 4, instr: ea00089b
> > > 
> > > I took a peek at the executable being run with "/my-arm-build/objdudump -D
> > > the-executable.exe", and I see we are failing in initialise_monitor_handles(). 
> > > This suggests we're failing during the start-up code:
> > 
> > >     8ca4:       e1c520fc        strd    r2, [r5, #12]
> > > 
> > > It seems that last store is corrupting things and making us jump to a PC of
> > > 4???
> > 
> > Address 4 is the "undefined instruction" vector.  If the simulator thinks
> > that the instruction is illegal/unknown then it will branch to address 4
> > and start executing from there.  (Or else it loads the value stored at 
> > address 4 and starts executing from that address.  I forget which).
> > 
> > So, this basically means that the simulator does not like that STRD 
> > instruction. :-(  Looking at the code in Handle_Store_Double() in 
> > sim/arm/armemu.c, I think that the reason is probably because the address
> > for the store is not double word aligned.  Which leads me to wonder,
> > what value is stored in r5 when the STRD instruction is being executed ?
> 
> You wouldn't take the undef vector for an alignment issue: that would take a
> data abort.
> 
> Sounds like your simulator is built for an older architecture, that doesn't
> have strd (ie it's pre-armv5te).

Actually, Nick is correct.  The simulator stops around here:

  /* The address must be aligned on a 8 byte boundary.  */
  if (addr & 0x7)
    {
#ifdef ABORTS
      ARMul_DATAABORT (addr);
#else
      ARMul_UndefInstr (state, instr);
#endif

(gdb) p/x addr
$26 = 0x1b6f4
(gdb) p/x addr & 7
$27 = 0x4

and ARMul_UndefInstr() is:

void
ARMul_UndefInstr (ARMul_State * state, ARMword instr ATTRIBUTE_UNUSED)
{
  ARMul_Abort (state, ARMul_UndefinedInstrV);
}

#define ARMUndefinedInstrV 4L
...
#define ARMul_UndefinedInstrV ARMUndefinedInstrV

The 4 is not exactly intuitive...

I still think someone with access to a working environment should reduce or debug this further :(.
Comment 20 Nick Clifton 2018-02-05 17:18:43 UTC
Hi Aldy,

>>> for the store is not double word aligned.  Which leads me to wonder,
>>> what value is stored in r5 when the STRD instruction is being executed ?
>>
>> 1: x/i $pc
>> => 0x8c24 <initialise_monitor_handles+156>:     strd    r2, [r5, #12]
>> (gdb) info reg r5
>> r5             0x1b6e8  112360

>> ...which is 64 bit aligned.

Hmm, curious.  OK - my next recommendation would be to add some printf()s
to the simulator to find out a) if Handle_Store_Double() really is being
called, or if the abort is happening somewhere else.  Plus, if it is being
called, then b) where inside that function the abort is happening.  Maybe
the store operations are triggering a memory access failure.

Cheers
  Nick
Comment 21 Nick Clifton 2018-02-05 17:20:47 UTC
Hi Aldy,

>>> instruction. :-(  Looking at the code in Handle_Store_Double() in 
>>> sim/arm/armemu.c, I think that the reason is probably because the address
>>> for the store is not double word aligned.  Which leads me to wonder,
>>> what value is stored in r5 when the STRD instruction is being executed ?


>> => 0x8c24 <initialise_monitor_handles+156>:     strd    r2, [r5, #12]
>> (gdb) info reg r5
>> r5             0x1b6e8  112360

>> ...which is 64 bit aligned.

But, as you have just discovered, (r5 + 12) is not 64-bit aligned...
Comment 22 Richard Earnshaw 2018-02-05 17:40:18 UTC
(In reply to Nick Clifton from comment #21)
> Hi Aldy,
> 
> >>> instruction. :-(  Looking at the code in Handle_Store_Double() in 
> >>> sim/arm/armemu.c, I think that the reason is probably because the address
> >>> for the store is not double word aligned.  Which leads me to wonder,
> >>> what value is stored in r5 when the STRD instruction is being executed ?
> 
> 
> >> => 0x8c24 <initialise_monitor_handles+156>:     strd    r2, [r5, #12]
> >> (gdb) info reg r5
> >> r5             0x1b6e8  112360
> 
> >> ...which is 64 bit aligned.
> 
> But, as you have just discovered, (r5 + 12) is not 64-bit aligned...

But from ARMv7 onwards it only has to be 4-byte aligned, which it is.  And this code was build for cortex-a9, which is ARMv7-a.
Comment 23 Nick Clifton 2018-02-05 17:50:14 UTC
Hi Guys,

>> But, as you have just discovered, (r5 + 12) is not 64-bit aligned...
> 
> But from ARMv7 onwards it only has to be 4-byte aligned, which it is.  And this
> code was build for cortex-a9, which is ARMv7-a.

Ok, so this is a simulator bug.

Aldy - these easiest thing for now would be to unilaterally relax the alignment
test in Handle_Store_Double and see if that allows you to get further with your
tests.

(But yes, I agree, a reduced testcase would be a much better help than all this
mucking about in the simulator).

Cheers
  Nick
Comment 24 Aldy Hernandez 2018-02-05 17:52:49 UTC
(In reply to Richard Earnshaw from comment #22)
> (In reply to Nick Clifton from comment #21)
> > Hi Aldy,
> > 
> > >>> instruction. :-(  Looking at the code in Handle_Store_Double() in 
> > >>> sim/arm/armemu.c, I think that the reason is probably because the address
> > >>> for the store is not double word aligned.  Which leads me to wonder,
> > >>> what value is stored in r5 when the STRD instruction is being executed ?
> > 
> > 
> > >> => 0x8c24 <initialise_monitor_handles+156>:     strd    r2, [r5, #12]
> > >> (gdb) info reg r5
> > >> r5             0x1b6e8  112360
> > 
> > >> ...which is 64 bit aligned.
> > 
> > But, as you have just discovered, (r5 + 12) is not 64-bit aligned...
> 
> But from ARMv7 onwards it only has to be 4-byte aligned, which it is.  And
> this code was build for cortex-a9, which is ARMv7-a.

(In reply to Richard Earnshaw from comment #22)
> (In reply to Nick Clifton from comment #21)
> > Hi Aldy,
> > 
> > >>> instruction. :-(  Looking at the code in Handle_Store_Double() in 
> > >>> sim/arm/armemu.c, I think that the reason is probably because the address
> > >>> for the store is not double word aligned.  Which leads me to wonder,
> > >>> what value is stored in r5 when the STRD instruction is being executed ?
> > 
> > 
> > >> => 0x8c24 <initialise_monitor_handles+156>:     strd    r2, [r5, #12]
> > >> (gdb) info reg r5
> > >> r5             0x1b6e8  112360
> > 
> > >> ...which is 64 bit aligned.
> > 
> > But, as you have just discovered, (r5 + 12) is not 64-bit aligned...
> 
> But from ARMv7 onwards it only has to be 4-byte aligned, which it is.  And
> this code was build for cortex-a9, which is ARMv7-a.

In that case, unless I'm missing something, the simulator looks wrong.

The unalignment occurs in initialise_monitor_files() here:

openfiles[0].handle = monitor_stdin;

(gdb) p &openfiles[0].handle
$14 = (int *) 0x1b6f4 <openfiles>
(gdb) p/x (unsigned int)$14 % 4
$15 = 0x0
(gdb) p/x (unsigned int)$14 % 8
$16 = 0x4

So openfiles[0].handle is aligned to 4 bytes, but not to 8.  Forthat matter, &openfiles is 4 byte aligned only.  And Richard says that is ok.

So, why is Handle_Store_Double() unilaterally barfing on non 64-bit alignment?

  /* The address must be aligned on a 8 byte boundary.  */
  if (addr & 0x7)
    {
#ifdef ABORTS
      ARMul_DATAABORT (addr);
#else
      ARMul_UndefInstr (state, instr);
#endif
      return;
    }
Comment 25 Aldy Hernandez 2018-02-05 18:04:47 UTC
> Aldy - these easiest thing for now would be to unilaterally relax the
> alignment
> test in Handle_Store_Double and see if that allows you to get further with
> your
> tests.


We're debugging past each other :).  I was already doing that, but it only mildly helps:

pc: 8358,  ldm  r3, {r0, r1, r2, r3}
pc: 835c,  stm  ip, {r0, r1, r2, r3}
pc: 8360,  sub  r3, fp, #40     ; 0x28
pc: 8364,  mov  r0, r3
pc: 8368,  blx  0x0000000000000698
 pc changed to 8a00
pc: 8a00, Thumb instr: f890|f000   emulate as: e5d0f000  ldrb   pc, [r0]        ; <UNPREDICTABLE>
 pc changed to a

I'm done mucking around with the simulator.  I'll file a GDB/sim PR for the alignment issue though.  Thanks.

> 
> (But yes, I agree, a reduced testcase would be a much better help than all
> this
> mucking about in the simulator).

Yes please.  Cristophe?
Comment 26 Christophe Lyon 2018-02-06 10:35:00 UTC
I've manually built or tried to build several revisions:
* 197671: build OK, test fails to run at -fno-vect-cost-model -O3 -g
* 197669: same (!)
* 197815: GCC fails to build
* 197816: same
* 197900: same

So.... although I still see the test failing, I don't understand why bisect said the guilty commit would be between 197671 and 197815, given that the test fails at 197669....
Comment 27 Aldy Hernandez 2018-02-06 10:40:02 UTC
(In reply to Christophe Lyon from comment #26)
> I've manually built or tried to build several revisions:
> * 197671: build OK, test fails to run at -fno-vect-cost-model -O3 -g
> * 197669: same (!)
> * 197815: GCC fails to build
> * 197816: same
> * 197900: same
> 
> So.... although I still see the test failing, I don't understand why bisect
> said the guilty commit would be between 197671 and 197815, given that the
> test fails at 197669....

What's the first revision before 197669 that shows the test succeeding?
Comment 28 Christophe Lyon 2018-02-06 14:09:26 UTC
It's possible that my bisect script got confused by the fact the GCC started ICEing at -O2 on this test at r197671.

Investigating....
Comment 29 Christophe Lyon 2018-02-06 16:53:57 UTC
I still haven't found a commit where the test passes with -fno-vect-cost-model (before -O3).

I went back to r193053 (Nov 1, 2012), where I was able to build GCC but the test fails.
With a revision 1 month earlier, the GCC fails to build.
I tried with earlier revision, but soon reached a point where "-mfloat-abi=hard and VFP" is not implemented.
Comment 30 Aldy Hernandez 2018-02-06 17:49:36 UTC
(In reply to Christophe Lyon from comment #29)
> I still haven't found a commit where the test passes with
> -fno-vect-cost-model (before -O3).
> 
> I went back to r193053 (Nov 1, 2012), where I was able to build GCC but the
> test fails.
> With a revision 1 month earlier, the GCC fails to build.
> I tried with earlier revision, but soon reached a point where
> "-mfloat-abi=hard and VFP" is not implemented.

Ok, bisecting isn't getting us anywhere.  I apologize for making you going through all this in vain.

Could you further reduce the testcase as I suggested in comment 12, and perhaps we can start looking at the assembly to see where things are going wrong?
Comment 31 Christophe Lyon 2018-02-07 10:06:11 UTC
Created attachment 43352 [details]
Reduced testcase

I commented out most calls, since abort() is called from csub4.
Comment 32 Christophe Lyon 2018-02-07 10:06:52 UTC
Created attachment 43353 [details]
assembly for arm (little-endian)
Comment 33 Christophe Lyon 2018-02-07 10:07:16 UTC
Created attachment 43354 [details]
assembly for armeb (big-endian)
Comment 34 Christophe Lyon 2018-02-07 10:11:04 UTC
Created attachment 43355 [details]
execution traces for arm

I have removed the logs/traces for -O1/-O2/-Os/etc... and kept only -O3 -g
Comment 35 Christophe Lyon 2018-02-07 10:11:42 UTC
Created attachment 43356 [details]
execution traces for armeb
Comment 36 Christophe Lyon 2018-02-07 10:12:25 UTC
The attachments were generated with trunk r257076
Comment 37 Aldy Hernandez 2018-02-07 10:25:32 UTC
(In reply to Christophe Lyon from comment #31)
> Created attachment 43352 [details]
> Reduced testcase
> 
> I commented out most calls, since abort() is called from csub4.

Can you also remove the csub8, isub4, and isub8 unused functions as well?

I see you've commented out this in csub4:

!!  if (any(bb /= b)) call abort

I assume this is irrelevant to the failure?

Can you also verify that after these changes you have a revision of GCC for which this reduced testcase succeeds (regardless of the vect cost model rabbit hole), and a revision of GCC for which this fails?

I'm trying to make sure all this removing of stuff didn't cause an inconditional abort.

Also, is this only reproducible with -g?

BTW, no need to include the assembly.  I should be able to generate it with a cross ./cc1.
Comment 38 Christophe Lyon 2018-02-07 12:49:57 UTC
(In reply to Aldy Hernandez from comment #37)
> (In reply to Christophe Lyon from comment #31)
> > Created attachment 43352 [details]
> > Reduced testcase
> > 
> > I commented out most calls, since abort() is called from csub4.
> 
> Can you also remove the csub8, isub4, and isub8 unused functions as well?
> 
> I see you've commented out this in csub4:
> 
> !!  if (any(bb /= b)) call abort
> 
> I assume this is irrelevant to the failure?
> 
> Can you also verify that after these changes you have a revision of GCC for
> which this reduced testcase succeeds (regardless of the vect cost model
> rabbit hole), and a revision of GCC for which this fails?
> 
> I'm trying to make sure all this removing of stuff didn't cause an
> inconditional abort.

I don't speak fortran, but I thought the program did:
main-> call csub4-> call abort if some condition
In my testing, removing all calls but csub4 from main is sufficient to make the program fail, and then it seems the first call to abort in csub4 is taken too.

What would it change to remove csub8/isub4/isub8? (except not generating dead code, which is irrelevant to the current problem)


> 
> Also, is this only reproducible with -g?
I don't know, it's added by the torture harness. I wouldn't expect this to change code generation, though.

> 
> BTW, no need to include the assembly.  I should be able to generate it with
> a cross ./cc1.
Comment 39 Christophe Lyon 2018-02-07 12:52:16 UTC
Maybe we can demote this from P1?
I'm sure armeb is getting a lot of attention, given other bug reports.
Comment 40 Aldy Hernandez 2018-02-09 12:57:07 UTC
*sigh*  Confirmed with:

$ sudo yum install qemu
$ git clone https://git.linaro.org/toolchain/abe.git
$ (cd abe; git checkout stable)
$ mkdir build-cortex-a9
$ cd build-cortex-a9
$ ../abe/configure
$../abe/abe.sh --target armeb-linux-gnueabihf --set languages=fortran --build gcc,glibc,binutils gcc=gcc.git~master --set cpu=cortex-a9

I avoided --build all because gdb fails to build with --set cpu=cortex-a9.

Then I ran tests like this:

../abe/abe.sh --target armeb-linux-gnueabihf --set runtestflags=execute.exp=in-pack.f90 --build gcc --check gcc  gcc=gcc.git~master --disable update

which actually passed, because there doesn't seem to be a way to pass to pass -mfpu= to either the abe.sh configury, or the "runtestflags=" line.

So... then you do:

$ find . -name gfortran.log
./builds/x86_64-unknown-linux-gnu/armeb-linux-gnueabihf/gcc.git~master-stage2/gcc/testsuite/gfortran/gfortran.log

and fish out the gcc command line to run manually on the command line with -mcpu=cortex-a9 -mfpu=neon-fp16 -O3 -o i-wanna-kill-myself.exe

And finally:

$ qemu-armeb -cpu any -R 0 -L /home/cygnus/aldyh/bld/arm-hell/build-cortex-a9/sysroots/armeb-linux-gnueabihf i-wanna-kill-myself.exe
Program aborted. Backtrace:
qemu: uncaught target signal 6 (Aborted) - core dumped

I've verified with -O2 or without -mfpu=neon-fp16 succeeds, so the test isn't just dying for all flags :).

Sigh (have I said that before?).  I will now see if I can reduce the test and see where we are dying.

p.s. gcc.git~master directory seems to have a pretty up to date upstream GCC (as of today). I'll double check dropping in an actual FSF gcc doesn't change the results.
Comment 41 Wilco 2018-02-15 17:05:31 UTC
I'm guessing it's this (64-bit loads reading 32-bit data):

        vldr    d16, .L39
        vldr    d17, .L39+8
.L5:
        vsub.i32        q10, q11, q8
        add     r3, r3, #1
        vsub.i32        q9, q8, q11


.L39:
        .word   1
        .word   2
        .word   3
        .word   4
Comment 42 Wilco 2018-02-15 20:35:43 UTC
Cut down example:

typedef struct { int x, y; } X;

void f (X *p, int n)
{
  for (int i = 0; i < n; i++)
  { p[i].x = i;
    p[i].y = i + 1;
  }
}
Comment 43 Richard Biener 2018-02-16 08:26:40 UTC
(In reply to Wilco from comment #42)
> Cut down example:
> 
> typedef struct { int x, y; } X;
> 
> void f (X *p, int n)
> {
>   for (int i = 0; i < n; i++)
>   { p[i].x = i;
>     p[i].y = i + 1;
>   }
> }

Can't reproduce your assembler with -O3 -mcpu=cortex-a9 -mfpu=neon-fp16 [-fno-vect-cost-model] [-mthumb]

Without -fno-vect-cost-model we don't vectorize anything.  With we only
SLP vectorize and that using V1SI vector types (huh).

For the loop case:

t.c:6:3: note: Build SLP for _3->x = i_15;
t.c:6:3: note: Build SLP for _3->y = _4;
t.c:6:3: note: vect_is_simple_use: operand i_15
t.c:6:3: note: def_stmt: i_15 = PHI <0(5), _4(6)>
t.c:6:3: note: type of def: induction
t.c:6:3: note: vect_is_simple_use: operand _4
t.c:6:3: note: def_stmt: _4 = i_15 + 1;
t.c:6:3: note: type of def: internal
t.c:6:3: note: Build SLP failed: different types

ok, known missed handling of SLP induction.

t.c:6:3: note: ==> examining statement: _3->x = i_15;
t.c:6:3: note: vect_is_simple_use: operand i_15
t.c:6:3: note: def_stmt: i_15 = PHI <0(5), _4(6)>
t.c:6:3: note: type of def: induction
t.c:6:3: note: no array mode for DI[2]
permutaion op not supported by target.

so we don't support intereaving either.  Not sure why it talks about DI[2]
instead of SI[2].
Comment 44 Jakub Jelinek 2018-02-16 12:59:40 UTC
Maybe -O3 -mcpu=cortex-a9 -mfpu=neon-fp16 -mfloat-abi=hard is needed.
With that I certainly see the #c42 loop vectorized.

On x86_64 we get in *.optimized:
  <bb 5> [local count: 567644349]:
  # vect_vec_iv_.4_33 = PHI <{ 0, 1, 2, 3, 4, 5, 6, 7 }(4), vect_vec_iv_.4_34(5)>
  # ivtmp.10_14 = PHI <ivtmp.10_85(4), ivtmp.10_23(5)>
  vect_vec_iv_.4_34 = vect_vec_iv_.4_33 + { 8, 8, 8, 8, 8, 8, 8, 8 };
  vect__4.5_36 = vect_vec_iv_.4_33 + { 1, 1, 1, 1, 1, 1, 1, 1 };
  vect_inter_high_39 = VEC_PERM_EXPR <vect_vec_iv_.4_33, vect__4.5_36, { 0, 8, 1, 9, 2, 10, 3, 11 }>;
  vect_inter_low_40 = VEC_PERM_EXPR <vect_vec_iv_.4_33, vect__4.5_36, { 4, 12, 5, 13, 6, 14, 7, 15 }>;
  _86 = (void *) ivtmp.10_14;
  MEM[base: _86, offset: 0B] = vect_inter_high_39;
  MEM[base: _86, offset: 32B] = vect_inter_low_40;
  ivtmp.10_23 = ivtmp.10_14 + 64;
  if (ivtmp.10_23 != _90)
    goto <bb 5>; [83.33%]
  else
    goto <bb 6>; [16.67%]
which doesn't look optimal either, in this case I'd say better would be to have two IVs bumped by { 8, ... 8 } in each iteration, one starting with
{ 0, 1, 1, 2, 2, 3, 3, 4 } and another with
{ 4, 5, 5, 6, 6, 7, 7, 8 } or just one and add { 4, ... 4 }; to it for the second store and avoid both VEC_PERM_EXPRs in that case.

On armeb with the above options I see:
  <bb 5> [local count: 504572758]:
  # vect_vec_iv_.7_45 = PHI <{ 0, 1, 2, 3 }(4), vect_vec_iv_.7_46(5)>
  # ivtmp.31_128 = PHI <ivtmp.31_130(4), ivtmp.31_129(5)>
  vectp_p.9_49 = (int[8] *) ivtmp.31_128;
  vect_vec_iv_.7_46 = vect_vec_iv_.7_45 + { 4, 4, 4, 4 };
  vect__4.8_48 = vect_vec_iv_.7_45 + { 1, 1, 1, 1 };
  vect_array.11[0] = vect_vec_iv_.7_45;
  vect_array.11[1] = vect__4.8_48;
  MEM[(int *)vectp_p.9_49] = STORE_LANES (vect_array.11);
  ivtmp.31_129 = ivtmp.31_128 + 32;
  if (ivtmp.31_129 != _133)
    goto <bb 5>; [83.33%]
  else
    goto <bb 6>; [16.67%]
which looks wrong to me (because vect_vec_iv_.7_45 and vect__4.8_48 really should be interleaved when stored into MEM[(int *)vectp_p.9_49]), but I really don't know what exactly the STORE_LANES does.
Comment 45 Jakub Jelinek 2018-02-16 13:09:20 UTC
Note the vectorized loop is pretty much the same on arm little-endian,
  # vect_vec_iv_.6_33 = PHI <{ 0, 1, 2, 3 }(4), vect_vec_iv_.6_34(5)>
  # ivtmp.12_14 = PHI <ivtmp.12_51(4), ivtmp.12_23(5)>
  vectp_p.8_37 = (int[8] *) ivtmp.12_14;
  vect_vec_iv_.6_34 = vect_vec_iv_.6_33 + { 4, 4, 4, 4 };
  vect__4.7_36 = vect_vec_iv_.6_33 + { 1, 1, 1, 1 };
  vect_array.10[0] = vect_vec_iv_.6_33;
  vect_array.10[1] = vect__4.7_36;
  MEM[(int *)vectp_p.8_37] = STORE_LANES (vect_array.10);
  ivtmp.12_23 = ivtmp.12_14 + 32;
  if (ivtmp.12_23 != _54)
    goto <bb 5>; [83.33%]
  else
    goto <bb 6>; [16.67%]
for which we emit:
        vmov.i32        q12, #4  @ v4si
        vmov.i32        q9, #1  @ v4si
...
        vldr    d16, .L13
        vldr    d17, .L13+8
.L4:
        vmov    q10, q8  @ v4si
        vadd.i32        q11, q8, q9
        vadd.i32        q8, q8, q12
        vst2.32 {d20-d23}, [r3]!
        cmp     r3, r2
        bne     .L4

vst2.32 seems to be documented to do 32-bit interleaving, so if qN registers overlap d{2*N} and d{2*N+1} registers, I guess this does the right thing.
Comment 46 Jakub Jelinek 2018-02-16 13:22:48 UTC
Wonder if that:
  vect_array.11[0] = vect_vec_iv_.7_45;
  vect_array.11[1] = vect__4.8_48;
on armeb shouldn't have been [1] and [0] instead, otherwise we end up with:
(insn 35 37 38 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 0)
        (reg:V4SI 110 [ vect_vec_iv_.7 ])) "pr82518.c":8 939 {*neon_movv4si}
     (nil))
(insn 38 35 41 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 16)
        (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ])
            (reg:V4SI 171))) "pr82518.c":8 998 {*addv4si3_neon}
     (nil))
(insn 41 38 39 5 (set (reg:V4SI 110 [ vect_vec_iv_.7 ])
        (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ])
            (reg:V4SI 169))) 998 {*addv4si3_neon}
     (nil))
(insn 39 41 43 5 (set (mem:OI (post_inc:SI (reg:SI 152 [ ivtmp.31 ])) [2 MEM[(int *)vectp_p.9_49]+0 S32 A32])
        (unspec:OI [
                (reg:OI 155 [ vect_array.11 ])
                (unspec:V4SI [
                        (const_int 0 [0])
                    ] UNSPEC_VSTRUCTDUMMY)
            ] UNSPEC_VST2)) "pr82518.c":8 2396 {neon_vst2v4si}
     (expr_list:REG_INC (reg:SI 152 [ ivtmp.31 ])
        (nil)))
where pseudo 110 is the vect_vec_iv_.7_45 ({i, i + 1, i + 2, i + 3}) and
insn 38 adds {1, 1, 1, 1} to that.  It really depends on what exactly the
neon_vst2v4si instruction does on armeb.
        vmov.i32        q10, #4  @ v4si
        vmov.i32        q9, #1  @ v4si
...
        vldr    d16, .L19
        vldr    d17, .L19+8
.L4:
        vadd.i32        q11, q8, q9
        vst1.64 {d16-d17}, [sp:64]
        vadd.i32        q8, q8, q10
        vstr    d22, [sp, #16]
        vstr    d23, [sp, #24]
        vld1.64 {d22-d25}, [sp:64]
        vst2.32 {d22-d25}, [r3]!
If it works like on armel, except the elements of the vectors are byte-swapped, then it should be [1] and [0].
Comment 47 Wilco 2018-02-16 13:37:29 UTC
(In reply to Jakub Jelinek from comment #46)
> Wonder if that:
>   vect_array.11[0] = vect_vec_iv_.7_45;
>   vect_array.11[1] = vect__4.8_48;
> on armeb shouldn't have been [1] and [0] instead, otherwise we end up with:
> (insn 35 37 38 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 0)
>         (reg:V4SI 110 [ vect_vec_iv_.7 ])) "pr82518.c":8 939 {*neon_movv4si}
>      (nil))
> (insn 38 35 41 5 (set (subreg:V4SI (reg:OI 155 [ vect_array.11 ]) 16)
>         (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ])
>             (reg:V4SI 171))) "pr82518.c":8 998 {*addv4si3_neon}
>      (nil))
> (insn 41 38 39 5 (set (reg:V4SI 110 [ vect_vec_iv_.7 ])
>         (plus:V4SI (reg:V4SI 110 [ vect_vec_iv_.7 ])
>             (reg:V4SI 169))) 998 {*addv4si3_neon}
>      (nil))
> (insn 39 41 43 5 (set (mem:OI (post_inc:SI (reg:SI 152 [ ivtmp.31 ])) [2
> MEM[(int *)vectp_p.9_49]+0 S32 A32])
>         (unspec:OI [
>                 (reg:OI 155 [ vect_array.11 ])
>                 (unspec:V4SI [
>                         (const_int 0 [0])
>                     ] UNSPEC_VSTRUCTDUMMY)
>             ] UNSPEC_VST2)) "pr82518.c":8 2396 {neon_vst2v4si}
>      (expr_list:REG_INC (reg:SI 152 [ ivtmp.31 ])
>         (nil)))
> where pseudo 110 is the vect_vec_iv_.7_45 ({i, i + 1, i + 2, i + 3}) and
> insn 38 adds {1, 1, 1, 1} to that.  It really depends on what exactly the
> neon_vst2v4si instruction does on armeb.
>         vmov.i32        q10, #4  @ v4si
>         vmov.i32        q9, #1  @ v4si
> ...
>         vldr    d16, .L19
>         vldr    d17, .L19+8
> .L4:
>         vadd.i32        q11, q8, q9
>         vst1.64 {d16-d17}, [sp:64]
>         vadd.i32        q8, q8, q10
>         vstr    d22, [sp, #16]
>         vstr    d23, [sp, #24]
>         vld1.64 {d22-d25}, [sp:64]
>         vst2.32 {d22-d25}, [r3]!
> If it works like on armel, except the elements of the vectors are
> byte-swapped, then it should be [1] and [0].

The vst2 works on little endian, but in big-endian the lane numbering is complex since all data is still treated as 64-bit quantities. 

The stores and vld1.64 have no effect on data layout, so everything is still 64-bit data in 64-bit registers. The vst2.32 can only be used in big-endian if the data is lane-swapped first. AArch64 in big-endian does this:

.L26:
	mov	v2.16b, v0.16b
	add	v3.4s, v0.4s, v6.4s
	add	v0.4s, v0.4s, v7.4s
	tbl	v4.16b, {v2.16b}, v1.16b
	tbl	v5.16b, {v3.16b}, v1.16b
	st2	{v4.4s - v5.4s}, [x2], 32
	cmp	x2, x3
	bne	.L26
Comment 48 Jakub Jelinek 2018-02-16 14:01:59 UTC
Can someone familiar with ARM please take this over?  Either this is a backend bug, or a bug in the vectorizer part specific to only ARM/AArch64 (no other target has STORE_LANES stuff).  This is a P1 (for now), so it would be nice to get it resolved soon.
Comment 49 Wilco 2018-02-16 16:08:34 UTC
AArch64 does this:

(define_expand "vec_store_lanesoi<mode>"
  [(set (match_operand:OI 0 "aarch64_simd_struct_operand" "=Utv")
        (unspec:OI [(match_operand:OI 1 "register_operand" "w")
                    (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                   UNSPEC_ST2))]
  "TARGET_SIMD"
{
  if (BYTES_BIG_ENDIAN)
    {
      rtx tmp = gen_reg_rtx (OImode);
      rtx mask = aarch64_reverse_mask (<MODE>mode, <nunits>);
      emit_insn (gen_aarch64_rev_reglistoi (tmp, operands[1], mask));
      emit_insn (gen_aarch64_simd_st2<mode> (operands[0], tmp));
    }
  else
    emit_insn (gen_aarch64_simd_st2<mode> (operands[0], operands[1]));
  DONE;
})

ARM seems to be missing the swap:

(define_expand "vec_store_lanesoi<mode>"
  [(set (match_operand:OI 0 "neon_struct_operand")
        (unspec:OI [(match_operand:OI 1 "s_register_operand")
                    (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
                   UNSPEC_VST2))]
  "TARGET_NEON")

So clearly looks like a backend issue.
Comment 50 ktkachov 2018-02-20 12:12:01 UTC
(In reply to Wilco from comment #49)
> AArch64 does this:
> 
> (define_expand "vec_store_lanesoi<mode>"
>   [(set (match_operand:OI 0 "aarch64_simd_struct_operand" "=Utv")
>         (unspec:OI [(match_operand:OI 1 "register_operand" "w")
>                     (unspec:VQ [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
>                    UNSPEC_ST2))]
>   "TARGET_SIMD"
> {
>   if (BYTES_BIG_ENDIAN)
>     {
>       rtx tmp = gen_reg_rtx (OImode);
>       rtx mask = aarch64_reverse_mask (<MODE>mode, <nunits>);
>       emit_insn (gen_aarch64_rev_reglistoi (tmp, operands[1], mask));
>       emit_insn (gen_aarch64_simd_st2<mode> (operands[0], tmp));
>     }
>   else
>     emit_insn (gen_aarch64_simd_st2<mode> (operands[0], operands[1]));
>   DONE;
> })
> 
> ARM seems to be missing the swap:
> 
> (define_expand "vec_store_lanesoi<mode>"
>   [(set (match_operand:OI 0 "neon_struct_operand")
>         (unspec:OI [(match_operand:OI 1 "s_register_operand")
>                     (unspec:VQ2 [(const_int 0)] UNSPEC_VSTRUCTDUMMY)]
>                    UNSPEC_VST2))]
>   "TARGET_NEON")
> 
> So clearly looks like a backend issue.

Indeed, and arm is missing the equivalent logic, including the reverse_mask, rev_reglist etc.

For GCC 8 and the branches the least invasive fix would be to return false for BYTES_BIG_ENDIAN in arm_array_mode_supported_p. That will disable the use of the vec_load, vec_store lanes on big-endian. vectorisation on arm NEON is already severely restricted (look at all the patterns in neon.md gated on !BYTES_BIG_ENDIAN) and the vec_load/store_lanes has never worked correctly on that target as far as I can see, so switching it off properly is not a radical change.

At some point we'll want to take a holistic approach for NEON big-endian and fix up (and document!) the lane-ordering everywhere, but the priority at this stage is to fix the wrong-code in a not-too-invasive way.
Comment 51 ktkachov 2018-02-20 12:14:47 UTC
Testing a simple patch along the lines described in comment #50
Comment 52 ktkachov 2018-03-20 17:13:48 UTC
Author: ktkachov
Date: Tue Mar 20 17:13:16 2018
New Revision: 258687

URL: https://gcc.gnu.org/viewcvs?rev=258687&root=gcc&view=rev
Log:
This PR shows that we get the load/store_lanes logic wrong for arm big-endian.
It is tricky to get right. Aarch64 does it by adding the appropriate lane-swapping
operations during expansion.

I'd like to do the same on arm eventually, but we'd need to port and validate the VTBL-generating
code and add it to all the right places and I'm not comfortable enough doing it for GCC 8, but I am keen
in getting the wrong-code fixed.
As I say in the PR, vectorisation on armeb is already severely restricted (we disable many patterns on BYTES_BIG_ENDIAN)
and the load/store_lanes patterns really were not working properly at all, so disabling them is not
a radical approach.

The way to do that is to return false in ARRAY_MODE_SUPPORTED_P for BYTES_BIG_ENDIAN.

Bootstrapped and tested on arm-none-linux-gnueabihf.
Also tested on armeb-none-eabi.


     PR target/82518
     * config/arm/arm.c (arm_array_mode_supported_p): Return false for
     BYTES_BIG_ENDIAN.

     * lib/target-supports.exp (check_effective_target_vect_load_lanes):
     Disable for armeb targets.
     * gcc.target/arm/pr82518.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/arm/pr82518.c
Modified:
    trunk/gcc/config/arm/arm.c
    trunk/gcc/testsuite/lib/target-supports.exp
Comment 53 ktkachov 2018-03-20 17:17:08 UTC
Fixed for GCC 8. Will backport fix to branches later if no problems
Comment 54 Christophe Lyon 2018-03-21 09:26:35 UTC
Author: clyon
Date: Wed Mar 21 09:26:03 2018
New Revision: 258707

URL: https://gcc.gnu.org/viewcvs?rev=258707&root=gcc&view=rev
Log:
PR target/82518: Fix testcase.

2018-03-21  Christophe Lyon  <christophe.lyon@linaro.org>

	* gcc.target/arm/pr82518.c: Require arm_neon_hw.


Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/arm/pr82518.c
Comment 55 ktkachov 2018-03-21 09:36:55 UTC
Author: ktkachov
Date: Wed Mar 21 09:36:24 2018
New Revision: 258708

URL: https://gcc.gnu.org/viewcvs?rev=258708&root=gcc&view=rev
Log:
Commit missing Changelogs for PR target/82518 fix.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
Comment 56 ktkachov 2018-03-27 11:20:27 UTC
Author: ktkachov
Date: Tue Mar 27 11:19:55 2018
New Revision: 258874

URL: https://gcc.gnu.org/viewcvs?rev=258874&root=gcc&view=rev
Log:
[arm] PR target/82518: Return false in ARRAY_MODE_SUPPORTED_P for BYTES_BIG_ENDIAN

	Backport from mainline
	2018-03-20  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

	PR target/82518
	* config/arm/arm.c (arm_array_mode_supported_p): Return false for
	BYTES_BIG_ENDIAN.

	* lib/target-supports.exp (check_effective_target_vect_load_lanes):
	Disable for armeb targets.
	* gcc.target/arm/pr82518.c: New test.

Added:
    branches/gcc-7-branch/gcc/testsuite/gcc.target/arm/pr82518.c
Modified:
    branches/gcc-7-branch/gcc/ChangeLog
    branches/gcc-7-branch/gcc/config/arm/arm.c
    branches/gcc-7-branch/gcc/testsuite/ChangeLog
    branches/gcc-7-branch/gcc/testsuite/lib/target-supports.exp
Comment 57 ktkachov 2018-03-27 13:07:54 UTC
Author: ktkachov
Date: Tue Mar 27 13:07:22 2018
New Revision: 258879

URL: https://gcc.gnu.org/viewcvs?rev=258879&root=gcc&view=rev
Log:
[arm] PR target/82518: Return false in ARRAY_MODE_SUPPORTED_P for BYTES_BIG_ENDIAN

	Backport from mainline
	2018-03-20  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

	PR target/82518
	* config/arm/arm.c (arm_array_mode_supported_p): Return false for
	BYTES_BIG_ENDIAN.

	* lib/target-supports.exp (check_effective_target_vect_load_lanes):
	Disable for armeb targets.
	* gcc.target/arm/pr82518.c: New test.

Added:
    branches/gcc-6-branch/gcc/testsuite/gcc.target/arm/pr82518.c
Modified:
    branches/gcc-6-branch/gcc/ChangeLog
    branches/gcc-6-branch/gcc/config/arm/arm.c
    branches/gcc-6-branch/gcc/testsuite/ChangeLog
    branches/gcc-6-branch/gcc/testsuite/lib/target-supports.exp
Comment 58 ktkachov 2018-03-27 13:30:04 UTC
Fixed on all active branches.
Comment 59 ktkachov 2018-04-27 08:56:34 UTC
Author: ktkachov
Date: Fri Apr 27 08:56:02 2018
New Revision: 259700

URL: https://gcc.gnu.org/viewcvs?rev=259700&root=gcc&view=rev
Log:
[arm] PR target/82518: Return false in ARRAY_MODE_SUPPORTED_P for BYTES_BIG_ENDIAN followup

	PR target/82518
	* lib/target-supports.exp (check_effective_target_vect_load_lanes):
	Use check_effective_target_arm_little_endian.


Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/lib/target-supports.exp
Comment 60 ktkachov 2018-05-04 16:36:04 UTC
Author: ktkachov
Date: Fri May  4 16:35:32 2018
New Revision: 259940

URL: https://gcc.gnu.org/viewcvs?rev=259940&root=gcc&view=rev
Log:
[arm] PR target/82518: Return false in ARRAY_MODE_SUPPORTED_P for BYTES_BIG_ENDIAN pt2

	Backport from trunk
	2018-04-27  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

	PR target/82518
	* lib/target-supports.exp (check_effective_target_vect_load_lanes):
	Use check_effective_target_arm_little_endian.

Modified:
    branches/gcc-8-branch/gcc/testsuite/ChangeLog
    branches/gcc-8-branch/gcc/testsuite/lib/target-supports.exp