63304 – Aarch64 pc-relative load offset out of range

Bug 63304 - Aarch64 pc-relative load offset out of range

Summary: Aarch64 pc-relative load offset out of range

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	5.0

Importance:	P5 normal
Target Milestone:	6.0
Assignee:	Ramana Radhakrishnan

URL:
Keywords:	assemble-failure

Depends on:
Blocks:

Reported:	2014-09-19 05:31 UTC by Venkataramanan
Modified:	2016-01-25 12:45 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:	aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed:	2014-09-19 00:00:00

Attachments
Attached test case (542 bytes, application/x-xz) 2014-09-19 05:38 UTC, Venkataramanan	Details
31-lines, minimal testcase (229 bytes, text/plain) 2015-01-10 11:24 UTC, David Abdurachmanov	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Venkataramanan 2014-09-19 05:31:03 UTC

Constant literal table is kept at large offset, resulting in pc-relative load offset out of range.

aarch64-none-linux-gnu-gcc x.c 
/tmp/ccrOQLEb.s: Assembler messages:
/tmp/ccrOQLEb.s:10: Error: pc-relative load offset out of range

Comment 1 Venkataramanan 2014-09-19 05:38:34 UTC

Created attachment 33515 [details]
Attached test case

Comment 2 Venkataramanan 2014-09-19 05:44:01 UTC

Marcus, can you please assign it to me if it is confirmed.

Comment 3 Andrew Pinski 2014-09-19 07:56:42 UTC

Confirmed.

From .expand:
(insn 5 4 6 (set (reg:DF 75)
        (mem/u/c:DF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0  S8 A64])) /home/pinskia/Downloads/x.c:7 -1
     (nil))

Note next time please use macros to create your testcase, it is ok sometimes to use non preprocessed source

Comment 4 Richard Earnshaw 2014-09-19 10:19:15 UTC

I see this as a theoretical problem that's unlikely to ever manifest itself with real code (functions generating .5 million instructions would take insanely long to compile in real life).

Unless you can show some *real* code that exhibits this problem I propose we don't try to fix this; it's just a limit on the compiler.

We have bigger problems to worry about.

Comment 5 Venkataramanan 2014-09-19 14:09:53 UTC

We got inspired by this bug.
https://bugs.linaro.org/show_bug.cgi?id=400 
It happens at -O0 now.

Comment 6 Richard Earnshaw 2014-09-19 15:02:39 UTC

OK.  But I still wouldn't loose much sleep over it.  It wouldn't surprise me that if you turn the optimizer on it then blows up by using too much memory...

Comment 7 Venkataramanan 2014-09-19 17:58:43 UTC

Ok I am closing this as wont fix now. Checked with Mitthias who reported the problem to us and he is fine building it with -O2.

Comment 8 Andrew Pinski 2014-09-19 18:00:16 UTC

This code is still valid and really should be fixed; maybe not today or for GCC 5 or GCC 6 but it is still a bug.

Comment 9 Andrew Pinski 2014-10-12 00:42:34 UTC

Actually this is much worse than what is mentioned here.  Having the constant pool be part of the .text section really does not work if the alignment of the constants are smaller than 4 byte aligned.  The assembler rejects the debugging info as the end label of the text section is not 4 byte aligned.
I am working on fixing this bug and the bug dealing with the constant pool in the text section (which is wrong and is not shared between translational units really).

Comment 10 Andrew Pinski 2014-10-12 02:39:43 UTC

More point is with LTO you can get this even with optimization turned on.  If the text section is big enough.

Comment 11 David Abdurachmanov 2014-12-28 20:14:59 UTC

I believe, I hit this issue with OpenLoops package on AArch64, while it works fine on x86_64.

It shows up on 3 of Fortran files:
scons: *** [process_obj/pplljjj/loop_pplljjj_eexddxggg_1_qp.os] Error 1
scons: *** [process_obj/pplljjj/loop_pplljjj_eexuuxggg_1_qp.os] Error 1
scons: *** [process_obj/pplljjj/loop_pplljjj_eexbbxggg_1_qp.os] Error 1

I get "Error: pc-relative load offset out of range" 864 times.

Tested with GCC 4.9.1 and binutils 2.24.

I also looked into f9edeb70961d404caac5a849b0783c53228ddf62 / 213078, with it applied on top it gets worse, i.e. more files are affected. From 864 it increases to 1392 errors.

This particular library (pplljjj) is 158MB on x86_64. Most of it (153MB) belongs to .text section.

The code must be compiled with -O0 otherwise machines have hard time handling it:

gfortran: internal compiler error: Killed (program f951)
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://gcc.gnu.org/bugs.html> for instructions.

@Andrew,
I am happy to test your patches if necessary.

Comment 12 David Abdurachmanov 2015-01-05 10:27:18 UTC

I decided to re-enable -Os for OpenLoops. Then use powerful hardware with 32-physical-cores (x86_64) and 0.5TB of RAM to see if I could get lucky. Fired up QEMU user mode with Fedora for AArch64 chroot for compiling.

It did resolve some of failures, but not everything. I have seen processes going as far as 25+GB of RSS and taking hours to finish. This is the reason package is compiled with -O0.

Comment 13 David Abdurachmanov 2015-01-10 11:24:31 UTC

Created attachment 34416 [details]
31-lines, minimal testcase

I am adding 31-lines minimal testcase. Should be good enough for GCC testsuite.

$ gcc -O0 -c pr63304-testcase.c
/tmp/ccKdBqsL.s: Assembler messages:
/tmp/ccKdBqsL.s:58: Error: pc-relative load offset out of range
/tmp/ccKdBqsL.s:62: Error: pc-relative load offset out of range

Yesterday I looked into RTL output and assembly of some failures for OpenLoops. The function was 400+K lines in assembly. On x86_64 it was something 180+K lines of assembly and ~1.3MB for function body size.

I can confirm that offset is above 1MB mark and it was trying to load __float128/long double to q1 from constant pool for passing to `addtf3`.

Comment 14 David Abdurachmanov 2015-07-19 11:42:46 UTC

I hit another two cases of this.

1. g2root tool, which converts GEANT geometry to ROOT geometry. It create a single function, which contains lots of descriptions of material, shapes, etc. all describing some 3D objects and physical properties. These can be very huge.

Also could think of MILL computing as possible example. They use C++ as they assembler (at least for simulation). Thus you have something like:

void somefunc(void) {
  add(b0, b1);
  add(b2, b3);
  ...
}

IIRC, by compiling it they generate a simulator to run this program on specific CPU.

2. mcfm 6.3 package, to be precise: mcfm-6.3/src/WW/triangle11new.f

triangle11new.s: Assembler messages:
triangle11new.s:587: Error: pc-relative load offset out of range
triangle11new.s:592: Error: pc-relative load offset out of range

..
   587         ldr     x0, .LC2
   588         fmov    d0, x2
   589         fmov    d1, x0
   590         fdiv    d0, d0, d1
   591         fmov    x2, d0
   592         ldr     x0, .LC2
..

346645 .LC2:
346646         .word   0
346647         .word   1073741824
346648         .align  3
..

Distance between ldr instruction and .LC2 is 346058 assembly lines.

Here is the source file: https://github.com/cms-externals/MCFM/blob/master/src/WW/triangle11new.f (just 1355 sloc).

It's those huge computations, which are killing it. Similar issue as in OpenLoops package.

Comment 15 Andrew Pinski 2015-07-20 09:42:23 UTC

Not working on this any time soon.  But someone from ARM really should look into fixing this as it blocks standard C/C++ code from HPC and distros.

Comment 16 Jiong Wang 2015-07-20 10:42:54 UTC

Have done a quick look at this, basic ideas to fix this:

  * generate a special pattern which initialize literal pool start address.
  * implement TARGET_MACHINE_DEPENDENT_REORG to calculate whehter the
    pc-relative literal load is within range.
  * output final insruction sequences which initializing literal pool start
    address based on the result from reorg pass analysis. Use movk/z, adrp + add,
    single adr for different distance.

Comment 17 Wilco 2015-07-20 12:58:41 UTC

Well there seem to be 2 ways to address this:

* If a function is huge, emit literals as const data. This enables the use of anchors and sharing of literals across all functions in a compilation unit.

* Reserve a register in the adr/ldr literal patterns and add a 2-instruction sequence using adrp when out of range. Ideally the register should only be reserved if a function is huge.

Comment 18 Ramana Radhakrishnan 2015-07-22 16:06:17 UTC

I'm taking a look into this.

Comment 19 Ramana Radhakrishnan 2015-07-27 14:34:23 UTC

(In reply to Ramana Radhakrishnan from comment #18)
> I'm taking a look into this.

RFC here - https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02258.html

Comment 20 Ramana Radhakrishnan 2015-07-29 10:51:31 UTC

(In reply to Ramana Radhakrishnan from comment #19)
> (In reply to Ramana Radhakrishnan from comment #18)
> > I'm taking a look into this.
> 
> RFC here - https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02258.html

David - can you see if this works for you ?

regards
Ramana

Comment 21 David Abdurachmanov 2015-07-29 11:19:04 UTC

I am on vacations now, but I already marked this on my TODO list. Once I find a free time slot I will give it a spin. I will try to report in a few days.

BTW, I will also show up at GNU Tools Cauldron 2015.

Comment 22 Ramana Radhakrishnan 2015-07-30 08:02:50 UTC

(In reply to David Abdurachmanov from comment #21)
> I am on vacations now, but I already marked this on my TODO list. Once I
> find a free time slot I will give it a spin. I will try to report in a few
> days.
> 
> BTW, I will also show up at GNU Tools Cauldron 2015.

I must also add that this patch is a pre-requisite for the appropriate iterators to be available.

https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02074.html

and this may need some rebasing after Alan's latest patches to handle fp16.

Comment 23 David Abdurachmanov 2015-08-07 11:29:49 UTC

GCC trunk
  r226676
  or 15af172f2a0ea281969e3105da9f9bb100097c7d from git://gcc.gnu.org/git/gcc.git
  Date:   Thu Aug 6 14:26:18 2015 +0000)

Rebased and applied:

  https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02074.html
  https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02258.html

binutils trunk
  f0ce0d3a331129309a46a6a9ac85fce35acae72b from git://sourceware.org/git/binutils-gdb.git
  Date:   Thu Jul 23 16:01:01 2015 +0100)

Rebases and applied:
  https://sourceware.org/ml/binutils/2015-07/msg00137.html

$ gcc -dumpversion
6.0.0
$ ld --version
GNU ld (GNU Binutils) 2.25.51.20150806

I __successfully__ compiled OpenLoops 1.1.1 and MCFM 6.3 packages.

Comment 24 Ramana Radhakrishnan 2015-09-11 09:44:57 UTC

Author: ramana
Date: Fri Sep 11 09:44:26 2015
New Revision: 227679

URL: https://gcc.gnu.org/viewcvs?rev=227679&root=gcc&view=rev
Log:
Remove separate movtf pattern - Use an iterator for all FP modes.
    
movtf is unnecessary as a separate expander. Move this to be with
the standard scalar floating point expanders.

Achieved by adding a new iterator and then using the same.

Tested cross aarch64-none-elf and no regressions.

Rebased version from https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00767.html


2015-09-10  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>

	PR target/63304
        * config/aarch64/aarch.md (mov<mode>:GPF_F16): Use GPF_TF_F16.
        (movtf): Delete.
        * config/aarch64/iterators.md (GPF_TF_F16): New.
        (GPF_F16): Delete.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64.md
    trunk/gcc/config/aarch64/iterators.md

Comment 25 Ramana Radhakrishnan 2015-09-14 13:17:31 UTC

Author: ramana
Date: Mon Sep 14 13:16:59 2015
New Revision: 227748

URL: https://gcc.gnu.org/viewcvs?rev=227748&root=gcc&view=rev
Log:
[AArch64] Handle literal pools for functions > 1 MiB in size.
    

This patch fixes the issue in PR63304 where we have
functions that are > 1MiB. The idea is to use adrp / ldr or adrp / add
instructions to address the literal pools under the use of a command line
option. I would like to turn this on by default on trunk but keep this
disabled by default for the release branches in order to get some
serious testing for this feature while it bakes on trunk.

As a follow-up I would like to try and see if estimate_num_insns or
something else can give us a heuristic to turn this on for "large" functions.
After all the number of incidences of this are quite low in real life,
so may be we should look to restrict this use as much as possible on the
grounds that this code generation implies an extra integer register for
addressing for every floating point and vector constant and I don't think
that's great in code that already may have high register pressure.

Tested on aarch64-none-elf with no regressions. A previous
version was bootstrapped and regression tested.

Applied to trunk.

regards
Ramana

2015-09-14  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>

    	PR target/63304
    	* config/aarch64/aarch64.c (aarch64_expand_mov_immediate): Handle
    	nopcrelative_literal_loads.
    	(aarch64_classify_address): Likewise.
    	(aarch64_constant_pool_reload_icode): Define.
    	(aarch64_secondary_reload): Handle secondary reloads for
    	literal pools.
    	(aarch64_override_options): Handle nopcrelative_literal_loads.
    	(aarch64_classify_symbol): Handle nopcrelative_literal_loads.
    	* config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>):
    	Define.
    	(aarch64_reload_movcp<VALL:mode><P:mode>): Likewise.
    	* config/aarch64/aarch64.opt (mpc-relative-literal-loads): New option.
    	* config/aarch64/predicates.md (aarch64_constant_pool_symref): New
    	predicate.
    	* doc/invoke.texi (mpc-relative-literal-loads): Document.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64.c
    trunk/gcc/config/aarch64/aarch64.md
    trunk/gcc/config/aarch64/aarch64.opt
    trunk/gcc/config/aarch64/iterators.md
    trunk/gcc/config/aarch64/predicates.md
    trunk/gcc/doc/invoke.texi

Comment 26 Ramana Radhakrishnan 2015-10-09 10:58:38 UTC

Author: ramana
Date: Fri Oct  9 10:58:06 2015
New Revision: 228644

URL: https://gcc.gnu.org/viewcvs?rev=228644&root=gcc&view=rev
Log:
[AArch64] Handle literal pools for functions > 1 MiB in size.
    

This patch fixes the issue in PR63304 where we have
functions that are > 1MiB. The idea is to use adrp / ldr or adrp / add
instructions to address the literal pools under the use of a command line
option. I would like to turn this on by default on trunk but keep this
disabled by default for the release branches in order to get some
serious testing for this feature while it bakes on trunk.

As a follow-up I would like to try and see if estimate_num_insns or
something else can give us a heuristic to turn this on for "large" functions.
After all the number of incidences of this are quite low in real life,
so may be we should look to restrict this use as much as possible on the
grounds that this code generation implies an extra integer register for
addressing for every floating point and vector constant and I don't think
that's great in code that already may have high register pressure.

Tested on aarch64-none-elf with no regressions. A previous
version was bootstrapped and regression tested.

Applied to trunk.

regards
Ramana

2015-09-14  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>

    	PR target/63304
    	* config/aarch64/aarch64.c (aarch64_expand_mov_immediate): Handle
    	nopcrelative_literal_loads.
    	(aarch64_classify_address): Likewise.
    	(aarch64_constant_pool_reload_icode): Define.
    	(aarch64_secondary_reload): Handle secondary reloads for
    	literal pools.
    	(aarch64_override_options): Handle nopcrelative_literal_loads.
    	(aarch64_classify_symbol): Handle nopcrelative_literal_loads.
    	* config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>):
    	Define.
    	(aarch64_reload_movcp<VALL:mode><P:mode>): Likewise.
    	* config/aarch64/aarch64.opt (mpc-relative-literal-loads): New option.
    	* config/aarch64/predicates.md (aarch64_constant_pool_symref): New
    	predicate.
    	* doc/invoke.texi (mpc-relative-literal-loads): Document.


Added:
    trunk/gcc/testsuite/gcc.target/arm/pr67366.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimple-fold.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/lib/target-supports.exp

Comment 27 Ramana Radhakrishnan 2015-10-09 11:02:17 UTC

(In reply to Ramana Radhakrishnan from comment #26)
> Author: ramana
> Date: Fri Oct  9 10:58:06 2015
> New Revision: 228644
> 
> URL: https://gcc.gnu.org/viewcvs?rev=228644&root=gcc&view=rev
> Log:
> [AArch64] Handle literal pools for functions > 1 MiB in size.
>     
> 
> This patch fixes the issue in PR63304 where we have
> functions that are > 1MiB. The idea is to use adrp / ldr or adrp / add
> instructions to address the literal pools under the use of a command line
> option. I would like to turn this on by default on trunk but keep this
> disabled by default for the release branches in order to get some
> serious testing for this feature while it bakes on trunk.
> 
> As a follow-up I would like to try and see if estimate_num_insns or
> something else can give us a heuristic to turn this on for "large" functions.
> After all the number of incidences of this are quite low in real life,
> so may be we should look to restrict this use as much as possible on the
> grounds that this code generation implies an extra integer register for
> addressing for every floating point and vector constant and I don't think
> that's great in code that already may have high register pressure.
> 
> Tested on aarch64-none-elf with no regressions. A previous
> version was bootstrapped and regression tested.
> 
> Applied to trunk.
> 
> regards
> Ramana
> 
> 2015-09-14  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
> 
>     	PR target/63304
>     	* config/aarch64/aarch64.c (aarch64_expand_mov_immediate): Handle
>     	nopcrelative_literal_loads.
>     	(aarch64_classify_address): Likewise.
>     	(aarch64_constant_pool_reload_icode): Define.
>     	(aarch64_secondary_reload): Handle secondary reloads for
>     	literal pools.
>     	(aarch64_override_options): Handle nopcrelative_literal_loads.
>     	(aarch64_classify_symbol): Handle nopcrelative_literal_loads.
>     	* config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>):
>     	Define.
>     	(aarch64_reload_movcp<VALL:mode><P:mode>): Likewise.
>     	* config/aarch64/aarch64.opt (mpc-relative-literal-loads): New option.
>     	* config/aarch64/predicates.md (aarch64_constant_pool_symref): New
>     	predicate.
>     	* doc/invoke.texi (mpc-relative-literal-loads): Document.
> 
> 
> Added:
>     trunk/gcc/testsuite/gcc.target/arm/pr67366.c
> Modified:
>     trunk/gcc/ChangeLog
>     trunk/gcc/gimple-fold.c
>     trunk/gcc/testsuite/ChangeLog
>     trunk/gcc/testsuite/lib/target-supports.exp

This commit really was for PR67366.. Copied wrong text into the commit message.

Comment 28 Ramana Radhakrishnan 2015-10-09 11:08:36 UTC

Author: ramana
Revision: 228644
Modified property: svn:log

Modified: svn:log at Fri Oct  9 11:08:05 2015
------------------------------------------------------------------------------
--- svn:log (original)
+++ svn:log Fri Oct  9 11:08:05 2015
@@ -1,45 +1,43 @@
-[AArch64] Handle literal pools for functions > 1 MiB in size.
-    
+[PATCH PR target/67366 2/2] [gimple-fold.c] Support movmisalign optabs in gimple-fold.c
 
-This patch fixes the issue in PR63304 where we have
-functions that are > 1MiB. The idea is to use adrp / ldr or adrp / add
-instructions to address the literal pools under the use of a command line
-option. I would like to turn this on by default on trunk but keep this
-disabled by default for the release branches in order to get some
-serious testing for this feature while it bakes on trunk.
+This patch by Richard allows for movmisalign optabs to be supported
+in gimple-fold.c. This caused a bit of pain in the testsuite with strlenopt-8.c
+in conjunction with the ARM support for movmisalign_optabs as the test
+was coded up to do different things depending on whether the target
+supported misaligned access or not. However now with unaligned access
+being allowed for different levels of the architecture in the arm backend,
+the concept of the helper function non_strict_align mapping identically
+to the definition of STRICT_ALIGNMENT disappears.
 
-As a follow-up I would like to try and see if estimate_num_insns or
-something else can give us a heuristic to turn this on for "large" functions.
-After all the number of incidences of this are quite low in real life,
-so may be we should look to restrict this use as much as possible on the
-grounds that this code generation implies an extra integer register for
-addressing for every floating point and vector constant and I don't think
-that's great in code that already may have high register pressure.
+Adjusted thusly for ARM. The testsuite/lib changes were tested with an
+arm-none-eabi multilib that included architecture variants that did not
+support unaligned access and architecture variants that did.
 
-Tested on aarch64-none-elf with no regressions. A previous
-version was bootstrapped and regression tested.
+The testing matrix for this patch was:
 
-Applied to trunk.
+1. x86_64 bootstrap and regression test - no regressions.
+2. armhf bootstrap and regression test - no regressions.
+3. arm-none-eabi cross build and regression test for
 
-regards
-Ramana
+{-marm/-march=armv7-a/-mfpu=vfpv3-d16/-mfloat-abi=softfp}
+{-mthumb/-march=armv8-a/-mfpu=crypto-neon-fp-armv8/-mfloat-abi=hard}
+{-marm/-mcpu=arm7tdmi/-mfloat-abi=soft}
+{-mthumb/-mcpu=arm7tdmi/-mfloat-abi=soft}
 
-2015-09-14  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
+with no regressions.
 
-    	PR target/63304
-    	* config/aarch64/aarch64.c (aarch64_expand_mov_immediate): Handle
-    	nopcrelative_literal_loads.
-    	(aarch64_classify_address): Likewise.
-    	(aarch64_constant_pool_reload_icode): Define.
-    	(aarch64_secondary_reload): Handle secondary reloads for
-    	literal pools.
-    	(aarch64_override_options): Handle nopcrelative_literal_loads.
-    	(aarch64_classify_symbol): Handle nopcrelative_literal_loads.
-    	* config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>):
-    	Define.
-    	(aarch64_reload_movcp<VALL:mode><P:mode>): Likewise.
-    	* config/aarch64/aarch64.opt (mpc-relative-literal-loads): New option.
-    	* config/aarch64/predicates.md (aarch64_constant_pool_symref): New
-    	predicate.
-    	* doc/invoke.texi (mpc-relative-literal-loads): Document.
+Ok to apply ?
 
+2015-10-09  Richard Biener  <rguenth@suse.de>
+
+	PR target/67366
+	* gimple-fold.c (optabs-query.h): Include
+	(gimple_fold_builtin_memory_op): Allow unaligned stores
+	when movmisalign_optabs are available.
+
+2015-10-09  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>
+
+	PR target/67366
+	* lib/target-supports.exp (check_effective_target_non_strict_align):
+	Adjust for arm*-*-*.
+	* gcc.target/arm/pr67366.c: New test.

Comment 29 Ramana Radhakrishnan 2015-10-22 04:27:24 UTC

Author: ramana
Date: Thu Oct 22 04:26:50 2015
New Revision: 229160

URL: https://gcc.gnu.org/viewcvs?rev=229160&root=gcc&view=rev
Log:
[Patch AArch64 63304] Fix issue with global state.


Jiong pointed out privately that there was a thinko
in the way in which the global state was being
set and reset. I don't like adding such
global state but ....


2015-10-22  Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>

        PR target/63304
        * config/aarch64/aarch64.c (aarch64_nopcrelative_literal_loads): New.
        (aarch64_expand_mov_immediate): Use aarch64_nopcrelative_literal_loads.
        (aarch64_classify_address): Likewise.
        (aarch64_secondary_reload): Likewise.
        (aarch64_override_options_after_change_1): Adjust.
        * config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>):
        Use aarch64_nopcrelative_literal_loads.
        (aarch64_reload_movcp<VALL:mode><P:mode>): Likewise.
        * config/aarch64/aarch64-protos.h (aarch64_nopcrelative_literal_loads): 
	Declare

2015-10-22  Jiong Wang  <jiong.wang@arm.com>
            Ramana Radhakrishnan  <ramana.radhakrishnan@arm.com>

        PR target/63304
        * gcc.target/aarch64/pr63304_1.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/aarch64/pr63304_1.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64-protos.h
    trunk/gcc/config/aarch64/aarch64.c
    trunk/gcc/config/aarch64/aarch64.md
    trunk/gcc/testsuite/ChangeLog

Comment 30 Evandro 2015-11-06 19:29:23 UTC

The performance impact of always referring to constants as if they were far away is significant on targets which do not fuse ADRP and LDR together.  What's the status of the solution that evaluates the function size?  Should this be optionally enabled only?  Would it be the case to come up with a medium code model? :-P  Could the assembler be left to address this issue by relaxing such loads? :-P  

Thank you.

Comment 31 Ramana Radhakrishnan 2015-11-06 19:52:37 UTC

(In reply to Evandro from comment #30)
> The performance impact of always referring to constants as if they were far
> away is significant on targets which do not fuse ADRP and LDR together. 

What happens if you split them up and schedule them appropriately ? I didn't see any significant impact in my benchmarking on implementations that did not implement such fusion. Where people want performance in these cases they can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there already.

> What's the status of the solution that evaluates the function size? 

I am not working on that follow-up as I didn't see the real need for it in the benchmarking results I was looking at. You are welcome to investigate.

> Should
> this be optionally enabled only?  

It is enabled by default for -mcmodel=small and -mcmodel=large. 

And no because it has been done after quite a lot of complaints from the general user community that people are unable to build large software bases with the compiler.

> Could the assembler be left to address this issue by
> relaxing such loads? :-P  

No...

Comment 32 Evandro 2015-11-06 20:14:19 UTC

(In reply to Ramana Radhakrishnan from comment #31)
> (In reply to Evandro from comment #30)
> > The performance impact of always referring to constants as if they were far
> > away is significant on targets which do not fuse ADRP and LDR together. 
> 
> What happens if you split them up and schedule them appropriately ? I didn't
> see any significant impact in my benchmarking on implementations that did
> not implement such fusion. Where people want performance in these cases they
> can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there
> already.

Because of side effects of the Haiffa scheduler, the loads now pile up, and the ADRPs may affect the load issue rate rather badly if not fused.  At leas on our processor.  

Which brings another point, shouldn't there be just one ADRP per BB or, ideally, per function?  Or am I missing something?

> > What's the status of the solution that evaluates the function size? 
> 
> I am not working on that follow-up as I didn't see the real need for it in
> the benchmarking results I was looking at. You are welcome to investigate.

OK

Comment 33 Wilco 2015-11-06 20:41:33 UTC

(In reply to Evandro from comment #32)
> (In reply to Ramana Radhakrishnan from comment #31)
> > (In reply to Evandro from comment #30)
> > > The performance impact of always referring to constants as if they were far
> > > away is significant on targets which do not fuse ADRP and LDR together. 
> > 
> > What happens if you split them up and schedule them appropriately ? I didn't
> > see any significant impact in my benchmarking on implementations that did
> > not implement such fusion. Where people want performance in these cases they
> > can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there
> > already.
> 
> Because of side effects of the Haiffa scheduler, the loads now pile up, and
> the ADRPs may affect the load issue rate rather badly if not fused.  At leas
> on our processor.  

ADRP latency to load-address should be zero on any OoO core - ADRP is basically a move-immediate, so can execute early and hide any latency.

> Which brings another point, shouldn't there be just one ADRP per BB or,
> ideally, per function?  Or am I missing something?

That's not possible in this case as the section is mergeable. An alternative implementation using anchors may be feasible, but GCC is extremely bad at using anchors efficiently - functions using several global variables also end up with a large number of ADRPs when you'd expect a single ADRP.

Comment 34 Evandro 2015-11-06 20:58:45 UTC

(In reply to Wilco from comment #33)
> (In reply to Evandro from comment #32)
> ADRP latency to load-address should be zero on any OoO core - ADRP is
> basically a move-immediate, so can execute early and hide any latency.

In an ideal world, yes.  In the actual world, they compete for limited resources that could be used by other insns.

> > Which brings another point, shouldn't there be just one ADRP per BB or,
> > ideally, per function?  Or am I missing something?
> 
> That's not possible in this case as the section is mergeable. An alternative
> implementation using anchors may be feasible, but GCC is extremely bad at
> using anchors efficiently - functions using several global variables also
> end up with a large number of ADRPs when you'd expect a single ADRP.

I see.  

I'll investigate placing the constant after the function, as before, if the estimated function size allows for it.  I think that eliminating the ADRPs could potentially be more beneficial to code size than merging constants in a common literal pool (v. http://bit.ly/1Ptc8nh).

Thank you.

Comment 35 Ramana Radhakrishnan 2015-11-06 22:23:38 UTC

(In reply to Evandro from comment #32)
> (In reply to Ramana Radhakrishnan from comment #31)
> > (In reply to Evandro from comment #30)
> > > The performance impact of always referring to constants as if they were far
> > > away is significant on targets which do not fuse ADRP and LDR together. 
> > 
> > What happens if you split them up and schedule them appropriately ? I didn't
> > see any significant impact in my benchmarking on implementations that did
> > not implement such fusion. Where people want performance in these cases they
> > can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there
> > already.
> 
> Because of side effects of the Haiffa scheduler, the loads now pile up, and
> the ADRPs may affect the load issue rate rather badly if not fused.  At leas
> on our processor.  

In straight line code I can imagine this happening - In loopy code I would have expected the constants to be hoisted - atleast that's what I remember seeing in my analysis. You have seen -mcprelative-literal-loads haven't you ? 

> 
> Which brings another point, shouldn't there be just one ADRP per BB or,
> ideally, per function?  Or am I missing something?

That isn't really how this thing works. Well the constants are shared across the program now as you say down thread. 

> 
> I'll investigate placing the constant after the function, as before, if the
> estimated function size allows for it.  I think that eliminating the ADRPs
> could potentially be more beneficial to code size than merging constants in
> a common literal pool (v. http://bit.ly/1Ptc8nh).

I was actually surprised by the amount of constant sharing that was happening in what I looked at. 


Thanks,
Ramana

Comment 36 Evandro 2015-11-06 22:49:42 UTC

(In reply to Ramana Radhakrishnan from comment #35)
> (In reply to Evandro from comment #32)
> > Because of side effects of the Haiffa scheduler, the loads now pile up, and
> > the ADRPs may affect the load issue rate rather badly if not fused.  At leas
> > on our processor.  
> 
> In straight line code I can imagine this happening - In loopy code I would
> have expected the constants to be hoisted - atleast that's what I remember
> seeing in my analysis. You have seen -mcprelative-literal-loads haven't you
> ? 

The cases that I have in mind involve SL code in functions which are called form a loop.  Since they are external, only LTO would address such cases.  And, since we do not control how they are built, we have to handle them as they come.

As long as there's an opening to investigate the benefits and drawbacks of reverting to the legacy way considering the function size, I think that it's interesting to find out the results.

Thank you.

Comment 37 Evandro 2015-11-13 22:40:13 UTC

Here's what I had in mind: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01787.html

Feedback is welcome.

Comment 38 PeteVine 2016-01-04 01:11:01 UTC

Hi all,

I'm trying to narrow down a similar issue on armv7 and would welcome any suggestions where to start:
 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69082

Thx

Comment 39 Christophe Lyon 2016-01-21 10:35:11 UTC

We have backported r227748, 229160 and 229161 to our linaro-gcc-5 branch, and we got a bug report from the kernel team.

Indeed, when the kernel is configured with CONFIG_ARM64_ERRATUM_843419, the support for relocations R_AARCH64_ADR_PREL_PG_HI21 and R_AARCH64_ADR_PREL_PG_HI21_NC is removed from the kernel to avoid loading objects with possibly offending sequences. In this case the kernel is built with
-mcmodel=large.

With these backports, these relocations are generated again by default.

Adding -mpc-relative-literal-loads to the kernel Makefile in arch/arm64/Makefile does fix the build, but since this option is not supported by GCC if it does not contain these backports (and the compiler aborts with an error), it's not obvious how to modify the Makefile to decide when to use this option.

So, although Ramana said he would backport these to the FSF gcc-5 branch, maybe it's safer not to.

Or change the default?

Or... change GCC's -mfix-cortex-a53-843419 to change the default for -mpc-relative-literal-loads and add -mfix-cortex-a53-843419 to the kernel Makefile.

Comment 40 cbaylis 2016-01-21 15:52:15 UTC

The kernel does support a mechanism to add a command line option for only compilers which support it.

Documentation/kbuild/makefiles.txt:
    cc-option
        cc-option is used to check if $(CC) supports a given option, and if
        not supported to use an optional second option.

I think you could therefore add this line to arch/arm64/Makefile
KBUILD_CFLAGS   += $(call cc-option,-mpc-relative-literal-loads)

Comment 41 Christophe Lyon 2016-01-21 16:58:39 UTC

Indeed, having:

ifeq ($(CONFIG_ARM64_ERRATUM_843419), y)
KBUILD_CFLAGS_MODULE    += -mcmodel=large
KBUILD_CFLAGS_MODULE    += $(call cc-option, -mpc-relative-literal-loads)
endif

in arch/arm64/Makefile does the trick.

Comment 42 cbaylis 2016-01-21 19:32:57 UTC

While we're suggesting fixes to the kernel, wouldn't it be better if CONFIG_ARM64_ERRATUM_843419 forced the kernel to be built with the linker workarounds if the kernel is configured

KBUILD_LDFLAGS += --fix-cortex-a53-843419
    or maybe
KBUILD_LDFLAGS += $(call ld-option, --fix-cortex-a53-843419)

Comment 43 Christophe Lyon 2016-01-21 19:55:02 UTC

Indeed, that seems safer.
However, reading http://www.spinics.net/lists/arm-kernel/msg445346.html there is a comment saying:
----------------------
Note that the kernel itself must be linked with a version of ld
which fixes potentially affected ADRP instructions through the
use of veneers
-----------------------

But I don't know how this is enforced.

Comment 44 Ramana Radhakrishnan 2016-01-22 05:05:56 UTC

(In reply to Christophe Lyon from comment #39)
> We have backported r227748, 229160 and 229161 to our linaro-gcc-5 branch,
> and we got a bug report from the kernel team.

Sorry about the breakage. 

> 
> Indeed, when the kernel is configured with CONFIG_ARM64_ERRATUM_843419, the
> support for relocations R_AARCH64_ADR_PREL_PG_HI21 and
> R_AARCH64_ADR_PREL_PG_HI21_NC is removed from the kernel to avoid loading
> objects with possibly offending sequences. In this case the kernel is built
> with
> -mcmodel=large.
> 
> With these backports, these relocations are generated again by default.

The reason for the code generation to be in this form , is to store literal pools in rodata and not have any data in the code segment at all. Now when these need to be addressed that far away, addressing them with adrp / load is reasonable as you are thinking of a text + rodata segment to be > 4GB before this fails.

In an ideal world this would be implemented by the 4 insn mov / movk sequence to construct the literal address, however those relocations are not likely to be supported by the kernel loader (or rather I haven't checked). 


> 
> Adding -mpc-relative-literal-loads to the kernel Makefile in
> arch/arm64/Makefile does fix the build, but since this option is not
> supported by GCC if it does not contain these backports (and the compiler
> aborts with an error), it's not obvious how to modify the Makefile to decide
> when to use this option.

(In reply to Christophe Lyon from comment #43)
> Indeed, that seems safer.
> However, reading http://www.spinics.net/lists/arm-kernel/msg445346.html
> there is a comment saying:
> ----------------------
> Note that the kernel itself must be linked with a version of ld
> which fixes potentially affected ADRP instructions through the
> use of veneers
> -----------------------
> 
> But I don't know how this is enforced.


With kernel modules, there's no way the linker fix can be used as kernel modules are simply put just object files. When I did the work, I had forgotten about the erratum and the repercussions on kernel modules. Thus -mpc-relative-literal-loads is quite a reasonable workaround to the problem.


> 
> Or... change GCC's -mfix-cortex-a53-843419 to change the default for
> -mpc-relative-literal-loads and add -mfix-cortex-a53-843419 to the kernel
> Makefile.


That's probably not that bad an idea as a short term workaround for GCC 6.

Comment 45 Ramana Radhakrishnan 2016-01-22 05:15:16 UTC

I no longer have plans to fix this in the GCC 5 branch.

Comment 46 ard.biesheuvel 2016-01-25 07:15:57 UTC

One issue that this causes, which I did not see mentioned anywhere in the thread, is that the use of adrp/add and adrp/ldr imposes a 4 KB section alignment. In EDK2 Tianocore (UEFI reference implementation), we deliberately use -mcmodel=large to get around this requirement, since code size is a big deal when executing from NOR flash, and the architecture of EDK2 (lots and lots of small separately linked binaries) makes
the overhead of 4 KB section alignment prohibitive. (It uses 32 byte section alignment unless there are objects like a vector table that require more)

Comment 47 Richard Earnshaw 2016-01-25 10:08:46 UTC

(In reply to ard.biesheuvel from comment #46)
> One issue that this causes, which I did not see mentioned anywhere in the
> thread, is that the use of adrp/add and adrp/ldr imposes a 4 KB section
> alignment. In EDK2 Tianocore (UEFI reference implementation), we
> deliberately use -mcmodel=large to get around this requirement, since code
> size is a big deal when executing from NOR flash, and the architecture of
> EDK2 (lots and lots of small separately linked binaries) makes
> the overhead of 4 KB section alignment prohibitive. (It uses 32 byte section
> alignment unless there are objects like a vector table that require more)

Huh?  It imposes a 4k *SEGMENT* alignment.  It doesn't impose a section alignment.

Comment 48 ard.biesheuvel 2016-01-25 10:34:52 UTC

(In reply to Richard Earnshaw from comment #47)
> (In reply to ard.biesheuvel from comment #46)
> > One issue that this causes, which I did not see mentioned anywhere in the
> > thread, is that the use of adrp/add and adrp/ldr imposes a 4 KB section
> > alignment. In EDK2 Tianocore (UEFI reference implementation), we
> > deliberately use -mcmodel=large to get around this requirement, since code
> > size is a big deal when executing from NOR flash, and the architecture of
> > EDK2 (lots and lots of small separately linked binaries) makes
> > the overhead of 4 KB section alignment prohibitive. (It uses 32 byte section
> > alignment unless there are objects like a vector table that require more)
> 
> Huh?  It imposes a 4k *SEGMENT* alignment.  It doesn't impose a section
> alignment.

Indeed, apologies for mixing up the lingo.

But my point is that -mcmodel=large did not use to impose such a minimum alignment, and with this change, it now does. Would it perhaps make sense to default enable this feature only for -mcmodel=small (which already uses adrp/add anyway)

Comment 49 Ramana Radhakrishnan 2016-01-25 11:00:28 UTC

(In reply to ard.biesheuvel from comment #48)
> (In reply to Richard Earnshaw from comment #47)
> > (In reply to ard.biesheuvel from comment #46)
> > > One issue that this causes, which I did not see mentioned anywhere in the
> > > thread, is that the use of adrp/add and adrp/ldr imposes a 4 KB section
> > > alignment. In EDK2 Tianocore (UEFI reference implementation), we
> > > deliberately use -mcmodel=large to get around this requirement, since code
> > > size is a big deal when executing from NOR flash, and the architecture of
> > > EDK2 (lots and lots of small separately linked binaries) makes
> > > the overhead of 4 KB section alignment prohibitive. (It uses 32 byte section
> > > alignment unless there are objects like a vector table that require more)
> > 
> > Huh?  It imposes a 4k *SEGMENT* alignment.  It doesn't impose a section
> > alignment.
> 
> Indeed, apologies for mixing up the lingo.
> 
> But my point is that -mcmodel=large did not use to impose such a minimum
> alignment, and with this change, it now does. Would it perhaps make sense to
> default enable this feature only for -mcmodel=small (which already uses
> adrp/add anyway)


-mcmodel=large allows for images > 1MiB to be compiled, this change enables functions > 1MiB in size to exist in images compiled and linked with -mcmodel=large.

As of now, if you really want to use -mcmodel=large for working around this, you already have -mpc-relative-literal-loads to allow this.

Comment 50 ard.biesheuvel 2016-01-25 12:45:45 UTC

(In reply to Ramana Radhakrishnan from comment #49)
> (In reply to ard.biesheuvel from comment #48)
> > (In reply to Richard Earnshaw from comment #47)
> > > (In reply to ard.biesheuvel from comment #46)
> > > > One issue that this causes, which I did not see mentioned anywhere in the
> > > > thread, is that the use of adrp/add and adrp/ldr imposes a 4 KB section
> > > > alignment. In EDK2 Tianocore (UEFI reference implementation), we
> > > > deliberately use -mcmodel=large to get around this requirement, since code
> > > > size is a big deal when executing from NOR flash, and the architecture of
> > > > EDK2 (lots and lots of small separately linked binaries) makes
> > > > the overhead of 4 KB section alignment prohibitive. (It uses 32 byte section
> > > > alignment unless there are objects like a vector table that require more)
> > > 
> > > Huh?  It imposes a 4k *SEGMENT* alignment.  It doesn't impose a section
> > > alignment.
> > 
> > Indeed, apologies for mixing up the lingo.
> > 
> > But my point is that -mcmodel=large did not use to impose such a minimum
> > alignment, and with this change, it now does. Would it perhaps make sense to
> > default enable this feature only for -mcmodel=small (which already uses
> > adrp/add anyway)
> 
> 
> -mcmodel=large allows for images > 1MiB to be compiled, this change enables
> functions > 1MiB in size to exist in images compiled and linked with
> -mcmodel=large.
> 

AIUI the small model allows images up to 4 GB (since that is the range of adrp+lo12), and the large model allows any size, by emitting cross section references as literals containing absolute addresses (rather than movz/movk sequences). This relies on the literals themselves being in range, which is usually the case, of course, if they are emitted into the same section, modulo the uses cases that caused this bug in the first place. (BTW my GCC man page erroneously lists the tiny model as having a 1 GB range but this should obviously be 1 MB)

So by enabling this feature by default, you are now imposing a 4 GB image limit on the large model as well, since the literals are moved to a different section (which, given our choice for the large model, is not guaranteed to be within 4 GB in the final image), and referenced via instructions that only have a 4 GB range.

> As of now, if you really want to use -mcmodel=large for working around this,
> you already have -mpc-relative-literal-loads to allow this.

I am arguing that enabling this feature essentially breaks the large model, so it should not be enabled by default, and perhaps be made mutually exclusive.