Constant literal table is kept at large offset, resulting in pc-relative load offset out of range. aarch64-none-linux-gnu-gcc x.c /tmp/ccrOQLEb.s: Assembler messages: /tmp/ccrOQLEb.s:10: Error: pc-relative load offset out of range
Created attachment 33515 [details] Attached test case
Marcus, can you please assign it to me if it is confirmed.
Confirmed. From .expand: (insn 5 4 6 (set (reg:DF 75) (mem/u/c:DF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S8 A64])) /home/pinskia/Downloads/x.c:7 -1 (nil)) Note next time please use macros to create your testcase, it is ok sometimes to use non preprocessed source
I see this as a theoretical problem that's unlikely to ever manifest itself with real code (functions generating .5 million instructions would take insanely long to compile in real life). Unless you can show some *real* code that exhibits this problem I propose we don't try to fix this; it's just a limit on the compiler. We have bigger problems to worry about.
We got inspired by this bug. https://bugs.linaro.org/show_bug.cgi?id=400 It happens at -O0 now.
OK. But I still wouldn't loose much sleep over it. It wouldn't surprise me that if you turn the optimizer on it then blows up by using too much memory...
Ok I am closing this as wont fix now. Checked with Mitthias who reported the problem to us and he is fine building it with -O2.
This code is still valid and really should be fixed; maybe not today or for GCC 5 or GCC 6 but it is still a bug.
Actually this is much worse than what is mentioned here. Having the constant pool be part of the .text section really does not work if the alignment of the constants are smaller than 4 byte aligned. The assembler rejects the debugging info as the end label of the text section is not 4 byte aligned. I am working on fixing this bug and the bug dealing with the constant pool in the text section (which is wrong and is not shared between translational units really).
More point is with LTO you can get this even with optimization turned on. If the text section is big enough.
I believe, I hit this issue with OpenLoops package on AArch64, while it works fine on x86_64. It shows up on 3 of Fortran files: scons: *** [process_obj/pplljjj/loop_pplljjj_eexddxggg_1_qp.os] Error 1 scons: *** [process_obj/pplljjj/loop_pplljjj_eexuuxggg_1_qp.os] Error 1 scons: *** [process_obj/pplljjj/loop_pplljjj_eexbbxggg_1_qp.os] Error 1 I get "Error: pc-relative load offset out of range" 864 times. Tested with GCC 4.9.1 and binutils 2.24. I also looked into f9edeb70961d404caac5a849b0783c53228ddf62 / 213078, with it applied on top it gets worse, i.e. more files are affected. From 864 it increases to 1392 errors. This particular library (pplljjj) is 158MB on x86_64. Most of it (153MB) belongs to .text section. The code must be compiled with -O0 otherwise machines have hard time handling it: gfortran: internal compiler error: Killed (program f951) Please submit a full bug report, with preprocessed source if appropriate. See <http://gcc.gnu.org/bugs.html> for instructions. @Andrew, I am happy to test your patches if necessary.
I decided to re-enable -Os for OpenLoops. Then use powerful hardware with 32-physical-cores (x86_64) and 0.5TB of RAM to see if I could get lucky. Fired up QEMU user mode with Fedora for AArch64 chroot for compiling. It did resolve some of failures, but not everything. I have seen processes going as far as 25+GB of RSS and taking hours to finish. This is the reason package is compiled with -O0.
Created attachment 34416 [details] 31-lines, minimal testcase I am adding 31-lines minimal testcase. Should be good enough for GCC testsuite. $ gcc -O0 -c pr63304-testcase.c /tmp/ccKdBqsL.s: Assembler messages: /tmp/ccKdBqsL.s:58: Error: pc-relative load offset out of range /tmp/ccKdBqsL.s:62: Error: pc-relative load offset out of range Yesterday I looked into RTL output and assembly of some failures for OpenLoops. The function was 400+K lines in assembly. On x86_64 it was something 180+K lines of assembly and ~1.3MB for function body size. I can confirm that offset is above 1MB mark and it was trying to load __float128/long double to q1 from constant pool for passing to `addtf3`.
I hit another two cases of this. 1. g2root tool, which converts GEANT geometry to ROOT geometry. It create a single function, which contains lots of descriptions of material, shapes, etc. all describing some 3D objects and physical properties. These can be very huge. Also could think of MILL computing as possible example. They use C++ as they assembler (at least for simulation). Thus you have something like: void somefunc(void) { add(b0, b1); add(b2, b3); ... } IIRC, by compiling it they generate a simulator to run this program on specific CPU. 2. mcfm 6.3 package, to be precise: mcfm-6.3/src/WW/triangle11new.f triangle11new.s: Assembler messages: triangle11new.s:587: Error: pc-relative load offset out of range triangle11new.s:592: Error: pc-relative load offset out of range .. 587 ldr x0, .LC2 588 fmov d0, x2 589 fmov d1, x0 590 fdiv d0, d0, d1 591 fmov x2, d0 592 ldr x0, .LC2 .. 346645 .LC2: 346646 .word 0 346647 .word 1073741824 346648 .align 3 .. Distance between ldr instruction and .LC2 is 346058 assembly lines. Here is the source file: https://github.com/cms-externals/MCFM/blob/master/src/WW/triangle11new.f (just 1355 sloc). It's those huge computations, which are killing it. Similar issue as in OpenLoops package.
Not working on this any time soon. But someone from ARM really should look into fixing this as it blocks standard C/C++ code from HPC and distros.
Have done a quick look at this, basic ideas to fix this: * generate a special pattern which initialize literal pool start address. * implement TARGET_MACHINE_DEPENDENT_REORG to calculate whehter the pc-relative literal load is within range. * output final insruction sequences which initializing literal pool start address based on the result from reorg pass analysis. Use movk/z, adrp + add, single adr for different distance.
Well there seem to be 2 ways to address this: * If a function is huge, emit literals as const data. This enables the use of anchors and sharing of literals across all functions in a compilation unit. * Reserve a register in the adr/ldr literal patterns and add a 2-instruction sequence using adrp when out of range. Ideally the register should only be reserved if a function is huge.
I'm taking a look into this.
(In reply to Ramana Radhakrishnan from comment #18) > I'm taking a look into this. RFC here - https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02258.html
(In reply to Ramana Radhakrishnan from comment #19) > (In reply to Ramana Radhakrishnan from comment #18) > > I'm taking a look into this. > > RFC here - https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02258.html David - can you see if this works for you ? regards Ramana
I am on vacations now, but I already marked this on my TODO list. Once I find a free time slot I will give it a spin. I will try to report in a few days. BTW, I will also show up at GNU Tools Cauldron 2015.
(In reply to David Abdurachmanov from comment #21) > I am on vacations now, but I already marked this on my TODO list. Once I > find a free time slot I will give it a spin. I will try to report in a few > days. > > BTW, I will also show up at GNU Tools Cauldron 2015. I must also add that this patch is a pre-requisite for the appropriate iterators to be available. https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02074.html and this may need some rebasing after Alan's latest patches to handle fp16.
GCC trunk r226676 or 15af172f2a0ea281969e3105da9f9bb100097c7d from git://gcc.gnu.org/git/gcc.git Date: Thu Aug 6 14:26:18 2015 +0000) Rebased and applied: https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02074.html https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02258.html binutils trunk f0ce0d3a331129309a46a6a9ac85fce35acae72b from git://sourceware.org/git/binutils-gdb.git Date: Thu Jul 23 16:01:01 2015 +0100) Rebases and applied: https://sourceware.org/ml/binutils/2015-07/msg00137.html $ gcc -dumpversion 6.0.0 $ ld --version GNU ld (GNU Binutils) 2.25.51.20150806 I __successfully__ compiled OpenLoops 1.1.1 and MCFM 6.3 packages.
Author: ramana Date: Fri Sep 11 09:44:26 2015 New Revision: 227679 URL: https://gcc.gnu.org/viewcvs?rev=227679&root=gcc&view=rev Log: Remove separate movtf pattern - Use an iterator for all FP modes. movtf is unnecessary as a separate expander. Move this to be with the standard scalar floating point expanders. Achieved by adding a new iterator and then using the same. Tested cross aarch64-none-elf and no regressions. Rebased version from https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00767.html 2015-09-10 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> PR target/63304 * config/aarch64/aarch.md (mov<mode>:GPF_F16): Use GPF_TF_F16. (movtf): Delete. * config/aarch64/iterators.md (GPF_TF_F16): New. (GPF_F16): Delete. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64.md trunk/gcc/config/aarch64/iterators.md
Author: ramana Date: Mon Sep 14 13:16:59 2015 New Revision: 227748 URL: https://gcc.gnu.org/viewcvs?rev=227748&root=gcc&view=rev Log: [AArch64] Handle literal pools for functions > 1 MiB in size. This patch fixes the issue in PR63304 where we have functions that are > 1MiB. The idea is to use adrp / ldr or adrp / add instructions to address the literal pools under the use of a command line option. I would like to turn this on by default on trunk but keep this disabled by default for the release branches in order to get some serious testing for this feature while it bakes on trunk. As a follow-up I would like to try and see if estimate_num_insns or something else can give us a heuristic to turn this on for "large" functions. After all the number of incidences of this are quite low in real life, so may be we should look to restrict this use as much as possible on the grounds that this code generation implies an extra integer register for addressing for every floating point and vector constant and I don't think that's great in code that already may have high register pressure. Tested on aarch64-none-elf with no regressions. A previous version was bootstrapped and regression tested. Applied to trunk. regards Ramana 2015-09-14 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> PR target/63304 * config/aarch64/aarch64.c (aarch64_expand_mov_immediate): Handle nopcrelative_literal_loads. (aarch64_classify_address): Likewise. (aarch64_constant_pool_reload_icode): Define. (aarch64_secondary_reload): Handle secondary reloads for literal pools. (aarch64_override_options): Handle nopcrelative_literal_loads. (aarch64_classify_symbol): Handle nopcrelative_literal_loads. * config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>): Define. (aarch64_reload_movcp<VALL:mode><P:mode>): Likewise. * config/aarch64/aarch64.opt (mpc-relative-literal-loads): New option. * config/aarch64/predicates.md (aarch64_constant_pool_symref): New predicate. * doc/invoke.texi (mpc-relative-literal-loads): Document. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64.c trunk/gcc/config/aarch64/aarch64.md trunk/gcc/config/aarch64/aarch64.opt trunk/gcc/config/aarch64/iterators.md trunk/gcc/config/aarch64/predicates.md trunk/gcc/doc/invoke.texi
Author: ramana Date: Fri Oct 9 10:58:06 2015 New Revision: 228644 URL: https://gcc.gnu.org/viewcvs?rev=228644&root=gcc&view=rev Log: [AArch64] Handle literal pools for functions > 1 MiB in size. This patch fixes the issue in PR63304 where we have functions that are > 1MiB. The idea is to use adrp / ldr or adrp / add instructions to address the literal pools under the use of a command line option. I would like to turn this on by default on trunk but keep this disabled by default for the release branches in order to get some serious testing for this feature while it bakes on trunk. As a follow-up I would like to try and see if estimate_num_insns or something else can give us a heuristic to turn this on for "large" functions. After all the number of incidences of this are quite low in real life, so may be we should look to restrict this use as much as possible on the grounds that this code generation implies an extra integer register for addressing for every floating point and vector constant and I don't think that's great in code that already may have high register pressure. Tested on aarch64-none-elf with no regressions. A previous version was bootstrapped and regression tested. Applied to trunk. regards Ramana 2015-09-14 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> PR target/63304 * config/aarch64/aarch64.c (aarch64_expand_mov_immediate): Handle nopcrelative_literal_loads. (aarch64_classify_address): Likewise. (aarch64_constant_pool_reload_icode): Define. (aarch64_secondary_reload): Handle secondary reloads for literal pools. (aarch64_override_options): Handle nopcrelative_literal_loads. (aarch64_classify_symbol): Handle nopcrelative_literal_loads. * config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>): Define. (aarch64_reload_movcp<VALL:mode><P:mode>): Likewise. * config/aarch64/aarch64.opt (mpc-relative-literal-loads): New option. * config/aarch64/predicates.md (aarch64_constant_pool_symref): New predicate. * doc/invoke.texi (mpc-relative-literal-loads): Document. Added: trunk/gcc/testsuite/gcc.target/arm/pr67366.c Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-fold.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/lib/target-supports.exp
(In reply to Ramana Radhakrishnan from comment #26) > Author: ramana > Date: Fri Oct 9 10:58:06 2015 > New Revision: 228644 > > URL: https://gcc.gnu.org/viewcvs?rev=228644&root=gcc&view=rev > Log: > [AArch64] Handle literal pools for functions > 1 MiB in size. > > > This patch fixes the issue in PR63304 where we have > functions that are > 1MiB. The idea is to use adrp / ldr or adrp / add > instructions to address the literal pools under the use of a command line > option. I would like to turn this on by default on trunk but keep this > disabled by default for the release branches in order to get some > serious testing for this feature while it bakes on trunk. > > As a follow-up I would like to try and see if estimate_num_insns or > something else can give us a heuristic to turn this on for "large" functions. > After all the number of incidences of this are quite low in real life, > so may be we should look to restrict this use as much as possible on the > grounds that this code generation implies an extra integer register for > addressing for every floating point and vector constant and I don't think > that's great in code that already may have high register pressure. > > Tested on aarch64-none-elf with no regressions. A previous > version was bootstrapped and regression tested. > > Applied to trunk. > > regards > Ramana > > 2015-09-14 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> > > PR target/63304 > * config/aarch64/aarch64.c (aarch64_expand_mov_immediate): Handle > nopcrelative_literal_loads. > (aarch64_classify_address): Likewise. > (aarch64_constant_pool_reload_icode): Define. > (aarch64_secondary_reload): Handle secondary reloads for > literal pools. > (aarch64_override_options): Handle nopcrelative_literal_loads. > (aarch64_classify_symbol): Handle nopcrelative_literal_loads. > * config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>): > Define. > (aarch64_reload_movcp<VALL:mode><P:mode>): Likewise. > * config/aarch64/aarch64.opt (mpc-relative-literal-loads): New option. > * config/aarch64/predicates.md (aarch64_constant_pool_symref): New > predicate. > * doc/invoke.texi (mpc-relative-literal-loads): Document. > > > Added: > trunk/gcc/testsuite/gcc.target/arm/pr67366.c > Modified: > trunk/gcc/ChangeLog > trunk/gcc/gimple-fold.c > trunk/gcc/testsuite/ChangeLog > trunk/gcc/testsuite/lib/target-supports.exp This commit really was for PR67366.. Copied wrong text into the commit message.
Author: ramana Revision: 228644 Modified property: svn:log Modified: svn:log at Fri Oct 9 11:08:05 2015 ------------------------------------------------------------------------------ --- svn:log (original) +++ svn:log Fri Oct 9 11:08:05 2015 @@ -1,45 +1,43 @@ -[AArch64] Handle literal pools for functions > 1 MiB in size. - +[PATCH PR target/67366 2/2] [gimple-fold.c] Support movmisalign optabs in gimple-fold.c -This patch fixes the issue in PR63304 where we have -functions that are > 1MiB. The idea is to use adrp / ldr or adrp / add -instructions to address the literal pools under the use of a command line -option. I would like to turn this on by default on trunk but keep this -disabled by default for the release branches in order to get some -serious testing for this feature while it bakes on trunk. +This patch by Richard allows for movmisalign optabs to be supported +in gimple-fold.c. This caused a bit of pain in the testsuite with strlenopt-8.c +in conjunction with the ARM support for movmisalign_optabs as the test +was coded up to do different things depending on whether the target +supported misaligned access or not. However now with unaligned access +being allowed for different levels of the architecture in the arm backend, +the concept of the helper function non_strict_align mapping identically +to the definition of STRICT_ALIGNMENT disappears. -As a follow-up I would like to try and see if estimate_num_insns or -something else can give us a heuristic to turn this on for "large" functions. -After all the number of incidences of this are quite low in real life, -so may be we should look to restrict this use as much as possible on the -grounds that this code generation implies an extra integer register for -addressing for every floating point and vector constant and I don't think -that's great in code that already may have high register pressure. +Adjusted thusly for ARM. The testsuite/lib changes were tested with an +arm-none-eabi multilib that included architecture variants that did not +support unaligned access and architecture variants that did. -Tested on aarch64-none-elf with no regressions. A previous -version was bootstrapped and regression tested. +The testing matrix for this patch was: -Applied to trunk. +1. x86_64 bootstrap and regression test - no regressions. +2. armhf bootstrap and regression test - no regressions. +3. arm-none-eabi cross build and regression test for -regards -Ramana +{-marm/-march=armv7-a/-mfpu=vfpv3-d16/-mfloat-abi=softfp} +{-mthumb/-march=armv8-a/-mfpu=crypto-neon-fp-armv8/-mfloat-abi=hard} +{-marm/-mcpu=arm7tdmi/-mfloat-abi=soft} +{-mthumb/-mcpu=arm7tdmi/-mfloat-abi=soft} -2015-09-14 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> +with no regressions. - PR target/63304 - * config/aarch64/aarch64.c (aarch64_expand_mov_immediate): Handle - nopcrelative_literal_loads. - (aarch64_classify_address): Likewise. - (aarch64_constant_pool_reload_icode): Define. - (aarch64_secondary_reload): Handle secondary reloads for - literal pools. - (aarch64_override_options): Handle nopcrelative_literal_loads. - (aarch64_classify_symbol): Handle nopcrelative_literal_loads. - * config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>): - Define. - (aarch64_reload_movcp<VALL:mode><P:mode>): Likewise. - * config/aarch64/aarch64.opt (mpc-relative-literal-loads): New option. - * config/aarch64/predicates.md (aarch64_constant_pool_symref): New - predicate. - * doc/invoke.texi (mpc-relative-literal-loads): Document. +Ok to apply ? +2015-10-09 Richard Biener <rguenth@suse.de> + + PR target/67366 + * gimple-fold.c (optabs-query.h): Include + (gimple_fold_builtin_memory_op): Allow unaligned stores + when movmisalign_optabs are available. + +2015-10-09 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> + + PR target/67366 + * lib/target-supports.exp (check_effective_target_non_strict_align): + Adjust for arm*-*-*. + * gcc.target/arm/pr67366.c: New test.
Author: ramana Date: Thu Oct 22 04:26:50 2015 New Revision: 229160 URL: https://gcc.gnu.org/viewcvs?rev=229160&root=gcc&view=rev Log: [Patch AArch64 63304] Fix issue with global state. Jiong pointed out privately that there was a thinko in the way in which the global state was being set and reset. I don't like adding such global state but .... 2015-10-22 Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> PR target/63304 * config/aarch64/aarch64.c (aarch64_nopcrelative_literal_loads): New. (aarch64_expand_mov_immediate): Use aarch64_nopcrelative_literal_loads. (aarch64_classify_address): Likewise. (aarch64_secondary_reload): Likewise. (aarch64_override_options_after_change_1): Adjust. * config/aarch64/aarch64.md (aarch64_reload_movcp<GPF_TF:mode><P:mode>): Use aarch64_nopcrelative_literal_loads. (aarch64_reload_movcp<VALL:mode><P:mode>): Likewise. * config/aarch64/aarch64-protos.h (aarch64_nopcrelative_literal_loads): Declare 2015-10-22 Jiong Wang <jiong.wang@arm.com> Ramana Radhakrishnan <ramana.radhakrishnan@arm.com> PR target/63304 * gcc.target/aarch64/pr63304_1.c: New test. Added: trunk/gcc/testsuite/gcc.target/aarch64/pr63304_1.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64-protos.h trunk/gcc/config/aarch64/aarch64.c trunk/gcc/config/aarch64/aarch64.md trunk/gcc/testsuite/ChangeLog
The performance impact of always referring to constants as if they were far away is significant on targets which do not fuse ADRP and LDR together. What's the status of the solution that evaluates the function size? Should this be optionally enabled only? Would it be the case to come up with a medium code model? :-P Could the assembler be left to address this issue by relaxing such loads? :-P Thank you.
(In reply to Evandro from comment #30) > The performance impact of always referring to constants as if they were far > away is significant on targets which do not fuse ADRP and LDR together. What happens if you split them up and schedule them appropriately ? I didn't see any significant impact in my benchmarking on implementations that did not implement such fusion. Where people want performance in these cases they can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there already. > What's the status of the solution that evaluates the function size? I am not working on that follow-up as I didn't see the real need for it in the benchmarking results I was looking at. You are welcome to investigate. > Should > this be optionally enabled only? It is enabled by default for -mcmodel=small and -mcmodel=large. And no because it has been done after quite a lot of complaints from the general user community that people are unable to build large software bases with the compiler. > Could the assembler be left to address this issue by > relaxing such loads? :-P No...
(In reply to Ramana Radhakrishnan from comment #31) > (In reply to Evandro from comment #30) > > The performance impact of always referring to constants as if they were far > > away is significant on targets which do not fuse ADRP and LDR together. > > What happens if you split them up and schedule them appropriately ? I didn't > see any significant impact in my benchmarking on implementations that did > not implement such fusion. Where people want performance in these cases they > can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there > already. Because of side effects of the Haiffa scheduler, the loads now pile up, and the ADRPs may affect the load issue rate rather badly if not fused. At leas on our processor. Which brings another point, shouldn't there be just one ADRP per BB or, ideally, per function? Or am I missing something? > > What's the status of the solution that evaluates the function size? > > I am not working on that follow-up as I didn't see the real need for it in > the benchmarking results I was looking at. You are welcome to investigate. OK
(In reply to Evandro from comment #32) > (In reply to Ramana Radhakrishnan from comment #31) > > (In reply to Evandro from comment #30) > > > The performance impact of always referring to constants as if they were far > > > away is significant on targets which do not fuse ADRP and LDR together. > > > > What happens if you split them up and schedule them appropriately ? I didn't > > see any significant impact in my benchmarking on implementations that did > > not implement such fusion. Where people want performance in these cases they > > can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there > > already. > > Because of side effects of the Haiffa scheduler, the loads now pile up, and > the ADRPs may affect the load issue rate rather badly if not fused. At leas > on our processor. ADRP latency to load-address should be zero on any OoO core - ADRP is basically a move-immediate, so can execute early and hide any latency. > Which brings another point, shouldn't there be just one ADRP per BB or, > ideally, per function? Or am I missing something? That's not possible in this case as the section is mergeable. An alternative implementation using anchors may be feasible, but GCC is extremely bad at using anchors efficiently - functions using several global variables also end up with a large number of ADRPs when you'd expect a single ADRP.
(In reply to Wilco from comment #33) > (In reply to Evandro from comment #32) > ADRP latency to load-address should be zero on any OoO core - ADRP is > basically a move-immediate, so can execute early and hide any latency. In an ideal world, yes. In the actual world, they compete for limited resources that could be used by other insns. > > Which brings another point, shouldn't there be just one ADRP per BB or, > > ideally, per function? Or am I missing something? > > That's not possible in this case as the section is mergeable. An alternative > implementation using anchors may be feasible, but GCC is extremely bad at > using anchors efficiently - functions using several global variables also > end up with a large number of ADRPs when you'd expect a single ADRP. I see. I'll investigate placing the constant after the function, as before, if the estimated function size allows for it. I think that eliminating the ADRPs could potentially be more beneficial to code size than merging constants in a common literal pool (v. http://bit.ly/1Ptc8nh). Thank you.
(In reply to Evandro from comment #32) > (In reply to Ramana Radhakrishnan from comment #31) > > (In reply to Evandro from comment #30) > > > The performance impact of always referring to constants as if they were far > > > away is significant on targets which do not fuse ADRP and LDR together. > > > > What happens if you split them up and schedule them appropriately ? I didn't > > see any significant impact in my benchmarking on implementations that did > > not implement such fusion. Where people want performance in these cases they > > can well use -mpc-relative-literal-loads or -mcmodel=tiny - it's in there > > already. > > Because of side effects of the Haiffa scheduler, the loads now pile up, and > the ADRPs may affect the load issue rate rather badly if not fused. At leas > on our processor. In straight line code I can imagine this happening - In loopy code I would have expected the constants to be hoisted - atleast that's what I remember seeing in my analysis. You have seen -mcprelative-literal-loads haven't you ? > > Which brings another point, shouldn't there be just one ADRP per BB or, > ideally, per function? Or am I missing something? That isn't really how this thing works. Well the constants are shared across the program now as you say down thread. > > I'll investigate placing the constant after the function, as before, if the > estimated function size allows for it. I think that eliminating the ADRPs > could potentially be more beneficial to code size than merging constants in > a common literal pool (v. http://bit.ly/1Ptc8nh). I was actually surprised by the amount of constant sharing that was happening in what I looked at. Thanks, Ramana
(In reply to Ramana Radhakrishnan from comment #35) > (In reply to Evandro from comment #32) > > Because of side effects of the Haiffa scheduler, the loads now pile up, and > > the ADRPs may affect the load issue rate rather badly if not fused. At leas > > on our processor. > > In straight line code I can imagine this happening - In loopy code I would > have expected the constants to be hoisted - atleast that's what I remember > seeing in my analysis. You have seen -mcprelative-literal-loads haven't you > ? The cases that I have in mind involve SL code in functions which are called form a loop. Since they are external, only LTO would address such cases. And, since we do not control how they are built, we have to handle them as they come. As long as there's an opening to investigate the benefits and drawbacks of reverting to the legacy way considering the function size, I think that it's interesting to find out the results. Thank you.
Here's what I had in mind: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01787.html Feedback is welcome.
Hi all, I'm trying to narrow down a similar issue on armv7 and would welcome any suggestions where to start: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69082 Thx
We have backported r227748, 229160 and 229161 to our linaro-gcc-5 branch, and we got a bug report from the kernel team. Indeed, when the kernel is configured with CONFIG_ARM64_ERRATUM_843419, the support for relocations R_AARCH64_ADR_PREL_PG_HI21 and R_AARCH64_ADR_PREL_PG_HI21_NC is removed from the kernel to avoid loading objects with possibly offending sequences. In this case the kernel is built with -mcmodel=large. With these backports, these relocations are generated again by default. Adding -mpc-relative-literal-loads to the kernel Makefile in arch/arm64/Makefile does fix the build, but since this option is not supported by GCC if it does not contain these backports (and the compiler aborts with an error), it's not obvious how to modify the Makefile to decide when to use this option. So, although Ramana said he would backport these to the FSF gcc-5 branch, maybe it's safer not to. Or change the default? Or... change GCC's -mfix-cortex-a53-843419 to change the default for -mpc-relative-literal-loads and add -mfix-cortex-a53-843419 to the kernel Makefile.
The kernel does support a mechanism to add a command line option for only compilers which support it. Documentation/kbuild/makefiles.txt: cc-option cc-option is used to check if $(CC) supports a given option, and if not supported to use an optional second option. I think you could therefore add this line to arch/arm64/Makefile KBUILD_CFLAGS += $(call cc-option,-mpc-relative-literal-loads)
Indeed, having: ifeq ($(CONFIG_ARM64_ERRATUM_843419), y) KBUILD_CFLAGS_MODULE += -mcmodel=large KBUILD_CFLAGS_MODULE += $(call cc-option, -mpc-relative-literal-loads) endif in arch/arm64/Makefile does the trick.
While we're suggesting fixes to the kernel, wouldn't it be better if CONFIG_ARM64_ERRATUM_843419 forced the kernel to be built with the linker workarounds if the kernel is configured KBUILD_LDFLAGS += --fix-cortex-a53-843419 or maybe KBUILD_LDFLAGS += $(call ld-option, --fix-cortex-a53-843419)
Indeed, that seems safer. However, reading http://www.spinics.net/lists/arm-kernel/msg445346.html there is a comment saying: ---------------------- Note that the kernel itself must be linked with a version of ld which fixes potentially affected ADRP instructions through the use of veneers ----------------------- But I don't know how this is enforced.
(In reply to Christophe Lyon from comment #39) > We have backported r227748, 229160 and 229161 to our linaro-gcc-5 branch, > and we got a bug report from the kernel team. Sorry about the breakage. > > Indeed, when the kernel is configured with CONFIG_ARM64_ERRATUM_843419, the > support for relocations R_AARCH64_ADR_PREL_PG_HI21 and > R_AARCH64_ADR_PREL_PG_HI21_NC is removed from the kernel to avoid loading > objects with possibly offending sequences. In this case the kernel is built > with > -mcmodel=large. > > With these backports, these relocations are generated again by default. The reason for the code generation to be in this form , is to store literal pools in rodata and not have any data in the code segment at all. Now when these need to be addressed that far away, addressing them with adrp / load is reasonable as you are thinking of a text + rodata segment to be > 4GB before this fails. In an ideal world this would be implemented by the 4 insn mov / movk sequence to construct the literal address, however those relocations are not likely to be supported by the kernel loader (or rather I haven't checked). > > Adding -mpc-relative-literal-loads to the kernel Makefile in > arch/arm64/Makefile does fix the build, but since this option is not > supported by GCC if it does not contain these backports (and the compiler > aborts with an error), it's not obvious how to modify the Makefile to decide > when to use this option. (In reply to Christophe Lyon from comment #43) > Indeed, that seems safer. > However, reading http://www.spinics.net/lists/arm-kernel/msg445346.html > there is a comment saying: > ---------------------- > Note that the kernel itself must be linked with a version of ld > which fixes potentially affected ADRP instructions through the > use of veneers > ----------------------- > > But I don't know how this is enforced. With kernel modules, there's no way the linker fix can be used as kernel modules are simply put just object files. When I did the work, I had forgotten about the erratum and the repercussions on kernel modules. Thus -mpc-relative-literal-loads is quite a reasonable workaround to the problem. > > Or... change GCC's -mfix-cortex-a53-843419 to change the default for > -mpc-relative-literal-loads and add -mfix-cortex-a53-843419 to the kernel > Makefile. That's probably not that bad an idea as a short term workaround for GCC 6.
I no longer have plans to fix this in the GCC 5 branch.
One issue that this causes, which I did not see mentioned anywhere in the thread, is that the use of adrp/add and adrp/ldr imposes a 4 KB section alignment. In EDK2 Tianocore (UEFI reference implementation), we deliberately use -mcmodel=large to get around this requirement, since code size is a big deal when executing from NOR flash, and the architecture of EDK2 (lots and lots of small separately linked binaries) makes the overhead of 4 KB section alignment prohibitive. (It uses 32 byte section alignment unless there are objects like a vector table that require more)
(In reply to ard.biesheuvel from comment #46) > One issue that this causes, which I did not see mentioned anywhere in the > thread, is that the use of adrp/add and adrp/ldr imposes a 4 KB section > alignment. In EDK2 Tianocore (UEFI reference implementation), we > deliberately use -mcmodel=large to get around this requirement, since code > size is a big deal when executing from NOR flash, and the architecture of > EDK2 (lots and lots of small separately linked binaries) makes > the overhead of 4 KB section alignment prohibitive. (It uses 32 byte section > alignment unless there are objects like a vector table that require more) Huh? It imposes a 4k *SEGMENT* alignment. It doesn't impose a section alignment.
(In reply to Richard Earnshaw from comment #47) > (In reply to ard.biesheuvel from comment #46) > > One issue that this causes, which I did not see mentioned anywhere in the > > thread, is that the use of adrp/add and adrp/ldr imposes a 4 KB section > > alignment. In EDK2 Tianocore (UEFI reference implementation), we > > deliberately use -mcmodel=large to get around this requirement, since code > > size is a big deal when executing from NOR flash, and the architecture of > > EDK2 (lots and lots of small separately linked binaries) makes > > the overhead of 4 KB section alignment prohibitive. (It uses 32 byte section > > alignment unless there are objects like a vector table that require more) > > Huh? It imposes a 4k *SEGMENT* alignment. It doesn't impose a section > alignment. Indeed, apologies for mixing up the lingo. But my point is that -mcmodel=large did not use to impose such a minimum alignment, and with this change, it now does. Would it perhaps make sense to default enable this feature only for -mcmodel=small (which already uses adrp/add anyway)
(In reply to ard.biesheuvel from comment #48) > (In reply to Richard Earnshaw from comment #47) > > (In reply to ard.biesheuvel from comment #46) > > > One issue that this causes, which I did not see mentioned anywhere in the > > > thread, is that the use of adrp/add and adrp/ldr imposes a 4 KB section > > > alignment. In EDK2 Tianocore (UEFI reference implementation), we > > > deliberately use -mcmodel=large to get around this requirement, since code > > > size is a big deal when executing from NOR flash, and the architecture of > > > EDK2 (lots and lots of small separately linked binaries) makes > > > the overhead of 4 KB section alignment prohibitive. (It uses 32 byte section > > > alignment unless there are objects like a vector table that require more) > > > > Huh? It imposes a 4k *SEGMENT* alignment. It doesn't impose a section > > alignment. > > Indeed, apologies for mixing up the lingo. > > But my point is that -mcmodel=large did not use to impose such a minimum > alignment, and with this change, it now does. Would it perhaps make sense to > default enable this feature only for -mcmodel=small (which already uses > adrp/add anyway) -mcmodel=large allows for images > 1MiB to be compiled, this change enables functions > 1MiB in size to exist in images compiled and linked with -mcmodel=large. As of now, if you really want to use -mcmodel=large for working around this, you already have -mpc-relative-literal-loads to allow this.
(In reply to Ramana Radhakrishnan from comment #49) > (In reply to ard.biesheuvel from comment #48) > > (In reply to Richard Earnshaw from comment #47) > > > (In reply to ard.biesheuvel from comment #46) > > > > One issue that this causes, which I did not see mentioned anywhere in the > > > > thread, is that the use of adrp/add and adrp/ldr imposes a 4 KB section > > > > alignment. In EDK2 Tianocore (UEFI reference implementation), we > > > > deliberately use -mcmodel=large to get around this requirement, since code > > > > size is a big deal when executing from NOR flash, and the architecture of > > > > EDK2 (lots and lots of small separately linked binaries) makes > > > > the overhead of 4 KB section alignment prohibitive. (It uses 32 byte section > > > > alignment unless there are objects like a vector table that require more) > > > > > > Huh? It imposes a 4k *SEGMENT* alignment. It doesn't impose a section > > > alignment. > > > > Indeed, apologies for mixing up the lingo. > > > > But my point is that -mcmodel=large did not use to impose such a minimum > > alignment, and with this change, it now does. Would it perhaps make sense to > > default enable this feature only for -mcmodel=small (which already uses > > adrp/add anyway) > > > -mcmodel=large allows for images > 1MiB to be compiled, this change enables > functions > 1MiB in size to exist in images compiled and linked with > -mcmodel=large. > AIUI the small model allows images up to 4 GB (since that is the range of adrp+lo12), and the large model allows any size, by emitting cross section references as literals containing absolute addresses (rather than movz/movk sequences). This relies on the literals themselves being in range, which is usually the case, of course, if they are emitted into the same section, modulo the uses cases that caused this bug in the first place. (BTW my GCC man page erroneously lists the tiny model as having a 1 GB range but this should obviously be 1 MB) So by enabling this feature by default, you are now imposing a 4 GB image limit on the large model as well, since the literals are moved to a different section (which, given our choice for the large model, is not guaranteed to be within 4 GB in the final image), and referenced via instructions that only have a 4 GB range. > As of now, if you really want to use -mcmodel=large for working around this, > you already have -mpc-relative-literal-loads to allow this. I am arguing that enabling this feature essentially breaks the large model, so it should not be enabled by default, and perhaps be made mutually exclusive.