Created attachment 48584 [details] proposed patch I would like to propose an implementation of the medium code model in aarch64. A prototype is attached, passed bootstrap and the regression test. Mcmodel = medium is a missing code model in aarch64 architecture, which is supported in x86. This code model describes a situation that some small data is relocated by small code model while large data is relocated by large code model. The official statement about medium code model in x86 ABI file page 34 URL : https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf The key difference between x86 and aarch64 is that x86 can use lea+movabs instruction to implement a dynamic relocatable large code model. Currently, large code model in AArch64 relocate the symbol using ldr instruction, which can only be static linked. However, the small code mode use adrp + ldr instruction, which can be dynamic linked. Therefore, the medium code model cannot be implemented directly by simply setting a threshold. As a result a dynamic reloadable large code model is needed first for a functional medium code model. I met this problem when compiling CESM, which is a climate forecast software that widely used in hpc field. In some configure case, when the manipulating large arrays, the large code model with dynamic relocation is needed. The following case is abstract from CESM for this scenario. program main common/baz/a,b,c real a,b,c b = 1.0 call foo() print*, b end subroutine foo() common/baz/a,b,c real a,b,c integer, parameter :: nx = 1024 integer, parameter :: ny = 1024 integer, parameter :: nz = 1024 integer, parameter :: nf = 1 real :: bar(nf,nx*ny*nz) real :: bar1(nf,nx*ny*nz) bar = 0.0 bar1 =0.0 b = bar(1,1024*1024*100) b = bar1(1,1) return end compile with -mcmodel=small -fPIC will give following error due to the access of bar1 array test.f90:(.text+0x28): relocation truncated to fit: R_AARCH64_ADR_PREL_PG_HI21 against `.bss' test.f90:(.text+0x6c): relocation truncated to fit: R_AARCH64_ADR_PREL_PG_HI21 against `.bss' compile with -mcmodel=large -fPIC will give unsupported error: f951: sorry, unimplemented: code model ‘large’ with ‘-fPIC’ As discussed in the beginning, to tackle this problem we have to solve the static large code model problem. My solution here is to use R_AARCH64_MOVW_PREL_Gx group relocation with instructions to calculate the current PC value. Before change (mcmodel=small) : adrp x0, bar1.2782 add x0, x0, :lo12:bar1.2782 After change:(mcmodel = medium proposed): movz x0, :prel_g3:bar1.2782 movk x0, :prel_g2_nc:bar1.2782 movk x0, :prel_g1_nc:bar1.2782 movk x0, :prel_g0_nc:bar1.2782 adr x1, . sub x1, x1, 0x4 add x0, x0, x1 The first 4 movk instruction will calculate the offset between bar1 and the last movk instruction in 64-bits, which fulfil the requirement of large code model(64-bit relocation). The adr+sub instruction will calculate the pc-address of the last movk instruction. By adding the offset with the PC address, bar1 can be dynamically located. Because this relocation is time consuming, a threshold is set to classify the size of the data to be relocated, like x86. The default value of the threshold is set to 65536, which is max relocation capability of small code model. This implementation will also need to amend the linker in binutils so that the4 movk can calculated the same pc-offset of the last movk instruction. The good side of this implementation is that it can use existed relocation type to prototype a medium code model. The drawback of this implementation also exists. For start, these 4movk instructions and the adr instruction must be combined in this order. No other instruction should insert in between the sequence, which will leads to mistake symbol address. This might impede the insn schedule optimizations. Secondly, the linker need to make the change correspondingly so that every mov instruction calculate the same pc-offset. For example, in my implementation, the fisrt movz instruction will need to add 12 to the result of ":prel_g3:bar1.2782" to make up the pc-offset. I haven't figure out a suitable solution for these problems yet. You are most welcomed to leave your suggestions regarding these issues.
Created attachment 48585 [details] patch for binutils
(In reply to Bu Le from comment #0) > Created attachment 48584 [details] > proposed patch > > I would like to propose an implementation of the medium code model in > aarch64. A prototype is attached, passed bootstrap and the regression test. > > Mcmodel = medium is a missing code model in aarch64 architecture, which is > supported in x86. This code model describes a situation that some small data > is relocated by small code model while large data is relocated by large code > model. The official statement about medium code model in x86 ABI file page > 34 URL : https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf > > The key difference between x86 and aarch64 is that x86 can use lea+movabs > instruction to implement a dynamic relocatable large code model. Currently, > large code model in AArch64 relocate the symbol using ldr instruction, which > can only be static linked. However, the small code mode use adrp + ldr > instruction, which can be dynamic linked. Therefore, the medium code model > cannot be implemented directly by simply setting a threshold. As a result a > dynamic reloadable large code model is needed first for a functional medium > code model. > > I met this problem when compiling CESM, which is a climate forecast software > that widely used in hpc field. In some configure case, when the manipulating > large arrays, the large code model with dynamic relocation is needed. The > following case is abstract from CESM for this scenario. > > program main > common/baz/a,b,c > real a,b,c > b = 1.0 > call foo() > print*, b > end > > subroutine foo() > common/baz/a,b,c > real a,b,c > > integer, parameter :: nx = 1024 > integer, parameter :: ny = 1024 > integer, parameter :: nz = 1024 > integer, parameter :: nf = 1 > real :: bar(nf,nx*ny*nz) > real :: bar1(nf,nx*ny*nz) > bar = 0.0 > bar1 =0.0 > b = bar(1,1024*1024*100) > b = bar1(1,1) > > return > end > > compile with -mcmodel=small -fPIC will give following error due to the > access of bar1 array > test.f90:(.text+0x28): relocation truncated to fit: > R_AARCH64_ADR_PREL_PG_HI21 against `.bss' > test.f90:(.text+0x6c): relocation truncated to fit: > R_AARCH64_ADR_PREL_PG_HI21 against `.bss' > > compile with -mcmodel=large -fPIC will give unsupported error: > f951: sorry, unimplemented: code model ‘large’ with ‘-fPIC’ > > As discussed in the beginning, to tackle this problem we have to solve the > static large code model problem. My solution here is to use > R_AARCH64_MOVW_PREL_Gx group relocation with instructions to calculate the > current PC value. > > Before change (mcmodel=small) : > adrp x0, bar1.2782 > add x0, x0, :lo12:bar1.2782 > > After change:(mcmodel = medium proposed): > movz x0, :prel_g3:bar1.2782 > movk x0, :prel_g2_nc:bar1.2782 > movk x0, :prel_g1_nc:bar1.2782 > movk x0, :prel_g0_nc:bar1.2782 > adr x1, . > sub x1, x1, 0x4 > add x0, x0, x1 > > The first 4 movk instruction will calculate the offset between bar1 and the > last movk instruction in 64-bits, which fulfil the requirement of large code > model(64-bit relocation). > The adr+sub instruction will calculate the pc-address of the last movk > instruction. By adding the offset with the PC address, bar1 can be > dynamically located. > > Because this relocation is time consuming, a threshold is set to classify > the size of the data to be relocated, like x86. The default value of the > threshold is set to 65536, which is max relocation capability of small code > model. > This implementation will also need to amend the linker in binutils so that > the4 movk can calculated the same pc-offset of the last movk instruction. > > The good side of this implementation is that it can use existed relocation > type to prototype a medium code model. > > The drawback of this implementation also exists. > For start, these 4movk instructions and the adr instruction must be combined > in this order. No other instruction should insert in between the sequence, > which will leads to mistake symbol address. This might impede the insn > schedule optimizations. > Secondly, the linker need to make the change correspondingly so that every > mov instruction calculate the same pc-offset. For example, in my > implementation, the fisrt movz instruction will need to add 12 to the result > of ":prel_g3:bar1.2782" to make up the pc-offset. > > I haven't figure out a suitable solution for these problems yet. You are > most welcomed to leave your suggestions regarding these issues. Is the main usage scenario huge arrays? If so, these could easily be allocated via malloc at startup rather than using bss. It means an extra indirection in some cases (to load the pointer), but it should be much more efficient than using a large code model with all the overheads.
(In reply to Wilco from comment #2) > Is the main usage scenario huge arrays? If so, these could easily be > allocated via malloc at startup rather than using bss. It means an extra > indirection in some cases (to load the pointer), but it should be much more > efficient than using a large code model with all the overheads. Thanks for the reply. The large array is just used to construct the test case. It is not a neccessary condition for this scenario. The common scenario is that the symbol is too far away for small code model to reach it, which cloud also result from large amount of small arrays, structures, etc. Meanwhile, the large code model is able to reach the symbol but can not be position independent, which cause the problem. Besides, the code in CESM is quiet complicated to reconstruct with malloc, which is also not an acceptable option for my customer. Clear enough for your concern?
(In reply to Bu Le from comment #3) > (In reply to Wilco from comment #2) > > > Is the main usage scenario huge arrays? If so, these could easily be > > allocated via malloc at startup rather than using bss. It means an extra > > indirection in some cases (to load the pointer), but it should be much more > > efficient than using a large code model with all the overheads. > > Thanks for the reply. > > The large array is just used to construct the test case. It is not a > neccessary condition for this scenario. The common scenario is that the > symbol is too far away for small code model to reach it, which cloud also > result from large amount of small arrays, structures, etc. Meanwhile, the > large code model is able to reach the symbol but can not be position > independent, which cause the problem. > > Besides, the code in CESM is quiet complicated to reconstruct with malloc, > which is also not an acceptable option for my customer. > > Clear enough for your concern? Well the question is whether we're talking about more than 4GB of code or more than 4GB of data. With >4GB code you're indeed stuck with the large model. With data it is feasible to automatically use malloc for arrays when larger than a certain size, so there is no need to change the application at all. Something like that could be the default in the small model so that you don't have any extra overhead unless you have huge arrays. Making the threshold configurable means you can tune it for a specific application.
(In reply to Bu Le from comment #0) Also it would be much more efficient to have a relocation like this if you wanted a 48-bit PC-relative offset: adrp x0, bar1.2782 add x0, x0, :lo12:bar1.2782 movk x0, :high32_47:bar1.2782
(In reply to Wilco from comment #4) > (In reply to Bu Le from comment #3) > > (In reply to Wilco from comment #2) > Well the question is whether we're talking about more than 4GB of code or > more than 4GB of data. With >4GB code you're indeed stuck with the large > model. With data it is feasible to automatically use malloc for arrays when > larger than a certain size, so there is no need to change the application at > all. Something like that could be the default in the small model so that you > don't have any extra overhead unless you have huge arrays. Making the > threshold configurable means you can tune it for a specific application. Is this automatic malloc already avaiable on some target? I haven't found an example that works in that way. Would you mind provide an example?
(In reply to Wilco from comment #5) > (In reply to Bu Le from comment #0) > > Also it would be much more efficient to have a relocation like this if you > wanted a 48-bit PC-relative offset: > > adrp x0, bar1.2782 > add x0, x0, :lo12:bar1.2782 > movk x0, :high32_47:bar1.2782 I am afraid that put the PC-relative offset into x0 is not correct, because x0 issuppose to be the final address of bar1 rather than an PC offset. Therefore an extra register is needed to hold the offest temporarily. Later, we need to add the PC address of the movk with the offset to calsulate 32:48 bits of the final address of bar1. Finally, add this part of address with x0 to compute the entire 48 bits final address. So the code sould be following sequence: adrp x0, bar1.2782 add x0, x0, :lo12:bar1.2782 //x0 here hold the 0:31 bits of the final addr movk x4, :prel_g2:bar1.2782 adr x1, . sub x1, x1, 0x4 add x4, x4, x1 // x4 here hold the 32:47 bits of the final addr add x0, x4, x0 (By the way, the high32_47 relocation you suggested is the prel_g2 in the officail aarch64 ABI released) So acctually, if we just want a 48-bit PC-relevent relocation, your idea and mine both need 6-7 instructions to get the symbol. In terms of efficiency, it would be similar. And in terms of engineering, you idea can save the trouble to modify the linker for calculating the offset for 3 movks. But we still need to make a new relocation type for ADRP, because it currently checking the overflow of address and gives the "relocation truncated to fit" error. Therefore, both idea need to do works in binutils, which make it also equivalent.
(In reply to Bu Le from comment #6) > (In reply to Wilco from comment #4) > > (In reply to Bu Le from comment #3) > > > (In reply to Wilco from comment #2) > > > Well the question is whether we're talking about more than 4GB of code or > > more than 4GB of data. With >4GB code you're indeed stuck with the large > > model. With data it is feasible to automatically use malloc for arrays when > > larger than a certain size, so there is no need to change the application at > > all. Something like that could be the default in the small model so that you > > don't have any extra overhead unless you have huge arrays. Making the > > threshold configurable means you can tune it for a specific application. > > > Is this automatic malloc already avaiable on some target? I haven't found an > example that works in that way. Would you mind provide an example? Fortran already has -fstack-arrays to decide between allocating arrays on the heap or on the stack.
(In reply to Bu Le from comment #7) > (In reply to Wilco from comment #5) > > (In reply to Bu Le from comment #0) > > > > Also it would be much more efficient to have a relocation like this if you > > wanted a 48-bit PC-relative offset: > > > > adrp x0, bar1.2782 > > add x0, x0, :lo12:bar1.2782 > > movk x0, :high32_47:bar1.2782 > > I am afraid that put the PC-relative offset into x0 is not correct, because > x0 issuppose to be the final address of bar1 rather than an PC offset. > Therefore an extra register is needed to hold the offest temporarily. You're right, we need an extra add, so it's like this: adrp x0, bar1.2782 movk x1, :high32_47:bar1.2782 add x0, x0, x1 add x0, x0, :lo12:bar1.2782 > (By the way, the high32_47 relocation you suggested is the prel_g2 in the > officail aarch64 ABI released) It needs a new relocation because of the ADRP. ADR could be used so the existing R_<CLS>_MOVW_PREL_G0-3 work, but then you need 5 instructions. > And in terms of engineering, you idea can save the trouble to modify the > linker for calculating the offset for 3 movks. But we still need to make a > new relocation type for ADRP, because it currently checking the overflow of > address and gives the "relocation truncated to fit" error. Therefore, both > idea need to do works in binutils, which make it also equivalent. There is relocation 276 (R_<CLS>_ADR_PREL_PG_HI21_NC).
> Fortran already has -fstack-arrays to decide between allocating arrays on > the heap or on the stack. I tried the flag with my example. The fstack-array seems cannot move the array in the bss to the heap. The problem is still there. Anyway, my point is that the size of single data does't affact the fact that medium code model is missing in aarch64 and aarch64 is lack of PIC large code model.
> You're right, we need an extra add, so it's like this: > > adrp x0, bar1.2782 > movk x1, :high32_47:bar1.2782 > add x0, x0, x1 > add x0, x0, :lo12:bar1.2782 > > > (By the way, the high32_47 relocation you suggested is the prel_g2 in the > > officail aarch64 ABI released) > > It needs a new relocation because of the ADRP. ADR could be used so the > existing R_<CLS>_MOVW_PREL_G0-3 work, but then you need 5 instructions. So you suggest a new relocation type "high32_47" to calculate the offset between ADRP and bar1. Am I right? > > And in terms of engineering, you idea can save the trouble to modify the > > linker for calculating the offset for 3 movks. But we still need to make a > > new relocation type for ADRP, because it currently checking the overflow of > > address and gives the "relocation truncated to fit" error. Therefore, both > > idea need to do works in binutils, which make it also equivalent. > > There is relocation 276 (R_<CLS>_ADR_PREL_PG_HI21_NC). Yes, through, we still need to make a change to compiler so when it comes to medium code model, ADRP can use R_<CLS>_ADR_PREL_PG_HI21_NC relocation.
(In reply to Bu Le from comment #10) > > Fortran already has -fstack-arrays to decide between allocating arrays on > > the heap or on the stack. > > I tried the flag with my example. The fstack-array seems cannot move the > array in the bss to the heap. The problem is still there. It is an existing feature that chooses between malloc and stack. It would need modification to do the same for large data/bss objects. > Anyway, my point is that the size of single data does't affact the fact that > medium code model is missing in aarch64 and aarch64 is lack of PIC large > code model. What is missing is efficient support for >4GB of data, right? How that is implemented is a different question - my point is that it does not require a new code model. It would be much better if it just worked without users even needing to think about code models. Also, what is the purpose of a large fpic model? Are there any applications that use shared libraries larger than 4GB?
(In reply to Bu Le from comment #11) > > > You're right, we need an extra add, so it's like this: > > > > adrp x0, bar1.2782 > > movk x1, :high32_47:bar1.2782 > > add x0, x0, x1 > > add x0, x0, :lo12:bar1.2782 > > > > > (By the way, the high32_47 relocation you suggested is the prel_g2 in the > > > officail aarch64 ABI released) > > > > It needs a new relocation because of the ADRP. ADR could be used so the > > existing R_<CLS>_MOVW_PREL_G0-3 work, but then you need 5 instructions. > > So you suggest a new relocation type "high32_47" to calculate the offset > between ADRP and bar1. Am I right? Yes. It needs to have an offset to the adrp instruction so it can compute the correct ADRP offset and then extract bits 32-47.
> > Anyway, my point is that the size of single data does't affact the fact that > > medium code model is missing in aarch64 and aarch64 is lack of PIC large > > code model. > > What is missing is efficient support for >4GB of data, right? How that is > implemented is a different question - my point is that it does not require a > new code model. It would be much better if it just worked without users even > needing to think about code models. > > Also, what is the purpose of a large fpic model? Are there any applications > that use shared libraries larger than 4GB? Yes, I understand, and I am grateful for you suggestion. I have to say it is not a critical problem. After all, most applications works fine with curreent code modes. But there are some cases, like CESM with certain configuration, or my test case, which cannot be compiled with current gcc compiler on aarch64. Unfortunately, applications that large than 4GB is quiet normal in HPC feild. In the meantime, x86 and llvm-aarch64 can compile it, with medium or large-pic code model. That is the purpose I am proposing it. By adding this feature, we can make a step forward for aarch64 gcc compiler, making it more powerful and robust. Clear enough for your concern? And for the implementation you suggested, I believe it is a promissing plan. I would like to try to implement it first. Might take weeks of development. I will see what I can get. I will give you update with progress. Thanks for the suggestion again.
(In reply to Bu Le from comment #14) > > > Anyway, my point is that the size of single data does't affact the fact that > > > medium code model is missing in aarch64 and aarch64 is lack of PIC large > > > code model. > > > > What is missing is efficient support for >4GB of data, right? How that is > > implemented is a different question - my point is that it does not require a > > new code model. It would be much better if it just worked without users even > > needing to think about code models. > > > > Also, what is the purpose of a large fpic model? Are there any applications > > that use shared libraries larger than 4GB? > > Yes, I understand, and I am grateful for you suggestion. I have to say it is > not a critical problem. After all, most applications works fine with > curreent code modes. > > But there are some cases, like CESM with certain configuration, or my test > case, which cannot be compiled with current gcc compiler on aarch64. > Unfortunately, applications that large than 4GB is quiet normal in HPC > feild. In the meantime, x86 and llvm-aarch64 can compile it, with medium or > large-pic code model. That is the purpose I am proposing it. By adding this > feature, we can make a step forward for aarch64 gcc compiler, making it more > powerful and robust. > > Clear enough for your concern? Yes but such a feature needs to be defined in an ABI and well specified. This is why I'm trying to get the underlying requirements first. Note that while LLVM allows -fpic in large model, it doesn't correctly implement it. The large model shouldn't ever be needed by actual applications. > And for the implementation you suggested, I believe it is a promissing plan. > I would like to try to implement it first. Might take weeks of development. > I will see what I can get. I will give you update with progress. > > Thanks for the suggestion again. As discussed, there are many different ways of supporting the requirement of >4GB of data, so I wouldn't start on the implementation before there is a good specification. GCC and LLVM would need to implement it in the same way after all.
Note there is an early writeup of the current code models here: https://github.com/ARM-software/abi-aa/pull/57/files (I've added the issues with the current large model in review comments).
Here is the current medium code model proposal: https://github.com/ARM-software/abi-aa/pull/107/files