Spurious register spill with volatile function argument

Sun Mar 27 05:58:00 GMT 2016

Seems I had misused volatile. I removed ‘volatile’ from the function argument on test_0 and it prevented the spill through the stack.

I added volatile because I was trying to avoid the compiler optimising away the call to test_0 (as it has no side effects) but it appeared that volatile was unnecessary and was a misuse of volatile (intended to indicate storage may change outside of the control of the compiler). However it is an interesting case… as a register arguments don’t have storage.

GCC, Clang folk, any ideas on why there is a stack spill for a volatile register argument passed in esi? Does volatile force the argument to have storage allocated on the stack? Is this a corner case in the C standard? This argument in the x86_64 calling convention only has a register, so technically it can’t change outside the control of the C "virtual machine” so volatile has a vague meaning here. This seems to be a case of interpreting the C standard in such a was as to make sure that a volatile argument “can be changed” outside the control of the C "virtual machine” by explicitly giving it a storage location on the stack. I think volatile scalar arguments are a special case and that the volatile type label shouldn’t widen the scope beyond the register unless it actually *needs* storage to spill. This is not a volatile stack scoped variable unless the C standard interprets ABI register parameters as actually having ‘storage’ so this is questionable… Maybe I should have gotten a warning… or the volatile type qualifier on a scalar register argument should have been ignored…

volatile for scalar function arguments seems to mean: “make this volatile and subject to change outside of the compiler” rather than being a qualifier for its storage (which is a register).

# gcc
test_0:
     mov     DWORD PTR [rsp-4], esi
     mov     ecx, DWORD PTR [rsp-4]
     mov     eax, edi
     cdq
     idiv    ecx
     mov     eax, edx
     ret

# clang
test_0:
	mov	dword ptr [rsp - 4], esi
	xor	edx, edx
	mov	eax, edi
	div	dword ptr [rsp - 4]
	mov	eax, edx
	ret

/* Test program compiled on x86_64 with: cc -O3 -fomit-frame-pointer -masm=intel -S test.c -o test.S  */

#include <stdio.h>
#include <limits.h>

static const int p = 8191;
static const int s = 13;

int __attribute__ ((noinline)) test_0(unsigned int k, volatile int p)
{
     return k % p;
}

int __attribute__ ((noinline)) test_1(unsigned int k)
{
     return k % p;
}

int __attribute__ ((noinline)) test_2(unsigned int k)
{
     int i = (k&p) + (k>>s);
     i = (i&p) + (i>>s);
     if (i>=p) i -= p;
     return i;
}

int main()
{
     test_0(1, 8191); /* control */
     for (int i = INT_MIN; i < INT_MAX; i++) {
             int r1 = test_1(i), r2 = test_2(i);
             if (r1 != r2) printf("%d %d %d\n", i, r1, r2);
     }
}

> On 27 Mar 2016, at 2:32 PM, Andrew Waterman <andrew@sifive.com> wrote:
> 
> It would be good to figure out how to get rid of the spurious register spills.
> 
> The strength reduction optimization isn't always profitable on Rocket,
> as it increases instruction count and code size.  The divider has an
> early out and for small numbers is quite fast.
> 
> On Fri, Mar 25, 2016 at 5:43 PM, Michael Clark <michaeljclark@mac.com> wrote:
>> Now considering I have no idea how many cycles it takes for an integer divide on the Rocket so the optimisation may not be a win.
>> 
>> Trying to read MuDiv in multiplier.scala, and will at some point run some timings in the cycle-accurate simulator.
>> 
>> In either case, the spurious stack moves emitted by GCC are curious...
>> 
>>> On 26 Mar 2016, at 9:42 AM, Michael Clark <michaeljclark@mac.com> wrote:
>>> 
>>> Hi All,
>>> 
>>> I have found an interesting case where an optimisation is not being applied by GCC on RISC-V. And also some strange assembly output from GCC on RISC-V.
>>> 
>>> Both GCC and Clang appear to optimise division by a constant Mersenne prime on x86_64 however GCC on RISC-V is not applying this optimisation.
>>> 
>>> See test program and assembly output for these platforms:
>>> 
>>> * GCC -O3 on RISC-V
>>> * GCC -O3 on x86_64
>>> * LLVM/Clang -O3 on x86_64
>>> 
>>> Another strange observation is GCC on RISC-V is moving a1 to a5 via a stack store followed by a stack load. Odd? GCC 5 also seems to be doing odd stuff with stack ‘moves' on x86_64, moving esi to ecx via the stack (I think recent x86 micro-architecture treats tip of the stack like an extended register file so this may only have a small penalty on x86).
>>> 
>>> See GCC on RISC-V is emitting this:
>>> 
>>> test_0:
>>>      add     sp,sp,-16
>>>      sw      a1,12(sp)
>>>      lw      a5,12(sp)
>>>      add     sp,sp,16
>>>      remuw   a0,a0,a5
>>>      jr      ra
>>> 
>>> instead of this:
>>> 
>>> test_0:
>>>      remuw   a0,a0,a1
>>>      jr      ra
>>> 
>>> Compiler devs, please read Test program and assembly output. I have not yet tested LLVM/Clang on RISC-V yet… I will do that next… I have not had time to dig into compiler code yet...
>>> 
>>> Regards,
>>> Michael.
>>> 
>>> 
>>> /* Test program */
>>> 
>>> #include <stdio.h>
>>> #include <limits.h>
>>> 
>>> static const int p = 8191;
>>> static const int s = 13;
>>> 
>>> int __attribute__ ((noinline)) test_0(unsigned int k, volatile int p)
>>> {
>>>      return k % p;
>>> }
>>> 
>>> int __attribute__ ((noinline)) test_1(unsigned int k)
>>> {
>>>      return k % p;
>>> }
>>> 
>>> int __attribute__ ((noinline)) test_2(unsigned int k)
>>> {
>>>      int i = (k&p) + (k>>s);
>>>      i = (i&p) + (i>>s);
>>>      if (i>=p) i -= p;
>>>      return i;
>>> }
>>> 
>>> int main()
>>> {
>>>      test_0(1, 8191); /* control */
>>>      for (int i = INT_MIN; i < INT_MAX; i++) {
>>>              int r1 = test_1(i), r2 = test_2(i);
>>>              if (r1 != r2) printf("%d %d %d\n", i, r1, r2);
>>>      }
>>> }
>>> 
>>> 
>>> 
>>> /* RISC-V GCC */
>>> 
>>> $ riscv64-unknown-elf-gcc --version
>>> riscv64-unknown-elf-gcc (GCC) 5.2.0
>>> 
>>> test_0:
>>>      add     sp,sp,-16
>>>      sw      a1,12(sp)
>>>      lw      a5,12(sp)
>>>      add     sp,sp,16
>>>      remuw   a0,a0,a5
>>>      jr      ra
>>> test_1:
>>>      li      a5,8192
>>>      addw    a5,a5,-1
>>>      remuw   a0,a0,a5
>>>      ret
>>> test_2:
>>>      li      a3,8192
>>>      addw    a2,a3,-1
>>>      and     a4,a0,a2
>>>      srlw    a0,a0,13
>>>      addw    a5,a4,a0
>>>      and     a0,a5,a2
>>>      sraw    a5,a5,13
>>>      addw    a0,a0,a5
>>>      addw    a3,a3,-2
>>>      ble     a0,a3,.L5
>>>      subw    a0,a0,a2
>>> .L5:
>>>      ret
>>> 
>>> 
>>> /* Linux x86_64 GCC */
>>> 
>>> $ gcc --version
>>> gcc (Debian 5.2.1-23) 5.2.1 20151028
>>> 
>>> test_0:
>>>      mov     DWORD PTR [rsp-4], esi
>>>      mov     ecx, DWORD PTR [rsp-4]
>>>      mov     eax, edi
>>>      cdq
>>>      idiv    ecx
>>>      mov     eax, edx
>>>      ret
>>> test_1:
>>>      mov     eax, edi
>>>      mov     rcx, rax
>>>      mov     rdx, rax
>>>      sal     rcx, 6
>>>      sal     rdx, 19
>>>      add     rdx, rcx
>>>      add     rax, rdx
>>>      mov     edx, edi
>>>      shr     rax, 32
>>>      sub     edx, eax
>>>      shr     edx
>>>      add     eax, edx
>>>      shr     eax, 12
>>>      mov     edx, eax
>>>      sal     edx, 13
>>>      sub     edx, eax
>>>      sub     edi, edx
>>>      mov     eax, edi
>>>      ret
>>> test_2:
>>>      mov     eax, edi
>>>      shr     edi, 13
>>>      and     eax, 8191
>>>      add     eax, edi
>>>      mov     edx, eax
>>>      sar     eax, 13
>>>      and     edx, 8191
>>>      add     eax, edx
>>>      lea     edx, [rax-8191]
>>>      cmp     eax, 8191
>>>      cmovge  eax, edx
>>>      ret
>>> 
>>> 
>>> /* Darwin x86_64 LLVM Clang */
>>> 
>>> $ cc --version
>>> Apple LLVM version 7.3.0 (clang-703.0.29)
>>> 
>>> _test_0:
>>>       mov     dword ptr [rsp - 4], esi
>>>       xor     edx, edx
>>>       mov     eax, edi
>>>       div     dword ptr [rsp - 4]
>>>       mov     eax, edx
>>>       ret
>>> _test_1:
>>>       mov     eax, edi
>>>       imul    rax, rax, 524353
>>>       shr     rax, 32
>>>       mov     ecx, edi
>>>       sub     ecx, eax
>>>       shr     ecx
>>>       add     ecx, eax
>>>       shr     ecx, 12
>>>       imul    eax, ecx, 8191
>>>       sub     edi, eax
>>>       mov     eax, edi
>>>       ret
>>> _test_2:
>>>       mov     eax, edi
>>>       and     eax, 8191
>>>       mov     ecx, edi
>>>       shr     ecx, 13
>>>       add     eax, ecx
>>>       add     ecx, edi
>>>       and     ecx, 8191
>>>       shr     eax, 13
>>>       lea     edx, [rcx + rax]
>>>       cmp     edx, 8190
>>>       lea     eax, [rcx + rax - 8191]
>>>       cmovbe  eax, edx
>>>       ret
>>> 
>>> --
>>> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe@groups.riscv.org.
>>> To post to this group, send email to sw-dev@groups.riscv.org.
>>> Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
>>> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/2600D96D-94BC-4259-9D39-DE4993859281%40mac.com.
>> 
>> --
>> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe@groups.riscv.org.
>> To post to this group, send email to sw-dev@groups.riscv.org.
>> Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
>> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/9F3C9DE6-F00B-4402-A83B-354455DEAFFA%40mac.com.