This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Asm volatile causing performance regressions on ARM


Hi all,

We have recently ran into a performance/code size regression on ARM targets after transition from GCC 4.7 to GCC 4.8 (this regression is also present in 4.9).

The following code snippet uses Linux-style compiler barriers to protect memory writes:

  #define barrier() __asm__ __volatile__ ("": : :"memory")
  #define write(v,a) { barrier(); *(volatile unsigned *)(a) = (v); }

  #define v1 0x00100000
  #define v2 0xaabbccdd

  void test(unsigned base) {
    write(v1, base + 0x100);
    write(v2, base + 0x200);
    write(v1, base + 0x300);
    write(v2, base + 0x400);
  }

Code generated by GCC 4.7 under -Os (all good):

   mov r2, #7340032
   str r2, [r0, #3604]
   ldr r3, .L2
   str r3, [r0, #3612]
   str r2, [r0, #3632]
   str r3, [r0, #3640]

(note that compiler decided to load v2 from constant pool).

Now code generated by GCC 4.8/4.9 under -Os is much larger because v1 and v2 are reloaded before every store:

   mov r3, #7340032
   str r3, [r0, #3604]
   ldr r3, .L2
   str r3, [r0, #3612]
   mov r3, #7340032
   str r3, [r0, #3632]
   ldr r3, .L2
   str r3, [r0, #3640]

v1 and v2 are constant literals and can't really be changed by user so I would expect compiler to combine loads.

After some investigation, we discovered that this behavior is caused by big hammer in gcc/cse.c:
   /* A volatile ASM or an UNSPEC_VOLATILE invalidates everything.  */
   if (NONJUMP_INSN_P (insn)
       && volatile_insn_p (PATTERN (insn)))
     flush_hash_table ();
This code (introduced in http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=193802) aborts CSE after seeing a volatile inline asm.

Is this compiler behavior reasonable? AFAIK GCC documentation only says that __volatile__ prevents compiler from removing the asm but it does not mention that it supresses optimization of all surrounding expressions.

If this behavior is not intended, what would be the best way to fix performance? I could teach GCC to not remove constant RTXs in flush_hash_table() but this is probably very naive and won't cover some corner-cases.

-Y


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]