Asm volatile causing performance regressions on ARM

Yury Gribov y.gribov@samsung.com
Thu Feb 27 14:35:00 GMT 2014


Hi all,

We have recently ran into a performance/code size regression on ARM 
targets after transition from GCC 4.7 to GCC 4.8 (this regression is 
also present in 4.9).

The following code snippet uses Linux-style compiler barriers to protect 
memory writes:

   #define barrier() __asm__ __volatile__ ("": : :"memory")
   #define write(v,a) { barrier(); *(volatile unsigned *)(a) = (v); }

   #define v1 0x00100000
   #define v2 0xaabbccdd

   void test(unsigned base) {
     write(v1, base + 0x100);
     write(v2, base + 0x200);
     write(v1, base + 0x300);
     write(v2, base + 0x400);
   }

Code generated by GCC 4.7 under -Os (all good):

    mov r2, #7340032
    str r2, [r0, #3604]
    ldr r3, .L2
    str r3, [r0, #3612]
    str r2, [r0, #3632]
    str r3, [r0, #3640]

(note that compiler decided to load v2 from constant pool).

Now code generated by GCC 4.8/4.9 under -Os is much larger because v1 
and v2 are reloaded before every store:

    mov r3, #7340032
    str r3, [r0, #3604]
    ldr r3, .L2
    str r3, [r0, #3612]
    mov r3, #7340032
    str r3, [r0, #3632]
    ldr r3, .L2
    str r3, [r0, #3640]

v1 and v2 are constant literals and can't really be changed by user so I 
would expect compiler to combine loads.

After some investigation, we discovered that this behavior is caused by 
big hammer in gcc/cse.c:
    /* A volatile ASM or an UNSPEC_VOLATILE invalidates everything.  */
    if (NONJUMP_INSN_P (insn)
        && volatile_insn_p (PATTERN (insn)))
      flush_hash_table ();
This code (introduced in 
http://gcc.gnu.org/viewcvs/gcc?view=revision&revision=193802) aborts CSE 
after seeing a volatile inline asm.

Is this compiler behavior reasonable? AFAIK GCC documentation only says 
that __volatile__ prevents compiler from removing the asm but it does 
not mention that it supresses optimization of all surrounding expressions.

If this behavior is not intended, what would be the best way to fix 
performance? I could teach GCC to not remove constant RTXs in 
flush_hash_table() but this is probably very naive and won't cover some 
corner-cases.

-Y



More information about the Gcc mailing list