This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Asm volatile causing performance regressions on ARM

From: David Brown <david at westcontrol dot com>
To: Richard Biener <richard dot guenther at gmail dot com>
Cc: Georg-Johann Lay <avr at gjlay dot de>, Yury Gribov <y dot gribov at samsung dot com>, Eric Botcazou <ebotcazou at adacore dot com>, GCC Development <gcc at gcc dot gnu dot org>, <hp at gcc dot gnu dot org>, Viacheslav Garbuzov <v dot garbuzov at samsung dot com>, Yuri Gribov <tetra2005 at gmail dot com>, <rsandifo at linux dot vnet dot ibm dot com>
Date: Mon, 3 Mar 2014 16:01:09 +0100
Subject: Re: Asm volatile causing performance regressions on ARM
Authentication-results: sourceware.org; auth=none
References: <530F4D3A dot 4020800 at samsung dot com> <2054281 dot lbfa4qFJsN at polaris> <CAFiYyc20wRO=t98jh9iug2H=DctR2o7WVyLUu=84JaqhevqH3A at mail dot gmail dot com> <530F5B92 dot 6000807 at samsung dot com> <87zjlc8p4k dot fsf at sandifor-thinkpad dot stglab dot manchester dot uk dot ibm dot com> <5310546A dot 7050409 at gjlay dot de> <87ob1r8m60 dot fsf at sandifor-thinkpad dot stglab dot manchester dot uk dot ibm dot com> <53145C49 dot 8060104 at westcontrol dot com> <CAFiYyc221qCLypkTq-xGnWiJdRBHgnJ7-A6PDUekOHEUU8=P8g at mail dot gmail dot com> <53147B30 dot 5050004 at westcontrol dot com> <CAFiYyc2j=a=A_X3m3AkabaJd5HDWMfOfEsODth=Yz0WeYBpcdA at mail dot gmail dot com>

On 03/03/14 14:54, Richard Biener wrote:
> On Mon, Mar 3, 2014 at 1:53 PM, David Brown <david@westcontrol.com> wrote:
>> On 03/03/14 11:49, Richard Biener wrote:
>>> On Mon, Mar 3, 2014 at 11:41 AM, David Brown <david@westcontrol.com> wrote:
>>>> On 28/02/14 13:19, Richard Sandiford wrote:
>>>>> Georg-Johann Lay <avr@gjlay.de> writes:
>>>>>> Notice that in code1, func might contain such asm-pairs to implement
>>>>>> atomic operations, but moving costly_func across func does *not*
>>>>>> affect the interrupt respond times in such a disastrous way.
>>>>>>
>>>>>> Thus you must be *very* careful w.r.t. optimizing against asm volatile
>>>>>> + memory clobber.  It's too easy to miss some side effects of *real*
>>>>>> code.
>>>>>
>>>>> I understand the example, but I don't think volatile asms guarantee
>>>>> what you want here.
>>>>>
>>>>>> Optimizing code to scrap and pointing to some GCC internal reasoning or some
>>>>>> standard's wording does not help with real code.
>>>>>
>>>>> But how else can a compiler work?  It doesn't just regurgitate canned code,
>>>>> so it can't rely on human intuition as to what "makes sense".  We have to
>>>>> have specific rules and guarantees and say that anything outside those
>>>>> rules and guarantees is undefined.
>>>>>
>>>>> It sounds like you want an asm with an extra-strong ordering guarantee.
>>>>> I think that would need to be an extension, since it would need to consider
>>>>> cases where the asm is used in a function.  (Shades of carries_dependence
>>>>> or whatever in the huge atomic thread.)  I think anything where:
>>>>>
>>>>>   void foo (void) { X; }
>>>>>   void bar (void) { Y1; foo (); Y2; }
>>>>>
>>>>> has different semantics from:
>>>>>
>>>>>   void bar (void) { Y1; X; Y2; }
>>>>>
>>>>> is very dangerous.  And assuming that any function call could enable
>>>>> or disable interrupts, and therefore that nothing can be moved across
>>>>> a non-const function call, would limit things a bit too much.
>>>>>
>>>>> Thanks,
>>>>> Richard
>>>>>
>>>>>
>>>>
>>>> I think the problem stems from "volatile" being a barrier to /data flow/
>>>> changes,
>>>
>>> What kind of /data flow/ changes?  It certainly isn't that currently,
>>> only two volatiles always conflict but not a volatile and a non-volatile mem:
>>>
>>> static int
>>> true_dependence_1 (const_rtx mem, enum machine_mode mem_mode, rtx mem_addr,
>>>                    const_rtx x, rtx x_addr, bool mem_canonicalized)
>>> {
>>> ...
>>>   if (MEM_VOLATILE_P (x) && MEM_VOLATILE_P (mem))
>>>     return 1;
>>>
>>> bool
>>> refs_may_alias_p_1 (ao_ref *ref1, ao_ref *ref2, bool tbaa_p)
>>> {
>>> ...
>>>   /* Two volatile accesses always conflict.  */
>>>   if (ref1->volatile_p
>>>       && ref2->volatile_p)
>>>     return true;
>>>
>>>> but what is needed in this case is a barrier to /control flow/
>>>> changes.  To my knowledge, C does not provide any way of doing this, nor
>>>> are there existing gcc extensions to guarantee the ordering.  But it
>>>> certainly is the case that control flow ordering like this is important
>>>> - it can be critical in embedded systems (such as in the example here by
>>>> Georg-Johann), but it can also be important for non-embedded systems
>>>> (such as to minimise the time spend while holding a lock).
>>>
>>> Can you elaborate on this?  I have a hard time thinking of a
>>> control flow transform that affects volatiles.
>>>
>>> Richard.
>>>
>>
>> I am perhaps not expressing myself very clearly here (and I don't know
>> the internals of gcc well enough to use the source to help).
>>
>> Normal (i.e., not "asm") volatile accesses force an order on those
>> volatile data accesses - if the source code says a volatile read of "x"
>> then a volatile read of "y", then the compiler has to issue those reads
>> in that order.  It can't re-order them, or hoist them out of a loop, or
>> do any other re-ordering optimisations.  Clobbers, inputs and outputs in
>> inline assembly give a similar ordering on the data flow.  But none of
>> this affects the /control/ flow.  So the __attribute__((const))
>> "costly_func" described by Georg-Johann can be moved freely by the
>> compiler amongst these volatile /data/ accesses.
>>
>> The C abstract machine does not have any concept of timings, only of
>> observable accesses (volatile accesses, calls to external code, and
>> entry/exit from main()).  So it does not distinguish between the sequences:
>>
>>         volX = 1;
>>         y = costly_func(z);
>>         volX = 2;
>>
>> and
>>
>>         y = costly_func(z);
>>         volX = 1;
>>         volX = 2;
>>
>> and
>>         volX = 1;
>>         volX = 2;
>>         y = costly_func(z);
>>
>> (This assumes that costly_func is __attribute__((const)), and y and z
>> are non-volatile.)
>>
>> For some real-world usage, however, these sequences are very different.
>>  In "big" systems, it is unlikely to change correctness.  If "volX" were
>> part of a locking mechanism, for example, then each version of this code
>> would be correct - but they might differ in the length of time that the
>> locks were held, and that could seriously affect performance.  In
>> embedded systems, low performance could mean failure.  The problem is
>> exasperated by small cpus that need library functions for seemingly
>> simple operations - gcc might happily move a division operation around
>> without realising the massive time cost on an 8-bit processor.
>>
>> In particular, I have seen code like this:
>>
>> extern volatile int v1, v2;
>> extern volatile bool interruptEnable;
>> int c;
>> void foo(int a) {
>>         int b = a / c;
>>         interruptEnable = 0;
>>         v1 = b;
>>         v2 = b;
>>         interruptEnable = 1;
>> }
>>
>> get transformed to move the division in between the interrupt disable
>> and writing to "v1".  This is a valid transform from C's viewpoint.
>> Putting a "volatile asm("" ::: "memory");" before disabling the
>> interrupts usually helps, but AFAIK it is not guaranteed by gcc.  Making
>> "b" volatile /will/ help, but means extra memory and instructions -
>> something you often want to avoid in embedded systems.
> 
> That's not what I call "control flow" but it's rather data dependences
> again (or value dependences if you like to distinguish it from things
> in memory).
> 

Maybe I am using the term "control flow" in a different way - and as I
don't know if I am correct, or you are correct, or if we are both wrong
or both correct, I shall stop using it.  Hopefully that will help.

The problem here is that there are no helpful data dependencies, but we
want to force an execution order dependency.  There are no data
dependencies that force the calculation of "b" before "interruptEnable =
0" - the compiler is therefore free to move the "a/c" calculation until
after that line.  There is a data dependency forcing the "a/c"
calculation before "v1 = b".

> Indeed nothing specifies the point where a/c is executed apart from
> that it will be computed before its value is consumed.

Correct - that is the data flow.

What we would like is to force the execution of "a/c" to occur before
the execution of "interruptEnable = 0".

We could do that (in this case) by making an artificial volatile data
dependency, and thus forcing the order - simply making "b" volatile.
Even better, perhaps, would be to write this:

void foo(int a) {
        int b = a / c;
        volatile int volB = b;
        interruptEnable = 0;
        v1 = b;
        v2 = b;
        interruptEnable = 1;
}

That write to "volB" would force the calculation of "a/c" before the
interruptEnable, at the cost of extra stack space and an extra write -
but at least it would not add extra volatile reads for b.

> 
> You can't get both, "strict ordering" and "no penalty due to using
> volatile".  But in the above case the scheduler description for the
> target should ensure that a/b is moved as far away from its consumer
> as possible - but wait - probably that gets disabled by volatiles
> being a scheduling barrier ... ;)  (and at expansion time TER likely
> "moves" a/c directly before v1 = b).
> 

It's these moves of the execution that I (and other embedded developers)
want to avoid in such cases.  And we /want/ to have it with no penalty -
or at least minimal penalty.

In this particular case, I think it would be possible to write:

void foo(int a) {
        int b = a / c;
        volatile asm( "" :: "r" (b) );
        interruptEnable = 0;
        v1 = b;
        v2 = b;
        interruptEnable = 1;
}

I /believe/ that tells gcc that "b" will be used but not changed by the
volatile asm line, thus forcing an extra data dependency and getting the
calculation order that we need.

However, I don't know if it is always possible to make such data
dependencies - having a way to force execution order would be a useful
feature.

David


> Richard.
>

References:
- Re: Asm volatile causing performance regressions on ARM
  - From: David Brown
- Re: Asm volatile causing performance regressions on ARM
  - From: Richard Biener
- Re: Asm volatile causing performance regressions on ARM
  - From: David Brown
- Re: Asm volatile causing performance regressions on ARM
  - From: Richard Biener

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]