This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [RFC PATCH, i386]: Use "lock orl $0, -4(%esp)" in mfence_nosse
- From: Uros Bizjak <ubizjak at gmail dot com>
- To: Jakub Jelinek <jakub at redhat dot com>
- Cc: "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>, peter at cordes dot ca
- Date: Fri, 17 Feb 2017 17:59:30 +0100
- Subject: Re: [RFC PATCH, i386]: Use "lock orl $0, -4(%esp)" in mfence_nosse
- Authentication-results: sourceware.org; auth=none
- References: <CAFULd4ZB8jehEJZBDmn10HGqQvOho9MJ9wDZVorRmbZMduJxDA@mail.gmail.com> <20170217163022.GK1849@tucnak>
On Fri, Feb 17, 2017 at 5:30 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Sun, May 29, 2016 at 11:10:15PM +0200, Uros Bizjak wrote:
>> As explained in PR71245, comment #3 [1], it is better to use offset -4
>> to a %esp to implement a non-SSE memory fence instruction:
>>
>> -q-
>>
>> I guess it costs a code byte for a disp8 in the addressing mode, but
>> it avoids adding a lot of latency to a critical path involving a
>> spill/reload to (%esp), in functions where there is something at
>> (%esp).
>>
>> If it's an object larger than 4B, the lock orl could even cause a
>> store-forwarding stall when the object is reloaded. (e.g. a double or
>> a vector).
>>
>> Ideally we could do the lock orl on some padding between two locals,
>> or on something in memory that wasn't going to be loaded soon, to
>> avoid touching more stack memory (which might be in the next page
>> down). But we still want to do it on a cache line that's hot, so
>> going way up above our own stack frame isn't good either.
>
> Unfortunately this makes valgrind unhappy about that:
> https://bugzilla.redhat.com/show_bug.cgi?id=1423434
> I assume it will complain now on anything pre-SSE2 that contains the memory
> barrier in 32-bit code.
> Perhaps we should decrement and increment %esp around it or something
> similar (or push/pop)? Of course, that would mean we need to take care
> of async unwind info.
Or, we can simply revert the patch? Not that the barrier performance
of non-SSE 32bit targets matter...
Uros.