Bug 91053 - __builtin___clear_cache can fail
Summary: __builtin___clear_cache can fail
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: libgcc (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-02 07:50 UTC by Orion Hodson
Modified: 2019-07-05 15:34 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Orion Hodson 2019-07-02 07:50:47 UTC
This issue arises from using libgcc in the Android Runtime's Just-In-Time compiler. The Android Runtime uses __buitin__clear_cache() for JIT cache maintenance and we’ve become aware that this builtin can silently fail. This leaves the CPU caches in an unknown state potentially leading to a crash from the execution of unintended instruction sequences.

The specific case where we've observed this failure is for devices with ARMv7 Linux kernels. The libgcc clear_cache builtin calls the cacheflush() system call which bottoms out in v7_coherent_user_range():

  https://github.com/torvalds/linux/blob/master/arch/arm/mm/cache-v7.S#L253

The important detail in that code is that the blocks of code within USER() macros will call the fault handler, labelled 9001, if the cache operation causes a fault. The return value from v7_coherent_user_range is then -EFAULT. When this happens after code updates in the JIT cache, crashes can be expected. For example, if we didn't manage to invalidate all of the instruction cache range, and thus executes a mix-and-match of old and new instructions.

This issue is quite hard to reproduce and typically only occurs under memory pressure.

The documentation for __builtin___clear_cache() does not comment on the possibilities of failures:

  https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

but this is documented behaviour of the cacheflush() system call:

  http://man7.org/linux/man-pages/man2/cacheflush.2.html

Potential fixes could include:

  1) no change to __builtin__clear_cache, but update the documentation to indicate that failure is possible on systems where cache flushing is a privileged operation. On these systems callers of the builtin should clear errno before the call and check it afterwards.

  2) change __builtin___clear_cache() to return an error code.

  3) a fix for Linux kernels so cacheflush() cannot fail. [Outside the scope of this bug.]

Our workaround for now is to special case on ARMv7 to use the cacheflush() system call directly and check for errors (http://r.android.com/989545).
Comment 1 Andrew Pinski 2019-07-02 08:13:46 UTC
So was reading the code.  The only time a fault will happen when a page is not mapped in.  This happens if the page was paged out (this is why it happens under heavily pressure).  This is not about being privileged at all.
Comment 2 Orion Hodson 2019-07-05 15:34:15 UTC
For sure the goal wasn't to suggest that this was due to a privileged operation.