Bug 96327 - Inefficient increment through pointer to volatile on x86
Summary: Inefficient increment through pointer to volatile on x86
Status: RESOLVED DUPLICATE of bug 3506
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: unknown
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2020-07-27 01:47 UTC by Paul McKenney
Modified: 2020-07-30 19:23 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Paul McKenney 2020-07-27 01:47:08 UTC
Although the code generation for increment (++, --) through a pointer to volatile has improved greatly over the past 15 years, there is a case in which the address calculation is needlessly done separately instead of by the x86 increment instruction itself.  Here is some example code:

struct task {
    int other;
    int rcu_count;
};

struct task *current;

void rcu_read_lock()
{
    (*(volatile int*)&current->rcu_count)++;
}

As can be seen in godbolt.org (https://godbolt.org/z/fGze8E), the address calculation is split by GCC. The shorter code sequence generated by clang/LLVM is preferable.

Fixing this would allow the Linux kernel to use safer code sequences for certain fastpaths, in this example, rcu_read_lock() and rcu_read_unlock() for kernels built with CONFIG_PREEMPT=y.
Comment 1 Paul McKenney 2020-07-27 01:56:18 UTC
This manifests on GCC trunk (see the godbolt.org URL), but was first noted in gcc version 7.5.0.  This is specific to x86, but might apply to any other architecture that provides increment-memory instructions.  This behavior does not seem to be affected by GCC options.

This can be reproduced by placing the sample code in a file "rrl.c" and running:

cc -o rrl rrl.c

This completes successfully with no error or warnings.

Running "cc -o rrl rrl.c --save-temps" generates the following file:

# 1 "rrl.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 31 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "<command-line>" 2
# 1 "rrl.c"
struct task {
    int other;
    int rcu_count;
};

struct task *current;

void rcu_read_lock()
{
    (*(volatile int*)&current->rcu_count)++;
}
Comment 2 Andrew Pinski 2020-07-27 02:05:30 UTC
This has been reported many times before.  volatile does not mean it will be done atomically or even without load/stores.
Comment 3 Andrew Pinski 2020-07-27 02:06:28 UTC
Dup of bug 3506.

*** This bug has been marked as a duplicate of bug 3506 ***
Comment 4 Paul McKenney 2020-07-27 03:26:13 UTC
Bug 3506 has since been fixed, at least for the example shown in this bug report, as you can see if you look at the godbolt, which shows that both compilers generate a single addl instruction, which is exactly what the submitter of 3506 requested.

This bug is different, instead asking that the calculation of the address of the volatile object not be split into multiple instructions.
Comment 5 Marc Glisse 2020-07-30 19:03:02 UTC
I don't think bug 3506 has been fixed (its status seems wrong to me). But don't worry, there are several other duplicates that still have status NEW (bug 50677 for instance).
This is a sensible enhancement request, I think some gcc backends already do a similar optimization, it simply isn't a priority, because volatile almost means "don't optimize this".
At least the difference between the gcc and clang codes matches those other PRs. Not sure why you are talking of address computations.
Comment 6 Paul McKenney 2020-07-30 19:23:53 UTC
(In reply to Marc Glisse from comment #5)
> I don't think bug 3506 has been fixed (its status seems wrong to me). But
> don't worry, there are several other duplicates that still have status NEW
> (bug 50677 for instance).
> This is a sensible enhancement request, I think some gcc backends already do
> a similar optimization, it simply isn't a priority, because volatile almost
> means "don't optimize this".
> At least the difference between the gcc and clang codes matches those other
> PRs. Not sure why you are talking of address computations.

Probably because I was confused by the addressing mode and by wishful thinking, and yes, you are quite correct.

Anyway, if you look at https://godbolt.org/z/fGze8E, you can see that Clang/LLVM is using a to-memory addl rather than loading, adding, and storing.