This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/46091] missed optimization: x86 bt/btc/bts instructions
- From: "ubizjak at gmail dot com" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Mon, 14 Aug 2017 17:25:07 +0000
- Subject: [Bug target/46091] missed optimization: x86 bt/btc/bts instructions
- Auto-submitted: auto-generated
- References: <bug-46091-4@http.gcc.gnu.org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46091
--- Comment #10 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Avi Kivity from comment #9)
> I believe the comment is wrong. Here's what the manual says:
>
> "This instruction can be used with a LOCK prefix to allow the instruction to
> be executed atomically."
>
> Implying that without the LOCK prefix, it is not atomic. XCHG is the only
> instruction that asserts LOCK implicitly.
>
> Agner lists BTC reciprocal throughput as 1 for imm, mem case and 5 for reg,
> mem. The latter is slow, but perhaps still worthwhile as a replacement for
> the code in the first comment (but probably not when addressing a single
> word).
BTC/BTR/BTS with a memory operand (RMW) is indeed slower, but so are other
logic instructions. Following testcase:
--cut here--
extern unsigned long long a;
void
test (void)
{
a &= ~(1ull << 55);
}
--cut here--
should generate RMW BTR instruction.
I'll look into this a bit some more. However, these insn should be rare, so do
not expect any noticeable application speed-up ...
> Note there is also the BT instruction (with reciprocal throughput of 0.5!)
Yes, we already emit this.