This is the mail archive of the
mailing list for the GCC project.
Re: [isocpp-parallel] Proposal for new memory_order_consume definition
- From: Linus Torvalds <torvalds at linux-foundation dot org>
- To: Michael Matz <matz at suse dot de>
- Cc: Markus Trippelsdorf <markus at trippelsdorf dot de>, Paul McKenney <paulmck at linux dot vnet dot ibm dot com>, "linux-arch at vger dot kernel dot org" <linux-arch at vger dot kernel dot org>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>, parallel at lists dot isocpp dot org, llvm-dev at lists dot llvm dot org, Will Deacon <will dot deacon at arm dot com>, Linux Kernel Mailing List <linux-kernel at vger dot kernel dot org>, David Howells <dhowells at redhat dot com>, Peter Zijlstra <peterz at infradead dot org>, Ramana Radhakrishnan <Ramana dot Radhakrishnan at arm dot com>, Luc Maranget <luc dot maranget at inria dot fr>, Andrew Morton <akpm at linux-foundation dot org>, Jade Alglave <j dot alglave at ucl dot ac dot uk>, Ingo Molnar <mingo at kernel dot org>
- Date: Mon, 29 Feb 2016 09:57:13 -0800
- Subject: Re: [isocpp-parallel] Proposal for new memory_order_consume definition
- Authentication-results: sourceware.org; auth=none
- References: <20160218011033 dot GA1505 at linux dot vnet dot ibm dot com> <20160220021516 dot 4898897 dot 32908 dot 5212 at gmail dot com> <20160220195318 dot GF3522 at linux dot vnet dot ibm dot com> <CAPUmR1bw=N4NkjAK1zn_X0+84KEaEAM6HZCHZJy_txqC9hMgSg at mail dot gmail dot com> <20160227170615 dot GU3522 at linux dot vnet dot ibm dot com> <CA+55aFyHmykKc=YybJMo9ZUO352MY5noJVB4-K1Lkjmw4UHXfA at mail dot gmail dot com> <20160227231033 dot GW3522 at linux dot vnet dot ibm dot com> <20160228082702 dot GA300 at x4> <CA+55aFyauwKmUdxfrKcy5Q2kvdej5fWt6xL+amVyPFhzmHMcsg at mail dot gmail dot com> <alpine dot LSU dot 2 dot 20 dot 1602291825401 dot 20277 at wotan dot suse dot de>
On Mon, Feb 29, 2016 at 9:37 AM, Michael Matz <email@example.com> wrote:
>The important part is with induction variables controlling
> short i; for (i = start; i < end; i++)
> unsigned short u; for (u = start; u < end; u++)
> For the former you're allowed to assume that the loop will terminate, and
> that its iteration count is easily computable. For the latter you get
> modulo arithmetic and (if start/end are of larger type than u, say 'int')
> it might not even terminate at all. That has direct consequences of
> vectorizability of such loops (or profitability of such transformation)
> and hence quite important performance implications in practice.
Stop bullshitting me.
It would generally force the compiler to add a few extra checks when
you do vectorize (or, more generally, do any kind of loop unrolling),
and yes, it would make things slightly more painful. You might, for
example, need to add code to handle the wraparound and have a more
complex non-unrolled head/tail version for that case.
In theory you could do a whole "restart the unrolled loop around the
index wraparound" if you actually cared about the performance of such
a case - but since nobody would ever care about that, it's more likely
that you'd just do it with a non-unrolled fallback (which would likely
be identical to the tail fixup).
It would be painful, yes.
But it wouldn't be fundamentally hard, or hurt actual performance fundamentally.
It would be _inconvenient_ for compiler writers, and the bad ones
would argue vehemently against it.
.. and it's how a "go fast" mode would be implemented by a compiler
writer initially as a compiler option, for those HPC people. Then you
have a use case and implementation example, and can go to the
standards body and say "look, we have people who use this already, it
breaks almost no code, and it makes our compiler able to generate much
Which is why the standard was written to be good for compiler writers,
not actual users.
Of course, in real life HPC performance is often more about doing the
cache blocking etc, and I've seen people move to more parameterized
languages rather than C to get best performance. Generate the code
from a much higher-level description, and be able to do a much better
job, and leave C to do the low-level job, and let people do the
But no. Instead the C compiler people still argue for bad features
that were a misdesign and a wart on the language.
At the very least it should have been left as a "go unsafe, go fast"
option, and standardize *that*, instead of screwing everybody else
The HPC people end up often using those anyway, because it turns out
that they'll happily get rid of proper rounding etc if it buys them a
couple of percent on their workload. Things like "I really want you
to generate multiply-accumulate instructions because I don't mind
having intermediates with higher precision" etc.