[PATCH] RISC-V: Enable overlap-by-pieces in case of fast unaliged access

Palmer Dabbelt palmer@dabbelt.com
Mon Aug 16 16:15:09 GMT 2021


On Mon, 16 Aug 2021 03:02:42 PDT (-0700), Kito Cheng wrote:
> HI Christoph:
>
> Could you submit v3 patch which is v1 with overlap_op_by_pieces field,
> testcase from v2 and add a few more comments to describe the field?
>
> And add an -mtune=ultra-size to make it able to test without change
> other behavior?
>
> Hi Palmer:
>
> Are you OK with that?

I'm still not convinced on the performance: like Andrew and I pointed 
out, this is a difficult case for pipelines of this flavor to handle.  
Nobody here knows anything about this pipeline deeply enough to say 
anything difinitive, though, so this is really just a guess.

As I'm not convinced this is an obvious performance win I'm not going to 
merge it without a benchmark.  If you're convinced and want to merge it 
that's fine, I don't really care about the performance fo the C906 and 
if someone complains we can always just revert it later.

> On Sat, Aug 14, 2021 at 1:54 AM Christoph Müllner via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Ping.
>>
>> On Thu, Aug 5, 2021 at 11:11 AM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
>> >
>> > Ping.
>> >
>> > On Thu, Jul 29, 2021 at 9:36 PM Christoph Müllner <cmuellner@gcc.gnu.org> wrote:
>> > >
>> > > On Thu, Jul 29, 2021 at 8:54 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>> > > >
>> > > > On Tue, 27 Jul 2021 02:32:12 PDT (-0700), cmuellner@gcc.gnu.org wrote:
>> > > > > Ok, so if I understand correctly Palmer and Andrew prefer
>> > > > > overlap_op_by_pieces to be controlled
>> > > > > by its own field in the riscv_tune_param struct and not by the field
>> > > > > slow_unaligned_access in this struct
>> > > > > (i.e. slow_unaligned_access==false is not enough to imply
>> > > > > overlap_op_by_pieces==true).
>> > > >
>> > > > I guess, but I'm not really worried about this at that level of detail
>> > > > right now.  It's not like the tune structures form any sort of external
>> > > > interface we have to keep stable, we can do whatever we want with those
>> > > > fields so I'd just aim for encoding the desired behavior as simply as
>> > > > possible rather than trying to build something extensible.
>> > > >
>> > > > There are really two questions we need to answer: is this code actually
>> > > > faster for the C906, and is this what the average users wants under -Os.
>> > >
>> > > I never mentioned -Os.
>> > > My main goal is code compiled for -O2, -O3 or even -Ofast.
>> > > And I want to execute code as fast as possible.
>> > >
>> > > Loading hot data from cache is faster when being done by a single
>> > > load-word instruction than 4 load-byte instructions.
>> > > Less instructions implies less pressure for the instruction cache.
>> > > Less instructions implies less work for a CPU pipeline.
>> > > Architectures, which don't have a penalty for unaligned accesses
>> > > therefore observe a performance benefit.
>> > >
>> > > What I understand from Andrew's email is that it is not that simple
>> > > and implementation might have a penalty for overlapping accesses
>> > > that is high enough to avoid them. I don't have the details for C906,
>> > > so I can't say if that's the case.
>> > >
>> > > > That first one is pretty easy: just running those simple code sequences
>> > > > under a sweep of page offsets should be sufficient to determine if this
>> > > > is always faster (in which case it's an easy yes), if it's always slower
>> > > > (an easy no), or if there's some slow cases like page/cache line
>> > > > crossing (in which case we'd need to think a bit).
>> > > >
>> > > > The second one is a bit tricker.  In the past we'd said these sort of
>> > > > "actively misalign accesses to generate smaller code" sort of thing
>> > > > isn't suitable for -Os (as most machines still have very slow unaligned
>> > > > accesses) but is suitable for -Oz (don't remember if that ever ended up
>> > > > in GCC, though).  That still seems like a reasonable decision, but if it
>> > > > turns out that implementations with fast unaligned accesses become the
>> > > > norm then it'd probably be worth revisiting it.  Not sure exactly how to
>> > > > determine that tipping point, but I think we're a long way away from it
>> > > > right now.
>> > > >
>> > > > IMO it's really just premature to try and design an encoding of the
>> > > > tuning paramaters until we have an idea of what they are, as we'll just
>> > > > end up devolving down the path of trying to encode all possible hardware
>> > > > and that's generally a huge waste of time.  Since there's no ABI here we
>> > > > can refactor this however we want as new tunings show up.
>> > >
>> > > I guess you mean that there needs to be a clear benefit for a supported
>> > > machine in GCC. Either obviously (see below), by measurement results,
>> > > or by decision
>> > > of the machine's maintainer (especially if the decision is a trade-off).
>> > >
>> > > >
>> > > > > I don't have access to pipeline details that give proof that there are cases
>> > > > > where this patch causes a performance penalty.
>> > > > >
>> > > > > So, I leave this here as a summary for someone who has enough information and
>> > > > > interest to move this forward:
>> > > > > * the original patch should be sufficient, but does not have tests:
>> > > > >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html
>> > > > > * the tests can be taken from this patch:
>> > > > >   https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
>> > > > >   Note, that there is a duplicated "sw" in builtins-overlap-6.c, which
>> > > > > should be a "sd".
>> > > > >
>> > > > > Thanks for the feedback!
>> > > >
>> > > > Cool.  Looks like the C906 is starting to show up in the real world, so
>> > > > we should be able to find someone who has access to one and cares enough
>> > > > to at least run some simple benchamrks of these code sequences.  IMO
>> > > > that's a pretty low interest bar, so I don't see any harm in waiting --
>> > > > when the hardware is common then I'm sure someone will care enough to
>> > > > give this a shot, and until then it's not really impacting anyone either
>> > > > way.
>> > > >
>> > > > The -Os thing is a bigger discussion, and while I'm happy to have it I
>> > > > don't really think we're even close to these being common enough yet.  I
>> > > > saw your memmove patch and think the same rationale might apply there,
>> > > > but I haven't looked closely and won't have time to for a bit as I've
>> > > > got to get around to the other projects.
>> > >
>> > > The cpymemsi patch is also targeting -O2 or higher for fast code execution.
>> > > And it is one of the cases where there is an obvious performance benefit
>> > > for all machines that have slow_unaligned_access==false.
>> > >
>> > > At the moment the cpymemsi expansion for RISC-V is implemented as if
>> > > there is no machine with slow_unaligned_access==false.
>> > > And in fact there is a machine already in GCC mainline with this property: C906.
>> > >
>> > > Machines that can do fast unaligned accesses should not be wasting their
>> > > cycles with load-store-pairs of bytes, if they can do load-store pairs of words.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > >
>> > > > > On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>> > > > >>
>> > > > >> On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
>> > > > >> > On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>> > > > >> >>
>> > > > >> >> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
>> > > > >> >> > Could you add a testcase? Otherwise LGTM.
>> > > > >> >> >
>> > > > >> >> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
>> > > > >> >> > void foo(char *dst){
>> > > > >> >> >    __builtin_memset(dst, 0, 15);
>> > > > >> >> > }
>> > > > >> >>
>> > > > >> >> I'd like to see:
>> > > > >> >>
>> > > > >> >> * Test results.  This is only on for one target right now, so relying on
>> > > > >> >>   it to just work on others isn't a good idea.
>> > > > >> >> * Something to demonstrate this doesn't break -mstrict-align.
>> > > > >> >> * Some sort of performance analysis.  Most machines that support
>> > > > >> >>   unaligned access do so with some performance degredation,
>> > > > >> >
>> > > > >> > Also, some machines that gracefully support misaligned accesses under
>> > > > >> > most circumstances nevertheless experience a perf degradation when the
>> > > > >> > load depends on two stores that overlap partially but not fully.  This
>> > > > >> > transformation will obviously trigger such behavior from time to time.
>> > > > >>
>> > > > >> Ya, I thought I wrote a response to this but I guess it's just in a
>> > > > >> buffer somewhere.  The code sequences this is generating are really the
>> > > > >> worst case for unaligned stores: one of them is always guaranteed to be
>> > > > >> misaligned, and it partially overlaps with a store one cycle away.
>> > > > >>
>> > > > >> We're really only saving a handful of instructions at best here, so
>> > > > >> there's not much room for error when it comes to these sorts of things.
>> > > > >> Even if this difficult case is handled fast we're only talking about
>> > > > >> saving 2 cycles, which is pretty borderline as the single-issue in-order
>> > > > >> machines I've worked with that do support misaligned accesses tend to
>> > > > >> take at least a few cycles of performance hit on misaligned accesses.
>> > > > >> Even if misaligned accesses are single cycle, some back of the envelope
>> > > > >> calculations says that a pipeline flush when crossing a cache line would
>> > > > >> definitely push this negative and one per page crossing would be in the
>> > > > >> margins (depending on how we assume the original accesses are aligned).
>> > > > >>
>> > > > >> This is way too subtle of a thing to analyze without at least some
>> > > > >> knowledge of how this pipeline works, whether that's from a benchmark or
>> > > > >> just a pipeline description.
>> > > > >>
>> > > > >> > Note, I'm not objecting to this patch; I'm only responding to Palmer's
>> > > > >> > comment.  It makes sense to enable this kind of optimization for
>> > > > >> > -mtune=native etc., just not for standard software distributions.
>> > > > >>
>> > > > >> IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native
>> > > > >> is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy
>> > > > >> enabling this as long as it's actually correct and improves performance,
>> > > > >> but that'll need at least some hardware-oriented analysis -- whether
>> > > > >> it's a benchmark or just a confirmation that these sequences are
>> > > > >> actually expected to run fast.
>> > > > >>
>> > > > >> -Os is a different case, though.  Last time this came up we decided that
>> > > > >> -Os shouldn't purposefully misalign accesses, even when that saves code
>> > > > >> size, as that's too likely to hit pathological performance cases.  I
>> > > > >> don't know if the weights have changed here: the C906 is currently by
>> > > > >> far the cheapest chip (which likely means it's going to be the most
>> > > > >> popular), but it's unclear as to whether it's even RISC-V compliant and
>> > > > >> I don't know of many people who've actually gotten one.  IMO it's too
>> > > > >> early to flip -Os over to this behavior, we at least need to know that
>> > > > >> this chip is going to be sufficiently RISC-V-ey that standard software
>> > > > >> will even run on it much less be popular.
>> > > > >>
>> > > > >> >
>> > > > >> >
>> > > > >> >>   it's unclear
>> > > > >> >>   that this conversion will actually manifst a performance improvement.
>> > > > >> >>   I don't have a C906 and don't know what workloads people care about
>> > > > >> >>   running on one, but I'm certainly not convinced this is a win --
>> > > > >> >>   what's listed here looks to be the best case, and that's only saving
>> > > > >> >>   two instructions to generate a pretty pathological sequence
>> > > > >> >>   (misaligned access that conflicts with a prior store).
>> > > > >>
>> > > > >> Ah, I guess there it was ;)
>> > > > >>
>> > > > >> >>
>> > > > >> >> Jojo: do you have any description of the C906 pipeline?  Specifically in
>> > > > >> >> this case it'd be good to know how it handles unaligned accesses.
>> > > > >> >>
>> > > > >> >> >
>> > > > >> >> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
>> > > > >> >> > <gcc-patches@gcc.gnu.org> wrote:
>> > > > >> >> >>
>> > > > >> >> >> This patch enables the overlap-by-pieces feature of the by-pieces
>> > > > >> >> >> infrastructure for inlining builtins in case the target has set
>> > > > >> >> >> riscv_slow_unaligned_access_p to false.
>> > > > >> >> >>
>> > > > >> >> >> To demonstrate the effect for targets with fast unaligned access,
>> > > > >> >> >> the following code sequences are generated for a 15-byte memset-zero.
>> > > > >> >> >>
>> > > > >> >> >> Without overlap_op_by_pieces we get:
>> > > > >> >> >>   8e:   00053023                sd      zero,0(a0)
>> > > > >> >> >>   92:   00052423                sw      zero,8(a0)
>> > > > >> >> >>   96:   00051623                sh      zero,12(a0)
>> > > > >> >> >>   9a:   00050723                sb      zero,14(a0)
>> > > > >> >> >>
>> > > > >> >> >> With overlap_op_by_pieces we get:
>> > > > >> >> >>   7e:   00053023                sd      zero,0(a0)
>> > > > >> >> >>   82:   000533a3                sd      zero,7(a0)
>> > > > >> >> >>
>> > > > >> >> >> gcc/ChangeLog:
>> > > > >> >> >>
>> > > > >> >> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
>> > > > >> >> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
>> > > > >> >> >>         riscv_overlap_op_by_pieces.
>> > > > >> >> >>
>> > > > >> >> >> Signed-off-by: Christoph Muellner <cmuellner@gcc.gnu.org>
>> > > > >> >> >> ---
>> > > > >> >> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
>> > > > >> >> >>  1 file changed, 11 insertions(+)
>> > > > >> >> >>
>> > > > >> >> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
>> > > > >> >> >> index 576960bb37c..98c76ba657a 100644
>> > > > >> >> >> --- a/gcc/config/riscv/riscv.c
>> > > > >> >> >> +++ b/gcc/config/riscv/riscv.c
>> > > > >> >> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, unsigned int)
>> > > > >> >> >>    return riscv_slow_unaligned_access_p;
>> > > > >> >> >>  }
>> > > > >> >> >>
>> > > > >> >> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
>> > > > >> >> >> +
>> > > > >> >> >> +static bool
>> > > > >> >> >> +riscv_overlap_op_by_pieces (void)
>> > > > >> >> >> +{
>> > > > >> >> >> +  return !riscv_slow_unaligned_access_p;
>> > > > >> >> >> +}
>> > > > >> >> >> +
>> > > > >> >> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
>> > > > >> >> >>
>> > > > >> >> >>  static bool
>> > > > >> >> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
>> > > > >> >> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
>> > > > >> >> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
>> > > > >> >> >>
>> > > > >> >> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
>> > > > >> >> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
>> > > > >> >> >> +
>> > > > >> >> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
>> > > > >> >> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
>> > > > >> >> >>
>> > > > >> >> >> --
>> > > > >> >> >> 2.31.1
>> > > > >> >> >>


More information about the Gcc-patches mailing list