ARM NEON Intrinsics guide?

Jeffrey Walton
Wed May 11 09:39:00 GMT 2016

> As has been mentioned on this thread already,
> is a list of the intrinsics and how they map down to NEON instructions,
> thought it's
> more of a reference rather than a user guide.
> If you can isolate a standalone example where GCC NEON intrinsics perform
> poorly it can you
> please file a bug report with the testcase.

I hope to get something together shortly.

Here's one of the pain points:

   int64x2_t c = vcombine_s64(vget_high_s64(a),vget_low_s64(b));

I'm testing alternatives at the moment... It looks like lane
extraction and insertion produces better code under GCC. It seems to
limit GCC's desire to spill out into R registers.

> As an aside, I notice your command-line options are sub-optimal.
> If you're targeting a Cortex-A7  you want to use -mfpu=neon-vfpv4 rather
> than just -mfpu=neon.
> This will give you access to the vfma instructions.
> Whereas if you're targeting ARMv8-A on a Cortex-A53 you'll want to use
> -mfpu=neon-fp-armv8
> to enable the ARMv8 floating-point an NEON instructions.

Thanks, this is the sort of thing I was looking for: higher level prescriptions.

I'm also looking for something on creating new vectors on the fly from
scattered data. vcombine_s64 is a pain point under this data set, and
the suggestions here don't apply:


More information about the Gcc-help mailing list