This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug target/43364] Suboptimal code for the use of ARM NEON intrinsic "vset_lane_f32"



------- Comment #3 from siarhei dot siamashka at gmail dot com  2010-06-15 20:14 -------
The whole point of submitting this PR was to find an efficient way to use NEON
instructions to operate on any arbitrary scalar floating point values in order
to overcome Cortex-A8 VFP Lite inherent slowness (maybe make it transparent via
wrapping it into a C++ class and use operator overloading).

Using 'vdup_n_f32' to load a single floating point value seems to be better
than 'vset_lane_f32' here because we don't have to deal with uninitialized part
of the register. But 'vdup_n_f32' suffers from the similar performance issues
(VLD1 instruction is not used directly) and results in redundant instructions
emitted when the value is loaded from memory. Optimistically, something like
this should have been used instead of 'vdup_n_f32' in this case:

static inline float32x2_t vdup_n_f32_mem(float *p)
{
    float32x2_t result;
    asm ("vld1.f32 {%P0[]}, [%1, :32]" : "=w" (result) : "r" (p) : "memory");
    return result;
}

If wonder if it is possible to check at compile time whether the operand comes
from memory or from a register? Something similar to '__builtin_constant_p'
builtin-function? Or use multiple alternatives feature for inline assembly
constraints to emit either VMOV or VLD1? Anything else?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43364


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]