Bug 47895 - usage of __attribute__ ((__target__ ("xyz"))) with buitins
usage of __attribute__ ((__target__ ("xyz"))) with buitins
Status: UNCONFIRMED
Product: gcc
Classification: Unclassified
Component: middle-end
4.6.0
: P3 normal
: ---
Assigned To: Not yet assigned to anyone
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-02-25 13:32 UTC by vincenzo Innocente
Modified: 2011-02-26 09:55 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description vincenzo Innocente 2011-02-25 13:32:51 UTC
I would like to generate code for multiple targets from the same source when using builtins
(
I think that this issue has been discussed before for instance in
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39840
) 


I have code as in the example below that compiles only with -mavx.
In such a case it will use AVX instruction for all functions including the one "targetted" for sse3
while I would like to obtain an object file that I can run on multiple platform.
This problem occurs only when builtins are used: standard c code is correctly emitted accordingly to the target provided that the minimal -m is used.

Is there any preprocessor flag to "activate" all intrinsics and builtins in x86intrin.h?

-----------------------------
example

#include <x86intrin.h>

float  __attribute__ ((__target__ ("sse3"))) sum3(float const * __restrict__ x, float const * __restrict__ y, float const * __restrict__ z) {
  __m128 sum = _mm_setzero_ps();
  for (int i=0; i!=1024; i+=4)
    sum  += _mm_add_ps(_mm_loadu_ps(z+i),
                       _mm_mul_ps(_mm_loadu_ps(x+i),_mm_loadu_ps(y+i)) );
    sum = _mm_hadd_ps(sum,sum);
    sum = _mm_hadd_ps(sum,sum);
  float ret;
  _mm_store_ss(&ret,sum);
  return ret;
}

float  __attribute__ ((__target__ ("avx"))) sumv(float const * __restrict__ x, float const * __restrict__ y, float const * __restrict__ z) {
  __m256 sum = _mm256_setzero_ps();
  for (int i=0; i!=1024; i+=8)
    sum  += _mm256_add_ps(_mm256_loadu_ps(z+i),
                       _mm256_mul_ps(_mm256_loadu_ps(x+i),_mm256_loadu_ps(y+i)) );
    sum = _mm256_hadd_ps(sum,sum);
    sum = _mm256_hadd_ps(sum,sum);
    sum = _mm256_hadd_ps(sum,sum);
  float ret[8];
  _mm256_store_ps(ret,sum);
  return ret[0];
}
Comment 1 Richard Biener 2011-02-25 14:44:27 UTC
A way easier and more portable way is to split your source into multiple
compilation units and use appropriate flags to compile them.
Comment 2 vincenzo Innocente 2011-02-26 09:55:03 UTC
I find that the solution with multiple files shifts the problem to the build system, which is not necessarily an easier solution in all projects, and make maintenance more difficult as more files need to be tracked for each single algorithm.
I would much prefer a solution that is fully confined to the source code without involving the configuration and build system. 

In any case at the moment there is a clear unbalance between plane c code, for which a single compilation unit with multiple functions for different "targets" do work, and code exploiting builtins for which __attribute__ ((__target__ ("xyz"))) is not ineffective.
I consider this behavior a defect.