Target Specific Optimization
The target specific optimization has several internal stages. These stages can be delivered in different GCC releases. The first two stages are geared towards people who need to build high performance libraries that must span several different underlying architectures, while the third stage is meant to be usable by the majority of programmers, since it will not involve source code modifications to use. While the focus of this work is to allow ix86 programmers to code for various AMD and Intel platforms, other GCC backends will be able to use target specific optimization by adding the appropriate machine dependent parts.
The stages are:
- Compile a single function with specific machine options.
- Compile a single function multiple times with multiple different options.
- Compile functions with multiple different options automatically.
Stage1: Compile single function with specific options
Stage1: Objective of compiling a single function with specific options
- The objective of being able to compile a single function with specific options is to allow the user to control how an individual function in a compilation unit is compiled without having to move the function to a separate source file and specify different options on a Makefile.
In this stage, it is targeted towards power users who are willing to modify their code to achieve the benefits.
- This stage is also needed to provide the necessary infrastructure for stage2 and stage3 which will need the ability to compile a function with specific options. It is expected that relatively few users will make the necessary code modifications to use this enchancement, but the bulk of the work is to add the ability to modify what options are used to compile a single function, but it makes a convenient stepping point to stage2.
Stage1: Details of compiling a single function with specific options
Most users are not willing to build their applications multiple times with the appropriate -march=xxx or -mtune=xxx options to achieve the best performance, but instead use generic options like -O2 to build their application. When you have multiple different platforms that implement the same basic instruction set, but have different additional instructions or timing characeteristics, you can leave a lot of performance on the table.
This stage is to allow users who write performance critical libraries to code up an individual routine to use special features for a particular platform (for, example using the SSE3 instructions on newer AMD and Intel machines, or using SSE4.1/SSE5 instructions in future processors). A really motivated user can do this today by using different files and different compilation options. This stage would make doing these special functions easier to code.
A secondary benefit would be for memory limited environments, like embedded environments, where you can compile non-critical functions with -Os instead of -O3 to reduce the size of the code, but not impact the performance of critical routines.
- In this stage, it is a non-goal to provide any automatic means of selecting the appropriate function. It is assumed that the application will select the appropriate function to call.
- Recently within AMD, we discovered one major disadvantage to the approach of compiling whole files with special target options is that static constructors that are declared inside of a module will get called, even though the machine might not have support for the instructions being compiled. Being more selective about which functions are compiled with can help avoid the problem.
- By using these options, it will force the developer to test his/her software on different platforms that have different instructions or instruction timings because different code paths will be used by the compiler. Since this stage is targeted to power users that need to wring the most performance out of their software, it is assumed that they will already be testing their code for different environments.
- One issue is what happens to the preprocessor macros that are set to denote each feature set or specific machine. Ideally, these should be tracked by the preprocessor.
A related problem is the definition of intrinsics that users call to access individual instructions. In the x86 backend, intrinsics are only defined when the compiler starts up based on the switches passed to the compiler (i.e. if -msse5 is used, the builtin_ia32_pmacsww intrinsic is defined, but if -msse5 isn't defined, the definition won't be entered into the table). For target specific optimization, the x86 port will have to be modified to define all of the intrinsics, and then allow/disallow the intrinsic, based on the compilation options used to compile the function. In addition, the common intrinsic files, such as bmmintrin.h depend on the appropriate macros and low level intrinsics to be defined in order to define the intrinsics that are shared among several x86 compilers. One possible solution is a special switch that enables the use of processor specific function support, which define all of the possbile feature test bits, when defining the intrinsics, and macros, and then reset the default value of the switches when compiling code.
- The inliner should be taught to not automatically inline a function that is compiled for a specific machine into a function that is compiled generically (possibly having some sort of escape valve to allow this). A target specific function calling another target specific function for the same options should be inlinable, as well as a target specific function calling a generic function.
I am not sure we need to add the full capability to set all -O, -f, -W, and -m options. One problem with adding these options is certain optimizations depend on other optimization options, such as PRE depends on the -O2 optimizations. It is not in the scope of this project to add such support, but once the basic ability is added to compile some functions with different options, this functionality can be added by other people.
In general, we should not add attributes that change the basic ABI of the machine, so there should be no analogs of the -m96-bit-double, -malign-double, -m128-long-double, -mintel-syntax, -mpc, -m32, -m64 x86 switches.
Stage1: Syntax for target specific optimization
There are two alternative methods to specify target specific optimization. Using attributes on a per-function basis is the traditional GCC method for controling functions, and it works inside of macros. Using pragmas mirrors the ways other compilers support such options, and is likely to be more comfortable for the potential audience to use. Using #pragma however allows preprocessor macros to be adjusted, so that you can include files like bmmintrin.h if you have SSE5 options enabled. If the wider community prefers one method over another, we can restrict the work proposal to just that method.
Stage1: attribute syntax
I propose we add the following attributes to the compiler:
attribute((cold)) -- This attribute already exists, and I would propose switching optimization to *-Os* for the function.
attribute((hot)) -- This attribute already exists, and it may be useful to bump up the optimization level to *-O3* in this function.
attribute ((sse)) -- Equivalent to the ix86 -msse command line option.
attribute ((sse2)) -- Equivalent to the ix86 -mmse2 command line option (also implies -msse).
attribute ((sse3)) -- Equivalent to the ix86 -mmse3 command line option (also implies -msse2, -msse).
attribute ((ssse3)) -- Equivalent to the ix86 -mssse3 command line option (also implies -msse3, -msse2, -msse).
attribute ((sse4_1)) -- Equivalent to the ix86 -msse4.1 command line option (also implies -mssse3, -msse3, -msse2, -msse).
attribute ((sse4_2)) -- Equivalent to the ix86 -msse4.2 command line option (also implies -msse4.1, -mssse3, -msse3, -msse2, -msse).
attribute ((sse4)) -- Equivalent to the ix86 -msse4 command line option (same as the sse4_2 attribute).
attribute ((sse4a)) -- Equivalent to the ix86 -msse4a command line option (also implies -msse3, -msse2, -msse, -m3dnow).
attribute ((sse5)) -- Equivalent to the ix86 -msse5 command line option (also implies -msse5, -msse3, -msse2, -msse).
attribute ((cx16)) -- Equivalent to the -mcx16 command line option.
attribute ((popcnt)) -- Equivalent to the -mpopcnt command line option.
attribute ((shaf)) -- Equivalent to the -mshaf command line option.
attribute ((recip)) -- Equivalent to the -mrecip command line option.
attribute ((fused_madd)) -- Equivalent to the -mfused-madd command line option.
attribute ((no_sse)) -- Equivalent to the ix86 -mno-sse command line option.
attribute ((no_sse2)) -- Equivalent to the ix86 -mno-sse2 command line option.
attribute ((no_sse3)) -- Equivalent to the ix86 -mno-sse3 command line option.
attribute ((no_ssse3)) -- Equivalent to the ix86 -mno-ssse3 command line option.
attribute ((no_sse4a)) -- Equivalent to the ix86 -mno-sse4a command line option.
attribute ((no_sse4_1)) -- Equivalent to the ix86 -mno-sse4.1 command line option.
attribute ((no_sse4_2)) -- Equivalent to the ix86 -mno-sse4.2 command line option.
attribute ((no_sse4)) -- Equivalent to the ix86 -mno-sse4 command line option.
attribute ((no_sse5)) -- Equivalent to the ix86 -mno-sse5 command line option.
attribute ((no_cx16)) -- Equivalent to the -mno-cx16 command line option.
attribute ((no_popcnt)) -- Equivalent to the -mno-popcnt command line option.
attribute ((no_shaf)) -- Equivalent to the -mno-shaf command line option.
attribute ((no_recip)) -- Equivalent to the -mno-recip command line option.
attribute ((no_fused_madd)) -- Equivalent to the -mno-fused-madd command line option.
Stage1: pragma syntax
All of these pragmas would turn on the equivalent *attribute* support for the succeeding functions until they are reset. If a function has an attribute declared, it would override the *#pragma*:
#pragma GCC cold
#pragma GCC hot
#pragma GCC sse
#pragma GCC sse2
#pragma GCC sse3
#pragma GCC ssse3
#pragma GCC sse4_1
#pragma GCC sse4_2
#pragma GCC sse4
#pragma GCC sse4a
#pragma GCC sse5
#pragma GCC cx16
#pragma GCC popcnt
#pragma GCC shaf
#pragma GCC recip
#pragma GCC fused_madd
#pragma GCC no_sse
#pragma GCC no_sse2
#pragma GCC no_sse3
#pragma GCC no_ssse3
#pragma GCC no_sse4a
#pragma GCC no_sse4_1
#pragma GCC no_sse4_2
#pragma GCC no_sse4
#pragma GCC no_sse5
#pragma GCC no_cx16
#pragma GCC no_popcnt
#pragma GCC no_shaf
#pragma GCC no_recip
#pragma GCC no_fused_madd
The following pragmas will allow header files to turn on/off the options in a global fashion:
#pragma GCC push-options -- Push the current options onto a separate stack so that changes can be undone after the section of code that needs to be compiled with target specific options.
#pragma GCC pop-options -- Pop the current options from the stack created with push-options to restore the previous command line options.
#pragma GCC initial-options -- Restore the current options from those specified via the command line switches.
Stage1: Example using #pragma
Here is an example of how you might use target specific functions using *#pragma*. It uses the common compiler intrinsics include files (and needs pragma because bmmintrin.h and smmintrin.h check for SSE5 and SSE4_1 being defined). The code calculates a minimum of a vector of 32-bit signed integers, using the pcomd and pcmov instructions under SSE5 and the pminsd instruction under SSE4.1.
1 #pragma GCC push-options
2 #pragma GCC sse5
3 #include <bmmintrin.h>
4
5 void sse5_min (__m128i *a, __m128i *b, __m128i *c, int n) {
6 int i;
7 for (i = 0; i < n; i++) {
8 __m128i test = _mm_comlt_epi32 (b[i], c[i]);
9 a[i] = _mm_cmov_si128 (b[i], c[i], test);
10 }
11 }
12
13 #pragma GCC initial-options
14 #pragma GCC sse4_1
15 #include <smmintrin.h>
16
17 void sse4_1_min (__m128i *a, __m128i *b, __m128i *c, int n) {
18 int i;
19 for (i = 0; i < n; i++) {
20 a[i] = _mm_min_epi32 (b[i], c[i]);
21 }
22 }
23
24 #pragma GCC pop-options
25 void generic_min (__m128i *a, __m128i *b, __m128i *c, int n) {
26 int i;
27 int n_int = 4 * n;
28 int *a_int = (int *) a;
29 int *b_int = (int *) b;
30 int *c_int = (int *) c;
31 for (i = 0; i < n_int; i++) {
32 a_int[i] = (b_int[i] < c_int[i]) ? b_int[i] : c_int[i];
33 }
34 }
35 void do_min (__m128i *a, __m128i *b, __m128i *c, int n) {
36 if (HAVE_SSE5) {
37 sse5_min (a, b, c, n);
38 } else if (HAVE_SSE4_1) {
39 sse4_1_min (a, b, c, n);
40 } else {
41 generic_min (a, b, c, n);
42 }
43 }
Stage1: Example using attribute
Here is an example of how you might use target specific functions using attributes. It uses the GCC intrinsics. The code calculates a minimum of a vector of 32-bit signed integers, using the pcomd and pcmov instructions under SSE5 and the pminsd instruction under SSE4.1.
1 typedef int __v4si __attribute__ ((__vector_size__ (16), __may_alias__));
2 void sse5_min (__v4si *, __v4si *, __v4si *, int) __attribute__ ((__sse5__));
3 void sse4_1_min (__v4si *, __v4si *, __v4si *, int) __attribute__ ((__sse4_1__));
4 void generic_min (__v4si *, __v4si *, __v4si *, int);
5 void sse5_min (__v4si *a, __v4si *b, __v4si *c, int n) {
6 int i;
7 for (i = 0; i < n; i++) {
8 __v4si test = __builtin_ia32_pcomltd (b[i], c[i]);
9 a[i] = __builtin_ia32_pcmov_v4si (b[i], c[i], test);
10 }
11 }
12 void sse4_1_min (__v4si *a, __v4si *b, __v4si *c, int n) {
13 int i;
14 for (i = 0; i < n; i++) {
15 a[i] = __builtin_ia32_pminsd (b[i], c[i]);
16 }
17 }
18 void generic_min (__v4si *a, __v4si *b, __v4si *c, int n) {
19 int i;
20 int n_int = 4 * n;
21 int *a_int = (int *) a;
22 int *b_int = (int *) b;
23 int *c_int = (int *) c;
24 for (i = 0; i < n_int; i++) {
25 a_int[i] = (b_int[i] < c_int[i]) ? b_int[i] : c_int[i];
26 }
27 }
28 void do_min (__v4si *a, __v4si *b, __v4si *c, int n) {
29 if (HAVE_SSE5) {
30 sse5_min (a, b, c, n);
31 } else if (HAVE_SSE4_1) {
32 sse4_1_min (a, b, c, n);
33 } else {
34 generic_min (a, b, c, n);
35 }
36 }
Stage1: Work items
This section is an attempt to break down the stage1 work into smaller chunks, with separate deliverables.
Stage1: Create a branch.
A subversion branch will be created at the FSF to host this project. All work will be done in this branch. All people contributing to this branch must have the appropriate FSF paperwork so that their work can be incorporated into the mainstream GCC. All FSF coding guidelines will be used. Merges from the mainline will occur at least monthly. It will take 1 day to create the branch. It is anticipated that each merge will take 1 day to do the merge, and do any updates to the target specific work that is needed.
Stage1: Move command line options into a global structure.
Currently, each individual command line option is a separate external variable. This work item will modify the opt*.awk scripts so that all of the options are collected into one global structure. Each field will be an option that previously was a global variable will be a *#define* so that the rest of the compiler will not need source modifications. I expect this work item to take about 1 week of time. When the 4.4 tree opens up, this work item will be migrated to the mainline.
Stage1: Add target hook support for changing options
We will add target hooks that allow the backend to be notified when the user issues a *#pragma* or *attribute* that changes the current set of optimization and warning options. In addition, in the ix86 backend, we will add the ix86 support for the various *-msse* type options. I expect this to take 2 weeks of work.
Stage1: Add #pragma support
We will add the necessary *#pragma* support to add function specific optimizations, calling the the appropriate target hooks where needed, pushing/popping the options as needed. I expect this to take 3 weeks of work.
Stage1: Add attribute support
Once #pragma support is added, the same work will be done to attribute's. I would expect this to take 1 week of time, since the #pragma support will have ironed out the bugs.
Stage1: Add support in the tree/RTL structure for remembering the options used
We will add support in the tree and RTL structures for remembering what the current options are. This work should interface with the LTO team so that these options can be saved and used as part of the LTO work. I would estimate that it will take 4 weeks of investigation and neogotiation with the other groups to come up with a workable design. The design should be general enough so that in the future, if desired we can have if {...} blocks that use different compilation options than the main function.
Stage1: Teach the inliner about target specific functions
We will teach the inliner not to inline functions compiled with target specific optimizations inside of a general function. However, if a function that has target specific optimizations it should be able to inline normal functions, or functions compiled with the same set of target specific optimizations. I estimate that this should take 2 weeks of time.
Stage1: Convert ix86 intrinsics to know about target specific optimizations
We will rewrite the ix86 intrinsic code so that all intriniscs are added to the symbol table at compiler startup, but when the intrinsic is invoked, it will check whether the current compilation options allow it to be generated. I estimate this will take 2 weeks of time.
Stage1: Ix86 preprocessor macro support
The ix86 backend will define/undefine the appropriate processor specific macros (like SSE) based on the current function optimization options. It is anticipated that the ix86 backend will do this in the target hook created above, and there may be some modifications to the preprocessor. I estimate that this will take 2 weeks of time.
Stage1: Merge into mainline
Assuming all of this works, it will be merged into the mainline in pieces. I anticipate that this may take 4 weeks of effort.
Stage2: Details of compiling a single function multiple times manually
- If this is used all over the place, it can lead to massive code bloat.
- Ideally the compiler should determine if two or more clone functions generate the same code, but at present, this is not part of the goals of this project.
- Users that use this option really need to test their code on multiple platforms to insure that the compiler generates the correct code for each target.
- Functions that take variable arguments will not be allowed to be cloned, since the function that dispatches to the clones needs to pass all of the arguments to the clone functions.
- The backend should determine what are the appropriate clone targets, while the user should just indicate that a function should be cloned. This allows for new clone targets to be added automatically without modifying the code.
To cut down on code bloat, the ix86 backend should not generate clones for each different machine, but instead compile code for feature bits (i.e., whether a machine has the SSE3, SSSE3, SSE4.1, or SSE5 instruction sets), and not a specific machine.
- For 32-bit ix86 targets, it is important not to have too many clones in 32-bit, given the limited address space of user applications. I would expect the following clones to be provided:
- generic, use 387 floating point stack
-msse2
- For 64-bit ix86 targets, I would expect the following clones to be provided:
generic (implies -msse2)
-msse3
-msse4.1
-msse5
In generating the clone functions, the compiler will generate one function that dispatches to each of the clones based on feature tests. A function that runs as a static constructor will be responsible for doing the CPUID instruction(s) to determine what feature bits are supported.
- It is highly desirable that the debugger know about all clones, so that if you put a breakpoint in a cloned function, it puts the same breakpoint at the same line in each cloned function.
Stage2: Example
If you have a function declared as a clone, such as:
1 void my_min (int *, int *, int *) __attribute__((__clone__));
2 void my_min (int *a, int *b, int *c, int n) {
3 int i;
4 for (i = 0; i < n; i++) {
5 a[i] = (b[i] < c[i]) ? b[i] : c[i];
6 }
7 }
The compiler would logically generate code that would be equivalent to:
1 static void __do_cpuid (void) __attribute__ ((__constructor__));
2 static void my_min__clone_generic (int *, int *, int *, int);
3 static void my_min__clone_sse5 (int *, int *, int *, int) __attribute__((__sse5__));
4 static void my_min__clone_sse4_1 (int *, int *, int *, int) __attribute__((__sse4_1__));
5 static void (*my_min__clone_ptr)(int *, int *, int *, int) = my_min__clone_generic;
6 static void __do_cpuid (void) {
7 int have_sse5;
8 int have_sse4_1;
9 /* code to initialize have_sse5 and have_sse4_1 via CPUID. */
10 /* Update all clone pointers generated in this module */
11 if (have_sse5) {
12 my_min__clone_ptr = my_min__clone_sse5;
13 } else if (have_sse4_1) {
14 my_min__clone_ptr = my_min__clone_sse4_1;
15 } else {
16 my_min__clone_ptr = my_min__clone_generic;
17 }
18 }
19 void my_min (int *a, int *b, int *c, int n) {
20 (* my_min__clone_ptr) (a, b, c, n);
21 }
22 static void my_min__clone_generic (int *a, int *b, int *c, int n) {
23 int i;
24 for (i = 0; i < n; i++) {
25 a[i] = (b[i] < c[i]) ? b[i] : c[i];
26 }
27 }
28 /* compile with -msse5 as per the attribute in the declaration. */
29 static void my_min__clone_sse5 (int *a, int *b, int *c, int n) {
30 int i;
31 for (i = 0; i < n; i++) {
32 a[i] = (b[i] < c[i]) ? b[i] : c[i];
33 }
34 }
35 /* compile with -msse4.1 as per the attribute in the declaration. */
36 void my_min__clone_sse4_1 (int *a, int *b, int *c, int n) {
37 int i;
38 for (i = 0; i < n; i++) {
39 a[i] = (b[i] < c[i]) ? b[i] : c[i];
40 }
41 }
Stage3: Compile functions with multiple different options automatically
Stage3: Objective of compiling a single function multiple times automatically
Once we have the ability to clone functions, the compiler should be able with profile guided feedback, determine which functions are hotspot functions and automatically add the clone attribute.
It may be useful to add a -fhotspot=func1,func2,... switch as well.
The compiler should only do this cloning automatically when the user specifies to do this with a switch such as -fclone. Otherwise, if it is done via -O3 it will be incumbant on each user to test their code on multiple machines.
Branch
- The svn development branch is svn://gcc.gnu.org/svn/gcc/branches/function-specific-branch
- The svn tag branch is svn://gcc.gnu.org/svn/gcc/tag/function-specific-branch
- The branch was created from the trunk, revision 130896.
gcc/ChangeLog-function is where ChangeLog entries for this branch should go.