[PATCH] Fix PR 47272 to restore Altivec vec_ld/vec_st

Mon Jan 24 22:34:00 GMT 2011

This patch fixes bug PR target/47272, but I'm sending it out to a wider
audience to solicit feedback from other developers to resolve a sticky
situation with the PowerPC code gen.

For those of you who don't know the power architecture, particularly with the
VSX extensions, the first main vector extension was the Altivec (VMX) vector
support.  The VSX vector support adds more floating point vector registers, and
overlaps with the Altivec support.  In particular, the vector loads and stores
on Altivec ignore the bottom 3 bits of the address, while the vector loads and
stores of the VSX instruction set do not, and will do unaligned loads and
stores.  Obviously, if the address is completely aligned, either an Altivec or
a VSX memory instruction will behave the same.  If on the other hand you have
an unaligned address, you will get different bytes loaded/stored.

The PowerPC compiler has a full set of overloaded vector intrinisic builtin
functions, including builtins for doing load and store.  When I added the VSX
support, I changed the compiler to do VSX loads/stores if the user used -mvsx
or -mcpu=power7, including changing the builtin load/store functions to use the
VSX instructions.  However, as I said, you get different results for unaligned
addresses.

Richard Henderson's change to libcpp/lex.c in August 21st, 2010 added code to
use the Altivec instruction set if the compiler supports it to speed up the
preprocessor:

2010-08-21  Richard Henderson  <rth@redhat.com>
	    Andi Kleen <ak@linux.intel.com>
	    David S. Miller  <davem@davemloft.net>

	* configure.ac (AC_C_BIGENDIAN, AC_TYPE_UINTPTR_T): New tests.
	(ssize_t): Check via AC_TYPE_SSIZE_T instead of AC_CHECK_TYPE.
	(ptrdiff_t): Check via AC_CHECK_TYPE.
	* config.in, configure: Rebuild.
	* system.h: Include stdint.h, if available.
	* lex.c (WORDS_BIGENDIAN): Provide default.
	(acc_char_mask_misalign, acc_char_replicate, acc_char_cmp,
	acc_char_index, search_line_acc_char, repl_chars, search_line_mmx,
	search_line_sse2, search_line_sse42, init_vectorized_lexer,
	search_line_fast): New.
	(_cpp_clean_line): Use search_line_fast.  Restructure the fast
	loop to make it clear when we're leaving the loop.  Stay in the
	fast loop for non-trigraph '?'.

Recently we started to look at building internal versions of the GCC 4.6
compiler with the --with-cpu=power7 support, and it exposed the difference
between the two loads.

So after some debate within IBM, we've come to the conclusion that I should not
have changed the semantics of __builtin_vec_ld and __builtin_vec_st, and that
we should go back to using the Altivec form for these instructions.  However,
in doing so, it means that anybody who has written new code explicitly for
power7 since GCC 4.5 came out might now be suprised.  Unfortunately the bug
exists in GCC 4.5 as well as the Red Hat RHEL6 and SUSE Sles 11 Sp1 compilers.

I realize that we are in stage 4 of the release process, but if we are going to
change the builtins back to the 4.4 semantics, we should do it as soon as
possible.

David suggested I poll release managers and other interested parties
what path we should take (make the builtins adhere to the 4.4 semantics, or
just keep the current situation).

I'm enclosing patches to make the load/store builtins go back to the Altivec
semantics, and added vector double support to those.  In addition, I added
patches for libcpp/lex.c so that it will work with 4.5 compilers as well as 4.4
and future 4.6 compilers.  No matter whether we decide not to re-change the
builtin semantics or not, I feel the lex.c patch should go it.

Right now, I did not add an #ifdef or -m switch to toggle to the 4.5
behaviour.  I can do this if desired (it probably is a good idea to allow code
written for 4.5 to continue to be used).  I don't know how many people directly
write using the Altivec semantics.

I should mention that Darwin users and people using the host processor in PS3
that might have written Altivec specific code will not be affected, since those
machines do not have the VSX instruction set.  It is only the newer machines
being shipped by IBM that currently will have the problem.

Sorry about all this.

-- 
Michael Meissner, IBM
5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
meissner@linux.vnet.ibm.com	fax +1 (978) 399-6899
-------------- next part --------------
[gcc]
2011-01-24  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/47272
	* doc/extend.texi (PowerPC AltiVec/VSX Built-in Functions):
	Document using vector double with the load/store builtins, and
	that the load/store builtins always use Altivec instructions.

	* config/rs6000/vector.md (vector_altivec_load_<mode>): New insns
	to use altivec memory instructions, even on VSX.
	(vector_altivec_store_<mode>): Ditto.

	* config/rs6000/rs6000-protos.h (rs6000_address_for_altivec): New
	function.

	* config/rs6000/rs6000-c.c (altivec_overloaded_builtins): Add
	V2DF, V2DI support to load/store overloaded builtins.

	* config/rs6000/rs6000-builtin.def (ALTIVEC_BUILTIN_*): Add
	altivec load/store builtins for V2DF/V2DI types.

	* config/rs6000/rs6000.c (altivec_expand_ld_builtin): Add V2DF,
	V2DI support, use vector_altivec_load/vector_altivec_store
	builtins.
	(altivec_expand_st_builtin): Ditto.
	(altivec_expand_builtin): Update altivec lvx/stvx builtin name.
	(altivec_init_builtins): Add support for V2DF/V2DI altivec
	load/store builtins.
	(rs6000_address_for_altivec): Insure memory address is appropriate
	for Altivec.

	* config/rs6000/altivec.md (UNSPEC_LVX): New UNSPEC.
	(altivec_lvx_<mode>): Make altivec_lvx use a mode iterator.
	(altivec_stvx_<mode>): Make altivec_stvx use a mode iterator.

[libcpp]
2011-01-24  Peter Bergner  <bergner@vnet.ibm.com>
	    Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/47272
	* lex.c (search_line_fast): Work with compilers that generate
	either LXVW4X or LVX for vec_ld.

Index: gcc/doc/extend.texi
===================================================================

--- gcc/doc/extend.texi	(revision 169112)
+++ gcc/doc/extend.texi	(working copy)
@@ -12354,6 +12354,12 @@ vector bool long vec_cmplt (vector doubl
 vector float vec_div (vector float, vector float);
 vector double vec_div (vector double, vector double);
 vector double vec_floor (vector double);
+vector double vec_ld (int, const vector double *);
+vector double vec_ld (int, const double *);
+vector double vec_ldl (int, const vector double *);
+vector double vec_ldl (int, const double *);
+vector unsigned char vec_lvsl (int, const volatile double *);
+vector unsigned char vec_lvsr (int, const volatile double *);
 vector double vec_madd (vector double, vector double, vector double);
 vector double vec_max (vector double, vector double);
 vector double vec_min (vector double, vector double);
@@ -12382,6 +12388,8 @@ vector double vec_sel (vector double, ve
 vector double vec_sub (vector double, vector double);
 vector float vec_sqrt (vector float);
 vector double vec_sqrt (vector double);
+void vec_st (vector double, int, vector double *);
+void vec_st (vector double, int, double *);
 vector double vec_trunc (vector double);
 vector double vec_xor (vector double, vector double);
 vector double vec_xor (vector double, vector bool long);
@@ -12412,6 +12420,10 @@ int vec_any_nlt (vector double, vector d
 int vec_any_numeric (vector double);
 @end smallexample
 
+Note that the @samp{vec_ld} and @samp{vec_st} builtins will always
+generate the Altivec @samp{LVX} and @samp{STVX} instructions even
+if the VSX instruction set is available.
+
 GCC provides a few other builtins on Powerpc to access certain instructions:
 @smallexample
 float __builtin_recipdivf (float, float);
Index: libcpp/lex.c
===================================================================
--- libcpp/lex.c	(revision 169112)
+++ libcpp/lex.c	(working copy)
@@ -547,6 +547,11 @@ search_line_fast (const uchar *s, const 
   const vc zero = { 0 };
 
   vc data, mask, t;
+  const uchar *unaligned_s = s;
+
+  /* While altivec loads mask addresses, we still need to align S so
+     that the offset we compute at the end is correct.  */
+  s = (const uchar *)((uintptr_t)s & -16);
 
   /* Altivec loads automatically mask addresses with -16.  This lets us
      issue the first load as early as possible.  */
@@ -555,15 +560,20 @@ search_line_fast (const uchar *s, const 
   /* Discard bytes before the beginning of the buffer.  Do this by
      beginning with all ones and shifting in zeros according to the
      mis-alignment.  The LVSR instruction pulls the exact shift we
-     want from the address.  */
-  mask = __builtin_vec_lvsr(0, s);
+     want from the address.
+
+     Originally, we used s in the lvsr and did the alignment afterwords, which
+     works on a system that supported just the Altivec instruction set using
+     the LVX instruction.  With the introduction of the VSX instruction, for
+     GCC 4.5, the load became LXVW4X.  LVX ignores the bottom 3 bits, and
+     LXVW4X does not.  While GCC 4.6 will revert vec_ld/vec_st to go back to
+     only produce Altivec instructions, the possibiliy exists that the stage1
+     compiler was built with a compiler that generated LXVW4X.  This code will
+     work on either system.  */
+  mask = __builtin_vec_lvsr(0, unaligned_s);
   mask = __builtin_vec_perm(zero, ones, mask);
   data &= mask;
 
-  /* While altivec loads mask addresses, we still need to align S so
-     that the offset we compute at the end is correct.  */
-  s = (const uchar *)((uintptr_t)s & -16);
-
   /* Main loop processing 16 bytes at a time.  */
   goto start;
   do