This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
[PATCH], PR 71977/70568/78823: Improve PowerPC code that uses SFmode in unions

From: Michael Meissner <meissner at linux dot vnet dot ibm dot com>
To: gcc-patches at gcc dot gnu dot org, Segher Boessenkool <segher at kernel dot crashing dot org>, David Edelsohn <dje dot gcc at gmail dot com>, Bill Schmidt <wschmidt at linux dot vnet dot ibm dot com>
Date: Thu, 29 Dec 2016 19:24:06 -0500
Subject: [PATCH], PR 71977/70568/78823: Improve PowerPC code that uses SFmode in unions
Authentication-results: sourceware.org; auth=none
This is both a fix to a regression (since GCC 4.9), a code improvement for
GLIBC, and fixes potential bugs with the recent changes to allow small integers
(32-bit integer, SImode in particular) in floating point and vector registers.

The core of the problem is when a SFmode (32-bit binary floating point) value
is a scalar in the floating point and vector registers, it is stored internally
as a 64-bit binary floating point value.  This means that if you could look at
the value using the SUBREG mechanism you might see the wrong value.  Before the
recent changes to add small integer support went in, it was less of an issue,
since the only integer type allowed in floating point and vector registers was
64-bit integers (DImode).

The regression comes in in how SFmode values are moved between general purpose
and floating point/vector registers.  Up through the power7, the way you moved
SFmode values from one register set to another was store and then load.  Doing
back to back store and load to the same location could cause serious
performance problems on recent power systems.  On the power7 and power8, we
would put a special NOP that forces the two instructions to be in different
dispatch groups, which helps somewhat.

When the power8 (ISA 2.07) came out, it had direct move instructions and
convert between scalar double precision and single precision.  I added the
appropriate secondary_reload support so that if the register allocator wanted
to move a SFmode value between register banks, it would create a temporary and
do the appropriate instructions to move the value.  This worked in the GCC 4.8
time frame.

Some time in the 4.9 time frame, this broke and the register allocator would
more often generate store and load instead of the direct move sequences.
However, simple test cases continued to use the direct move instructions.  In
checking on power9 (ISA 3.0) code, it more likely uses the store/load than
direct move.

On power8 spec runs we have seen the effect of these store/load sequences on
some benchmarks from the next generation of the Spec suite.

The optimization that the GLIBC implementers have requested (PR 71977) was to
speed up code sequences that they use in implementing the single precision math
library.  They often times need to extract/modify bits in floating point values
(for example, setting exponents or mantissas, etc.).

For example from e_powf.c, you see code like this after macro expansion:

	typedef union
	{
	  float value;
	  u_int32_t word;
	} ieee_float_shape_type;

	float t1;
	int32_t is;
	/* ... */
	do
	  {
	    ieee_float_shape_type gf_u;
	    gf_u.value = (t1);
	    (is) = gf_u.word;
	  }
	while (0);
	do
	  {
	    ieee_float_shape_type sf_u;
	    sf_u.word = (is&0xfffff000);
	    (t1) = sf_u.value;
	  }
	while (0);

Originally, I just wrote a peephole2 to catch the above code, and it worked in
small test cases on the power8.  But it didn't work on larger programs or on
the power9.  I also wanted to fix the performance issue that we've seen.

I also became convinced that for GCC 7, it was a ticking time bomb where
eventually somebody would write code that intermixed SImode and SFmode, and it
would get the wrong value.

The main part of the patch is to not let the compiler generate:

	(set (reg:SI)
	     (subreg:SF (reg:SI)))

or

	(set (reg:SI)
	     (subreg:SI (reg:SF)))

Most of the register predicates eventually call gpc_reg_operand, and it was
simple to put the check in there, and to other predicates that did not call
gpc_reg_operand. 

I created new insns to do the move between formats that allocated the temporary
needed with match_scratch.

There were places that then needed to not have the check (the movsi/movsf
expanders themselves, and the insn spliters for format conversion insns), and I
added a predicate for that.

I have built the patches on a little endian power8, a big endian power8 (64-bit
only), and a big endian power7 (both 32-bit and 64-bit).  There were no
regression failures.

In addition, I built spec 2006, with the fixes and without, and did a quick run
comparing the results (1 run).  I am re-running the spec results, with the code
merged to today's trunk, and with 3 runs to isolate the occasional benchmark
that goes off in the weeds.

Of the 29 benchmarks in Spec 2006 CPU, 6 benchmarks had changes in the
instructions generated (perlbench, gromacs, cactusADM, namd, povray, wrf).

In the single run I did, there were no regressions, and 2 or 3 benchmarks
improved:

	namd		6%
	tonto		3%
	libquantum	6%

However, single runs of libquantum have varied as much as 10%, so without
seeing more runs, I will skip it.  Namd was one of the benchmarks that saw
changes in code generation, but tonto did not have changes of code.  I suspect
having the separate converter unspec insn, allowed the scheduler to move things
in between the move and the use.

So, in summary, can I check these changes into the GCC 7 trunk?  Given it does
fix a long standing regression in GCC 6 that hurts performance, did you want me
to develop the changes for a GCC 6 backport as well?  I realize it is skating
on thin ice whether it is a feature or a fix.

[gcc]
2016-12-29  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/71977
	PR target/70568
	PR target/78823
	* config/rs6000/predicates.md (sf_subreg_operand): New predicate
	to return true if the operand contains a SUBREG mixing SImode and
	SFmode on 64-bit VSX systems with direct move.  This can be a
	problem in that we have to know whether the SFmode value should be
	represented in the 32-bit memory format or the 64-bit scalar
	format used within the floating point and vector registers.
	(altivec_register_operand): Do not return true if the operand
	contains a SUBREG mixing SImode and SFmode.
	(vsx_register_operand): Likewise.
	(vsx_reg_sfsubreg_ok): New predicate.  Like vsx_register_operand,
	but do not check if there is a SUBREG mixing SImode and SFmode on
	64-bit VSX systems with direct move.
	(vfloat_operand): Do not return true if the operand contains a
	SUBREG mixing SImode and SFmode.
	(vint_operand): Likewise.
	(vlogical_operand): Likewise.
	(gpc_reg_operand): Likewise.
	(int_reg_operand): Likewise.
	* config/rs6000/rs6000.c (valid_sf_si_move): New function to
	determine if a MOVSI or MOVSF operation contains SUBREGs that mix
	SImode and SFmode.
	(rs6000_emit_move): If we have a MOVSI or MOVSF operation that
	contains SUBREGs that mix SImode and SFmode, call special insns,
	that can allocate the necessary temporary registers to convert
	between SFmode and SImode within the registers.
	* config/rs6000/vsx.md (SFBOOL_*): Add peephole2 to recognize when
	we are converting a SFmode to a SImode, moving the result to a GPR
	register, doing a single AND/IOR/XOR operation, and then moving it
	back to a vector register.  Change the insns recognized to move
	the integer value to the vector register and do the operation
	there.  This code occurs quite a bit in the GLIBC math library in
	float math functions.
	(peephole2 to speed up GLIB math functions): Likewise.
	* config/rs6000/rs6000-protos.h (valid_sf_si_move): Add
	declaration.
	* config/rs6000/rs6000.h (TARGET_NO_SF_SUBREG): New internal
	target macros to say whether we need to avoid SUBREGs mixing
	SImode and SFmode.
	(TARGET_ALLOW_SF_SUBREG): Likewise.
	* config/rs6000/rs6000.md (UNSPEC_SF_FROM_SI): Ne unspecs.
	(UNSPEC_SI_FROM_SF): Likewise.
	(iorxor): Change spacing.
	(and_ior_xor): New iterator for AND, IOR, and XOR.
	(movsi_from_sf): New insn to handle where we are moving SFmode
	values to SImode registers and we need to convert the value to the
	memory format from the format used within the register.
	(movdi_from_sf_zero_ext): Optimize zero extending movsi_from_sf.
	(mov<mode>_hardfloat, FMOVE32 iterator): Don't allow moving
	SUBREGs mixing SImode and SFmode on 64-bit VSX systems with direct
	move.
	(movsf_from_si): New insn to handle where we are moving SImode
	values to SFmode registers and we need to convert the value to the
	the format used within the floating point and vector registers to
	the 32-bit memory format.
	(fma<mode>4): Change register_operand to gpc_reg_operand to
	prevent SUBREGs mixing SImode and SFmode.
	(fms<mode>4): Likewise.
	(fnma<mode>4): Likewise.
	(fnms<mode>4): Likewise.
	(nfma<mode>4): Likewise.
	(nfms<mode>4): Likewise.

[gcc/testsuite]
2016-12-29  Michael Meissner  <meissner@linux.vnet.ibm.com>

	PR target/71977
	PR target/70568
	PR target/78823
	* gcc.target/powerpc/pr71977-1.c: New tests to check whether on
	64-bit VSX systems with direct move, whether we optimize common
	code sequences in the GLIBC math library for float math functions.
	* gcc.target/powerpc/pr71977-2.c: Likewise.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meissner@linux.vnet.ibm.com, phone: +1 (978) 899-4797
Attachment: pr71977.patch04b
Description: Text document
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]