Disable accumulate-outgoing-args for Generic and Buldozers

Fri Jan 24 22:40:00 GMT 2014

> On Wed, Jan 01, 2014 at 03:30:04PM +0100, Jan Hubicka wrote:
> > 	* config/i38/x86-tune.def: Disable X86_TUNE_ACCUMULATE_OUTGOING_ARGS
> > 	for generic and recent AMD chips
> > Index: config/i386/x86-tune.def
> > ===================================================================
> > --- config/i386/x86-tune.def	(revision 206233)
> > +++ config/i386/x86-tune.def	(working copy)
> > @@ -143,7 +143,7 @@ DEF_TUNE (X86_TUNE_REASSOC_FP_TO_PARALLE
> >     regression on mgrid due to IRA limitation leading to unecessary
> >     use of the frame pointer in 32bit mode.  */
> >  DEF_TUNE (X86_TUNE_ACCUMULATE_OUTGOING_ARGS, "accumulate_outgoing_args",
> > -	  m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_AMD_MULTIPLE | m_GENERIC)
> > +	  m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_ATHLON_K8)
> >  
> >  /* X86_TUNE_PROLOGUE_USING_MOVE: Do not use push/pop in prologues that are
> >     considered on critical path.  */
> 
> Are you sure this is a good idea even for 32-bit code (i.e. shouldn't we
> have separate tunables for 32-bit and 64-bit code)?
> I admit I haven't performed trunk bootstraps/regtests for 3 days, am doing
> x86_64 and i686 bootstraps/regtests concurrently and it is yes,rtl checking,
> but am quite surprised that compared to 3 days ago the bootstrap time of
> i686-linux (all,obj-c++,go) went up from about 70 minutes or so to 140 minutes today,
> while the x86_64-linux (all,obj-c++,go,ada) remained basically the same
> around 2 hours.  This is on quad socket Quad-Core AMD Opteron(tm) Processor 8354,
> perhaps it is just extremely undesirable there.

I ran SPEC benchmarks and compile times on those + code size. I tested only
64bit cross.

I did see minor slowdown, but definitely not of the degree you observe.  My
main concern was overall binary size increase with EH tables enabled as
described in the original mail. Given that accumulation was already disabled
for cores that now has about the same characteristic as AMD chips (both has
stack engine) it seemed we should go both ways on one target and other.  It
also seemed that 4% actual text segment reduction is very nice.

Perhaps we degenerate on some testcase or we get a lot slower on -O0 codegen?
We may just enable accumulation by default -O0 since it results in faster
compile times.

Also perhaps it is debug info output problem, since my SPEC was without -g.

I use now AMD Opteron(TM) Processor 6272 that should behave pretty much same.
I will try to get some data.

Honza