PREFERRED_STACK_BOUNDARY code tweaks

Jan Hubicka hubicka@atrey.karlin.mff.cuni.cz
Sat Jan 22 05:17:00 GMT 2000


Hi
This patch adds code to keep track of preferred stack boundary for each
function. It initialized by STACK_BOUNDARY and is increased by function calls
and alloca code. Previous approach (avoid end of frame alignment using leaf
function test) didn't work for alloca calls in the leaf functions, that
returned missaligned memory addresses. This new approach may generate
worse code in cases where the function calls are eliminated, but allow
some other optimizations.

The patch also attempts to fix problem in reload with stack_alignment_needed
optimization.  It forces frame alignment always to BIGGEST_ALIGNMENT, that
causes stack_align_needed to be always at least BIGGEST_ALIGNMENT. Because
BIGGEST_ALIGNMENT == PREFERRED_STACK_BOUNDARY now, the optimization is not
effective anymore.

      /* Round size of stack frame to BIGGEST_ALIGNMENT.  This must be done
	 here because the stack size may be a part of the offset computation
	 for register elimination, and there might have been new stack slots
	 created in the last iteration of this loop.   */
      assign_stack_local (BLKmode, 0, 0);

I've modified it to align only to current stack_alignment_needed value.
It is not 100% clear to me, that the alignment is still needed. Reload don't
do the alignment itself anymore (it probably did, because
PREFERRED_STACK_BOUNDARY is defined there). And I am not sure what this code
was shooting for.
Rounding stack frame always to 16 byte alignment is quite wasting of stack
space.

Last change I've implemented is simple optimization of
PREFERRED_STACK_BOUNDARY. The calculated stack boundaries are now stored to
DECL_ALIGN of the function and used by call code to avoid unnecesary
alignments. This optimization seems to save slighly more than 1% of code size
in XaoS.  I am quite nervous about overusing DECL_ALIGN for such purpose. If
you have better idea how to implement this, please let me know.
If this part turns out to be problematic, I will break the patch into
two halves. One adding the cfun->preferred_stack_alignment infrastructure
and second with the DECL_ALIGN part.

Overall I can measure following speedups (I am using tree patched by my
frame pointer elimination patches):
(first result is relative to my frame_elimination patches, second relative
to current CVS mainline, third relative to -mpreferred-stack-boundary=2
compilation)

Bell Labs Benchmark B1: Ackerman's function (recursion) (nongpl/bell-labs/b1.c)
---% ---% ---%    110% 142%  76%    113% 122%  80%    107% 130%  96% 
---% ---% ---%    122% 149%  72%     99% 126%  67%    109% 151%  95% 
Hanoi (tests/hanoi.c)
101% 101% 102%    100% 101%  97%    100% 102%  99%    100% 107% 100% 
101% 101% 102%    100% 101%  98%    100%  99% 100%    100% 106%  99% 
Quicksort (tests/qsort.c)
109% 101%  99%    100% 103%  98%    100% 102%  98%    101% 104%  97% 
107% 100%  98%    100% 103%  97%    101% 104%  99%    102% 105%  98% 
Bresenham line drawing algorithm (tests/bresenham.c)
100% 101% 100%    101% 102% 100%    100% 101% 100%    101% 102% 100% 
100% 101% 100%    101% 102% 100%    100% 102% 100%    101% 102% 100% 
Bell Labs Benchmark B7: (integer statistics) (nongpl/bell-labs/b7.c)
---% ---% ---%    100% 104%  96%    101% 105% 150%     67% 103%  97% 
---% ---% ---%    100% 104%  96%    101% 105%  95%    101% 106% 135% 
Unsorted tests
Bell Labs B2 (tree creation) (nongpl/bell-labs/b2.c):       103% 114%  96% 
Bell Labs B3 (qsort for strings) (nongpl/bell-labs/b3.c):   102% 107%  98% 
Bell Labs B4 (tty driver fragt) (nongpl/bell-labs/b4.c):    106% 120%  96% 
Bell Labs B6 (symbol tbl insert) (nongpl/bell-labs/b5.c):   104% 107% 100% 
Bell Labs B6 (release buffers) (nongpl/bell-labs/b6.c):     107% 114%  94% 
Dhrystone (nongpl/dhry/dhry_1.c nongpl/dhry/dhry_2.c):      102% 104%  98% 
Palette approximation (tests/pal.c):                        100% 114% 100% 
XaoS internal loop (tests/xaos.c):                          100% 103% 101% 
Bzip2 block sorting loop (tests/bzip2.c):                   101% 101% 103% 
Slalom benchmark (nongpl/slalom.c):                         100% 102% 101% 

So as you can see, except for the pathological case of Ackerman's function
test, the cost of preferred_stack_boundary is now less than 6% in all tests.

Sat Jan 19 18:12:15 CET 2000  Jan Hubicka  <hubicka@freesoft.cz>
	* calls.c (compute_argument_block_size): New argument
	preferred_stack_boundary.
	(expand_call): Obtain preferred_stack_boundary required, update
	cfun->preferred_stack_boundary, update call of
	compute_argument_block_size
	(emit_library_call): Increate cfun->preferred_stack_boundary
	to PREFERRED_STACK_BOUNDARY
	(emit_library_call_value): Likewise.
	* explow.c (allocate_dynamic_stack_spave): Likewise.
	* reload1.c (reload): Align stack frame to
	cfun->stack_alignment_needed only.
	* toplev.c (rest_of_compilation): Set up DECL_ALIGN once reload
	is finished.
	* tree.h (tree_decl): Update comment.
	* function.c (prepare_function_start): Set
	cfun->preferred_stack_boundary
	* function.h (struct function): Add preferred_stack_boundary field.
	* integrate.c (expand_inline_function): Update
	cfun->preferred_stack_boundary and cfun->stack_alignment_needed.
	(copy_rtx_and_substitute): Align frame to stack_alignment_needed only.
	* print_tree.c (print_node): Print preferred stack boundary for decl.
	* tree.c (PREFERRED_STACK_BOUNDARY): Provide default value.
	(make_node): Initialize frame alignment to PREFERRED_STACK_BOUNDARY.
	
*** calls.c.noa	Wed Jan 19 08:32:53 2000
--- calls.c	Wed Jan 19 18:40:33 2000
*************** static void precompute_arguments 		PARAM
*** 148,154 ****
  							 struct arg_data *,
  							 struct args_size *));
  static int compute_argument_block_size		PARAMS ((int, 
! 							 struct args_size *));
  static void initialize_argument_information	PARAMS ((int,
  							 struct arg_data *,
  							 struct args_size *,
--- 148,155 ----
  							 struct arg_data *,
  							 struct args_size *));
  static int compute_argument_block_size		PARAMS ((int, 
! 							 struct args_size *,
! 							 int));
  static void initialize_argument_information	PARAMS ((int,
  							 struct arg_data *,
  							 struct args_size *,
*************** initialize_argument_information (num_act
*** 1159,1167 ****
     for arguments passed in registers.  */
  
  static int
! compute_argument_block_size (reg_parm_stack_space, args_size)
       int reg_parm_stack_space;
       struct args_size *args_size;
  {
    int unadjusted_args_size = args_size->constant;
  
--- 1160,1170 ----
     for arguments passed in registers.  */
  
  static int
! compute_argument_block_size (reg_parm_stack_space, args_size,
!     			     preferred_stack_boundary)
       int reg_parm_stack_space;
       struct args_size *args_size;
+      int preferred_stack_boundary ATTRIBUTE_UNUSED;
  {
    int unadjusted_args_size = args_size->constant;
  
*************** compute_argument_block_size (reg_parm_st
*** 1175,1182 ****
        args_size->constant = 0;
  
  #ifdef PREFERRED_STACK_BOUNDARY
!       if (PREFERRED_STACK_BOUNDARY != BITS_PER_UNIT)
! 	args_size->var = round_up (args_size->var, STACK_BYTES);
  #endif
  
        if (reg_parm_stack_space > 0)
--- 1178,1186 ----
        args_size->constant = 0;
  
  #ifdef PREFERRED_STACK_BOUNDARY
!       preferred_stack_boundary /= BITS_PER_UNIT;
!       if (preferred_stack_boundary != BITS_PER_UNIT)
! 	args_size->var = round_up (args_size->var, preferred_stack_boundary);
  #endif
  
        if (reg_parm_stack_space > 0)
*************** compute_argument_block_size (reg_parm_st
*** 1197,1206 ****
    else
      {
  #ifdef PREFERRED_STACK_BOUNDARY
        args_size->constant = (((args_size->constant
  			       + pending_stack_adjust
! 			       + STACK_BYTES - 1)
! 			      / STACK_BYTES * STACK_BYTES)
  			     - pending_stack_adjust);
  #endif
  
--- 1201,1212 ----
    else
      {
  #ifdef PREFERRED_STACK_BOUNDARY
+       preferred_stack_boundary /= BITS_PER_UNIT;
        args_size->constant = (((args_size->constant
  			       + pending_stack_adjust
! 			       + preferred_stack_boundary - 1)
! 			      / preferred_stack_boundary
! 			      * preferred_stack_boundary)
  			     - pending_stack_adjust);
  #endif
  
*************** expand_call (exp, target, ignore)
*** 1672,1677 ****
--- 1678,1688 ----
    rtx call_fusage = 0;
    register tree p;
    register int i;
+ #ifdef PREFERRED_STACK_BOUNDARY
+   int preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
+ #else
+   int preferred_stack_boundary = 0;
+ #endif
  
    /* The value of the function call can be put in a hard register.  But
       if -fcheck-memory-usage, code which invokes functions (and thus
*************** expand_call (exp, target, ignore)
*** 1691,1696 ****
--- 1702,1712 ----
  	fndecl = 0;
        else
  	{
+ 	  if (!DECL_SAVED_INSNS (fndecl)
+ 	      && !DECL_BUILT_IN (fndecl)
+ 	      && DECL_ALIGN (fndecl))
+ 	    preferred_stack_boundary = DECL_ALIGN (fndecl);
+ 
  	  if (!flag_no_inline
  	      && fndecl != current_function_decl
  	      && DECL_INLINE (fndecl)
*************** expand_call (exp, target, ignore)
*** 1901,1906 ****
--- 1917,1929 ----
    if (fndecl && DECL_NAME (fndecl))
      name = IDENTIFIER_POINTER (DECL_NAME (fndecl));
  
+   /* Ensure current function's preferred stack boundary is at least
+      what we need.  We don't have to increase alignment for recursive
+      functions.  */
+   if (cfun->preferred_stack_boundary < preferred_stack_boundary
+       && fndecl != current_function_decl)
+     cfun->preferred_stack_boundary = preferred_stack_boundary;
+ 
    /* See if this is a call to a function that can return more than once
       or a call to longjmp or malloc.  */
    special_function_p (fndecl, &returns_twice, &is_longjmp,
*************** expand_call (exp, target, ignore)
*** 2030,2036 ****
       and constant sizes must be combined, the size may have to be rounded,
       and there may be a minimum required size.  */
    unadjusted_args_size
!     = compute_argument_block_size (reg_parm_stack_space, &args_size);
  
    /* Now make final decision about preallocating stack space.  */
    must_preallocate = finalize_must_preallocate (must_preallocate,
--- 2053,2060 ----
       and constant sizes must be combined, the size may have to be rounded,
       and there may be a minimum required size.  */
    unadjusted_args_size
!     = compute_argument_block_size (reg_parm_stack_space, &args_size,
! 				   preferred_stack_boundary);
  
    /* Now make final decision about preallocating stack space.  */
    must_preallocate = finalize_must_preallocate (must_preallocate,
*************** emit_library_call VPARAMS((rtx orgfun, i
*** 2671,2676 ****
--- 2695,2705 ----
  
    push_temp_slots ();
  
+   /* Ensure current function's preferred stack boundary is at least
+      what we need.  */
+   if (cfun->preferred_stack_boundary < PREFERRED_STACK_BOUNDARY)
+     cfun->preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
+ 
    for (count = 0; count < nargs; count++)
      {
        rtx val = va_arg (p, rtx);
*************** emit_library_call_value VPARAMS((rtx org
*** 3159,3164 ****
--- 3188,3198 ----
  
    is_const = no_queue;
    fun = orgfun;
+ 
+   /* Ensure current function's preferred stack boundary is at least
+      what we need.  */
+   if (cfun->preferred_stack_boundary < PREFERRED_STACK_BOUNDARY)
+     cfun->preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
  
    /* If this kind of value comes back in memory,
       decide where in memory it should come back.  */
*** explow.c.noa	Wed Jan 19 08:58:43 2000
--- explow.c	Wed Jan 19 09:18:33 2000
*************** allocate_dynamic_stack_space (size, targ
*** 1178,1183 ****
--- 1178,1190 ----
    if (GET_MODE (size) != VOIDmode && GET_MODE (size) != Pmode)
      size = convert_to_mode (Pmode, size, 1);
  
+   /* We can't attempt to minimize alignment necessary, because we don't
+      know the final value of preferred_stack_boundary yet while executing
+      this code.  */
+ #ifdef PREFERRED_STACK_BOUNDARY
+   cfun->preferred_stack_boundary = PREFERRED_STACK_BOUNDARY;
+ #endif
+ 
    /* We will need to ensure that the address we return is aligned to
       BIGGEST_ALIGNMENT.  If STACK_DYNAMIC_OFFSET is defined, we don't
       always know its final value at this point in the compilation (it 
*** function.c.noa	Wed Jan 19 08:27:01 2000
--- function.c	Wed Jan 19 10:00:08 2000
*************** prepare_function_start ()
*** 5727,5732 ****
--- 5727,5733 ----
    cfun->original_arg_vector = 0;  
  
    cfun->stack_alignment_needed = 0;
+   cfun->preferred_stack_boundary = STACK_BOUNDARY;
  
    /* Set if a call to setjmp is seen.  */
    current_function_calls_setjmp = 0;
*** function.h.noa	Wed Jan 19 07:53:36 2000
--- function.h	Wed Jan 19 07:55:57 2000
*************** struct function
*** 456,461 ****
--- 456,463 ----
    struct machine_function *machine;
    /* The largest alignment of slot allocated on the stack.  */
    int stack_alignment_needed;
+   /* Preferred alignment of the end of stack frame.  */
+   int preferred_stack_boundary;
  
    /* Language-specific code can use this to store whatever it likes.  */
    struct language_function *language;
*** integrate.c.noa	Wed Jan 19 08:53:26 2000
--- integrate.c	Wed Jan 19 09:49:39 2000
*************** expand_inline_function (fndecl, parms, t
*** 602,607 ****
--- 602,613 ----
  
    nargs = list_length (DECL_ARGUMENTS (fndecl));
  
+   if (cfun->preferred_stack_boundary < inl_f->preferred_stack_boundary)
+     cfun->preferred_stack_boundary = inl_f->preferred_stack_boundary;
+ 
+   if (cfun->stack_alignment_needed < inl_f->stack_alignment_needed)
+     cfun->stack_alignment_needed = inl_f->stack_alignment_needed;
+ 
    /* Check that the parms type match and that sufficient arguments were
       passed.  Since the appropriate conversions or default promotions have
       already been applied, the machine modes should match exactly.  */
*************** copy_rtx_and_substitute (orig, map, for_
*** 1619,1631 ****
  	    {
  	      rtx loc, seq;
  	      int size = get_func_frame_size (DECL_SAVED_INSNS (map->fndecl));
  
  #ifdef FRAME_GROWS_DOWNWARD
  	      /* In this case, virtual_stack_vars_rtx points to one byte
  		 higher than the top of the frame area.  So make sure we
  		 allocate a big enough chunk to keep the frame pointer
  		 aligned like a real one.  */
! 	      size = CEIL_ROUND (size, BIGGEST_ALIGNMENT / BITS_PER_UNIT);
  #endif
  	      start_sequence ();
  	      loc = assign_stack_temp (BLKmode, size, 1);
--- 1627,1643 ----
  	    {
  	      rtx loc, seq;
  	      int size = get_func_frame_size (DECL_SAVED_INSNS (map->fndecl));
+ 	      int alignment
+ 		= (DECL_SAVED_INSNS (map->fndecl)->stack_alignment_needed
+ 		   / BITS_PER_UNIT);
  
  #ifdef FRAME_GROWS_DOWNWARD
  	      /* In this case, virtual_stack_vars_rtx points to one byte
  		 higher than the top of the frame area.  So make sure we
  		 allocate a big enough chunk to keep the frame pointer
  		 aligned like a real one.  */
! 	      if (alignment)
! 		size = CEIL_ROUND (size, alignment);
  #endif
  	      start_sequence ();
  	      loc = assign_stack_temp (BLKmode, size, 1);
*** print-tree.c.noa	Wed Jan 19 10:25:44 2000
--- print-tree.c	Wed Jan 19 10:26:56 2000
*************** print_node (file, prefix, node, indent)
*** 403,408 ****
--- 403,410 ----
  	fprintf (file, " frame_size %d", DECL_FRAME_SIZE (node));
        else if (DECL_BUILT_IN (node))
  	fprintf (file, " built-in code %d", DECL_FUNCTION_CODE (node));
+       else
+ 	fprintf (file, " preffered_stack_boundary %d", DECL_ALIGN (node));
        if (TREE_CODE (node) == FIELD_DECL)
  	print_node (file, "bitpos", DECL_FIELD_BITPOS (node), indent + 4);
        if (DECL_POINTER_ALIAS_SET_KNOWN_P (node))
*** reload1.c.noa	Wed Jan 19 09:55:32 2000
--- reload1.c	Wed Jan 19 18:29:20 2000
*************** reload (first, global, dumpfile)
*** 841,851 ****
  
        HOST_WIDE_INT starting_frame_size;
  
!       /* Round size of stack frame to BIGGEST_ALIGNMENT.  This must be done
  	 here because the stack size may be a part of the offset computation
  	 for register elimination, and there might have been new stack slots
  	 created in the last iteration of this loop.   */
!       assign_stack_local (BLKmode, 0, 0);
  
        starting_frame_size = get_frame_size ();
  
--- 843,854 ----
  
        HOST_WIDE_INT starting_frame_size;
  
!       /* Round size of stack frame to stack_alignment_needed.  This must be done
  	 here because the stack size may be a part of the offset computation
  	 for register elimination, and there might have been new stack slots
  	 created in the last iteration of this loop.   */
!       if (cfun->stack_alignment_needed)
!         assign_stack_local (BLKmode, 0, cfun->stack_alignment_needed);
  
        starting_frame_size = get_frame_size ();
  
*** toplev.c.noa	Wed Jan 19 09:02:11 2000
--- toplev.c	Wed Jan 19 09:04:02 2000
*************** rest_of_compilation (decl)
*** 3410,3415 ****
--- 3410,3420 ----
    if (rebuild_label_notes_after_reload)
      TIMEVAR (jump_time, rebuild_jump_labels (insns));
  
+   /* Calculate stack boundary alignment preffered while calling the function.  */
+   if (!DECL_SAVED_INSNS (current_function_decl))
+     DECL_ALIGN (current_function_decl) = MAX (cfun->stack_alignment_needed,
+ 					      cfun->preferred_stack_boundary);
+ 
    /* On some machines, the prologue and epilogue code, or parts thereof,
       can be represented as RTL.  Doing so lets us schedule insns between
       it and the rest of the code and also allows delayed branch
*** tree.h.noa	Wed Jan 19 07:59:18 2000
--- tree.h	Wed Jan 19 07:59:22 2000
*************** struct tree_decl
*** 1395,1402 ****
    unsigned malloc_flag : 1;
    unsigned no_limit_stack : 1;
  
!   /* For a FUNCTION_DECL, if inline, this is the size of frame needed.
!      If built-in, this is the code for which built-in function.
       For other kinds of decls, this is DECL_ALIGN.  */
    union {
      int i;
--- 1395,1403 ----
    unsigned malloc_flag : 1;
    unsigned no_limit_stack : 1;
  
!   /* For a FUNCTION_DECL, if inline, this is the size of frame needed,
!      otherwise if known, this is the frame preferred stack boundary
!      alignment.  If built-in, this is the code for which built-in function.
       For other kinds of decls, this is DECL_ALIGN.  */
    union {
      int i;
*** tree.c.noa	Wed Jan 19 18:54:19 2000
--- tree.c	Wed Jan 19 09:14:28 2000
*************** Boston, MA 02111-1307, USA.  */
*** 43,48 ****
--- 43,52 ----
  #include "toplev.h"
  #include "ggc.h"
  
+ #if !defined PREFERRED_STACK_BOUNDARY && defined STACK_BOUNDARY
+ #define PREFERRED_STACK_BOUNDARY STACK_BOUNDARY
+ #endif
+ 
  #define obstack_chunk_alloc xmalloc
  #define obstack_chunk_free free
  /* obstack.[ch] explicitly declined to prototype this. */
*************** make_node (code)
*** 1060,1065 ****
--- 1064,1073 ----
      case 'd':
        if (code != FUNCTION_DECL)
  	DECL_ALIGN (t) = 1;
+ #ifdef PREFERRED_STACK_BOUDNARY
+       else
+ 	DECL_ALIGN (t) = PREFERRED_STACK_BOUNDARY;
+ #endif
        DECL_IN_SYSTEM_HEADER (t)
  	= in_system_header && (obstack == &permanent_obstack);
        DECL_SOURCE_LINE (t) = lineno;


More information about the Gcc-patches mailing list