This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: Faster for_each_rtx-like iterators


Trevor Saunders <tsaunders@mozilla.com> writes:
> On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote:
>> I noticed for_each_rtx showing up in profiles and thought I'd have a go
>> at using worklist-based iterators instead.  So far I have three:
>> 
>>   FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
>>   FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
>>   FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *
>> 
>> with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
>> 
>> I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because
>> most walks really don't modify the structure.  I think we should
>> encourage const_rtxes to be used whereever possible.  E.g. it might
>> make it easier to have non-GC storage for temporary rtxes in future.
>> 
>> I've locally replaced all for_each_rtx calls in the generic code with
>> these iterators and they make things reproducably faster.  The speed-up
>> on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
>> but maybe that's enough to justify the churn.
>
> seems pretty nice, and it seems like it'll make code a little more
> readable too :)
>
>> Implementation-wise, the main observation is that most subrtxes are part
>> of a single contiguous sequence of "e" fields.  E.g. when compiling an
>> oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
>> subrtxes of 7,636,542 rtxes.  Of those:
>> 
>> (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
>> (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
>>                       no "E" fields, and
>> (C)    43,532 (00.6%) are more complicated.
>> 
>> (A) is really a special case of (B) in which the block has zero length.
>> Those are the only two cases that really need to be handled inline.
>> The implementation does this by having a mapping from an rtx code to the
>> bounds of its "e" sequence, in the form of a start index and count.
>> 
>> Out of (C), the vast majority (43,509) are PARALLELs.  However, as you'd
>> probably expect, bloating the inline code with that case made things
>> slower rather than faster.
>> 
>> The vast majority (in fact all in the combine.ii run above) of iterations
>> can be done with a 16-element stack worklist.  We obviously still need a
>> heap fallback for the pathological cases though.
>> 
>> I spent a bit of time trying different iterator implementations and
>> seeing which produced the best code.  Specific results from that were:
>> 
>> - The storage used for the worklist is separate from the iterator,
>>   in order to avoid capturing iterator fields.
>> 
>> - Although the natural type of the storage would be auto_vec <..., 16>,
>>   that produced some overhead compared with a separate stack array and heap
>>   vector pointer.  With the heap vector pointer, the only overhead is an
>>   assignment in the constructor and an "if (x) release (x)"-style sequence
>>   in the destructor.  I think the extra complication over auto_vec is worth
>>   it because in this case the heap version is so very rarely needed.
>
> hm, where does the overhead come from exactly? it seems like if  its
>  faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something
>  about vectors since this isn't the only place it could matter.  does it
>  matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second
>  is basically just a wrapper around the former I'd expect has no effect.
>  I'm not saying you're doing the wrong thing here, but if we can make
>  generic vectors faster we probably should ;) or is the issue the
>  __builtin_expect()s you can add?

Part of the problem is that by having an array in the vec itself,
the other fields effectively have their address taken too.
So m_alloc, m_num and m_using_auto_storage need to be set up
and maintained on the stack, even though we're almost sure that they
will never be used.

>> - The maximum number of fields in (B)-type rtxes is 3.  We get better
>>   code by making that explicit rather than having a general loop.
>> 
>> - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single
>>   check to test for that and for cases where the stack worklist is
>>   too small.
>
>  can we use uint8_t?

We don't really use that in GCC yet.  I don't mind setting a precedent
though :-)

>> To give an example:
>> 
>> /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE
>>    whose UID is greater than the int uid that D points to.  */
>> 
>> static int
>> refs_newer_value_cb (rtx *x, void *d)
>> {
>>   if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d)
>>     return 1;
>> 
>>   return 0;
>> }
>> 
>> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
>>    that of V.  */
>> 
>> static bool
>> refs_newer_value_p (rtx expr, rtx v)
>> {
>>   int minuid = CSELIB_VAL_PTR (v)->uid;
>> 
>>   return for_each_rtx (&expr, refs_newer_value_cb, &minuid);
>> }
>> 
>> becomes:
>> 
>> /* Return TRUE if EXPR refers to a VALUE whose uid is greater than
>>    that of V.  */
>> 
>> static bool
>> refs_newer_value_p (const_rtx expr, rtx v)
>> {
>>   int minuid = CSELIB_VAL_PTR (v)->uid;
>>   subrtx_iterator::array_type array;
>
> some reason to not hide it in the macro?

I wanted to make the macro a true for loop, with no extra baggage.
AFAIK there's no way of declaring both the array and the iterator
in the first part of the for loop without making them part of the
same object (and thus capturing the iterator fields again).
A "do ... while (0)" wrapper might be too surprising.

Also, keeping the array separate allows you to reuse it between
iterations, so you only do the allocation and free once.  E.g.:

  subrtx_iterator::array_type array;
  for (rtx insn = ...; insn; insn = NEXT_INSN (insn))
    if (INSN_P (insn))
      FOR_EACH_SUBRTX (...)

>> +template <typename T>
>> +inline generic_subrtx_iterator <T>::~generic_subrtx_iterator ()
>> +{
>> +}
>
>  What's wrong with just letting the compiler generate that for you?

Bah, thanks for catching that.  It was left over from an earlier
experiment in which the destructor did do some cleanup of the array.

>> @@ -78,6 +82,93 @@ static int non_rtx_starting_operands[NUM_RTX_CODE];
>>  static unsigned int
>>  num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1];
>>  
>> +/* Store X into index I of ARRAY.  ARRAY is known to have at least I
>> +   elements.  Return the new base of ARRAY.  */
>> +
>> +template <typename T>
>> +typename T::value_type *
>> +generic_subrtx_iterator <T>::add_single_to_queue (array_type &array,
>> +						  value_type *base,
>> +						  size_t i, value_type x)
>> +{
>> +  if (base == array.stack)
>> +    {
>> +      if (i < LOCAL_ELEMS)
>> +	{
>> +	  base[i] = x;
>> +	  return base;
>> +	}
>
> it seems like this first little bit might be worth inlining, but I guess
> you tested and it wasn't.

Yeah.  The cases where an "e" or entire "E" fit within the stack buffer
are already handled inline.  So in practice we only get here if we know
we have blown or are about to blow the stack buffer.  The case it
handles is where part of an "E" field fits within the buffer and part of
it doesn't.

One option would have been to force the heap buffer to be used when
entering this function.  But that would then make the simple check:

		  if (m_end > LOCAL_ELEMS)
		    m_base = m_array.heap->address ();

unsafe.  It seemed better to make add_single_to_queue general and
handle both stack and heap pushes.

Thanks,
Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]