This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
On Wed, May 07, 2014 at 09:52:49PM +0100, Richard Sandiford wrote: > I noticed for_each_rtx showing up in profiles and thought I'd have a go > at using worklist-based iterators instead. So far I have three: > > FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx > FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx > FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx * > > with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement. > > I made FOR_EACH_SUBRTX the "default" (unsuffixed) version because > most walks really don't modify the structure. I think we should > encourage const_rtxes to be used whereever possible. E.g. it might > make it easier to have non-GC storage for temporary rtxes in future. > > I've locally replaced all for_each_rtx calls in the generic code with > these iterators and they make things reproducably faster. The speed-up > on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%, > but maybe that's enough to justify the churn. seems pretty nice, and it seems like it'll make code a little more readable too :) > Implementation-wise, the main observation is that most subrtxes are part > of a single contiguous sequence of "e" fields. E.g. when compiling an > oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the > subrtxes of 7,636,542 rtxes. Of those: > > (A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields, > (B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and > no "E" fields, and > (C) 43,532 (00.6%) are more complicated. > > (A) is really a special case of (B) in which the block has zero length. > Those are the only two cases that really need to be handled inline. > The implementation does this by having a mapping from an rtx code to the > bounds of its "e" sequence, in the form of a start index and count. > > Out of (C), the vast majority (43,509) are PARALLELs. However, as you'd > probably expect, bloating the inline code with that case made things > slower rather than faster. > > The vast majority (in fact all in the combine.ii run above) of iterations > can be done with a 16-element stack worklist. We obviously still need a > heap fallback for the pathological cases though. > > I spent a bit of time trying different iterator implementations and > seeing which produced the best code. Specific results from that were: > > - The storage used for the worklist is separate from the iterator, > in order to avoid capturing iterator fields. > > - Although the natural type of the storage would be auto_vec <..., 16>, > that produced some overhead compared with a separate stack array and heap > vector pointer. With the heap vector pointer, the only overhead is an > assignment in the constructor and an "if (x) release (x)"-style sequence > in the destructor. I think the extra complication over auto_vec is worth > it because in this case the heap version is so very rarely needed. hm, where does the overhead come from exactly? it seems like if its faster to use vec<T, va_heap, vl_embedd> *foo; we should fix something about vectors since this isn't the only place it could matter. does it matter if you use vec<T, va_heap, vl_embedd> * or vec<T> ? the second is basically just a wrapper around the former I'd expect has no effect. I'm not saying you're doing the wrong thing here, but if we can make generic vectors faster we probably should ;) or is the issue the __builtin_expect()s you can add? > - Several existing for_each_rtx callbacks have something like: > > if (GET_CODE (x) == CONST) > return -1; > > or: > > if (CONSTANT_P (x)) > return -1; > > to avoid walking subrtxes of constants. That can be done without > extra code checks and branches by having a separate code->bound > mapping in which all constants are treated as leaf rtxes. This usage > should be common enough to outweigh the cache penalty of two arrays. > > The choice between iterating over constants or not is given in the > final parameter of the FOR_EACH_* iterator. less repitition \O/ > - The maximum number of fields in (B)-type rtxes is 3. We get better > code by making that explicit rather than having a general loop. > > - (C) codes map to an "e" count of UCHAR_MAX, so we can use a single > check to test for that and for cases where the stack worklist is > too small. can we use uint8_t? > To give an example: > > /* Callback for for_each_rtx, that returns 1 upon encountering a VALUE > whose UID is greater than the int uid that D points to. */ > > static int > refs_newer_value_cb (rtx *x, void *d) > { > if (GET_CODE (*x) == VALUE && CSELIB_VAL_PTR (*x)->uid > *(int *)d) > return 1; > > return 0; > } > > /* Return TRUE if EXPR refers to a VALUE whose uid is greater than > that of V. */ > > static bool > refs_newer_value_p (rtx expr, rtx v) > { > int minuid = CSELIB_VAL_PTR (v)->uid; > > return for_each_rtx (&expr, refs_newer_value_cb, &minuid); > } > > becomes: > > /* Return TRUE if EXPR refers to a VALUE whose uid is greater than > that of V. */ > > static bool > refs_newer_value_p (const_rtx expr, rtx v) > { > int minuid = CSELIB_VAL_PTR (v)->uid; > subrtx_iterator::array_type array; some reason to not hide it in the macro? > FOR_EACH_SUBRTX (iter, array, expr, NONCONST) > if (GET_CODE (*iter) == VALUE && CSELIB_VAL_PTR (*iter)->uid > minuid) > return true; > return false; > } > > The iterator also allows subrtxes of a specific rtx to be skipped; > this is the equivalent of returning -1 from a for_each_rtx callback. > It also allows the current rtx to be replaced in the worklist by > another. E.g.: > > static void > mark_constants_in_pattern (rtx insn) > { > subrtx_iterator::array_type array; > FOR_EACH_SUBRTX (iter, array, PATTERN (insn), ALL) > { > const_rtx x = *iter; > if (GET_CODE (x) == SYMBOL_REF) > { > if (CONSTANT_POOL_ADDRESS_P (x)) > { > struct constant_descriptor_rtx *desc = SYMBOL_REF_CONSTANT (x); > if (desc->mark == 0) > { > desc->mark = 1; > iter.substitute (desc->constant); > } > } > else if (TREE_CONSTANT_POOL_ADDRESS_P (x)) > { > tree decl = SYMBOL_REF_DECL (x); > if (!TREE_ASM_WRITTEN (DECL_INITIAL (decl))) > { > n_deferred_constants--; > output_constant_def_contents (CONST_CAST_RTX (x)); > } > } > } > } > } > > where the iter.substitute avoids a recursive walk. (The CONST_CAST_RTX > is "temporary".) > > Does this look like it's worth pursuing? If so, is it OK to start > submitting now or should it wait until 4.9.1 is out? If it is OK, > I plan to convert the ports too and get rid of for_each_rtx. > > Thanks, > Richard > > > diff --git a/gcc/rtl-iter.h b/gcc/rtl-iter.h > new file mode 100644 > index 0000000..1bc171d > --- /dev/null > +++ b/gcc/rtl-iter.h > @@ -0,0 +1,293 @@ > +/* RTL iterators > + Copyright (C) 2014 Free Software Foundation, Inc. > + > +This file is part of GCC. > + > +GCC is free software; you can redistribute it and/or modify it under > +the terms of the GNU General Public License as published by the Free > +Software Foundation; either version 3, or (at your option) any later > +version. > + > +GCC is distributed in the hope that it will be useful, but WITHOUT ANY > +WARRANTY; without even the implied warranty of MERCHANTABILITY or > +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License > +for more details. > + > +You should have received a copy of the GNU General Public License > +along with GCC; see the file COPYING3. If not see > +<http://www.gnu.org/licenses/>. */ > + > +/* This structure describes the subrtxes of an rtx as follows: > + > + - if the rtx has no subrtxes, START and COUNT are both 0. > + > + - if all the subrtxes of an rtx are stored in a contiguous block > + of XEXPs ("e"s), START is the index of the first XEXP and COUNT > + is the number of them. > + > + - otherwise START is arbitrary and COUNT is UCHAR_MAX. */ > +struct rtx_subrtx_bound_info { > + unsigned char start; > + unsigned char count; > +}; > +extern rtx_subrtx_bound_info rtx_all_subrtx_bounds[]; > +extern rtx_subrtx_bound_info rtx_nonconst_subrtx_bounds[]; > + > +/* Return true if CODE has no subrtxes. */ > + > +static inline bool > +leaf_code_p (enum rtx_code code) > +{ > + return rtx_all_subrtx_bounds[code].count == 0; > +} > + > +/* Used to iterate over subrtxes of an rtx. T abstracts the type of > + access. */ > +template <typename T> > +class generic_subrtx_iterator > +{ > + static const size_t LOCAL_ELEMS = 16; > + typedef typename T::value_type value_type; > + typedef typename T::rtx_type rtx_type; > + typedef typename T::rtunion_type rtunion_type; > + > +public: > + struct array_type > + { > + array_type (); > + ~array_type (); > + value_type stack[LOCAL_ELEMS]; > + vec <value_type, va_heap, vl_embed> *heap; > + }; > + generic_subrtx_iterator (array_type &, value_type, > + const rtx_subrtx_bound_info *); > + ~generic_subrtx_iterator (); > + > + value_type operator * () const; > + bool at_end () const; > + void next (); > + void skip_subrtxes (); > + void substitute (value_type); > + > +private: > + /* The bounds to use for iterating over subrtxes. */ > + const rtx_subrtx_bound_info *m_bounds; > + > + /* The storage used for the worklist. */ > + array_type &m_array; > + > + /* The current rtx. */ > + value_type m_current; > + > + /* The base of the current worklist. */ > + value_type *m_base; > + > + /* The number of subrtxes in M_BASE. */ > + size_t m_end; > + > + /* The following booleans shouldn't end up in registers or memory > + but just direct control flow. */ > + > + /* True if the iteration is over. */ > + bool m_done; > + > + /* True if we should skip the subrtxes of M_CURRENT. */ > + bool m_skip; > + > + /* True if M_CURRENT has been replaced with a different rtx. */ > + bool m_substitute; > + > + static void free_array (array_type &); > + static size_t add_subrtxes_to_queue (array_type &, value_type *, size_t, > + rtx_type); > + static value_type *add_single_to_queue (array_type &, value_type *, size_t, > + value_type); > +}; > + > +template <typename T> > +inline generic_subrtx_iterator <T>::array_type::array_type () : heap (0) {} > + > +template <typename T> > +inline generic_subrtx_iterator <T>::array_type::~array_type () > +{ > + if (__builtin_expect (heap != 0, false)) > + free_array (*this); > +} > + > +/* Iterate over X and its subrtxes, in arbitrary order. Use ARRAY to > + store the worklist. We use an external array in order to avoid > + capturing the fields of this structure when taking the address of > + the array. Use BOUNDS to find the bounds of simple "e"-string codes. */ > + > +template <typename T> > +inline generic_subrtx_iterator <T>:: > +generic_subrtx_iterator (array_type &array, value_type x, > + const rtx_subrtx_bound_info *bounds) > + : m_bounds (bounds), > + m_array (array), > + m_current (x), > + m_base (m_array.stack), > + m_end (0), > + m_done (false), > + m_skip (false), > + m_substitute (false) > +{ > +} > + > +template <typename T> > +inline generic_subrtx_iterator <T>::~generic_subrtx_iterator () > +{ > +} What's wrong with just letting the compiler generate that for you? > +/* Return the current subrtx. */ > + > +template <typename T> > +inline typename T::value_type > +generic_subrtx_iterator <T>::operator * () const > +{ > + return m_current; > +} > + > +/* Return true if the iteration has finished. */ > + > +template <typename T> > +inline bool > +generic_subrtx_iterator <T>::at_end () const > +{ > + return m_done; > +} > + > +/* Move on to the next subrtx. */ > + > +template <typename T> > +inline void > +generic_subrtx_iterator <T>::next () > +{ > + if (m_substitute) > + { > + m_substitute = false; > + m_skip = false; > + return; > + } > + if (!m_skip) > + { > + /* Add the subrtxes of M_CURRENT. */ > + rtx_type x = T::get_rtx (m_current); > + if (__builtin_expect (x != 0, true)) > + { > + enum rtx_code code = GET_CODE (x); > + ssize_t count = m_bounds[code].count; > + if (count > 0) > + { > + /* Handle the simple case of a single "e" block that is known > + to fit into the current array. */ > + if (__builtin_expect (m_end + count <= LOCAL_ELEMS + 1, true)) > + { > + /* Set M_CURRENT to the first subrtx and queue the rest. */ > + ssize_t start = m_bounds[code].start; > + rtunion_type *src = &x->u.fld[start]; > + if (__builtin_expect (count > 2, false)) > + m_base[m_end++] = T::get_value (src[2].rt_rtx); > + if (count > 1) > + m_base[m_end++] = T::get_value (src[1].rt_rtx); > + m_current = T::get_value (src[0].rt_rtx); > + return; > + } > + /* Handle cases which aren't simple "e" sequences or where > + the sequence might overrun M_BASE. */ > + count = add_subrtxes_to_queue (m_array, m_base, m_end, x); > + if (count > 0) > + { > + m_end += count; > + if (m_end > LOCAL_ELEMS) > + m_base = m_array.heap->address (); > + m_current = m_base[--m_end]; > + return; > + } > + } > + } > + } > + else > + m_skip = false; > + if (m_end == 0) > + m_done = true; > + else > + m_current = m_base[--m_end]; > +} > + > +/* Skip the subrtxes of the current rtx. */ > + > +template <typename T> > +inline void > +generic_subrtx_iterator <T>::skip_subrtxes () > +{ > + m_skip = true; > +} > + > +/* Ignore the subrtxes of the current rtx and look at X instead. */ > + > +template <typename T> > +inline void > +generic_subrtx_iterator <T>::substitute (value_type x) > +{ > + m_substitute = true; > + m_current = x; > +} > + > +/* Iterators for const_rtx. */ > +struct const_rtx_accessor > +{ > + typedef const_rtx value_type; > + typedef const_rtx rtx_type; > + typedef const rtunion rtunion_type; > + static rtx_type get_rtx (value_type x) { return x; } > + static value_type get_value (rtx_type x) { return x; } > +}; > +typedef generic_subrtx_iterator <const_rtx_accessor> subrtx_iterator; > + > +/* Iterators for non-constant rtx. */ > +struct rtx_var_accessor > +{ > + typedef rtx value_type; > + typedef rtx rtx_type; > + typedef rtunion rtunion_type; > + static rtx_type get_rtx (value_type x) { return x; } > + static value_type get_value (rtx_type x) { return x; } > +}; > +typedef generic_subrtx_iterator <rtx_var_accessor> subrtx_var_iterator; > + > +/* Iterators for rtx *. */ > +struct rtx_ptr_accessor > +{ > + typedef rtx *value_type; > + typedef rtx rtx_type; > + typedef rtunion rtunion_type; > + static rtx_type get_rtx (value_type ptr) { return *ptr; } > + static value_type get_value (rtx_type &x) { return &x; } > +}; > +typedef generic_subrtx_iterator <rtx_ptr_accessor> subrtx_ptr_iterator; > + > +#define ALL_BOUNDS rtx_all_subrtx_bounds > +#define NONCONST_BOUNDS rtx_nonconst_subrtx_bounds > + > +/* Use ITER to iterate over const_rtx X and its recursive subrtxes, > + using subrtx_iterator::array ARRAY as the storage for the worklist. > + ARRAY can be reused for multiple consecutive iterations but shouldn't > + be shared by two concurrent iterations. TYPE is ALL if all subrtxes > + are of interest or NONCONST if it is safe to ignore subrtxes of > + constants. */ > +#define FOR_EACH_SUBRTX(ITER, ARRAY, X, TYPE) \ > + for (subrtx_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \ > + ITER.next ()) > + > +/* Like FOR_EACH_SUBRTX, but iterate over subrtxes of an rtx X. */ > +#define FOR_EACH_SUBRTX_VAR(ITER, ARRAY, X, TYPE) \ > + for (subrtx_var_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \ > + ITER.next ()) > + > +/* Like FOR_EACH_SUBRTX, but iterate over subrtx pointers of rtx pointer X. > + For example, if X is &PATTERN (insn) and the pattern is a SET, iterate > + over &PATTERN (insn), &SET_DEST (PATTERN (insn)), etc. */ > +#define FOR_EACH_SUBRTX_PTR(ITER, ARRAY, X, TYPE) \ > + for (subrtx_ptr_iterator ITER (ARRAY, X, TYPE##_BOUNDS); !ITER.at_end (); \ > + ITER.next ()) > diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c > index f3471b1..df52788 100644 > --- a/gcc/rtlanal.c > +++ b/gcc/rtlanal.c > @@ -37,6 +37,7 @@ along with GCC; see the file COPYING3. If not see > #include "tree.h" > #include "emit-rtl.h" /* FIXME: Can go away once crtl is moved to rtl.h. */ > #include "addresses.h" > +#include "rtl-iter.h" > > /* Forward declarations */ > static void set_of_1 (rtx, const_rtx, void *); > @@ -62,6 +63,9 @@ static unsigned int num_sign_bit_copies1 (const_rtx, enum machine_mode, const_rt > -1 if a code has no such operand. */ > static int non_rtx_starting_operands[NUM_RTX_CODE]; > > +rtx_subrtx_bound_info rtx_all_subrtx_bounds[NUM_RTX_CODE]; > +rtx_subrtx_bound_info rtx_nonconst_subrtx_bounds[NUM_RTX_CODE]; > + > /* Truncation narrows the mode from SOURCE mode to DESTINATION mode. > If TARGET_MODE_REP_EXTENDED (DESTINATION, DESTINATION_REP) is > SIGN_EXTEND then while narrowing we also have to enforce the > @@ -78,6 +82,93 @@ static int non_rtx_starting_operands[NUM_RTX_CODE]; > static unsigned int > num_sign_bit_copies_in_rep[MAX_MODE_INT + 1][MAX_MODE_INT + 1]; > > +/* Store X into index I of ARRAY. ARRAY is known to have at least I > + elements. Return the new base of ARRAY. */ > + > +template <typename T> > +typename T::value_type * > +generic_subrtx_iterator <T>::add_single_to_queue (array_type &array, > + value_type *base, > + size_t i, value_type x) > +{ > + if (base == array.stack) > + { > + if (i < LOCAL_ELEMS) > + { > + base[i] = x; > + return base; > + } it seems like this first little bit might be worth inlining, but I guess you tested and it wasn't. > + gcc_checking_assert (i == LOCAL_ELEMS); > + vec_safe_grow (array.heap, i + 1); > + base = array.heap->address (); > + memcpy (base, array.stack, sizeof (array.stack)); that threw me for a sec until I realized it was just manually doing what auto_vec does in this case. > + base[LOCAL_ELEMS] = x; > + return base; > + } > + unsigned int length = array.heap->length (); > + if (length > i) > + { > + gcc_checking_assert (base == array.heap->address ()); > + base[i] = x; > + return base; > + } > + else > + { > + gcc_checking_assert (i == length); > + vec_safe_push (array.heap, x); > + return array.heap->address (); this chunk seems redundant with what vec_safe_push does do you actually need to pass in |i| at all? > + } > +} > + > +/* Add the subrtxes of X to worklist ARRAY, starting at END. Return the > + number of elements added to the worklist. */ > + > +template <typename T> > +size_t > +generic_subrtx_iterator <T>::add_subrtxes_to_queue (array_type &array, > + value_type *base, > + size_t end, rtx_type x) > +{ > + const char *format = GET_RTX_FORMAT (GET_CODE (x)); > + size_t orig_end = end; > + for (int i = 0; format[i]; ++i) > + if (format[i] == 'e') > + { > + value_type subx = T::get_value (x->u.fld[i].rt_rtx); > + if (__builtin_expect (end < LOCAL_ELEMS, true)) > + base[end++] = subx; > + else > + base = add_single_to_queue (array, base, end++, subx); > + } > + else if (format[i] == 'E') > + { > + int length = GET_NUM_ELEM (x->u.fld[i].rt_rtvec); > + rtx *vec = x->u.fld[i].rt_rtvec->elem; > + if (__builtin_expect (end + length <= LOCAL_ELEMS, true)) > + for (int j = 0; j < length; j++) > + base[end++] = T::get_value (vec[j]); > + else > + for (int j = 0; j < length; j++) > + base = add_single_to_queue (array, base, end++, > + T::get_value (vec[j])); > + } > + return end - orig_end; > +} > + > +template <typename T> > +void > +generic_subrtx_iterator <T>::free_array (array_type &array) > +{ > + vec_free (array.heap); > +} > + > +template <typename T> > +const size_t generic_subrtx_iterator <T>::LOCAL_ELEMS; > + > +template class generic_subrtx_iterator <const_rtx_accessor>; > +template class generic_subrtx_iterator <rtx_var_accessor>; > +template class generic_subrtx_iterator <rtx_ptr_accessor>; > + > /* Return 1 if the value of X is unstable > (would be different at a different point in the program). > The frame pointer, arg pointer, etc. are considered stable > @@ -5324,8 +5415,42 @@ truncated_to_mode (enum machine_mode mode, const_rtx x) > return false; > } > > +/* Return true if RTX code CODE has a single sequence of zero or more > + "e" operands and no rtvec operands. Initialize its rtx_all_subrtx_bounds > + entry in that case. */ > + > +static bool > +setup_reg_subrtx_bounds (unsigned int code) > +{ > + const char *format = GET_RTX_FORMAT ((enum rtx_code) code); > + unsigned int i = 0; > + for (; format[i] != 'e'; ++i) > + { > + if (!format[i]) > + /* No subrtxes. Leave start and count as 0. */ > + return true; > + if (format[i] == 'E' || format[i] == 'V') > + return false; > + } > + > + /* Record the sequence of 'e's. */ > + rtx_all_subrtx_bounds[code].start = i; > + do > + ++i; > + while (format[i] == 'e'); > + rtx_all_subrtx_bounds[code].count = i - rtx_all_subrtx_bounds[code].start; > + /* rtl-iter.h relies on this. */ > + gcc_checking_assert (rtx_all_subrtx_bounds[code].count <= 3); > + > + for (; format[i]; ++i) > + if (format[i] == 'E' || format[i] == 'V' || format[i] == 'e') > + return false; > + > + return true; > +} > + > /* Initialize non_rtx_starting_operands, which is used to speed up > - for_each_rtx. */ > + for_each_rtx, and rtx_all_subrtx_bounds. */ > void > init_rtlanal (void) > { > @@ -5335,6 +5460,10 @@ init_rtlanal (void) > const char *format = GET_RTX_FORMAT (i); > const char *first = strpbrk (format, "eEV"); > non_rtx_starting_operands[i] = first ? first - format : -1; > + if (!setup_reg_subrtx_bounds (i)) > + rtx_all_subrtx_bounds[i].count = UCHAR_MAX; > + if (GET_RTX_CLASS (i) != RTX_CONST_OBJ) > + rtx_nonconst_subrtx_bounds[i] = rtx_all_subrtx_bounds[i]; > } > > init_num_sign_bit_copies_in_rep ();
Attachment:
signature.asc
Description: Digital signature
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |