This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: Faster for_each_rtx-like iterators


On 05/07/14 14:52, Richard Sandiford wrote:
I noticed for_each_rtx showing up in profiles and thought I'd have a go
at using worklist-based iterators instead.  So far I have three:

   FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
   FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
   FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *

with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
Yea, for_each_rtx is a bit of a workhorse, so I'm not surprised it's showing up on some profiles.



I've locally replaced all for_each_rtx calls in the generic code with
these iterators and they make things reproducably faster.  The speed-up
on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
but maybe that's enough to justify the churn.
I'd consider that enough to justify the churn, especially if it's an improvement we're seeing consistently rather than just something that's helping pathological cases.



Implementation-wise, the main observation is that most subrtxes are part
of a single contiguous sequence of "e" fields.  E.g. when compiling an
oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
subrtxes of 7,636,542 rtxes.  Of those:
Yea, makes sense. One could possibly carry this a step further and realize that we could specialize the walkers. ie, rather than indexing the rtx_code to get the rtx format, then iterate over that to find the 'e' fields, if we have an PLUS, then damn it, we ought to know how to walk pretty directly.


(A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
(B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
                       no "E" fields, and
(C)    43,532 (00.6%) are more complicated.

(A) is really a special case of (B) in which the block has zero length.
Those are the only two cases that really need to be handled inline.
The implementation does this by having a mapping from an rtx code to the
bounds of its "e" sequence, in the form of a start index and count.
Right. So my thought above us just extending this a bit and realizing that we don't need an index/count for most RTXs. That may (or may not) be worth the extra pain. Your call whether or not to investigate that further.



Does this look like it's worth pursuing?  If so, is it OK to start
submitting now or should it wait until 4.9.1 is out?  If it is OK,
I plan to convert the ports too and get rid of for_each_rtx.
I think it's definitely worth pursuing. I suspect the release managers would definitely want to wait for 4.9.1 :-)

jeff


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]