This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: RFC: Faster for_each_rtx-like iterators
- From: Jeff Law <law at redhat dot com>
- To: gcc-patches at gcc dot gnu dot org, rdsandiford at googlemail dot com
- Date: Fri, 09 May 2014 00:18:45 -0600
- Subject: Re: RFC: Faster for_each_rtx-like iterators
- Authentication-results: sourceware.org; auth=none
- References: <87k39x48gu dot fsf at talisman dot default>
On 05/07/14 14:52, Richard Sandiford wrote:
I noticed for_each_rtx showing up in profiles and thought I'd have a go
at using worklist-based iterators instead. So far I have three:
FOR_EACH_SUBRTX: iterates over const_rtx subrtxes of a const_rtx
FOR_EACH_SUBRTX_VAR: iterates over rtx subrtxes of an rtx
FOR_EACH_SUBRTX_PTR: iterates over subrtx pointers of an rtx *
with FOR_EACH_SUBRTX_PTR being the direct for_each_rtx replacement.
Yea, for_each_rtx is a bit of a workhorse, so I'm not surprised it's
showing up on some profiles.
I've locally replaced all for_each_rtx calls in the generic code with
these iterators and they make things reproducably faster. The speed-up
on full --enable-checking=release ./cc1 and ./cc1plus times is only about 1%,
but maybe that's enough to justify the churn.
I'd consider that enough to justify the churn, especially if it's an
improvement we're seeing consistently rather than just something that's
helping pathological cases.
Implementation-wise, the main observation is that most subrtxes are part
of a single contiguous sequence of "e" fields. E.g. when compiling an
oldish combine.ii on x86_64-linux-gnu with -O2, we iterate over the
subrtxes of 7,636,542 rtxes. Of those:
Yea, makes sense. One could possibly carry this a step further and
realize that we could specialize the walkers. ie, rather than indexing
the rtx_code to get the rtx format, then iterate over that to find the
'e' fields, if we have an PLUS, then damn it, we ought to know how to
walk pretty directly.
(A) 4,459,135 (58.4%) are leaf rtxes with no "e" or "E" fields,
(B) 3,133,875 (41.0%) are rtxes with a single block of "e" fields and
no "E" fields, and
(C) 43,532 (00.6%) are more complicated.
(A) is really a special case of (B) in which the block has zero length.
Those are the only two cases that really need to be handled inline.
The implementation does this by having a mapping from an rtx code to the
bounds of its "e" sequence, in the form of a start index and count.
Right. So my thought above us just extending this a bit and realizing
that we don't need an index/count for most RTXs. That may (or may not)
be worth the extra pain. Your call whether or not to investigate that
further.
Does this look like it's worth pursuing? If so, is it OK to start
submitting now or should it wait until 4.9.1 is out? If it is OK,
I plan to convert the ports too and get rid of for_each_rtx.
I think it's definitely worth pursuing. I suspect the release managers
would definitely want to wait for 4.9.1 :-)
jeff