User account creation filtered due to spam.

Bug 28364 - poor optimization choices when iterating over a std::string (probably not c++-specific)
Summary: poor optimization choices when iterating over a std::string (probably not c++...
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.1.2
: P3 normal
Target Milestone: ---
Assignee: Zdenek Dvorak
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2006-07-12 22:32 UTC by Zack Weinberg
Modified: 2006-07-26 19:38 UTC (History)
4 users (show)

See Also:
Host: i486-linux-gnu
Target: i486-linux-gnu
Build: i486-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2006-07-14 14:12:12


Attachments
assembly output (bad on top, hand-corrected below) (244 bytes, text/plain)
2006-07-12 22:33 UTC, Zack Weinberg
Details
C test case (with interesting implications) (255 bytes, text/plain)
2006-07-12 23:19 UTC, Zack Weinberg
Details
A patch for loop header selection. (3.72 KB, patch)
2006-07-18 00:45 UTC, Zdenek Dvorak
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zack Weinberg 2006-07-12 22:32:56 UTC
Consider the following test program:

#include <string>
bool has_bad_chars(std::string const & path)
{
  for (std::string::const_iterator c = path.begin(); c != path.end(); c++)
    {
      unsigned char x = static_cast<unsigned char>(*c);
      if (x <= 0x1f || x == 0x5c || x == 0x7f)
        return true;
    }
  return false;
}

At -O2, GCC 4.1 chooses to duplicate the entire body of the loop for no good reason; the code is not rendered more straight-line by this, and in fact it executes strictly more instructions even for a single-character string.  I'll attach an assembly file showing what it did (Z13has_bad_charsRKSs) and what it should have done (_Z14has_bad_chars2RKSs).  The bad transformation is done by the .t45.ch pass, which acronym does not mean anything to me.
Comment 1 Zack Weinberg 2006-07-12 22:33:38 UTC
Created attachment 11874 [details]
assembly output (bad on top, hand-corrected below)
Comment 2 Andrew Pinski 2006-07-12 22:41:59 UTC
Loop-Copy header is doing it Which means there is a confusion at what is the real header of the loop here.
Comment 3 Zack Weinberg 2006-07-12 22:42:28 UTC
I should mention that the exact command line flags were -O2 -fomit-frame-pointer -march=pentium4, and that I hand-tweaked the label numbers for ease of reading.

Also, -fno-tree-ch does suppress this bad optimization, but in exchange we get mildly worse code from the loop optimizer proper - it uses [reg+reg] indexing and a 0..n count instead of [reg] indexing and a base..limit count.  The code is pretty short so I'll just paste it here (meaningless labels removed):

_Z17has_bad_chars_newRKSs:
        pushl   %ebx
        movl    8(%esp), %eax
        movl    (%eax), %eax
        xorl    %ecx, %ecx
        movl    -12(%eax), %ebx
.L2:
        cmpl    %ecx, %ebx
        je      .L10
        movzbl  (%ecx,%eax), %edx
        cmpb    $31, %dl
        jbe     .L4
        cmpb    $92, %dl
        je      .L4
        addl    $1, %ecx
        cmpb    $127, %dl
        jne     .L2
.L4:
        movl    $1, %eax
        popl    %ebx
        .p2align 4,,2
        ret
.L10:
        xorl    %eax, %eax
        popl    %ebx
        .p2align 4,,2
        ret

Looking at the code, I see that the entire purpose of tree-ch is to duplicate loop bodies in this fashion, and the justification given is that it "increases effectiveness of code motion and reduces the need for loop preconditioning", which I take to cover the above degradation in addressing mode choice.  I'm not an optimizer expert, but surely there is a way to get the best of both worlds here...?
Comment 4 Zack Weinberg 2006-07-12 22:48:30 UTC
I remembered that I had a build of 4.2 from last week lying around.  It generates even worse code - still with the duplication of most of the loop, plus a bunch of unnecessary register fiddling and bad addressing mode choice.
Comment 5 Andrew Pinski 2006-07-12 22:52:25 UTC
For me on PPC-darwin, it generates pretty good code at just -O2 even though there is a duplicated "header".  The loop is pretty good at scheduling the code too:
L9:
	lbz r0,0(r3)
	cmplwi cr7,r0,31
	extsb r0,r0
	cmpwi cr1,r0,127
	cmpwi cr6,r0,92
	ble- cr7,L4
	beq- cr6,L4
	beq- cr1,L4
L8:
	addi r3,r3,1
	bdnz L9

Though the branches throw off everything (though that is a different issue), for the Cell really cror should be used (I think).
Comment 6 Zdenek Dvorak 2006-07-12 23:13:41 UTC
Loop header copying is OK; the result is the one I would expect, it certainly does not make the code worse (unless you are optimizing for code size), and in many cases may make it better.

I will have a look at the addressing mode choices.
Comment 7 Zack Weinberg 2006-07-12 23:19:05 UTC
Created attachment 11875 [details]
C test case (with interesting implications)

I've found a plain C test case.  In the process, I've found that the way libstdc++ <string> is coded interacts interestingly with the optimizer.

In the attached file, has_bad_chars_bad is a literal translation to C of the code seen by the optimizers after inlining for the original C++ test case.  Yes, libstdc++ <string> does the moral equivalent of ((struct rep*)path)[-1].len.  This function compiles to the same bad code as my original test case.

has_bad_chars_good, on the other hand, is how I naively thought <string> worked on the first read-through.  That one compiles to code which looks optimal to me.  I suspect some optimizer or other is not smart enough to see through this particular construct ... it would be good to make it do so, since we want libstdc++ <string> to generate good code.
Comment 8 Zack Weinberg 2006-07-12 23:21:24 UTC
Zdenek: I don't see how you can say that what we get is optimal code "unless optimizing for size".  The code generated *will* be slower than the alternative.
Comment 9 Zdenek Dvorak 2006-07-12 23:30:23 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

> Zdenek: I don't see how you can say that what we get is optimal code "unless
> optimizing for size".  The code generated *will* be slower than the
> alternative.

why?  Exactly the same number of instructions is executed.
Comment 10 Zack Weinberg 2006-07-12 23:33:21 UTC
I-cache.  Also, more iterations before the branch predictors figure out what's going on.
Comment 11 Zdenek Dvorak 2006-07-12 23:39:39 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

> I-cache.

this only matters if this increase in code size will make the code go
out of instruction cache.  It definitely is possible to artificially
construct programs where it matters, but I haven't seen one yet (ch
increases the total code size by less than 1% on all the te

> Also, more iterations before the branch predictors figure out what's
> going on.

But also possibly more consistent behavior with respect to branch
prediction, in case the loop is often exited in the first iteration.

In general, it is not possible to determine in ch whether loop header
copying will be profitable or not.  Undoing the loop header copying
by some later pass might be doable, although I am not quite sure how
much profitable.
Comment 12 Zack Weinberg 2006-07-13 03:09:02 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

> > I-cache.
> this only matters if this increase in code size will make the code go
> out of instruction cache.

The real program that this is taken from is a large C++ application
which is guaranteed to go out of cache - it's got slightly less than
four megabytes of .text - the actual goal is to make sure all of its
inner inner inner loops do stay in cache.  And this is one of 'em.

> > Also, more iterations before the branch predictors figure out what's
> > going on.
> But also possibly more consistent behavior with respect to branch
> prediction, in case the loop is often exited in the first iteration.

Again, in real life I know a priori that the function is rarely, if
ever, called with a zero-length string.

-----

I went through the tree dumps for my week-old 4.2.0 for the test
program with a fine comb.  They are quite instructive.  If tree-ch
were doing what it says on the label -- copying the loop header --
everything would be fine; dom1 cleans up the two copies of the header
later.  However, ch isn't just copying the loop header; it is also
copying the *entire body of the loop*, which nothing can fix. I call
that a clear bug.

zw
Comment 13 Andrew Pinski 2006-07-13 03:37:51 UTC
Hmm, for some reason I don't like the idea of using std::string in the inner loop :).  Even the C testcase does not seem like a good inner loop in general anyways as there is no caculation going on here.  To me these look like loops which are run for testing only, looking for bad characters to see if a problem had happened.
Comment 14 Zack Weinberg 2006-07-13 03:40:34 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

It's a validation routine, yes, which is run over every pathname the
program is working on, and there can be hundreds or thousands of
those.

And why the heck shouldn't I be able to use std::string in inner
loops?  I sure don't want to be using bare char*...
Comment 15 Andrew Pinski 2006-07-13 03:45:11 UTC
One more comment, a loop with an early exit is whole issue and that is the reason why all of the code in the loop is considered the header. There are a couple of loop headers in this case, one for each exit which makes it harder to deal with in general.  

What you did not mention is which how would this loop exit normally, via the return 1 or all the way through the loop.  There is no enough information from GCC's point of view to figure that out without profiling (for this case).  GCC is assuming that the loop exits in the first if statement which seems reasoniable.  Maybe you should try with profiling information and see what GCC does for this testcase.
Comment 16 Andrew Pinski 2006-07-13 04:01:17 UTC
If this is really a program's inner most loop, then the program is messed up as there is no caculation going on here at all.  
What type of program is this? Do you cache the result of this function?  Maybe chaching the results will show other bottle necks.  Maybe even instead of doing this loop, find another loop which loops over the text and also process it at the same time.  These are normal optimization tricks which usuaully cannot be done by the compiler.
Comment 17 Zack Weinberg 2006-07-13 04:23:01 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

> One more comment, a loop with an early exit is whole issue and that is the
> reason why all of the code in the loop is considered the header. There are a
> couple of loop headers in this case, one for each exit which makes it harder to
> deal with in general.

I didn't know that, and it is not obvious from the optimizer dumps.  Thanks for
explaining.

> What you did not mention is which how would this loop exit normally, via the
> return 1 or all the way through the loop.  There is no enough information from
> GCC's point of view to figure that out without profiling (for this case).  GCC
> is assuming that the loop exits in the first if statement which seems
> reasoniable. Maybe you should try with profiling information and see what GCC
> does for this testcase.

<flamebait> Feedback-directed optimization is only good for making
compilers look better on benchmarks.  It's useless in real life. </>

I can, in fact, get good code out of gcc 4.1 by beating it over the
head with __builtin_expect, but I don't think I should have to do
that.  I think my suggested version is better code no matter whether
or not the loop exits early.

4.2 still makes what I consider to be bad addressing mode choices
after that change, but Zdenek did say he would look at that.  It also
puts the "return 1" exit block in the middle of the loop in spite of
being told that all three conditions leading to that are unlikely.

struct rep
{
  unsigned long len;
  unsigned long alloc;
  unsigned long dummy;
};

struct data
{
  char * ptr;
};

struct string
{
  struct rep R;
  struct data D;
};

int
has_bad_chars(struct data *path)
{
  char *c;
  for (c = path->ptr;
       __builtin_expect(c < path->ptr + ((struct rep *)path)[-1].len, 1);
       c++)
    {
      unsigned char x = (unsigned char)(*c);
      if (__builtin_expect(x <= 0x1f || x == 0x5c || x == 0x7f, 0))
        return 1;
    }
  return 0;
}
Comment 18 Zdenek Dvorak 2006-07-13 07:58:17 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

> > > I-cache.
> > this only matters if this increase in code size will make the code go
> > out of instruction cache.
> 
> The real program that this is taken from is a large C++ application
> which is guaranteed to go out of cache - it's got slightly less than
> four megabytes of .text - the actual goal is to make sure all of its
> inner inner inner loops do stay in cache.  And this is one of 'em.
> 
> > > Also, more iterations before the branch predictors figure out what's
> > > going on.
> > But also possibly more consistent behavior with respect to branch
> > prediction, in case the loop is often exited in the first iteration.
> 
> Again, in real life I know a priori that the function is rarely, if
> ever, called with a zero-length string.
> 
> -----
> 
> I went through the tree dumps for my week-old 4.2.0 for the test
> program with a fine comb.  They are quite instructive.  If tree-ch
> were doing what it says on the label -- copying the loop header --
> everything would be fine; dom1 cleans up the two copies of the header
> later.  However, ch isn't just copying the loop header; it is also
> copying the *entire body of the loop*, which nothing can fix. I call
> that a clear bug.

how do you define a loop header?
Comment 19 Zdenek Dvorak 2006-07-13 08:01:04 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

> > > I-cache.
> > this only matters if this increase in code size will make the code go
> > out of instruction cache.
> 
> The real program that this is taken from is a large C++ application
> which is guaranteed to go out of cache - it's got slightly less than
> four megabytes of .text - the actual goal is to make sure all of its
> inner inner inner loops do stay in cache.  And this is one of 'em.

on your real program, how much performance do you gain by hand-rewriting
the assembler to look the way you like?  Just to make sure there really
is a problem.
Comment 20 Zack Weinberg 2006-07-13 08:25:33 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

> > However, ch isn't just copying the loop header; it is also
> > copying the *entire body of the loop*, which nothing can fix. I call
> > that a clear bug.
>
> how do you define a loop header?

I was under the impression it was just the one basic block called out
in the .ch dump, e.g.

;; Loop 1
;;  header 6, latch 5
;;  depth 1, level 1, outer 0

-- basic block 6 happens to contain just the code from the syntactic
loop condition.  Andrew informs me that this is wrong, and that in
this case the header is the entire loop, but I will come back at that
with 'ch should never be duplicating the entire loop; if the header is
the entire loop, it should do something more sensible, like duplicate
just the first basic block or something.'
Comment 21 Zack Weinberg 2006-07-13 08:28:13 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

> on your real program, how much performance do you gain by hand-rewriting
> the assembler to look the way you like?  Just to make sure there really
> is a problem.

I'm a little annoyed by the suggestion that this wouldn't be a real
problem if I couldn't measure a performance difference.

Depending on workload, other activity on the same machine, and phase
of moon, this loop is between .1% and 1% of runtime, and my tweaks
make it go about a third faster.
Comment 22 Richard Biener 2006-07-13 08:28:30 UTC
For practical purposes (determining the loop runs at least once) it needs to
duplicate the exit condition.  Which happens to be difficult here, as there are
multiple loop exits.
Comment 23 Zdenek Dvorak 2006-07-13 09:00:41 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

> > on your real program, how much performance do you gain by hand-rewriting
> > the assembler to look the way you like?  Just to make sure there really
> > is a problem.
> 
> I'm a little annoyed by the suggestion that this wouldn't be a real
> problem if I couldn't measure a performance difference.

sorry, but it indeed would not be.  I have seen so many examples of code
that looks bad at the first sight and preforms just fine that unless
there is something obviously wrong (which is not the case here, IMO), I
am somewhat more careful before I spend my time on fixing this type of
"bugs".

> Depending on workload, other activity on the same machine, and phase
> of moon, this loop is between .1% and 1% of runtime, and my tweaks
> make it go about a third faster.
Comment 24 Zdenek Dvorak 2006-07-13 09:03:59 UTC
Subject: Re:  poor optimization choices when iterating over a std::string (probably not c++-specific)

> > > However, ch isn't just copying the loop header; it is also
> > > copying the *entire body of the loop*, which nothing can fix. I call
> > > that a clear bug.
> >
> > how do you define a loop header?
> 
> I was under the impression it was just the one basic block called out
> in the .ch dump, e.g.
> 
> ;; Loop 1
> ;;  header 6, latch 5
> ;;  depth 1, level 1, outer 0
> 
> -- basic block 6 happens to contain just the code from the syntactic
> loop condition.  Andrew informs me that this is wrong, and that in
> this case the header is the entire loop, but I will come back at that
> with 'ch should never be duplicating the entire loop; if the header is
> the entire loop, it should do something more sensible, like duplicate
> just the first basic block or something.'

currently, we stop once the copied region is too large.  This means that
on "normal" loops that have a body that does something, we won't copy
whole loop.  Of course, any heuristics will have cases when it won't
perform ideally.
Comment 25 Zdenek Dvorak 2006-07-18 00:45:48 UTC
Created attachment 11906 [details]
A patch for loop header selection.

I tried improving the heuristics for the selection of the loop header, however without success.  In ch, copying all the exits simply looks like a good idea, since it makes the loop header contain the most important looking part of the loop, and makes the latch of the loop empty.  I append the patch I made, in case someone wanted to play with it.
Comment 26 Zdenek Dvorak 2006-07-25 15:20:48 UTC
A patch for the "return in the middle of the loop" problem:
http://gcc.gnu.org/ml/gcc-patches/2006-07/msg00893.html
(to be commited once mainline is open).
Comment 27 Zdenek Dvorak 2006-07-26 19:38:10 UTC
Patch for the wrong choice of induction variable:
http://gcc.gnu.org/ml/gcc-patches/2006-07/msg01125.html
Comment 28 Zdenek Dvorak 2006-08-16 21:14:19 UTC
Subject: Bug 28364

Author: rakdver
Date: Wed Aug 16 21:14:11 2006
New Revision: 116189

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=116189
Log:
	PR tree-optimization/28364
	* tree-ssa-loop-ivopts.c (aff_combination_to_tree): Handle zero
	correctly.
	(fold_affine_expr): New function.
	(may_eliminate_iv): Use fold_affine_expr.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-loop-ivopts.c