Bug 33761 - tree-copyrename and tree-dominators pessimizes gzip SPEC score
Summary: tree-copyrename and tree-dominators pessimizes gzip SPEC score
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.3.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-10-13 11:26 UTC by Uroš Bizjak
Modified: 2013-12-29 11:34 UTC (History)
5 users (show)

See Also:
Host:
Target: i686-pc-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2008-02-03 13:39:42


Attachments
address accumulation patch (978 bytes, patch)
2008-02-02 16:22 UTC, Jan Hubicka
Details | Diff
Path to predict_paths_leading_to (1.07 KB, patch)
2008-02-06 16:44 UTC, Jan Hubicka
Details | Diff
Complete continue heuristic patch (4.09 KB, patch)
2008-02-06 16:56 UTC, Jan Hubicka
Details | Diff
Annotated profile (23.14 KB, text/plain)
2008-02-07 12:30 UTC, Jan Hubicka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Uroš Bizjak 2007-10-13 11:26:54 UTC
The measurements were actually done on gzip-1.2.4 sources on core2-d with:

a) gcc -mtune=generic -m32 -O2
b) gcc -mtune=generic -m32 -O3

The testfile was created as the tar archive of current SVN trunk repository, which currently accounts for 865M uncompressed.

profile of a)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 54.63     14.76    14.76 102254750     0.00     0.00  longest_match
 18.47     19.75     4.99        1     4.99    27.02  deflate
 10.25     22.52     2.77    27389     0.00     0.00  fill_window
  6.81     24.36     1.84    27390     0.00     0.00  updcrc
  3.15     25.21     0.85     5901     0.00     0.00  compress_block
  2.85     25.98     0.77 203123663     0.00     0.00  send_bits
  2.66     26.70     0.72 89123566     0.00     0.00  ct_tally
  0.67     26.88     0.18  3378994     0.00     0.00  pqdownheap
  0.22     26.94     0.06    17709     0.00     0.00  build_tree
  0.15     26.98     0.04    11802     0.00     0.00  send_tree
  0.07     27.00     0.02  1367732     0.00     0.00  bi_reverse
  0.07     27.02     0.02    17710     0.00     0.00  gen_codes
  0.00     27.02     0.00    27390     0.00     0.00  file_read

profile of b)

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 86.86     29.35    29.35        1    29.35    33.79  deflate
  5.27     31.13     1.78    27390     0.00     0.00  updcrc
  2.69     32.04     0.91     5901     0.00     0.00  compress_block
  2.55     32.90     0.86 89123566     0.00     0.00  ct_tally
  2.04     33.59     0.69 203123663     0.00     0.00  send_bits
  0.44     33.74     0.15    17709     0.00     0.00  build_tree
  0.06     33.76     0.02  1367732     0.00     0.00  bi_reverse
  0.06     33.78     0.02     5903     0.00     0.00  flush_block
  0.03     33.79     0.01    11802     0.00     0.00  send_tree
  0.00     33.79     0.00    27390     0.00     0.00  file_read
  0.00     33.79     0.00     9237     0.00     0.00  flush_outbuf
  0.00     33.79     0.00        2     0.00     0.00  basename
  0.00     33.79     0.00        2     0.00     0.00  copy_block
  0.00     33.79     0.00        1     0.00     0.00  add_envopt

As can be seen from profiles, longest_match was inlined into deflate. Adding __attribute__((noinline)) to longest_match prototype, we obtain:

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 55.80     13.86    13.86 102254750     0.00     0.00  longest_match
 27.62     20.72     6.86        1     6.86    24.84  deflate
  7.09     22.48     1.76    27390     0.00     0.00  updcrc
  3.74     23.41     0.93     5901     0.00     0.00  compress_block
  2.62     24.06     0.65 89123566     0.00     0.00  ct_tally
  2.42     24.66     0.60 203123663     0.00     0.00  send_bits
  0.56     24.80     0.14    17709     0.00     0.00  build_tree
  0.08     24.82     0.02  1367732     0.00     0.00  bi_reverse
  0.08     24.84     0.02    11802     0.00     0.00  send_tree
  0.00     24.84     0.00    27390     0.00     0.00  file_read
  0.00     24.84     0.00     9237     0.00     0.00  flush_outbuf
  0.00     24.84     0.00     5903     0.00     0.00  flush_block
  0.00     24.84     0.00        2     0.00     0.00  basename
  0.00     24.84     0.00        2     0.00     0.00  copy_block

or ~26.5% improvement. I speculate that inlining increases register pressure on SMALL_REGISTER_CLASS target, as this problem is not that noticeable on x86_64.

The results of 32bit run are at [1] (valid from 13. oct) and results of 64bit run at [2].

[1] http://vmakarov.fedorapeople.org/spec/spec2000.toolbox_32/gcc/individual-run-ratio.html
[2] http://vmakarov.fedorapeople.org/spec/spec2000.toolbox/gcc/individual-run-ratio.html
Comment 1 Richard Biener 2007-10-13 12:31:15 UTC
I suppose that alias partitioning makes the difference instead.  This is not
really a fault of the inliner but our dumb memory optimizers.
Comment 2 Uroš Bizjak 2007-12-10 10:14:39 UTC
According to Issue 2 from http://gcc.gnu.org/ml/gcc/2007-11/msg00753.html, I think that this bug qualifies as a 4.3 regression.
Comment 3 Richard Biener 2007-12-10 10:52:20 UTC
I don't think this qualifies as a 4.3 regression -
http://www.suse.de/~gcctest/SPEC/CINT/sb-haydn-head-64-32o-32bit/index.html
shows that while there were jumps, the numbers close to the 4.2 release are
actually quite similar to what we have now.  So, unless somebody produces
numbers with 4.2 or earlier, this is not a 'regression', but a missed-optimization only.
Comment 4 Uroš Bizjak 2007-12-10 12:31:03 UTC
(In reply to comment #3)
> I don't think this qualifies as a 4.3 regression -

Fair enough. It looks that this problem is specific to Core2.
Comment 5 Uroš Bizjak 2007-12-10 17:12:10 UTC
(In reply to comment #4)

> Fair enough. It looks that this problem is specific to Core2.

Here are timings with 'gcc version 4.3.0 20071201 (experimental) [trunk revision 130554] (GCC)' on

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         X6800  @ 2.93GHz
stepping        : 5
cpu MHz         : 2933.389
cache size      : 4096 KB

-mtune=generic -m32 -O3: 40.763s   [*]
-mtune=generic -m32 -O2: 32.170s
-mtune=core2 -m32 -O3  : 36.850s
-mtune=core2 -m32 -O2  : 32.170s

-mtune=generic -m64 -O3: 28.550s
-mtune=generic -m64 -O2: 28.682s
-mtune=core2 -m64 -O3  : 28.670s
-mtune=core2 -m64 -O2  : 28.714s

With __attribute__((noinline)) to longest_match():

-mtune=generic -m32 -O3: 30.658s
-mtune=generic -m32 -O2: 32.154s
-mtune=core2 -m32 -O3  : 30.690s
-mtune=core2 -m32 -O2  : 32.247s

And with FC6 system compiler 'gcc version 4.1.1 20061011 (Red Hat 4.1.1-30)':

-mtune=generic -m32 -O3: 30.154s   [**]
-mtune=generic -m32 -O2: 30.275s

Comparing [*] to [**], it _is_ a regression, at least on Core2.
Comment 6 rguenther@suse.de 2007-12-10 17:13:53 UTC
Subject: Re:  non-optimal inlining heuristics
 pessimizes gzip SPEC score at -O3

On Mon, 10 Dec 2007, ubizjak at gmail dot com wrote:

> (In reply to comment #4)
> 
> > Fair enough. It looks that this problem is specific to Core2.
> 
> Here are timings with 'gcc version 4.3.0 20071201 (experimental) [trunk
> revision 130554] (GCC)' on
> 
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 15
> model name      : Intel(R) Core(TM)2 CPU         X6800  @ 2.93GHz
> stepping        : 5
> cpu MHz         : 2933.389
> cache size      : 4096 KB
> 
> -mtune=generic -m32 -O3: 40.763s   [*]
> -mtune=generic -m32 -O2: 32.170s
> -mtune=core2 -m32 -O3  : 36.850s
> -mtune=core2 -m32 -O2  : 32.170s
> 
> -mtune=generic -m64 -O3: 28.550s
> -mtune=generic -m64 -O2: 28.682s
> -mtune=core2 -m64 -O3  : 28.670s
> -mtune=core2 -m64 -O2  : 28.714s
> 
> With __attribute__((noinline)) to longest_match():
> 
> -mtune=generic -m32 -O3: 30.658s
> -mtune=generic -m32 -O2: 32.154s
> -mtune=core2 -m32 -O3  : 30.690s
> -mtune=core2 -m32 -O2  : 32.247s
> 
> And with FC6 system compiler 'gcc version 4.1.1 20061011 (Red Hat 4.1.1-30)':
> 
> -mtune=generic -m32 -O3: 30.154s   [**]
> -mtune=generic -m32 -O2: 30.275s
> 
> Comparing [*] to [**], it _is_ a regression, at least on Core2.

FSF GCC 4.1 does not have -mtune=generic.

Richard.
Comment 7 Uroš Bizjak 2007-12-10 17:26:11 UTC
(In reply to comment #6)

> FSF GCC 4.1 does not have -mtune=generic.

OK, OK. Now with 'gcc version 4.1.3 20070716 (prerelease)':

-m32 -O2: 29.306s
-m32 -O3: 29.582s

I don't have 4.2 here.
Comment 8 Uroš Bizjak 2007-12-11 06:00:15 UTC
Regression at least for 4.3.
Comment 9 Steven Bosscher 2007-12-11 06:09:04 UTC
One of those "regressions" where actually GCC made progress overall.  This should be low priority for GCC 4.3.
Comment 10 Uroš Bizjak 2007-12-11 06:16:56 UTC
(In reply to comment #9)
> One of those "regressions" where actually GCC made progress overall.  This
> should be low priority for GCC 4.3.

Probably it is too early in the morning here, but I can't see any traces of overall progress, when execution time of the testcase in _this_ PR goes from 30s to 40s.

Comment 11 Jan Hubicka 2008-01-16 16:46:28 UTC
Last time I looked into it, it was code                                           alignment affected by inlining in the string matching loop (longest_match).  This code is very atypical, since the internal loop comparing strings is hand unrolled but it almost never rolls, since the compressed strings tends to be all different.  GCC mispredicts this                                              moving some stuff out of the loop and bb-reorder aligns the code in a                                                   way that the default path not doing the loop is jumping pretty far                                                 hurting decode bandwidth of K8 especially because the jumps are hard to                                            predict.                                                                                                          
                                                                                                                  
I don't see any direct things in the code heuristics can use to realize                                            that the loop is not rooling, except for special casing the particular                                             benchmark.                                                                                                        
                                                                                                                  
FDO scores of gzip are not doing that bad, but there is still gap                                                  relative to ICC (even archaic version of it running 32bit compared to 64bit GCC).                                                                     
http://www.suse.de/~gcctest/SPEC-britten/CINT/sandbox-britten-FDO/index.html                                      
It would be nice to convince gzip/zlibc/bzip2 people to use profiling by                                           default in the build process - those packages are ideal targets.                                                  
                                                                                                                  
But since core is not that much sensitive to code alignment and nuber of                                           jumps as K8, perhaps there are extra problems demonstrated by this.
Comment 12 Jan Hubicka 2008-02-02 16:22:34 UTC
Created attachment 15079 [details]
address accumulation patch

While working on PR17863 I wrote the attached patch to make fwprop to combine code like:

a=base;
*a=something;
a++;
*a=something;
a++;
*a=something;
...

into

*base=something
a=base+1
*a=something
a=base+2
*a=something
....

I dropped it to vangelis and nightly tester shows gzip improvement 815->880. Gzip internal loop is hand unrolled into similar form as shown above.
(the tester peaks in Jul 2005 with scores somewhat above 900). Since it gzip results tends to be unstable it would be nice to know how this reproduce on other targets/setups.

Honza
Comment 13 Jan Hubicka 2008-02-03 13:39:42 UTC
Tonight runs on haydn with patch in shows regression on gzip: 950->901 in 32bit. FDO 64bit runs are not affected.

This is same score as we had in December, we improved a bit since then but not enough to match score we used to have.  
Looks like codegen of the string compare loop is very unstable here.
Uros, would be possible to give it a try on Core?  That would help to figure out if it is code layout problem of K8.

Honza
Comment 14 Uroš Bizjak 2008-02-03 17:35:16 UTC
(In reply to comment #13)

> Uros, would be possible to give it a try on Core?  That would help to figure
> out if it is code layout problem of K8.

Hm, the patch doesn't seem to help:

-m32 -O2: 32.434
-m32 -O2 (patched): 32.586

-m32 -O3: 40.723
-m32 -O3 (patched): 41.059
Comment 15 Jan Hubicka 2008-02-05 13:36:33 UTC
Thanks, looks comparable to K8 scores, except that -O3 is not actually that worse there.  So it looks there is more than just random effect of code layout involved, I will try to look into the assembly produced more.
Comment 16 Jan Hubicka 2008-02-05 13:55:16 UTC
Thanks, looks comparable to K8 scores, except that -O3 is not actually that worse there.  So it looks there is more than just random effect of code layout involved, I will try to look into the assembly produced more.
Comment 17 Jan Hubicka 2008-02-06 13:28:20 UTC
One problem is the following:
  do {
        ;
        match = window + cur_match;
        if (match[best_len] != scan_end ||
            match[best_len-1] != scan_end1 ||
            *match != *scan ||
            *++match != scan[1]) continue;
        scan += 2, match++;
        do {
        } while (*++scan == *++match && *++scan == *++match &&
                 *++scan == *++match && *++scan == *++match &&
                 *++scan == *++match && *++scan == *++match &&
                 *++scan == *++match && *++scan == *++match &&
                 scan < strend);

....

The internal loop is the string comparsion thingy, while the branch prediction logic completely misses it: the continue statement looks like it is forming 4 nested loops, so it concludes that this is the internal loop.

We used to have prediction heuristic guessing that continue statement is not used to form a loop.  This was killed when gimplification was introduced.  Perhaps we should bring it back, since this is resonably common scenario.

Looking at longest_match in not unrolled version, the "loops" formed by continue statement has frequencies: 298, 961, 2139, 3100, 6900, 1000
so every loop is predicted to iterate about twice.

The outer real loop now gets frequency 92, ie small enough to be predicted as cold.  The string comparsion loop now get freuqnecy 344, predicted to iterate 3 times (quite realistically).  But because the frequency is so small we end up allocating one of the two pointers in memory:
.L9:
        leal    1(%ecx), %eax
        movl    %eax, -16(%ebp)
        movzbl  1(%ecx), %eax
        cmpb    1(%edx), %al
        jne     .L8
        leal    2(%ecx), %eax
        movl    %eax, -16(%ebp)
        movzbl  2(%ecx), %eax
        cmpb    2(%edx), %al
        jne     .L8
        leal    3(%ecx), %eax
        movl    %eax, -16(%ebp)
        movzbl  3(%ecx), %eax
        cmpb    3(%edx), %al
        jne     .L8
        leal    4(%ecx), %eax
        movl    %eax, -16(%ebp)
        movzbl  4(%ecx), %eax 
        cmpb    4(%edx), %al
        jne     .L8
        leal    5(%ecx), %eax
        movl    %eax, -16(%ebp)
        movzbl  5(%ecx), %eax
        cmpb    5(%edx), %al
        jne     .L8

This happens in offline copy of longest_match. The inline gets this detail right, but frequencies of the deflate functions are all crazy, naturally.

I guess I should revive the patch for language scope branch predictors.
Honza
Comment 18 Jan Hubicka 2008-02-06 16:44:23 UTC
Created attachment 15107 [details]
Path to predict_paths_leading_to

Hi,
I've revived the continue heuristic patch.  By itself it does not help becuase of bug in predict_paths_leading_to.

The code looks as follows:

if (test1)
   goto continue_block;
if (test2)
   goto continue_block;
if (test3)
   goto continue_block;
if (test4)
   goto continue_block;
goto real_loop_body;
continue_block:
   goto loop_header;

We call predict_paths_leading_to on the continue_block and expect that the continue_block will not be very likely.

What the function does is to find dominator of continue_block that is the if(test1) block and predict edge from the first block.  This is however not quite enough as all the other paths remain likely.

It seems to me that we need to walk the whole set of BBs postdominated by the BB and mark all edges forming edge cut defined by this set.

I am testing the attached patch.  It makes the function linear (so we are overall quadratic) for very deep postdominator tree.  If this turns out to be problem, I think we can just cut the computation after some specified amount of BBs is walked.

Zdenek, does this seem sane?

With this change and continue prediction patch I get sort of sane prediction for longest_match function.  Profile is still quite unrealistic, but I am testing if it makes noticeable difference.

Honza
Comment 19 Jan Hubicka 2008-02-06 16:56:36 UTC
Created attachment 15108 [details]
Complete continue heuristic patch

Hi,
this is the complete patch.  With this patch we produce profile sane enough so the internal loops are not marked cold.  I will benchmark it probably tomorrow (I want to wait for the FP changes to show separately).

It fixes the offline copy of longest_match, so we no longer have one of IV variables at stack:
.L15:
        movzbl  2(%edx), %eax
        leal    2(%edx), %esi
        cmpb    2(%ecx), %al
        jne     .L8
        movzbl  3(%edx), %eax
        leal    3(%edx), %esi
        cmpb    3(%ecx), %al
        jne     .L8 
        movzbl  4(%edx), %eax
        leal    4(%edx), %esi
        cmpb    4(%ecx), %al
        jne     .L8
        movzbl  5(%edx), %eax
        leal    5(%edx), %esi
        cmpb    5(%ecx), %al
        jne     .L8
        movzbl  6(%edx), %eax
        leal    6(%edx), %esi
        cmpb    6(%ecx), %al
        jne     .L8
        movzbl  7(%edx), %eax
        leal    7(%edx), %esi
        cmpb    7(%ecx), %al
        jne     .L8
        leal    8(%ecx), %eax
        movl    %eax, %ecx
        movzbl  8(%edx), %eax
        cmpb    (%ecx), %al
        leal    8(%edx), %ebx
        movl    %ebx, %esi
        jne     .L8
        cmpl    %ebx, -20(%ebp)
        jbe     .L8
        movl    %ebx, %edx
        movzbl  1(%edx), %eax
        leal    1(%edx), %esi
        cmpb    1(%ecx), %al
        je      .L15

Irronically this can further widen the gap in between -O2 and -O3, since the inline copy in deflate was always allocated resonably.
Deflate codegen changes quite a lot and because function body is big I will wait for benchmarks before trying to analyze futher.
Comment 20 Uroš Bizjak 2008-02-06 18:42:48 UTC
Whoa, adding -fomit-frame-pointer brings us from

(gcc -O3 -m32)
user    0m41.031s

to

(gcc -O3 -m32 -fomit-frame-pointer)
user    0m30.006s

Since -fo-f-p adds another free reg, it looks that since inlining increases register pressure some unlucky heavy-used variable gets allocated to the stack slot.
Comment 21 Uroš Bizjak 2008-02-06 19:10:31 UTC
(In reply to comment #20)

> Since -fo-f-p adds another free reg, it looks that since inlining increases
> register pressure some unlucky heavy-used variable gets allocated to the stack
> slot.


It is "best_len" (and probably some others, too):

[uros@localhost gzip-1.2.4]$ grep best_len fp.s
        movl    %edx, -68(%ebp) #, best_len
        movl    -68(%ebp), %edx # best_len, best_len.494
        movl    %edx, -68(%ebp) # best_len.494, best_len
        movl    -68(%ebp), %edx # best_len,
        movl    -68(%ebp), %edx # best_len,
        movl    -68(%ebp), %edx # best_len, best_len.494
        cmpl    %esi, %edx      # lookahead, best_len.494
        movl    %edx, -108(%ebp)        # best_len.494, match_length
        movl    -68(%ebp), %edx # best_len, best_len.494
        movl    %edx, -88(%ebp) # prev_length.28, best_len
        movl    -88(%ebp), %edx # best_len, best_len.457
        movl    %edx, -88(%ebp) # best_len.457, best_len
        movl    -88(%ebp), %eax # best_len,
        movl    -88(%ebp), %edx # best_len,
        movl    -88(%ebp), %edx # best_len, best_len.457
        cmpl    %esi, %edx      # lookahead, best_len.457
        movl    %edx, -40(%ebp) # best_len.457, match_length.404
        movl    -88(%ebp), %edx # best_len, best_len.457
        leal    (%ecx,%eax), %edx       #, best_len.457
        cmpl    %edx, -88(%ebp) # best_len.457, best_len
        cmpl    -96(%ebp), %edx # nice_match.34, best_len.457
        leal    (%ecx,%eax), %edx       #, best_len.494
        cmpl    %edx, -68(%ebp) # best_len.494, best_len
        cmpl    -76(%ebp), %edx # nice_match.34, best_len.494

[uros@localhost gzip-1.2.4]$ grep best_len no-fp.s
        movl    %edx, 76(%esp)  #, best_len
        movl    76(%esp), %edx  # best_len,
        movl    76(%esp), %edx  # best_len, best_len.494
        movl    %edx, 76(%esp)  # best_len.494, best_len
        movl    76(%esp), %eax  # best_len,
        movl    76(%esp), %edx  # best_len, best_len.494
        movl    %edx, %ebp      # best_len.494, match_length
        movl    76(%esp), %edx  # best_len, best_len.494
        movl    %edx, %ebp      # prev_length.28, best_len
        movl    %ebp, %edx      # best_len, best_len.457
        movl    %edx, %ebp      # best_len.457, best_len
        movl    %ebp, %edx      # best_len, best_len.457
        cmpl    %esi, %edx      # lookahead, best_len.457
        movl    %ebp, %edx      # best_len, best_len.457
        leal    (%ecx,%eax), %edx       #, best_len.494
        cmpl    %edx, 76(%esp)  # best_len.494, best_len
        cmpl    68(%esp), %edx  # nice_match.34, best_len.494
        leal    (%ecx,%eax), %edx       #, best_len.457
        cmpl    %edx, %ebp      # best_len.457, best_len
        cmpl    52(%esp), %edx  # nice_match.34, best_len.457
Comment 22 Jan Hubicka 2008-02-06 19:22:06 UTC
Yes, there are number of unlucky variables. However the real source is here seems to be always wrong profile guiding regalloc to optimize for cold portions of the function rather than real increase of register pressure increase due to inlining.  

In general, inlining operation itself only decrease register pressure: you don't fix function parameters/return value to fixed registers and you know precisely what registers survive the body so you don't need to save caller saved registers when not needed. 

The losses from inlining with our regalloc is partly due to callee saved registers being sometimes more effective sort of immitating live range splitting. Increased register pressure is effect of propagating from function body to the rest of program, but it is not that bat either: at least all the inlining heuristic/RA bugs turned to be something else.

The high speedup by forwprop patch in 64bit mode (and slowdown in 32bit) is actually also register allocation related: the internal loop consisting of sequence of ++ operations ends up with extra copy instructions without forwprop patch, while with the patch we produce normal induction variable.  On 32bit it however results in regalloc putting this variable on stack because its liferange heuristics gives it lower priority then.

For 32bit data, britten 32-bit SPEC tester peaked at 760, while we now get 620 on peak with -fomit-frame-pointer. 20% regression on rather simple commonly used codebase definitly makes us look stupid.... More though that ICC 7.x did 820 on same machine. 64bit tester is 830 versus 740 approximately.

Honza
Comment 23 Jan Hubicka 2008-02-07 12:30:51 UTC
Created attachment 15115 [details]
Annotated profile

I am attaching dump with profile read in.  It shows the hot spots in longest_match at least:

(this is first conditional of the continue guard)
  # BLOCK 27 freq:10000 count:1346119696
  # PRED: 6 [100.0%]  count:112241556 (fallthru) 25 [99.5%]  count:1233878140 (true,exec)
  # scan_end_13 = PHI <scan_end_106(6), scan_end_14(25)>
  # scan_end1_11 = PHI <scan_end1_93(6), scan_end1_12(25)>
  # best_len_8 = PHI <best_len_25(6), best_len_9(25)>
  # scan_3 = PHI <scan_24(6), scan_6(25)>
  # chain_length_2 = PHI <chain_length_108(6), chain_length_105(25)>
  # cur_match_1 = PHI <cur_match_109(6), cur_match_104(25)>
  match_40 = &window + cur_match_1;
  best_len.31_41 = (unsigned int) best_len_8;
  D.2379_42 = match_40 + best_len.31_41;
  D.2380_43 = *D.2379_42;
  if (D.2380_43 != scan_end_13)
    goto <bb 10>;
  else
    goto <bb 7>;

  # SUCC: 10 [0.1%]  count:33977 (true,exec) 11 [99.9%]  count:48979565 (false,exec)

  # BLOCK 10 freq:9636 count:1297140131
  # PRED: 27 [87.5%]  count:1177665163 (true,exec) 7 [55.2%]  count:93018627 (true,exec) 8 [35.0%]  count:26422364 (true,exec) 9 [0.1%]  count:33977 (true,exec)
  goto <bb 24>;

(this is the continue statement)

  D.2391_102 = cur_match_1 & 32767;
  D.2392_103 = prev[D.2391_102];
  cur_match_104 = (IPos) D.2392_103;
  if (limit_15 >= cur_match_104)
    goto <bb 26>;
  else
    goto <bb 25>;
  # SUCC: 26 [7.7%]  count:104056913 (true,exec) 25 [92.3%]  count:1240391903 (false,exec)

  # BLOCK 25 freq:9215 count:1240391903
  # PRED: 24 [92.3%]  count:1240391903 (false,exec)
  chain_length_105 = chain_length_2 + 0x0ffffffff;
  if (chain_length_105 != 0)
    goto <bb 27>;
  else
    goto <bb 26>;


(this is end of outer loop)
Comment 24 Jan Hubicka 2008-02-08 15:11:27 UTC
Hi,
the tonight runs with continue heuristics shows again improvements on 64bit scores , but degradation on 32bit scores.  Looking into the loop, the real trouble seems to be that the main loop has 6 loop carried variables:

scan_end, scan_end1, best_len, scan, chain_length, cur_match

plus few temporaries are needed too. Obviously we can't fit in registers on i386. Making profile more realistic sometime helps sometimes hurts pretty much at random basis.

One case where I think register presure is increased is the fact that different SSA names of both scan_end and scan_end1 variables are actually not fully coalesced in out-of-SSA.  This is result of optimizing:

        if (match[best_len] != scan_end ||
            match[best_len-1] != scan_end1 ||
            *match != *scan ||
            *++match != scan[1]) continue;
       ...later code sometimes modifying scan_end....

into computing match[best_len] into name of scan_end that is sometimes assigned
int the later code on the path not modifying scan_end.  As a result we do have two scan_ends live at once.  I wonder if we can avoid this behaviour, though it looks all right on SSA form, it would save 2 "global" registers: there is no need at all to cache match[best_len]/match[best_len1] in register unless I missed something. Those two vars are manipulated on the hot paths through the loop.

Now the RA is driven by frequencies (bit confused by fact that two of loop carried vars are split) and by their "liveranges" that is actually number of instructions in bettween first and last occurence.  Since we are bit carelless on BB ordering moving some code to the very end of function, this heuristics is not realistic at all.  It would probably make more sense to replace it by number of inssn it is live across, but this is probably ninsn*npseudos to compute. Other idea would be degree in conflict graph, but I am not sure we want to start such experiemtns in parallel with YARA.

I tested YARA and it does not handle this situation much better. Perhaps Vladimir can help?

Honza
Comment 25 Jan Hubicka 2008-02-08 15:39:24 UTC
-fno-tree-dominator-opts -fno-tree-copyrename solves the coalescing problem (name is introduced by second, the actual problematic pattern by first pass), saving roughly 1s at both -O2 and 2s at -O3, -O3 is still worse however
Internal loop no longer spills, just reads val of scan_end stored in memory.

I will play with it more later and make simple testcase for this.
Honza
Comment 26 Jan Hubicka 2008-09-06 12:00:06 UTC
IRA seems to fix the remaining problem with spill in internal loop on 32bit nicely, so we produce good scores for gzip compared to older GCC versions. 
http://gcc.opensuse.org/SPEC-britten/CINT/sandbox-britten-32bit/164_gzip_big.png
and with profile feedback http://gcc.opensuse.org/SPEC-britten/CINT/sandbox-britten-FDO/164_gzip_big.png we get close to ICC scores.

We now output comparsion loop as:
.L98:   
        movzbl  1(%eax), %edx   #,
        leal    1(%eax), %edi   #, scan
        cmpb    1(%ecx), %dl    #,
        jne     .L161   #,
        movzbl  2(%eax), %edx   #,
        leal    2(%eax), %edi   #, scan
        cmpb    2(%ecx), %dl    #,
        jne     .L161   #,
        movzbl  3(%eax), %edx   #,
        leal    3(%eax), %edi   #, scan
        cmpb    3(%ecx), %dl    #,
        jne     .L161   #,
        movzbl  4(%eax), %edx   #,
        leal    4(%eax), %edi   #, scan
        cmpb    4(%ecx), %dl    #,
        jne     .L161   #,
        movzbl  5(%eax), %edx   #,
        leal    5(%eax), %edi   #, scan
        cmpb    5(%ecx), %dl    #,
        jne     .L161   #,
        movzbl  6(%eax), %edx   #,
        leal    6(%eax), %edi   #, scan
        cmpb    6(%ecx), %dl    #,
        jne     .L161   #,
        movzbl  7(%eax), %edx   #,
        leal    7(%eax), %edi   #, scan
        cmpb    7(%ecx), %dl    #,
        jne     .L161   #,

there is still room for improvement however.

Remaining problem is that we still miss coaliescing of scan_end and scan_end1 (so -fno-tree-dominator-opts -fno-tree-copyrename still helps).

Vladimir, perhaps this can be solved in IRA too?

Honza
Comment 27 Jan Hubicka 2008-09-06 12:02:54 UTC
Also just noticed that offline copy of longest-match get extra move:
.L15:   
        movzbl  2(%eax), %edi   #, tmp87
        leal    2(%eax), %ecx   #, scan.158
        movl    %edi, %edx      # tmp87,
        cmpb    2(%ebx), %dl    #,
        jne     .L6     #,
        movzbl  3(%eax), %edi   #, tmp88
        leal    3(%eax), %ecx   #, scan.158
        movl    %edi, %edx      # tmp88,
        cmpb    3(%ebx), %dl    #,
        jne     .L6     #,      
        movzbl  4(%eax), %edi   #, tmp89
        leal    4(%eax), %ecx   #, scan.158
        movl    %edi, %edx      # tmp89,
        cmpb    4(%ebx), %dl    #,
        jne     .L6     #,
        movzbl  5(%eax), %edi   #, tmp90
        leal    5(%eax), %ecx   #, scan.158
        movl    %edi, %edx      # tmp90,
        cmpb    5(%ebx), %dl    #,
        jne     .L6     #,
 
while inlined copy is fine:
.L98:   
        movzbl  1(%eax), %edx   #,
        leal    1(%eax), %edi   #, scan
        cmpb    1(%ecx), %dl    #,
        jne     .L161   #,
        movzbl  2(%eax), %edx   #,
        leal    2(%eax), %edi   #, scan
        cmpb    2(%ecx), %dl    #,
        jne     .L161   #,
        movzbl  3(%eax), %edx   #,
        leal    3(%eax), %edi   #, scan
        cmpb    3(%ecx), %dl    #,
        jne     .L161   #,
        movzbl  4(%eax), %edx   #,
        leal    4(%eax), %edi   #, scan
        cmpb    4(%ecx), %dl    #,
        jne     .L161   #,
interesting :)
Comment 28 Uroš Bizjak 2008-09-06 15:26:35 UTC
(In reply to comment #27)
> Also just noticed that offline copy of longest-match get extra move:
> .L15:   
>         movzbl  2(%eax), %edi   #, tmp87
>         leal    2(%eax), %ecx   #, scan.158
>         movl    %edi, %edx      # tmp87,
>         cmpb    2(%ebx), %dl    #,
>         jne     .L6     #,
>
> interesting :)

Perhaps due to ineffective regmove (similar to PR 37364)?
Comment 29 Steven Bosscher 2013-12-24 23:17:51 UTC
Comment on attachment 15079 [details]
address accumulation patch

Found to be not helpful.
Would need serious updating anyway for gimple tuples world.
Comment 30 Steven Bosscher 2013-12-24 23:26:05 UTC
Comment on attachment 15108 [details]
Complete continue heuristic patch

On trunk since forever:
http://gcc.gnu.org/ml/gcc-patches/2008-02/msg00289.html
Comment 31 Steven Bosscher 2013-12-24 23:28:24 UTC
Is there still a bug here?
Comment 32 Richard Biener 2013-12-29 11:34:49 UTC
Not sure - at least the subject is highly misleading, same for the component.
Looking at gzip history on our testers (AMD only, K8 and K10, only K8 going
back to 4.[123]) gzip scores don't vary a lot but they have bumps up and down.

Now it would be interesting to produce reportable runs for actual FSF releases
as well.

Let's close this bug as fixed.