[PATCH] Introduce 4-stages profiledbootstrap to get a better profile.

Jan Hubicka hubicka@ucw.cz
Mon May 29 15:11:00 GMT 2017


> On 05/25/2017 01:22 PM, Markus Trippelsdorf wrote:
> > On 2017.05.25 at 11:55 +0200, Martin Liška wrote:
> >> Hi.
> >>
> >> As I spoke about the PGO with Honza and Richi, current 3-stage is not ideal for following
> >> 2 reasons:
> >>
> >> 1) stageprofile compiler is train just on libraries that are built during stage2
> >> 2) apart from that, as the compiler is also used to build the final compiler, profile
> >> is being updated during the build. So the stage2 compiler is making different decisions.
> >>
> >> Both problems can be resolved by adding another step in between current stage2 and stage3
> >> where we train stage2 compiler by building compiler with default options.
> >>
> >> I'm going to do some measurements.
> > 
> > I did some measurements on gcc67 (trunk with --enable-checking=release).
> > The apparent speedup is in the noise.
> 
> Hello.
> 
> Thanks for measurements:
> 
> I can see difference for GCC 7.1:
> 
> g++-7 tramp3d-v4.ii -O2 && time for i in `seq 1 10` ; do g++-7 tramp3d-v4.ii -O2 ; done
> 
> before: 2m25.133s
> after: real	2m25.133s
> 
> which is 99.09124426480228%. It's probably within a noise level.
> 
> And apparently file size of binary is bugger:
> 
> before (using bloaty):
> 
>      VM SIZE                         FILE SIZE
>  --------------                   --------------
>   59.0%  15.1Mi .text              15.1Mi  62.3%
>   21.3%  5.45Mi .rodata            5.45Mi  22.5%
>    6.6%  1.69Mi .eh_frame          1.69Mi   6.9%
>    5.4%  1.38Mi .bss                    0   0.0%
>    3.3%   874Ki .dynstr             874Ki   3.5%
>    1.8%   480Ki .dynsym             480Ki   1.9%
>    1.1%   285Ki .eh_frame_hdr       285Ki   1.1%
>    0.6%   158Ki .gnu.hash           158Ki   0.6%
>    0.5%   144Ki .hash               144Ki   0.6%
>    0.2%  44.4Ki .data              44.4Ki   0.2%
>    0.2%  40.0Ki .gnu.version       40.0Ki   0.2%
>    0.0%  11.1Ki .rela.plt          11.1Ki   0.0%
>    0.0%  7.44Ki .plt               7.44Ki   0.0%
>    0.0%  4.56Ki .data.rel.ro       4.56Ki   0.0%
>    0.0%  3.73Ki .got.plt           3.73Ki   0.0%
>    0.0%      38 [Unmapped]         2.75Ki   0.0%
>    0.0%     624 [ELF Headers]      2.55Ki   0.0%
>    0.0%     848 [Other]            1.13Ki   0.0%
>    0.0%     917 .gcc_except_table     917   0.0%
>    0.0%     608 .dynamic              608   0.0%
>    0.0%      16 [None]                  0   0.0%
>  100.0%  25.7Mi TOTAL              24.3Mi 100.0%
> 
> after:
> 
>      VM SIZE                     FILE SIZE
>  --------------               --------------
>   58.3%  14.6Mi .text          14.6Mi  54.2%
>   21.6%  5.41Mi .rodata        5.41Mi  20.1%
>    0.0%       0 .strtab        2.13Mi   7.9%
>    6.7%  1.67Mi .eh_frame      1.67Mi   6.2%
>    5.5%  1.38Mi .bss                0   0.0%
>    0.0%       0 .symtab        1.11Mi   4.1%
>    3.4%   876Ki .dynstr         876Ki   3.2%
>    1.9%   480Ki .dynsym         480Ki   1.7%
>    1.1%   280Ki .eh_frame_hdr   280Ki   1.0%
>    0.6%   158Ki .gnu.hash       158Ki   0.6%
>    0.6%   144Ki .hash           144Ki   0.5%
>    0.2%  44.4Ki .data          44.4Ki   0.2%
>    0.2%  40.1Ki .gnu.version   40.1Ki   0.1%
>    0.0%  11.1Ki .rela.plt      11.1Ki   0.0%
>    0.0%  7.44Ki .plt           7.44Ki   0.0%
>    0.0%  4.56Ki .data.rel.ro   4.56Ki   0.0%
>    0.0%  3.73Ki .got.plt       3.73Ki   0.0%
>    0.0%      58 [Unmapped]     3.11Ki   0.0%
>    0.0%     624 [ELF Headers]  2.61Ki   0.0%
>    0.0%  2.32Ki [Other]        2.60Ki   0.0%
>    0.0%      16 [None]              0   0.0%
>  100.0%  25.1Mi TOTAL          26.9Mi 100.0%
> 
> As I had chat with Honza, we still have problem in GCC that using current working sets,
> get_hot_bb_threshold () is very close to number of runs, which is effectively 1 for a single
> run. That's mistake and that should be fixed.

Yep, with LTO+PGO bootstrap I think we also hit the problem that PGO inliner was never
seriously tuned (we basically use the very first badness metric I introduced and we never
experimented with parameters). The reason is that hot/cold partitioning even when it
is very coarsce does work reasonably well for per-file compilation model.  With LTO we
are facing very many inline decisions and probably there is a lot of low hanging fruit.

GCC is currently on transition to new profile counter code.  I will push out the initial
patch retiring gcov_type soon (once I finish updating it to current tree - it is very
anoying) and that will let us to track hotness more conservatively and fix the old
problem that count becomes unrealistically low by broken profile updates and thus
becomes cold.  This should make it possible to increase the threshold and start with
re-tunning (hopefully this or next week)

Honza
> 
> Martin



More information about the Gcc-patches mailing list