[patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

Thu Oct 4 15:36:00 GMT 2012

On 10/04/2012 03:24 AM, Steven Bosscher wrote:
> On Thu, Oct 4, 2012 at 8:57 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> On Thu, Oct 4, 2012 at 5:30 AM, Vladimir Makarov <vmakarov@redhat.com> wrote:
>>> I was going to look at this code too but I was interesting in generation of
>>> less points and live ranges.  It is strange that in my profiles,
>>> remove_some_program_points_and_update_live_ranges takes 0.6% of compiler
>>> time on these huge tests.   So I was not interesting to speed up the
>>> function and may be therefore you have no visible change in compilation
>>> time.
>> Right. The compression algorithm doesn't care much about the initial
>> number of program points, only about the number of live ranges before
>> and after compression. I had expected a bigger effect on the number of
>> live ranges before compression.
>>
>> 0.6% sounds really very different from my timings. How much time does
>> create_start_finish_chains take for you?
>>
>>
>>> I don't object the idea of the patch.  I need some time to look at it (the
>>> different results on a function is a bit scary for me) and check simulator
>>> times on other tests.
>> Understood.
> BTW, it would be great if you can also look at this additional patch hunk:
>
> @@ -994,8 +1044,8 @@ lra_create_live_ranges (bool all_p)
>     curr_point = 0;
>     point_freq_vec = VEC_alloc (int, heap, get_max_uid () * 2);
>     lra_point_freq = VEC_address (int, point_freq_vec);
> -  FOR_EACH_BB (bb)
> -    process_bb_lives (bb);
> +  FOR_EACH_BB_REVERSE (bb)
> +    process_bb_lives (bb, curr_point);
>     lra_live_max_point = curr_point;
>     create_start_finish_chains ();
>     if (lra_dump_file != NULL)
>
> I think this should result in more live ranges being merged. Here's
> why I think so, based on my far worse understanding of this code than
> yours, so forgive me if I'm Completely Wrong :-)
No, you are not wrong.
Two days ago, I worked on patch which contains the same code.  The patch 
actually takes EBB into account to decrease # calls of mark_pseudo_live 
at the beginning of process_bb_lives and mark_pseudo_dead at the 
function end and for that I needed FOR_EACH_BB_REVERSE.  The patch was 
half baked (it did not checked hard regs live changes at the end of BB 
to set up right hard reg conflicts for pseudos) but it gave an idea how 
much I can get from this.  It is not bad but not what I expected.  So I 
stopped work on this.  But we still should work on these ideas as they 
improve LRA speed in small steps (many small steps will create a visible 
effect).

We can really solve scalability problem only by using simpler but still 
good enough algorithms (too simple algorithms result in big code size 
and actually even in worse compilation times).  I've been working on it 
and I'll send a patch soon.
> process_bb_lives walks insns in the basic block from last to first, so
> say you have a basic block chain 1->2->3, and each block has 4 insns,
> then AFAIU the program points in block 1 will be [4,3,2,1], in block 2
> it will be [8,7,6,5], and in block 3 it will be [12,11,10,9]. Say a
> reg is used in block 3 at point 11, and set in block at point 3. Then
> this reg will have a live range chain [3-1],[8-5],[12-11].
>
> If you visit the basic blocks in reverse order, the program points
> will be: 1:[12,11,10,9], 2:[8,7,6,5], 3:[4,3,2,1]. Now the same reg
> will be set at point 11 and used at point 3, and the live range chain
> will be just [11-3].
>
>