[RFC] [P2] [PR tree-optimization/33562] Lowering more complex assignments.

Wed Feb 17 10:48:00 GMT 2016

On Wed, Feb 17, 2016 at 8:30 AM, Jeff Law <law@redhat.com> wrote:
> On 02/14/2016 11:38 AM, Richard Biener wrote:
>>
>> On February 14, 2016 5:35:13 PM GMT+01:00, Jeff Law <law@redhat.com>
>> wrote:
>>>
>>> On 02/12/2016 10:21 AM, Jeff Law wrote:
>>>>>
>>>>> But really we simply need a better DSE algorithm.
>>>>
>>>> So the way to fix DSE is to keep the existing algorithm and track the
>>>> hunks of Complex/aggregates that have been set a second time.
>>>>
>>>> My first thought was to implement this as an inverted bitmap.  ie,
>>>
>>> set
>>>>
>>>> it to 1 for every byte in the complex/aggregate that is set by the
>>>
>>> first
>>>>
>>>> store.  Clear bits for each byte in subsequent stores to the pieces.
>>>
>>> If
>>>>
>>>> the bitmap reaches an empty state, then the initial store is dead.
>>>>
>>>> Adjusting *ref could work too, but would probably be painful if the
>>>> subsequent stores don't happen in a convenient order.
>>>
>>> So that was scary easy.  We should have done this a long time ago.
>>>
>>> Essentially I call ao_get_ref_base to get the offset/size/max_size
>>> filled in for the first statement.  Those are used to initialize the
>>> live bytes bitfield, as long as max_size != -1.
>>>
>>> Then when we have a possible killing statement, we use the ao in a
>>> similar manner to determine which bytes to clear (taking care that the
>>> base is the same between the two references and that in the killing
>>> statement that the size/max_size are the same.).
>>>
>>> When all the live bytes are zero then we've killed the original
>>> statement.
>>>
>>> It's ~20 lines of code.
>>>
>>> I need to pull together some additional tests, but it looks likely
>>> we'll
>>> be able to wrap this up easily for gcc-6.
>>
>>
>> BTW, we had sth like this before but it had both correctness and more
>> importantly scalability issues.
>
> Just a couple more tibits.
>
> I instrumented a bootstrap -- the improved DSE finds ~20k additional DSE
> opportunities during a GCC bootstrap that could not be found by the current
> DSE.  Yes, 20k additional statements deleted by tree DSE.  Yow!

Well, DCE also can do quite some DSE and it runs after DSE - did that 20k
more DSE affect the overall end-result?

> Of those additional opportunities, none require bit level tracking.  So we
> can just punt when the size/offset is not byte sized/aligned.

Yep.  I expect us to eventually lower all those bit-precision stuff.

> Of those additional opportunities > 99% are for sizes of 64 bytes or
> smaller.  Thus we can pack those into 1 or 2 bitmap elements, depending on
> the starting offset.  So the bitmap side will be efficient with no real
> searching if we choose our PARAM value wisely.

So then please use a uint64_t or even uint32_t mask please.  Which means
a fixed size SBITMAP (32 bits) if you like to use the bitmap interface.

> The outliers are, well, strange.  There were cases where we found new DSE
> opportunities for objects of size 2k bytes or larger.  There weren't many of
> these, but I was surprised at the size.  Most likely it's a clobber or mem*
> thing that's participating in DSE.  But I haven't looked closely at those
> cases yet.

I suspect it's memset followed by actually initializing all elements.  We have
quite some of those I think.

> There's a ton of statements that are clobbering zero-sized objects.  My code
> can determine when those clobbers are redundant (with some later clobber),
> but I haven't looked closely to see if that's actually a good thing to do or
> not.
>
> Anyway, I still don't see anything which makes me think this can't wrap-up
> in the immediate future.

Can you share your work-in-progress patch?

Thanks,
Richard.

> jeff