wide-int more performance fixes for wide multiplication.

Mon Dec 16 18:39:00 GMT 2013

Kenneth Zadeck <zadeck@naturalbridge.com> wrote:
>On 12/16/2013 09:37 AM, Richard Biener wrote:
>> Kenneth Zadeck <zadeck@naturalbridge.com> wrote:
>>> On 12/16/2013 06:19 AM, Richard Biener wrote:
>>>> On 12/15/13 7:48 PM, Kenneth Zadeck wrote:
>>>>> On 12/15/2013 11:40 AM, Richard Sandiford wrote:
>>>>>> Kenneth Zadeck <zadeck@naturalbridge.com> writes:
>>>>>>> it is certainly true that in order to do an unbounded set of
>>> operations,
>>>>>>> you would have to check on every operation.   so my suggestion
>>> that we
>>>>>>> should remove the checking from the infinite precision would not
>>> support
>>>>>>> this.     but the reality is that there are currently no places
>in
>>> the
>>>>>>> compiler that do this.
>>>>>>>
>>>>>>> Currently all of the uses of widest-int are one or two
>operations,
>>> and
>>>>>>> the style of code writing is that you do these and then you deal
>>> with
>>>>>>> the overflow at the time that you convert the widest-int to a
>>> tree.   I
>>>>>>> think that it is important to maintain the style of programming
>>> where
>>>>>>> for a small finite number of computations do not need to check
>>> until
>>>>>>> they convert back.
>>>>>>>
>>>>>>> The problem with making the buffer size so tight is that we do
>not
>>> have
>>>>>>> an adequate reserves to allow this style for any supportable
>type.
>>>>>>> I personally think that 2x + some small n is what we need to
>have.
>>>>>>>
>>>>>>>
>>>>>>> i am not as familiar with how this is used (or to be used when
>all
>>> of
>>>>>>> the offset math is converted to use wide-int), but there appear
>to
>>> be
>>>>>>> two uses of multiply.    one is the "harmless" mult by 3" and
>the
>>> other
>>>>>>> is where people are trying to compute the size of arrays.   
>These
>>> last
>>>>>>> operations do need to be checked for overflow.    The question
>>> here is
>>>>>>> do you want to force those operations to overflow individually
>or
>>> do you
>>>>>>> want to check when you convert out.    Again, i think 2x + some
>>> small
>>>>>>> number is what we might want to consider.
>>>>>> It's a fair question, but personally I think checking for
>overflow
>>>>>> on the operation is much more robust.  Checking on conversion
>>> doesn't
>>>>>> allow you to stop thinking about overflow, it just changes the
>way
>>> you
>>>>>> think about it: rather than handling explicit overflow flags, you
>>> have
>>>>>> to remember to ask "is the range of the unconverted result within
>>> the
>>>>>> range of widest_int", which I bet it is something that would be
>>> easily
>>>>>> forgotten once widest_int & co. are part of the furniture.
>>>>>>
>>>>>> E.g. the SPARC operation (picked only because I remember it):
>>>>>>
>>>>>>         for (i = 0; i < VECTOR_CST_NELTS (arg0); ++i)
>>>>>>           {
>>>>>>             tree e0 = VECTOR_CST_ELT (arg0, i);
>>>>>>             tree e1 = VECTOR_CST_ELT (arg1, i);
>>>>>>
>>>>>>             bool neg1_ovf, neg2_ovf, add1_ovf, add2_ovf;
>>>>>>
>>>>>>             tmp = wi::neg (e1, &neg1_ovf);
>>>>>>             tmp = wi::add (e0, tmp, SIGNED, &add1_ovf);
>>>>>>             if (wi::neg_p (tmp))
>>>>>>           tmp = wi::neg (tmp, &neg2_ovf);
>>>>>>             else
>>>>>>           neg2_ovf = false;
>>>>>>             result = wi::add (result, tmp, SIGNED, &add2_ovf);
>>>>>>             overflow |= neg1_ovf | neg2_ovf | add1_ovf |
>add2_ovf;
>>>>>>           }
>>>>>>
>>>>>>         gcc_assert (!overflow);
>>>>>>
>>>>>>         return wide_int_to_tree (rtype, result);
>>>>>>
>>>>>> seems pretty natural.  If instead it was modelled as a widest_int
>>>>>> chain without overflow then it would be less obviously correct.
>>>>>>
>>>>>> Thanks,
>>>>>> Richard
>>>>> Let us for the sake of argument assume that this was common code
>>> rather
>>>>> than code in a particular port, because code in a particular port
>>> can
>>>>> know more about the environment than common code is allowed to.
>>>>>
>>>>> My main point is that this code is in wide-int not widest-int
>>> because at
>>>>> this level the writer of this code actually wants to model what
>the
>>>>> target wants to do.   So doing the adds in precision and testing
>>>>> overflow is perfectly fine at every step.    But this loop CANNOT
>be
>>>>> written in a style where you tested the overflow at the end
>because
>>> if
>>>>> this is common code you cannot make any assumptions about the
>>> largest
>>>>> mode on the machine.     If the buffer was 2x + n in size, then it
>>> would
>>>>> be reasonably safe to assume that the number of elements in the
>>> vector
>>>>> could be represented in an integer and so you could wait till the
>>> end.
>>>>> I think that my point was that (and i feel a little uncomfortable
>>>>> putting words in richi's mouth but i believe that this was his
>point
>>>>> early on) was that he thinks of the widest int as an infinite
>>> precision
>>>>> representation.    he was the one who was pushing for the entire
>rep
>>> to
>>>>> be done with a large internal (or perhaps unbounded) rep because
>he
>>> felt
>>>>> that this was more natural to not have to think about overflow.
>>> He
>>>>> wanted you to be able to chain a mult and a divide and not see the
>>>>> product get truncated before the divide was done.    The rep that
>we
>>>>> have now really sucks with respect to this because widest int
>>> truncates
>>>>> if you are close to the largest precision on the machine and does
>>> not if
>>>>> you are small with respect to that.
>>>>>
>>>>> My other point is that while you think that the example above is
>>> nice,
>>>>> the experience with double-int is contrary to this.   people will
>>> say
>>>>> (and test) the normal modes and anyone trying to use large modes
>>> will
>>>>> die a terrible death of a thousand cuts.
>>>> Well - the cases that matter in practice are
>>>>
>>>> 1) the things we have offset_int for - code that does bit vs. byte
>>>> quantity calculations on addresses or address offsets.  It used
>>>> either HWI before (and probably still does, and thus is buggy) or
>>>> double-int.  The usual patter was/is to do host_integerp (t, 0)
>>>> and then TREE_LOW_CST (t) * BITS_PER_UNIT (oops) or blindly assume
>>>> that doing things in double-int works (which it does in practice).
>>>>
>>>> 2) passes that want to know whether a single operation overflows
>>>>
>>>> the multiple-operation and then check overflow after-the-fact is
>>>> seldomly used - it is, mainly from the frontends which use trees
>>>> and thus get a sticky TREE_OVERFLOW.  Yes, infinite precision
>>>> would make this work as well, and yes, originally I thought of
>>>> basing all of wide-int on an internally infinite precision
>>>> implementation (and luckily we are close enough that I may end
>>>> up fixing the implementation detail to work that way ...).
>>>> With the infinite precision internal rep you'd have explicit
>>>> truncations and sign-/zero-extensions at the right point and
>>>> failing to do that before conversion to tree/RTX could have been
>>>> easily turned into ICEs saying we overflowed and nobody cared.
>>>>
>>>> Well.  Let's see how the thing we have now works out.
>>> richi,
>>>
>>> i think that you miss the point.   right now on the x86-64, the max
>>> size
>>> is 128 +64 bits.   On my private platform it is 256 + 64 bits.
>>>
>>> if, at the tree level, i do a widest-int multiply of two timode
>values
>>> that are large with respect to their precision, i get different
>answers
>>> on the two targets.    95% of the existing trunk bugs that we fixed
>we
>>> because people thought that double-int was big enough so that they
>did
>>> not have to care. But what you and richard are saying is that you do
>>> need to care (even though none of the code currently does care), but
>>> only for rarely executed cases.  With widest-int and a larger buffer
>we
>>>
>>> could can make that expectation true in all of the places where you
>get
>>>
>>> in, do some finite work and get out.   In my opinion, being able to
>do
>>> this is the only reason for widest-int and if this does not work
>>> reliably, then i would be inclined to convert everything to the
>>> explicit
>>> precision where the definition of what is done is rock solid.
>>> Because
>>> to me, if the rep is not completely predictable, then it is not
>worth
>>> using.
>> At the moment we do not have the infinite precision code.  We have
>offset._int which effectively has a precision and widest int which is
>indeed somewhat useless. We have fixed int where you specify your
>desired infinite precision which is better.
>>
>> Note that we have shifted the double-int host dependence to a
>widest-int target dependence. Which is equally bad.
>>
>> Richard.
>yes, my point is that it does not need to be this bad, we can make the 
>buffer large enough to hide this, at least for the kind of bounded 
>computation that we have in the compiler.   Having written most of this
>
>code, we tend to get in, do a couple of things and then get out.    If 
>we up the buffer size, that will always work reliably.
>
>I realize that it will not be perfect, in that if you have enough 
>bounded size computation you would hit the limit.    But right now
>every 
>multiply (except the ones by 3 in the offset int code) are a source of 
>failure.  We can do a lot better than that by upping the buffer size.
>
>you and i have slightly different experiences here.    i did not do a 
>lot of the offset int code and so i am not as familiar with the 
>potential for bugs here.   But I did a significant amount of the 
>widest-int code, and for that, 2x + 2 bits makes all the potential 
>problems go away.
>
>my guess is that the non 3x multiplies in offset-int most likely need
>to 
>be checked.

As long as we have persistent widest ints we cannot make them large.  Btw, most arithmetic is from user code and thus subject to language constraints, including overflow. Its only when the compiler introduces artificial ops that possible issues start.

Richard.

>>> Kenny
>>>
>>>
>>>
>>>> Richard.
>>>>
>>