This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: patch to fix constant math - 4th patch - the wide-int class - patch ping for the next stage 1


On 04/19/2013 09:31 AM, Richard Biener wrote:
+   number of elements of the vector that are in use.  When LEN *
+   HOST_BITS_PER_WIDE_INT < the precision, the value has been
+   compressed.  The values of the elements of the vector greater than
+   LEN - 1. are all equal to the highest order bit of LEN.

equal to the highest order bit of element LEN - 1. ?
Fixed, you are correct.

I have gone thru the entire wide-int patch to clean this up. The bottom line is that if the precision is not a multiple of the size of a HWI then everything above that precision is assumed to be identical to the sign bit.

Especially _not_ equal to the precision - 1 bit of the value, correct?
I do not understand your question here, because in the case talked about above, the bit at precision - 1 would not have been explicitly represented.

Anyway,  i went thru this top part carefully and made many things clearer.
+   The representation does not contain any information inherant about
+   signedness of the represented value, so it can be used to represent
+   both signed and unsigned numbers.   For operations where the results
+   depend on signedness (division, comparisons), the signedness must
+   be specified separately.  For operations where the signness
+   matters, one of the operands to the operation specifies either
+   wide_int::SIGNED or wide_int::UNSIGNED.

The last sentence is somehow duplicated.
fixed

+   The numbers are stored as sign entended numbers as a means of
+   compression.  Leading HOST_WIDE_INTS that contain strings of either
+   -1 or 0 are removed as long as they can be reconstructed from the
+   top bit that is being represented.

I'd put this paragraph before the one that talks about signedness, next
to the one that already talks about encoding.
done
+   All constructors for wide_int take either a precision, an enum
+   machine_mode or tree_type.  */

That's probably no longer true (I'll now check).
yes you are correct

+class wide_int {
+  /* Internal representation.  */
+
+  /* VAL is set to a size that is capable of computing a full
+     multiplication on the largest mode that is represented on the
+     target.  The full multiplication is use by tree-vrp.  tree-vpn
+     currently does a 2x largest mode by 2x largest mode yielding a 4x
+     largest mode result.  If operations are added that require larger
+     buffers, then VAL needs to be changed.  */
+  HOST_WIDE_INT val[WIDE_INT_MAX_ELTS];
+  unsigned short len;
+  unsigned int precision;

I wonder if there is a technical reason to stick to HOST_WIDE_INTs?
I'd say for efficiency HOST_WIDEST_FAST_INT would be more appropriate
(to get a 32bit value on 32bit x86 for example).  I of course see that
conversion to/from HOST_WIDE_INT is an important operation
that would get slightly more complicated.

Maybe just quickly checking the code generated on 32bit x86 for
HOST_WIDE_INT vs. HOST_WIDEST_FAST_INT tells us whether
it's worth considering (it would be bad if each add/multiply would
end up calling to libgcc for example - I know that doesn't happen
for x86, but maybe it would happen for an arm hosted gcc
targeting x86_64?)
This is an interesting point. my guess is that it is unlikely to be worth the work. consider add: most machines have add with carry and well written 32 bit ports would have used an add with carry sequence rather than making the libcall. If i rewrite wide-int in terms of host_fastest_int, then i have to do some messy code to compute the carry which is unlikely to translate into the proper carry instructions. Not to mention the cost overhead of converting to and from HFI given that gcc is written almost entirely using HWIs.

I thought about the possible idea of just converting the mul and div functions. This would be easy because i already reblock them into HOST_WIDE_HALF_INTs to do the math. I could just do a different reblocking. However, i think that it is unlikely that doing this would ever show up on anyone's performance counts. Either way you do the same number of multiply instructions, it is just the subroutine wrapper that could possibly go away.

+  enum ShiftOp {
+    NONE,
+    /* There are two uses for the wide-int shifting functions.  The
+       first use is as an emulation of the target hardware.  The
+       second use is as service routines for other optimizations.  The
+       first case needs to be identified by passing TRUNC as the value
+       of ShiftOp so that shift amount is properly handled according to the
+       SHIFT_COUNT_TRUNCATED flag.  For the second case, the shift
+       amount is always truncated by the bytesize of the mode of
+       THIS.  */
+    TRUNC
+  };

double-int simply honors SHIFT_COUNT_TRUNCATED.  Why differ
from that (and thus change behavior in existing code - not sure if you
do that with introducing wide-int)?
I believe that GCC is supposed to be a little schizophrenic here, at least according to the doc. when it is doing its own math it is defined to use SHIFT_COUNT_TRUNCATED, but when it is doing math for the program it is supposed to honor what the port does. The idea is that you pass in TRUNC when you want to have the shift op do what the port does. Again, this goes back to an insistence that the optimizations are just doing what the machine would do, only earlier.

+  enum SignOp {
+    /* Many of the math functions produce different results depending
+       on if they are SIGNED or UNSIGNED.  In general, there are two
+       different functions, whose names are prefixed with an 'S" and
+       or an 'U'.  However, for some math functions there is also a
+       routine that does not have the prefix and takes an SignOp
+       parameter of SIGNED or UNSIGNED.  */
+    SIGNED,
+    UNSIGNED
+  };

You seem to insist on that.  It should propagate to the various parts
of the compiler that have settled for the 'uns' integer argument.
Having one piece behave different is just weird.  I suppose I will
find code like

     wi.ext (prec, uns ? UNSIGNED : SIGNED)

in your conversion?
this has been resolved on another thread.
+  static wide_int from_shwi (HOST_WIDE_INT op0,
+                            unsigned int precision, bool *overflow = 0);

+  if (precision < HOST_BITS_PER_WIDE_INT)
+    {
+      HOST_WIDE_INT t = sext_hwi (op0, precision);
+      if (t != op0 && overflow)
+       *overflow = true;
+      op0 = t;
+    }

Hm.  I'd say input values should be properly extended already (we certainly
expect that from RTL or tree constants).  So we might want to assert the
above instead of tracking it via *overflow.

+  static wide_int from_array (const HOST_WIDE_INT* op0,
+                             unsigned int len,
+                             unsigned int precision, bool need_canon = true);
+  inline static wide_int from_array (const HOST_WIDE_INT* op0,
+                                    unsigned int len,
+                                    enum machine_mode mode);
+  inline static wide_int from_array (const HOST_WIDE_INT* op0,
+                                    unsigned int len,
+                                    const_tree type);

I still don't like the overloads precision vs. machine_mode vs. tree type.
It's much more explanative what you specify here if you write

   from_array (&x, len, GET_MODE_PRECISION (mode))

instead of

   from_array (&x, len, mode)

and it's not that much more typing either.

+  static wide_int from_double_int (double_int,
+                                  unsigned int precision);
+  inline static wide_int from_double_int (double_int, enum machine_mode);

this one would lack the tree type overload (I of course think it has an
excessive overload for machine_mode).

+  inline HOST_WIDE_INT to_shwi (unsigned int prec = 0) const;
+  inline unsigned HOST_WIDE_INT to_uhwi (unsigned int prec = 0) const;

this is

    wi.sext (prec);
    wi.to_shwi ();

which is less odd than handling prec == 0 specially.

+  static wide_int max_value (unsigned int prec,
+                            unsigned int precision, SignOp sgn);

two precisions - ugh ;)  In some places having to have a precision is ugly :/
this is really necessary, unfortunately. though perhaps i can rearrange the operands so that one of them can be defaulted. The problem is that in vrp you want to know the max value of some small precision but you want that value to be expressed as a number in a larger precision. outside of tree-vrp, the rest of the compiler only uses one precision

+  inline static wide_int minus_one (unsigned int prec);
+  inline static wide_int zero (unsigned int prec);
+  inline static wide_int one (unsigned int prec);
+  inline static wide_int two (unsigned int prec);

here as well.  I see two (1) properly truncates the value to zero ;)
It just comes to my mind that these could have "arbitrary" precision.
"arbitrary" == MAX_SIZE * HOST_BITS_PER_WIDE_INT.  Which
would get us back to mixed precision operations ...
these are now pretty rarely used since constants almost always appear as the second operand of a binary operation. but the first operand has to convey the precision.
+  /* Printing functions.  */
+
+  void print_dec (char *buf, SignOp sgn) const;
+  void print_dec (FILE *file, SignOp sgn) const;
+  void print_decs (char *buf) const;
+  void print_decs (FILE *file) const;
+  void print_decu (char *buf) const;
+  void print_decu (FILE *file) const;
+  void print_hex (char *buf) const;
+  void print_hex (FILE *file) const;

making those non-member functions would allow getting rid of stdio.h
from wide-int.h (similar making the tree / rtx / mode argument methods
either standalone or making them templates and externalizing the
implementation to a separate template class would get rid of the
rtl / tree dependency)

+  inline bool neg_p () const;

how can you test that?  wide-int doesn't have sign information.

+bool
+wide_int::neg_p () const
+{
+  return sign_mask () != 0;

not exactly efficient either ...

+HOST_WIDE_INT
+wide_int::sign_mask () const
+{
+  int i = len - 1;
+  if (precision < HOST_BITS_PER_WIDE_INT)
+    return ((val[0] << (HOST_BITS_PER_WIDE_INT - precision))
+           >> (HOST_BITS_PER_WIDE_INT - 1));
+
+  /* VRP appears to be badly broken and this is a very ugly fix.  */
+  if (i >= 0)
+    return val[i] >> (HOST_BITS_PER_WIDE_INT - 1);
+
+  gcc_unreachable ();
+#if 0
+  return val[len - 1] >> (HOST_BITS_PER_WIDE_INT - 1);
+#endif
+}

I can't see what's the "fix".  It seems to be equivalent to the commented out
code.  Apart from not handling len == 0 for which you ICE then.

Back to neg_p ... it computes wrong information.
from_uwhi (-1, HOST_BITS_PER_WIDE_INT).neg_p () returns true.

I suppose you want to rename it to msb () (implementing that efficiently
and using it from sign_mask, too)

+  template <typename T>
+    inline bool gt_p (T c, SignOp sgn) const;
+  template <typename T>
+    inline bool gts_p (T c) const;
+  template <typename T>
+    inline bool gtu_p (T c) const;

it's bad that we can't use the sign information we have available in almost
all cases ... (where precision is not an exact multiple of
HOST_BITS_PER_WIDE_INT
and len == precision / HOST_BITS_PER_WIDE_INT).  It isn't hard to encode
a sign - you just have to possibly waste a word of zeroes for positive
values where at the moment precision is an exact multiple of
HOST_BIST_PER_WIDE_INT and len == precision / HOST_BITS_PER_WIDE_INT.
Which of course means that the encoding can be one word larger than
maximally required by 'precision'.

+  wide_int force_to_size (unsigned int precision, SignOp sgn) const;
+  inline wide_int force_to_size (enum machine_mode mode, SignOp sgn) const;
+  inline wide_int force_to_size (const_tree type) const;
+  inline wide_int force_to_size (const_tree type, SignOp sgn) const;
+
+  inline wide_int sforce_to_size (enum machine_mode mode) const;
+  inline wide_int sforce_to_size (const_tree type) const;
+  inline wide_int zforce_to_size (enum machine_mode mode) const;
+  inline wide_int zforce_to_size (const_tree type) const;

too many overloads again

Looking at

+template <typename T>
+wide_int
+wide_int::operator & (T c) const
+{
+  wide_int result;
+  HOST_WIDE_INT ws[WIDE_INT_MAX_ELTS];
+  const HOST_WIDE_INT *s;
+  int cl;
+
+  s = to_shwi2 (ws, &cl, c);
+

I see

+  template <typename T>
+  static inline const HOST_WIDE_INT* to_shwi2 (HOST_WIDE_INT *s, int *l, T x)
+  {
+    s[0] = x;
+    if (~(T)0 < (T)0
...

+  static inline const HOST_WIDE_INT* to_shwi2 (HOST_WIDE_INT *s
ATTRIBUTE_UNUSED,
+                                       int *l, const wide_int &y)
+  {
+    *l = y.len;
...

I think you want to use template specialization here, not overloading.  Thus,

   template <>
   static inline const HOST_WIDE_INT* to_shwi2 (HOST_WIDE_INT *s, int
*l, const wide_int &y)

and adjust the HWI case to look like

   template <typename T>
   static inline const HOST_WIDE_INT* to_shwi2 (HOST_WIDE_INT *s, int
*l, const T& x)

you want to move that ~(T)0 < (T)0 check up to template substitution level
to allow SFINAE to disable the template if that's not computable.

+    s[0] = x;
+    if (~(T)0 < (T)0
+       || sizeof (T) < sizeof (HOST_WIDE_INT))
+      {
+       *l = 1;
+      }
+    else
+      {
+       s[1] = 0;
+       *l = 2;

that's only required if the MSB is set, otherwise it's not properly compressed

+      }
+    return s;

hmm, looking at from_uhwi I see that you are adding the extra zero words
as I suggested ... which means you _do_ have reliable sign information
available (well, not for the case where len would end up bigger than MAX_LEN).
So - can we simplify some things with that?  I can think of the ordered
comparisons for example.  Also the encoding description should be
clarified that there _is_ sign information available (and that len can be
bigger than precision / HOST_BITS_PER_WIDE_INT).
i did all of this because it makes somethings easier and other things easier to talk about. In particular it allows the binary operators to work properly with constants (thanks for the trick).

i did not do this because i have any interest in changing the compiler so that its math is signed infinite precision. I strongly do not want to go there. if you want to take my patches later and experiment with this, fine. I do not want to be the person who chances people to use llvm because they all of a sudden start getting weird (but conforming to the c and c++ spec) results out of the optimizer.

In particular, others in the community consider it non conforming gimple to look at a value beyond the precision. See the posting by iant about my question on 11/05/2012.




Returning to to_shwi2 (odd name, btw):

+       /* Copy the result because it does not fit in two HWIs.  */
+       s[0] = TREE_INT_CST_LOW (tcst);
+       s[1] = TREE_INT_CST_HIGH (tcst);
+       s[2] = 0;
+       *l = 3;
+       return s;

ah, this is why you need the extra argument ;)  Btw, what happens
when a proper encoding does not fit in MAX_LEN words?  That is,
we'd need an extra zero word?
this is a hack that will die when the tree cst has a variable len array. When that happens, this code becomes an unconditional pointer copy.

+  /* There should logically be an overload for rtl here, but it cannot
+     be here because of circular include issues.  It is in rtl.h.  */
+  static inline const HOST_WIDE_INT* to_shwi2
+    (HOST_WIDE_INT *s ATTRIBUTE_UNUSED, int *l, rtx rcst);

in rtl.c I suppose.  Btw, this is why I'm suggesting to defer implementation
to a template - you can implement that outside of wide-int.h even without
declaring the specialization here.

Is there an accessible branch with the wide-int work?  I may be tempted
to propose some changes in form of patches.  If not then I think the
work is in a state that is suitable to put on a merge-branch.
i submitted all of the rtl work in the other patch and the first of the tree patches. what i have is badly rotted so i was going to wait til i had a stable api to clean it up and post everything.
+wide_int
+wide_int::add (const wide_int &op1, SignOp sgn, bool *overflow) const
+{
...
+  /* Uncompress the rest.  */
+  for (i = len; i < op1.len; i++)
+    {
+      o0 = val[i];
+      o1 = mask1;
+      x = o0 + o1 + carry;
+      result.val[i] = x;
+      old_carry = carry;
+      carry = x < o0;
+    }
+  for (i = len; i < op1.len; i++)
+    {
+      o0 = mask0;
+      o1 = op1.val[i];
+      x = o0 + o1 + carry;
+      result.val[i] = x;
+      old_carry = carry;
+      carry = x < o0;
+    }

Cut & paste error here I think.  at least one should run up to this->len, no?
no, if len is 3, the top one you can look at is 0
And the first loop should run to min (len, op1.len) only.  How much
testing coverage does the code have for "interesting" values?
A standalone testing program will probably reveal such issues (due to
the rich-ness of the current API covering everything will be hard :/)

Looking at

+HOST_WIDE_INT
+wide_int::sign_mask () const
+{
+  int i = len - 1;
+  if (precision < HOST_BITS_PER_WIDE_INT)
+    return ((val[0] << (HOST_BITS_PER_WIDE_INT - precision))
+           >> (HOST_BITS_PER_WIDE_INT - 1));
+
+  /* VRP appears to be badly broken and this is a very ugly fix.  */
+  if (i >= 0)
+    return val[i] >> (HOST_BITS_PER_WIDE_INT - 1);
+
+  gcc_unreachable ();
+#if 0
+  return val[len - 1] >> (HOST_BITS_PER_WIDE_INT - 1);
+#endif
+}

again - this looks like it does overly complicated work.  Our encoding
of values guarantees that the following is correct:

HOST_WIDE_INT
wide_int::sign_mask () const
{
   if (val[len - 1] < 0)
     return -1;
   else
     return 0;
}

I suppose quite some code can be simplified that calls sign_mask
currently (thanks to the changed encoding).

back to ::add ():

+  small_prec = precision & (HOST_BITS_PER_WIDE_INT - 1);
+  if (small_prec == 0)
+    {
+      if (sgn == wide_int::SIGNED)
+       {
+         if (((!(o0 ^ o1)) & (x ^ o0)) >> (HOST_BITS_PER_WIDE_INT - 1))
+           *overflow = true;

shouldn't this share code with the non-overflow add_large and simply
do overflow detection conditional on overlflow being non-NULL?
Because the add_large code seems to be correct with respect to
the issues noted above ...

+ ex:
+  result.canonize ();
+  return result;
+}

canonizing is not strictly necessary - in fact it is unlikely to do
something useful most of the time.  I'd say we should only
canonize before building a tree or RTL with that representation.
Note that canonize is wrong, too:

+wide_int::canonize ()
+{
+  int small_prec = precision & (HOST_BITS_PER_WIDE_INT - 1);
+  int blocks_needed = BLOCKS_NEEDED (precision);
+  HOST_WIDE_INT top;
+  int i;
+
+  if (len > blocks_needed)
+    len = blocks_needed;

that will set a UHWI -1 to have len == 1, ignoring the extra needed
zero word.  Thus, either BLOCKS_NEEDED is broken or its
uses need to be audited.

+
+  /* Clean up the top bits for any mode that is not a multiple of a HWI.  */
+  if (len == blocks_needed && small_prec)
+    val[len - 1] = sext_hwi (val[len - 1], small_prec);

That's not necessary.  By construction all arithmetic operations will
preserve proper extension.  And you unconditionally sign-extend
small precision "large" positive constants here which is even wrong
(-1U, precision == 32, HWI == 64, original rep { 0x00..00fffff }, broken
{ 0xfff...fff } ).

+  if (len == 1)
+    return;

in fact for len == 1 we should do nothing in this function from the start.
Callers that also want to properly sign or zero extend the result value
should do so using ext (), not abuse another implementation inside
canonize () (which should only optimize the encoding).

Ok, I'll stop reviewing here.  It's a lot better than before - thanks for
doing the changes.

I still like you to cut down the interface somewhat and _test_ what
is remaining.  Put the code on a branch off trunk so I and others can
produce patches against it.

Thanks,
Richard.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]