Bug 53100

Summary: Optimize __int128 with range information
Product: gcc Reporter: Marc Glisse <marc.glisse>
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: enhancement Keywords: missed-optimization
Priority: P3    
Version: 4.8.0   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2023-07-21 00:00:00

Description Marc Glisse 2012-04-24 07:07:36 UTC
(not sure about the "component" field)

In the following program, on x86_64, the first version generates two imulq, while the second generates 4 imulq and 2 mulq. It would be convenient if I could just write the whole code with __int128 and let the compiler do the optimization by tracking the range of numbers.

int f(int a,int b,int c,int d,int e,int f){
#if 0
  long x=a;
  long y=b;
  long z=c;
  long t=d;
  long u=e;
  long v=f;
  return (z-x)*(__int128)(v-y) < (u-x)*(__int128)(t-y);
#else
  __int128 x=a;
  __int128 y=b;
  __int128 z=c;
  __int128 t=d;
  __int128 u=e;
  __int128 v=f;
  return (z-x)*(v-y) < (u-x)*(t-y);
#endif
}
Comment 1 Marc Glisse 2012-04-29 08:05:59 UTC
(In reply to comment #0)
> It would be convenient if I
> could just write the whole code with __int128 and let the compiler do the
> optimization by tracking the range of numbers.

The transformation from an __int128 to a pair of long happens extremely late (optabs.c), so we can't count on tree-vrp to notice that one of them is always zero (and actually it is either 0 or -1, as a sign extension, which would make this hard). On the other hand, tree-vrp does have the information that the differences are in [-4294967295, 4294967295], which comfortably fits in a type half the size of __int128. It seems a possible strategy would be to have tree-vrp mark variables that fit in a type half their size (only for TImode?), try and preserve that information along the way, and finally use it in expand_doubleword_mult. But that seems to imply storing the information in an rtx, and rtx seems a bit too densely packed to add this.

Better ideas?
Comment 2 Marc Glisse 2012-04-29 08:42:36 UTC
(In reply to comment #1)
> On the other hand, tree-vrp does have the information that the
> differences are in [-4294967295, 4294967295], which comfortably fits in a type
> half the size of __int128. It seems a possible strategy would be to have
> tree-vrp mark variables that fit in a type half their size (only for TImode?),
> try and preserve that information along the way, and finally use it in
> expand_doubleword_mult.

An other possibility would be, when the range analysis detects this situation, to have it introduce a double-cast: (__int128)(long)var. In the example here, it would give:

((__int128)(long)((__int128)c-(__int128)a))*((__int128)(long)((__int128)f-(__int128)b))

and existing optimizations already handle:

(long)((__int128)c-(__int128)a) as (long)c-(long)a

and

(__int128)mylong1*(__int128)mylong2 as a widening multiplication.

But then we'd have to be careful not to introduce too many such casts, not to introduce them too late, and not to introduce them just before an optimization that removes them. And find the appropriate half-sized type to cast to. And possibly do this only for modes not handled natively.
Comment 3 Marc Glisse 2012-05-01 12:47:03 UTC
(In reply to comment #2)
> and not to introduce them just before an optimization that removes them.

Usually, doing (long)num1*(__int128)(long)num2 does the right thing. I tried in the example here replacing the plain __int128 multiplications with:

inline bool g1(__int128 x){
  //return(x<=LONG_MAX)&&(x>=LONG_MIN);
  //on 2 lines because of PR30318, unless you apply the patch I posted there
  bool b1 = x<=LONG_MAX;
  bool b2 = x>=LONG_MIN;
  return b1&&b2;
}
inline __int128 mul(__int128 a,__int128 b){
  bool B=g1(a)&&g1(b);
  if(__builtin_constant_p(B)&&B)
    return (long)a*(__int128)(long)b;
  return a*b;
}

__builtin_constant_p does detect we are in the right case, however, because of bad timing between the various optimizations, the double cast (__int128)(long)(u-x) is simplified to just (u-x) before it gets a chance to help. I need to replace the subtraction instead (or in addition) to the multiplication:

inline __int128 sub(__int128 a,__int128 b){
  bool B=g1(a)&&g1(b)&& g1(a-b);
  if(__builtin_constant_p(B)&&B)
    return (long)a-(long)b;
  return a-b;
}

But it would fit better inside the compiler than as a fragile use of __builtin_constant_p.
Comment 4 Andrew Pinski 2021-08-06 00:33:27 UTC
So in the #if 0 case we get:
  x_13 = (long int) a_12(D);
  y_15 = (long int) b_14(D);
  z_17 = (long int) c_16(D);
  t_19 = (long int) d_18(D);
  u_21 = (long int) e_20(D);
  v_23 = (long int) f_22(D);
  _1 = z_17 - x_13;
  _3 = v_23 - y_15;
  _5 = _1 w* _3;

Notice the w*

While with the original case we get:
  x_9 = (__int128) a_8(D);
  y_11 = (__int128) b_10(D);
  z_13 = (__int128) c_12(D);
  t_15 = (__int128) d_14(D);
  u_17 = (__int128) e_16(D);
  v_19 = (__int128) f_18(D);
  _1 = z_13 - x_9;
  _2 = v_19 - y_11;
  _3 = _1 * _2;

If we had simplified/shortened: ((__int128) c_12(D)) - ((__int128) a_8(D))
to
(__int128)(long)((unsigned long) c_12(D)) - ((unsigned long) a_8(D))
We might have optimized this.
There might be another bug about having some PLUS_EXPR which has WRAPPING effects rather than undefined OVERFLOW which will help the casting issue.
Comment 5 Andrew Pinski 2023-07-21 23:32:57 UTC
(simplify
 (plus (convert @0) (convert @1))
 (if (INTEGRAL_TYPE (type)
      && types_match (TREE_TYPE (@0), TREE_TYPE (@1))
      && element_precision (TREE_TYPE (@0)) < element_precision (type))
 (with { type ntype = find_next_best_type (TREE_TYPE (@0));
  }
  (if (ntype
       && !types_match (ntype, type)
       && element_precision (ntype) < element_precision (type))
   (convert (plus (convert:ntype @0) (convert:ntype @1)))
  )
 )
)

Where find_next_best_type checks int, long, long long types ...