Bug 53100 - Optimize __int128 with range information
Summary: Optimize __int128 with range information
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.8.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-04-24 07:07 UTC by Marc Glisse
Modified: 2012-05-01 12:47 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marc Glisse 2012-04-24 07:07:36 UTC
(not sure about the "component" field)

In the following program, on x86_64, the first version generates two imulq, while the second generates 4 imulq and 2 mulq. It would be convenient if I could just write the whole code with __int128 and let the compiler do the optimization by tracking the range of numbers.

int f(int a,int b,int c,int d,int e,int f){
#if 0
  long x=a;
  long y=b;
  long z=c;
  long t=d;
  long u=e;
  long v=f;
  return (z-x)*(__int128)(v-y) < (u-x)*(__int128)(t-y);
#else
  __int128 x=a;
  __int128 y=b;
  __int128 z=c;
  __int128 t=d;
  __int128 u=e;
  __int128 v=f;
  return (z-x)*(v-y) < (u-x)*(t-y);
#endif
}
Comment 1 Marc Glisse 2012-04-29 08:05:59 UTC
(In reply to comment #0)
> It would be convenient if I
> could just write the whole code with __int128 and let the compiler do the
> optimization by tracking the range of numbers.

The transformation from an __int128 to a pair of long happens extremely late (optabs.c), so we can't count on tree-vrp to notice that one of them is always zero (and actually it is either 0 or -1, as a sign extension, which would make this hard). On the other hand, tree-vrp does have the information that the differences are in [-4294967295, 4294967295], which comfortably fits in a type half the size of __int128. It seems a possible strategy would be to have tree-vrp mark variables that fit in a type half their size (only for TImode?), try and preserve that information along the way, and finally use it in expand_doubleword_mult. But that seems to imply storing the information in an rtx, and rtx seems a bit too densely packed to add this.

Better ideas?
Comment 2 Marc Glisse 2012-04-29 08:42:36 UTC
(In reply to comment #1)
> On the other hand, tree-vrp does have the information that the
> differences are in [-4294967295, 4294967295], which comfortably fits in a type
> half the size of __int128. It seems a possible strategy would be to have
> tree-vrp mark variables that fit in a type half their size (only for TImode?),
> try and preserve that information along the way, and finally use it in
> expand_doubleword_mult.

An other possibility would be, when the range analysis detects this situation, to have it introduce a double-cast: (__int128)(long)var. In the example here, it would give:

((__int128)(long)((__int128)c-(__int128)a))*((__int128)(long)((__int128)f-(__int128)b))

and existing optimizations already handle:

(long)((__int128)c-(__int128)a) as (long)c-(long)a

and

(__int128)mylong1*(__int128)mylong2 as a widening multiplication.

But then we'd have to be careful not to introduce too many such casts, not to introduce them too late, and not to introduce them just before an optimization that removes them. And find the appropriate half-sized type to cast to. And possibly do this only for modes not handled natively.
Comment 3 Marc Glisse 2012-05-01 12:47:03 UTC
(In reply to comment #2)
> and not to introduce them just before an optimization that removes them.

Usually, doing (long)num1*(__int128)(long)num2 does the right thing. I tried in the example here replacing the plain __int128 multiplications with:

inline bool g1(__int128 x){
  //return(x<=LONG_MAX)&&(x>=LONG_MIN);
  //on 2 lines because of PR30318, unless you apply the patch I posted there
  bool b1 = x<=LONG_MAX;
  bool b2 = x>=LONG_MIN;
  return b1&&b2;
}
inline __int128 mul(__int128 a,__int128 b){
  bool B=g1(a)&&g1(b);
  if(__builtin_constant_p(B)&&B)
    return (long)a*(__int128)(long)b;
  return a*b;
}

__builtin_constant_p does detect we are in the right case, however, because of bad timing between the various optimizations, the double cast (__int128)(long)(u-x) is simplified to just (u-x) before it gets a chance to help. I need to replace the subtraction instead (or in addition) to the multiplication:

inline __int128 sub(__int128 a,__int128 b){
  bool B=g1(a)&&g1(b)&& g1(a-b);
  if(__builtin_constant_p(B)&&B)
    return (long)a-(long)b;
  return a-b;
}

But it would fit better inside the compiler than as a fragile use of __builtin_constant_p.