97282 – division done twice for modulo and divsion for 128-bit integers

Bug 97282 - division done twice for modulo and divsion for 128-bit integers

Summary: division done twice for modulo and divsion for 128-bit integers

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	11.0

Importance:	P3 enhancement
Target Milestone:	11.0
Assignee:	Jakub Jelinek

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2020-10-03 16:37 UTC by Thomas Koenig
Modified:	2021-08-15 11:23 UTC (History)
CC List:	1 user (show)

See Also:	89256 97459
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-10-05 00:00:00

Attachments
gcc11-pr97282.patch (2.32 KB, patch) 2020-10-05 13:37 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Thomas Koenig 2020-10-03 16:37:01 UTC

Currently, gcc calls the (long and slow) division routines for 128-bit
integers twice when both the residual and the value is needed.  For
the other integer types, this is optimized.

Take this test case:

$ cat digsum.c
#include <stdio.h>
#include <x86intrin.h>

typedef __uint128_t myint;

unsigned long digsum1 (myint x)
{
  unsigned long ret;

  if (x == 0)
    return 0; 

  ret = 0;
  while (x > 0)
    {
      ret = ret + x % 10;
      x = x / 10;
    }
  return ret;
}

unsigned long digsum2 (myint x)
{
  unsigned long ret;
  myint tmp;

  if (x == 0)
    return 0; 

  ret = 0;
  while (x > 0)
    {
      tmp = x / 10;
      ret = ret + (x - tmp * 10);
      x = tmp;
    }
  return ret;
}

#define NUM 1000000
int main()
{
  myint x;
  unsigned long sum;
  long int t1, t2;
  __uint128_t from, to;

  from = 1;
  from = (from << 93) - NUM/2;
  to = from + NUM;
  sum = 0;

  t1 = __rdtsc();
  for (x=from; x<to; x++)
    sum = sum + digsum1(x);

  t2 = __rdtsc();
  printf ("digsum1:\nsum = %lu\n", sum);
  printf ("cycles per sum = %.2f\n\n", (double) (t2-t1)/NUM);

  sum = 0;
  t1 = __rdtsc();
  for (x=from; x<to; x++)
    sum = sum + digsum2(x);

  t2 = __rdtsc();
  printf ("digsum2:\nsum = %lu\n", sum);
  printf ("cycles per sum = %.2f\n", (double) (t2-t1)/NUM);

  return 0;
}

"As is", this gives on my machine

$ gcc -O3 digsum.c
$ ./a.out
digsum1:
sum = 113493792
cycles per sum = 2021.68

digsum2:
sum = 113493792
cycles per sum = 1025.47

(similar timings if a signed type is used).

This also affects Fortran I/O.

Comment 1 Thomas Koenig 2020-10-03 23:18:42 UTC

And here is a version which uses two 64-bit numbers for calculation of he
sum of decimal digits as a benchmark for the division and modulo:

unsigned long digsum3 (myint x)
{
  unsigned long ret;
  __uint64_t high, low;
  const unsigned long int rem_high[10] = {0,6,2,8,4,0,6,2,8,4};
  const unsigned long int foo_high[10] =
    {0x0000000000000000, 0x1999999999999999, 0x3333333333333333, 0x4CCCCCCCCCCCCCCC,
     0x6666666666666666, 0x8000000000000000, 0x9999999999999999, 0xB333333333333333,
     0xCCCCCCCCCCCCCCCC, 0xE666666666666666 };

  high = x >> 64;
  low = x;
  ret = 0;
  while (low > 0 || high > 0)
    {
      unsigned long r_high, r_low, r_sum, r_carry;
      r_high = high % 10;
      r_carry = rem_high[r_high];
      high = high / 10;
      r_low = low % 10;
      low = low / 10;
      low = low + foo_high[r_high];
      r_sum = r_low + r_carry;
      if (r_sum >= 10)
	{
	  r_sum = r_sum - 10;
	  low ++;
	}
      ret = ret + r_sum;
    }
  return ret;
}

It is _much_ faster, taking around 250 to 260 cycles per
calculation, a speedup of a factor of around 8 versus the
original code.

Comment 2 Jakub Jelinek 2020-10-05 11:00:13 UTC

We have the divmod discovery in tree-ssa-math-opts.c and it will handle this case if the division/modulo is by the same variable.  But for constants it punts:
  /* Disable the transform if either is a constant, since division-by-constant
     may have specialized expansion.  */
  if (CONSTANT_CLASS_P (op1) || CONSTANT_CLASS_P (op2))
    return false;

So, perhaps we want to get it through if CONSTANT_CLASS_P (op2) (but not op1) and
some further conditions are met, e.g. that op2 is not a power of two, and either the type's precision is > HOST_BITS_PER_WIDE_INT (then expand_divmod punts on it), as choose_multiplier etc. work on HOST_WIDE_INTs), or compute choose_multiplier for the constant divisor and if pre or post shift needs to be BITS_PER_WORD or more bits.
Or optimize into DIVMOD always even for constant op2 and when trying to expand DIVMOD ifn with constant op2, see if the target can expand division by that constant using just shifts and +/- and if so, emit it as that division + subtraction, otherwise throw away the division expansion and emit a divmod.

Comment 3 Jakub Jelinek 2020-10-05 13:37:49 UTC

Created attachment 49307 [details]
gcc11-pr97282.patch

Untested fix.

Comment 4 GCC Commits 2020-10-06 08:33:04 UTC

The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:bf510679bb3f9bfd6019666065016bb26a5b5466

commit r11-3671-gbf510679bb3f9bfd6019666065016bb26a5b5466
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Oct 6 10:32:22 2020 +0200

    divmod: Match and expand DIVMOD even in some cases of constant divisor [PR97282]
    
    As written in the comment, tree-ssa-math-opts.c wouldn't create a DIVMOD
    ifn call for division + modulo by constant for the fear that during
    expansion we could generate better code for those cases.
    If the divisoris a power of two, that is certainly the case always,
    but otherwise expand_divmod can punt in many cases, e.g. if the division
    type's precision is above HOST_BITS_PER_WIDE_INT, we don't even call
    choose_multiplier, because it works on HOST_WIDE_INTs (true, something
    we should fix eventually now that we have wide_ints), or if pre/post shift
    is larger than BITS_PER_WORD.
    
    So, the following patch recognizes DIVMOD with constant last argument even
    when it is unclear if expand_divmod will be able to optimize it, and then
    during DIVMOD expansion if the divisor is constant attempts to expand it as
    division + modulo and if they actually don't contain any libcalls or
    division/modulo, they are kept as is, otherwise that sequence is thrown away
    and divmod optab or libcall is used.
    
    2020-10-06  Jakub Jelinek  <jakub@redhat.com>
    
            PR rtl-optimization/97282
            * tree-ssa-math-opts.c (divmod_candidate_p): Don't return false for
            constant op2 if it is not a power of two and the type has precision
            larger than HOST_BITS_PER_WIDE_INT or BITS_PER_WORD.
            * internal-fn.c (contains_call_div_mod): New function.
            (expand_DIVMOD): If last argument is a constant, try to expand it as
            TRUNC_DIV_EXPR followed by TRUNC_MOD_EXPR, but if the sequence
            contains any calls or {,U}{DIV,MOD} rtxes, throw it away and use
            divmod optab or divmod libfunc.
    
            * gcc.target/i386/pr97282.c: New test.

Comment 5 Andrew Pinski 2021-08-15 11:23:47 UTC

All fixed for GCC 11 by the patches for PR 97459 .