Currently, gcc calls the (long and slow) division routines for 128-bit integers twice when both the residual and the value is needed. For the other integer types, this is optimized. Take this test case: $ cat digsum.c #include <stdio.h> #include <x86intrin.h> typedef __uint128_t myint; unsigned long digsum1 (myint x) { unsigned long ret; if (x == 0) return 0; ret = 0; while (x > 0) { ret = ret + x % 10; x = x / 10; } return ret; } unsigned long digsum2 (myint x) { unsigned long ret; myint tmp; if (x == 0) return 0; ret = 0; while (x > 0) { tmp = x / 10; ret = ret + (x - tmp * 10); x = tmp; } return ret; } #define NUM 1000000 int main() { myint x; unsigned long sum; long int t1, t2; __uint128_t from, to; from = 1; from = (from << 93) - NUM/2; to = from + NUM; sum = 0; t1 = __rdtsc(); for (x=from; x<to; x++) sum = sum + digsum1(x); t2 = __rdtsc(); printf ("digsum1:\nsum = %lu\n", sum); printf ("cycles per sum = %.2f\n\n", (double) (t2-t1)/NUM); sum = 0; t1 = __rdtsc(); for (x=from; x<to; x++) sum = sum + digsum2(x); t2 = __rdtsc(); printf ("digsum2:\nsum = %lu\n", sum); printf ("cycles per sum = %.2f\n", (double) (t2-t1)/NUM); return 0; } "As is", this gives on my machine $ gcc -O3 digsum.c $ ./a.out digsum1: sum = 113493792 cycles per sum = 2021.68 digsum2: sum = 113493792 cycles per sum = 1025.47 (similar timings if a signed type is used). This also affects Fortran I/O.
And here is a version which uses two 64-bit numbers for calculation of he sum of decimal digits as a benchmark for the division and modulo: unsigned long digsum3 (myint x) { unsigned long ret; __uint64_t high, low; const unsigned long int rem_high[10] = {0,6,2,8,4,0,6,2,8,4}; const unsigned long int foo_high[10] = {0x0000000000000000, 0x1999999999999999, 0x3333333333333333, 0x4CCCCCCCCCCCCCCC, 0x6666666666666666, 0x8000000000000000, 0x9999999999999999, 0xB333333333333333, 0xCCCCCCCCCCCCCCCC, 0xE666666666666666 }; high = x >> 64; low = x; ret = 0; while (low > 0 || high > 0) { unsigned long r_high, r_low, r_sum, r_carry; r_high = high % 10; r_carry = rem_high[r_high]; high = high / 10; r_low = low % 10; low = low / 10; low = low + foo_high[r_high]; r_sum = r_low + r_carry; if (r_sum >= 10) { r_sum = r_sum - 10; low ++; } ret = ret + r_sum; } return ret; } It is _much_ faster, taking around 250 to 260 cycles per calculation, a speedup of a factor of around 8 versus the original code.
We have the divmod discovery in tree-ssa-math-opts.c and it will handle this case if the division/modulo is by the same variable. But for constants it punts: /* Disable the transform if either is a constant, since division-by-constant may have specialized expansion. */ if (CONSTANT_CLASS_P (op1) || CONSTANT_CLASS_P (op2)) return false; So, perhaps we want to get it through if CONSTANT_CLASS_P (op2) (but not op1) and some further conditions are met, e.g. that op2 is not a power of two, and either the type's precision is > HOST_BITS_PER_WIDE_INT (then expand_divmod punts on it), as choose_multiplier etc. work on HOST_WIDE_INTs), or compute choose_multiplier for the constant divisor and if pre or post shift needs to be BITS_PER_WORD or more bits. Or optimize into DIVMOD always even for constant op2 and when trying to expand DIVMOD ifn with constant op2, see if the target can expand division by that constant using just shifts and +/- and if so, emit it as that division + subtraction, otherwise throw away the division expansion and emit a divmod.
Created attachment 49307 [details] gcc11-pr97282.patch Untested fix.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:bf510679bb3f9bfd6019666065016bb26a5b5466 commit r11-3671-gbf510679bb3f9bfd6019666065016bb26a5b5466 Author: Jakub Jelinek <jakub@redhat.com> Date: Tue Oct 6 10:32:22 2020 +0200 divmod: Match and expand DIVMOD even in some cases of constant divisor [PR97282] As written in the comment, tree-ssa-math-opts.c wouldn't create a DIVMOD ifn call for division + modulo by constant for the fear that during expansion we could generate better code for those cases. If the divisoris a power of two, that is certainly the case always, but otherwise expand_divmod can punt in many cases, e.g. if the division type's precision is above HOST_BITS_PER_WIDE_INT, we don't even call choose_multiplier, because it works on HOST_WIDE_INTs (true, something we should fix eventually now that we have wide_ints), or if pre/post shift is larger than BITS_PER_WORD. So, the following patch recognizes DIVMOD with constant last argument even when it is unclear if expand_divmod will be able to optimize it, and then during DIVMOD expansion if the divisor is constant attempts to expand it as division + modulo and if they actually don't contain any libcalls or division/modulo, they are kept as is, otherwise that sequence is thrown away and divmod optab or libcall is used. 2020-10-06 Jakub Jelinek <jakub@redhat.com> PR rtl-optimization/97282 * tree-ssa-math-opts.c (divmod_candidate_p): Don't return false for constant op2 if it is not a power of two and the type has precision larger than HOST_BITS_PER_WIDE_INT or BITS_PER_WORD. * internal-fn.c (contains_call_div_mod): New function. (expand_DIVMOD): If last argument is a constant, try to expand it as TRUNC_DIV_EXPR followed by TRUNC_MOD_EXPR, but if the sequence contains any calls or {,U}{DIV,MOD} rtxes, throw it away and use divmod optab or divmod libfunc. * gcc.target/i386/pr97282.c: New test.
All fixed for GCC 11 by the patches for PR 97459 .