Bug 111209 - GCC fails to understand adc pattern what its document describes
Summary: GCC fails to understand adc pattern what its document describes
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 14.0
: P3 normal
Target Milestone: ---
Assignee: Jakub Jelinek
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-28 06:47 UTC by cqwrteur
Modified: 2023-08-29 08:47 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-08-28 00:00:00


Attachments
bug example (272 bytes, text/plain)
2023-08-28 06:47 UTC, cqwrteur
Details
gcc14-pr111209.patch (2.41 KB, patch)
2023-08-28 18:58 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description cqwrteur 2023-08-28 06:47:25 UTC
Created attachment 55805 [details]
bug example

template<typename T>
inline constexpr T addc(T a,T b,bool carryin,bool& carryout) noexcept
{
	T s;
	auto c1 = __builtin_add_overflow(a, b, __builtin_addressof(s));
	auto c2 = __builtin_add_overflow(s, carryin, __builtin_addressof(s));
	carryout = c1 | c2;
	return s;
}

void test_addc(unsigned long long* a,unsigned long long* b,unsigned long long* r)
{
    bool carry{};
    r[0]=addc(a[0],b[0],carry,carry);
    r[1]=addc(a[1],b[1],carry,carry);
}

I copied the example from gcc documentation

https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html

But gcc seems to fail to understand the pattern correctly even with what the document describes.
gcc:
https://godbolt.org/z/Whaaahn41

clang:
https://godbolt.org/z/Ma6rvaYd6

Looks like a bug here.
Comment 1 Jakub Jelinek 2023-08-28 18:56:20 UTC
Just use __int128 addition if all you want is double-word addition (or long long for 32-bit arches)?
Comment 2 Jakub Jelinek 2023-08-28 18:58:58 UTC
Created attachment 55809 [details]
gcc14-pr111209.patch

Anyway, here is a patch that makes it match, but it is getting ugly to avoid making it match prematurely and break other matching.
Comment 3 cqwrteur 2023-08-28 19:02:23 UTC
(In reply to Jakub Jelinek from comment #1)
> Just use __int128 addition if all you want is double-word addition (or long
> long for 32-bit arches)?

Well, I've presented this merely as an illustrative example. The length can actually be arbitrary. I've directly taken the code from the GCC documentation, but it doesn't appear to perform as the document asserts.

"
Built-in Function: unsigned int __builtin_addc (unsigned int a, unsigned int b, unsigned int carry_in, unsigned int *carry_out)
Built-in Function: unsigned long int __builtin_addcl (unsigned long int a, unsigned long int b, unsigned int carry_in, unsigned long int *carry_out)
Built-in Function: unsigned long long int __builtin_addcll (unsigned long long int a, unsigned long long int b, unsigned long long int carry_in, unsigned long long int *carry_out)
These built-in functions are equivalent to:

  ({ __typeof__ (a) s; \
      __typeof__ (a) c1 = __builtin_add_overflow (a, b, &s); \
      __typeof__ (a) c2 = __builtin_add_overflow (s, carry_in, &s); \
      *(carry_out) = c1 | c2; \
      s; })
i.e. they add 3 unsigned values, set what the last argument points to to 1 if any of the two additions overflowed (otherwise 0) and return the sum of those 3 unsigned values. Note, while all the first 3 arguments can have arbitrary values, better code will be emitted if one of them (preferrably the third one) has only values 0 or 1 (i.e. carry-in).
"

Additionally, it's advisable to steer clear of using __uint128_t in certain situations. This data type is not compatible with the Microsoft compiler and 32-bit machines. Moreover, the compiler does not effectively optimize the associated costs.
Comment 4 Jakub Jelinek 2023-08-28 19:04:31 UTC
(In reply to cqwrteur from comment #3)
> (In reply to Jakub Jelinek from comment #1)
> > Just use __int128 addition if all you want is double-word addition (or long
> > long for 32-bit arches)?
> 
> Well, I've presented this merely as an illustrative example. The length can
> actually be arbitrary.

No, it was working with all the other lengths.
Comment 5 cqwrteur 2023-08-28 19:08:36 UTC
(In reply to Jakub Jelinek from comment #4)
> (In reply to cqwrteur from comment #3)
> > (In reply to Jakub Jelinek from comment #1)
> > > Just use __int128 addition if all you want is double-word addition (or long
> > > long for 32-bit arches)?
> > 
> > Well, I've presented this merely as an illustrative example. The length can
> > actually be arbitrary.
> 
> No, it was working with all the other lengths.

This might come across as unusual. I frequently engage in manipulations involving the carry flag.

like this implementation for 128 bit division (for 32 bit machine and Microsoft compiler)


	auto shift = static_cast<unsigned>(::std::countl_zero(divisorhigh) - ::std::countl_zero(dividendhigh));

	divisorhigh = ::fast_io::intrinsics::shiftleft(divisorlow,divisorhigh,shift);
	divisorlow <<= shift;

	quotientlow = 0;
	bool carry;
	do
	{
		carry=0;
		dividendlow=intrinsics::subc(dividendlow,divisorlow,carry,carry);
		dividendhigh=intrinsics::subc(dividendhigh,divisorhigh,carry,carry);
		constexpr T zero{};
		T mask{zero-carry};
		T templow{divisorlow&mask},temphigh{divisorhigh&mask};
		carry=!carry;
		quotientlow=intrinsics::addc(quotientlow,quotientlow,carry,carry);
		carry=0;
		dividendlow=intrinsics::addc(dividendlow,templow,carry,carry);
		dividendhigh=intrinsics::addc(dividendhigh,temphigh,carry,carry);
		divisorlow = intrinsics::shiftright(divisorlow,divisorhigh,1u);
		divisorhigh >>= 1u;
	}
	while(shift--);
	return {quotientlow,0,dividendlow,dividendhigh};
Comment 6 GCC Commits 2023-08-29 08:47:27 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:a7aec76a74dd38524be325343158d3049b6ab3ac

commit r14-3541-ga7aec76a74dd38524be325343158d3049b6ab3ac
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Aug 29 10:46:01 2023 +0200

    tree-ssa-math-opts: Improve uaddc/usubc pattern matching [PR111209]
    
    The uaddc/usubc usual matching is of the .{ADD,SUB}_OVERFLOW pair in the
    middle, which adds/subtracts carry-in (from lower limbs) and computes
    carry-out (to higher limbs).  Before optimizations (unless user writes
    it intentionally that way already), all the steps look the same, but
    optimizations simplify the handling of the least significant limb
    (one which adds/subtracts 0 carry-in) to just a single
    .{ADD,SUB}_OVERFLOW and the handling of the most significant limb
    if the computed carry-out is ignored to normal addition/subtraction
    of multiple operands.
    Now, match_uaddc_usubc has code to turn that least significant
    .{ADD,SUB}_OVERFLOW call into .U{ADD,SUB}C call with 0 carry-in if
    a more significant limb above it is matched into .U{ADD,SUB}C; this
    isn't necessary for functionality, as .ADD_OVERFLOW (x, y) is
    functionally equal to .UADDC (x, y, 0) (provided the types of operands
    are the same and result is complex type with that type element), and
    it also has code to match the most significant limb with ignored carry-out
    (in that case one pattern match turns both the penultimate limb pair of
    .{ADD,SUB}_OVERFLOW into .U{ADD,SUB}C and the addition/subtraction
    of the 4 values (2 carries) into another .U{ADD,SUB}C.
    
    As the following patch shows, what we weren't handling is the case when
    one uses either the __builtin_{add,sub}c builtins or hand written forms
    thereof (either __builtin_*_overflow or even that written by hand) for
    just 2 limbs, where the least significant has 0 carry-in and the most
    significant ignores carry-out.  The following patch matches that, e.g.
      _16 = .ADD_OVERFLOW (_1, _2);
      _17 = REALPART_EXPR <_16>;
      _18 = IMAGPART_EXPR <_16>;
      _15 = _3 + _4;
      _12 = _15 + _18;
    into
      _16 = .UADDC (_1, _2, 0);
      _17 = REALPART_EXPR <_16>;
      _18 = IMAGPART_EXPR <_16>;
      _19 = .UADDC (_3, _4, _18);
      _12 = IMAGPART_EXPR <_19>;
    so that we can emit better code.
    
    As the 2 later comments show, we must do that carefully, because the
    pass walks the IL from first to last stmt in a bb and we must avoid
    pattern matching this way something that should be matched on a later
    instruction differently.
    
    2023-08-29  Jakub Jelinek  <jakub@redhat.com>
    
            PR middle-end/79173
            PR middle-end/111209
            * tree-ssa-math-opts.cc (match_uaddc_usubc): Match also
            just 2 limb uaddc/usubc with 0 carry-in on lower limb and ignored
            carry-out on higher limb.  Don't match it though if it could be
            matched later on 4 argument addition/subtraction.
    
            * gcc.target/i386/pr79173-12.c: New test.