Bug 79173

Summary: add-with-carry and subtract-with-borrow support (x86_64 and others)
Product: gcc Reporter: Vincent Lefèvre <vincent-gcc>
Component: middle-endAssignee: Jakub Jelinek <jakub>
Status: ASSIGNED ---    
Severity: enhancement CC: already5chosen, chfast, crazylht, dimhen, dje, fw, gasper.azman, hjl.tools, jakub, krebbel, law, rsandifo, rth, segher, slash.tmp, tkoenig, zimmerma+gcc
Priority: P3 Keywords: missed-optimization
Version: 7.0   
Target Milestone: ---   
See Also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97387
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110104
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112430
Host: Target: x86_64-*-*
Build: Known to work:
Known to fail: Last reconfirmed: 2021-05-28 00:00:00
Bug Depends on: 80491    
Bug Blocks:    
Attachments: gcc14-pr79173-wip.patch
gcc14-pr79173.patch

Description Vincent Lefèvre 2017-01-21 00:51:53 UTC
There should be a way to support full add-with-carry and subtract-with-borrow by generating adc / sbb instructions on x86_64 (and similar instructions on other targets).

GCC could add builtins, such as __builtin_addc* and __builtin_subc* (two arguments, carry in, carry out, and the result), similar to Clang:
http://clang.llvm.org/docs/LanguageExtensions.html#multiprecision-arithmetic-builtins
as suggested in PR 60206 comment 3.

Detection of special constructs in standard C/... code would be useful too. Here are some examples from https://gcc.gnu.org/ml/gcc-help/2017-01/msg00067.html for subtraction:

typedef unsigned long T;

void sub1 (T *p, T u0, T u1, T u2, T v0, T v1, T v2)
{
  T t1;
  int b0, b1;

  p[0] = u0 - v0;
  b0 = u0 < v0;
  t1 = u1 - v1;
  b1 = u1 < v1;
  p[1] = t1 - b0;
  b1 |= p[1] > t1;
  p[2] = u2 - v2 - b1;
}

void sub2 (T *p, T u0, T u1, T u2, T v0, T v1, T v2)
{
  int b0, b1;

  p[0] = u0 - v0;
  b0 = u0 < v0;
  p[1] = u1 - v1 - b0;
  b1 = u1 < v1 || (u1 == v1 && b0 != 0);
  p[2] = u2 - v2 - b1;
}

In the second example, the b1 line could also be replaced by:

  b1 = u1 < v1 + b0 || v1 + b0 < v1;

For the subtractions, optimal code would contain 1 sub and 2 sbb's.
Comment 1 Marc Glisse 2017-01-21 08:41:23 UTC
We could start with the simpler:

void f(unsigned*__restrict__ r,unsigned*__restrict__ s,unsigned a,unsigned b,unsigned c, unsigned d){
  *r=a+b;
  *s=c+d+(*r<a);
}

After combine, we have:

(insn 34 12 20 2 (set (reg:SI 93 [ _15+4 ])
        (ltu:SI (reg:CCC 17 flags)
            (const_int 0 [0]))) 608 {*setcc_si_1_movzbl}
     (expr_list:REG_DEAD (reg:CCC 17 flags)
        (nil)))

(insn 21 20 22 2 (parallel [
            (set (reg:SI 102)
                (plus:SI (reg:SI 37 r8 [ c ])
                    (reg:SI 38 r9 [ d ])))
            (clobber (reg:CC 17 flags))
        ]) "a.c":3 213 {*addsi_1}
     (expr_list:REG_DEAD (reg:SI 38 r9 [ d ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (expr_list:REG_DEAD (reg:SI 37 r8 [ c ])
                (nil)))))

(insn 25 24 26 2 (parallel [
            (set (reg:SI 105)
                (plus:SI (reg:SI 102)
                    (reg:SI 93 [ _15+4 ])))
            (clobber (reg:CC 17 flags))
        ]) "a.c":3 213 {*addsi_1}
     (expr_list:REG_DEAD (reg:SI 93 [ _15+4 ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (expr_list:REG_DEAD (reg:SI 102)
                (nil)))))

The combine dump says "Trying 21, 34 -> 25:" but the next line is blank and it moves on to trying something else.

If I use parentheses *s=c+(d+(*r<a)); and disable reassoc, I get:

Trying 23, 24 -> 25:
Successfully matched this instruction:
(parallel [
        (set (reg:SI 105)
            (plus:SI (plus:SI (ltu:SI (reg:CCC 17 flags)
                        (const_int 0 [0]))
                    (reg:SI 37 r8 [ c ]))
                (reg:SI 38 r9 [ d ])))
        (clobber (reg:CC 17 flags))
    ])
Instruction not appropriate for target.

I don't know where that target restriction is coming from, but at least we seem to be getting somewhere.

If I remove c and keep *s=d+(*r<a), I get

Failed to match this instruction:
(parallel [
        (set (reg:SI 103)
            (plus:SI (ltu:SI (reg:CCC 17 flags)
                    (const_int 0 [0]))
                (reg:SI 38 r9 [ d ])))
        (clobber (reg:CC 17 flags))
    ])
Failed to match this instruction:
(set (reg:SI 103)
    (plus:SI (ltu:SI (reg:CCC 17 flags)
            (const_int 0 [0]))
        (reg:SI 38 r9 [ d ])))

we would probably need a special pattern for this case, virtually adding 0.
Comment 2 Marc Glisse 2017-01-21 10:36:03 UTC
(In reply to Marc Glisse from comment #1)
> Trying 23, 24 -> 25:
> Successfully matched this instruction:
> (parallel [
>         (set (reg:SI 105)
>             (plus:SI (plus:SI (ltu:SI (reg:CCC 17 flags)
>                         (const_int 0 [0]))
>                     (reg:SI 37 r8 [ c ]))
>                 (reg:SI 38 r9 [ d ])))
>         (clobber (reg:CC 17 flags))
>     ])
> Instruction not appropriate for target.

I didn't notice immediately, but apparently ix86_legitimate_combined_insn was happy to let combine propagate the hard registers r8 and r9 into the simple additions *addsi_1 which has (match_operand:SWI48 2 "x86_64_general_operand" "rme,re,0,le"), but then it won't accept them in the addition with carry. My guess would be that letting the hard registers into the additions is premature and should be delayed until after combine. If I manually reject those in gdb, we do produce addl+adcl.
Comment 3 Richard Biener 2017-01-23 08:49:09 UTC
Confirmed, the bug should possibly be split (Clang compatible builtins, target specific detection of patterns, middle-end detection of patterns).  Note a suitable middle-end representation would not have the address argument but likely
return two values (via the usual _Complex int trick).
Comment 4 Vincent Lefèvre 2017-01-25 14:12:00 UTC
Also, make sure that the optimization is still done when a variable is a constant or replaced by a constant (with Clang, the optimization is no longer done in such a case).
Comment 5 Michael_S 2020-11-15 21:21:39 UTC
(In reply to Vincent Lefèvre from comment #0)
> There should be a way to support full add-with-carry and
> subtract-with-borrow by generating adc / sbb instructions on x86_64 (and
> similar instructions on other targets).
> 
> GCC could add builtins, such as __builtin_addc* and __builtin_subc* (two
> arguments, carry in, carry out, and the result), similar to Clang:
> http://clang.llvm.org/docs/LanguageExtensions.html#multiprecision-arithmetic-
> builtins
> as suggested in PR 60206 comment 3.
> 
> Detection of special constructs in standard C/... code would be useful too.
> Here are some examples from
> https://gcc.gnu.org/ml/gcc-help/2017-01/msg00067.html for subtraction:
> 
> typedef unsigned long T;
> 
> void sub1 (T *p, T u0, T u1, T u2, T v0, T v1, T v2)
> {
>   T t1;
>   int b0, b1;
> 
>   p[0] = u0 - v0;
>   b0 = u0 < v0;
>   t1 = u1 - v1;
>   b1 = u1 < v1;
>   p[1] = t1 - b0;
>   b1 |= p[1] > t1;
>   p[2] = u2 - v2 - b1;
> }
> 
> void sub2 (T *p, T u0, T u1, T u2, T v0, T v1, T v2)
> {
>   int b0, b1;
> 
>   p[0] = u0 - v0;
>   b0 = u0 < v0;
>   p[1] = u1 - v1 - b0;
>   b1 = u1 < v1 || (u1 == v1 && b0 != 0);
>   p[2] = u2 - v2 - b1;
> }
> 
> In the second example, the b1 line could also be replaced by:
> 
>   b1 = u1 < v1 + b0 || v1 + b0 < v1;
> 
> For the subtractions, optimal code would contain 1 sub and 2 sbb's.

I agree with regard to "other targets", first of all, aarch64, but x86_64 variant of gcc already provides requested functionality in for of _subborrow_u64 () and _addcarry_u64() intrinsic functions.
The problem here is not lack of functionality, but very poor implementation (mentioned many times on bugzilla with minimal effect).
In that regard gcc is more than decade behind MSVC and ~4 years behind clang/llvm. Surprisingly, icc is also quite bad.
Comment 6 Michael_S 2020-11-15 21:24:13 UTC
(In reply to Marc Glisse from comment #1)
> We could start with the simpler:
> 
> void f(unsigned*__restrict__ r,unsigned*__restrict__ s,unsigned a,unsigned
> b,unsigned c, unsigned d){
>   *r=a+b;
>   *s=c+d+(*r<a);
> }
> 

That works for dual-precision addition, but not for triple or more.

> After combine, we have:
> 
> (insn 34 12 20 2 (set (reg:SI 93 [ _15+4 ])
>         (ltu:SI (reg:CCC 17 flags)
>             (const_int 0 [0]))) 608 {*setcc_si_1_movzbl}
>      (expr_list:REG_DEAD (reg:CCC 17 flags)
>         (nil)))
> 
> (insn 21 20 22 2 (parallel [
>             (set (reg:SI 102)
>                 (plus:SI (reg:SI 37 r8 [ c ])
>                     (reg:SI 38 r9 [ d ])))
>             (clobber (reg:CC 17 flags))
>         ]) "a.c":3 213 {*addsi_1}
>      (expr_list:REG_DEAD (reg:SI 38 r9 [ d ])
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (expr_list:REG_DEAD (reg:SI 37 r8 [ c ])
>                 (nil)))))
> 
> (insn 25 24 26 2 (parallel [
>             (set (reg:SI 105)
>                 (plus:SI (reg:SI 102)
>                     (reg:SI 93 [ _15+4 ])))
>             (clobber (reg:CC 17 flags))
>         ]) "a.c":3 213 {*addsi_1}
>      (expr_list:REG_DEAD (reg:SI 93 [ _15+4 ])
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (expr_list:REG_DEAD (reg:SI 102)
>                 (nil)))))
> 
> The combine dump says "Trying 21, 34 -> 25:" but the next line is blank and
> it moves on to trying something else.
> 
> If I use parentheses *s=c+(d+(*r<a)); and disable reassoc, I get:
> 
> Trying 23, 24 -> 25:
> Successfully matched this instruction:
> (parallel [
>         (set (reg:SI 105)
>             (plus:SI (plus:SI (ltu:SI (reg:CCC 17 flags)
>                         (const_int 0 [0]))
>                     (reg:SI 37 r8 [ c ]))
>                 (reg:SI 38 r9 [ d ])))
>         (clobber (reg:CC 17 flags))
>     ])
> Instruction not appropriate for target.
> 
> I don't know where that target restriction is coming from, but at least we
> seem to be getting somewhere.
> 
> If I remove c and keep *s=d+(*r<a), I get
> 
> Failed to match this instruction:
> (parallel [
>         (set (reg:SI 103)
>             (plus:SI (ltu:SI (reg:CCC 17 flags)
>                     (const_int 0 [0]))
>                 (reg:SI 38 r9 [ d ])))
>         (clobber (reg:CC 17 flags))
>     ])
> Failed to match this instruction:
> (set (reg:SI 103)
>     (plus:SI (ltu:SI (reg:CCC 17 flags)
>             (const_int 0 [0]))
>         (reg:SI 38 r9 [ d ])))
> 
> we would probably need a special pattern for this case, virtually adding 0.
Comment 7 Jakub Jelinek 2020-11-15 21:32:57 UTC
(In reply to Michael_S from comment #5)
> I agree with regard to "other targets", first of all, aarch64, but x86_64
> variant of gcc already provides requested functionality in for of
> _subborrow_u64 () and _addcarry_u64() intrinsic functions.
> The problem here is not lack of functionality, but very poor implementation
> (mentioned many times on bugzilla with minimal effect).
> In that regard gcc is more than decade behind MSVC and ~4 years behind
> clang/llvm. Surprisingly, icc is also quite bad.

Are you sure you have tested gcc trunk?
There have been fixes for this a month ago as part of PR97387 fixes.
Comment 8 Michael_S 2020-11-15 22:30:27 UTC
(In reply to Jakub Jelinek from comment #7)
> (In reply to Michael_S from comment #5)
> > I agree with regard to "other targets", first of all, aarch64, but x86_64
> > variant of gcc already provides requested functionality in for of
> > _subborrow_u64 () and _addcarry_u64() intrinsic functions.
> > The problem here is not lack of functionality, but very poor implementation
> > (mentioned many times on bugzilla with minimal effect).
> > In that regard gcc is more than decade behind MSVC and ~4 years behind
> > clang/llvm. Surprisingly, icc is also quite bad.
> 
> Are you sure you have tested gcc trunk?
> There have been fixes for this a month ago as part of PR97387 fixes.

I am sure I did not :-)
And I most likely am not going to test trunk, sorry. I'd wait for release.
Essentially I am posting here not because I deeply care about the topic right now (I did care a years or so ago, but lost interest since then, see https://www.realworldtech.com/forum/?threadid=188061&curpostid=188061), but because I stepped on it occasionally while waiting for response to 97832.

So, sorry for intervention.
Comment 9 Michael_S 2020-11-15 22:45:39 UTC
Despite what I wrote above, I did took a look at the trunk on godbolt with same old code from a year ago. Because it was so easy. And indeed a trunk looks ALOT better.
But until it's released I wouldn't know if it's actually up to speed of MSVC and clang.
Comment 10 Thomas Koenig 2021-05-28 15:34:59 UTC
Just had a look at trunk.

It currently produces

adc:
        leaq    800(%rsi), %rcx
        xorl    %edx, %edx
.L2:
        movq    (%rdi), %rax
        addb    $-1, %dl
        adcq    (%rsi), %rax
        setc    %dl
        addq    $8, %rsi
        movq    %rax, (%rdi)
        addq    $8, %rdi
        cmpq    %rcx, %rsi
        jne     .L2
        ret

Clang does

adc:                                    # @adc
        movl    $4, %eax
        xorl    %ecx, %ecx
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        movq    -32(%rsi,%rax,8), %rdx
        addb    $-1, %cl
        adcq    %rdx, -32(%rdi,%rax,8)
        movq    -24(%rsi,%rax,8), %rcx
        adcq    %rcx, -24(%rdi,%rax,8)
        movq    -16(%rsi,%rax,8), %rcx
        adcq    %rcx, -16(%rdi,%rax,8)
        movq    -8(%rsi,%rax,8), %rcx
        adcq    %rcx, -8(%rdi,%rax,8)
        movq    (%rsi,%rax,8), %rcx
        adcq    %rcx, (%rdi,%rax,8)
        setb    %cl
        addq    $5, %rax
        cmpq    $104, %rax
        jne     .LBB0_1
        retq

so it actually unrolls the loop and does the ideal sequence of
add with carry.
Comment 11 Andrew Pinski 2021-09-09 06:48:18 UTC
x86 has _addcarry_u64 which gets it mostly (see PR 97387).

The clang builtins __builtin_*_overflow are there but not the __builtin_add* builtins.

GCC does do a decent job of optimizing the original code now too.
Comment 12 Vincent Lefèvre 2021-09-09 08:54:53 UTC
(In reply to Andrew Pinski from comment #11)
> x86 has _addcarry_u64 which gets it mostly (see PR 97387).
> 
> The clang builtins __builtin_*_overflow are there but not the __builtin_add*
> builtins.
> 
> GCC does do a decent job of optimizing the original code now too.

By "original code", do you mean the code with _addcarry_u64 (I haven't tested)? Otherwise, I don't see any optimization at all on the code I posted in Comment 0.

One issue is that _addcarry_u64 / x86intrin.h are not documented, so the conditions of its use in portable code are not clear. I suppose that it is designed to be used in a target-independent compiler builtin.
Comment 13 Andrew Pinski 2021-09-09 09:05:55 UTC
(In reply to Vincent Lefèvre from comment #12) 
> One issue is that _addcarry_u64 / x86intrin.h are not documented, so the
> conditions of its use in portable code are not clear. I suppose that it is
> designed to be used in a target-independent compiler builtin.

None of the x86 intrinsics are documented except on the Intel intrinsics guide page:
e.g:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_addcarry_u64
Comment 14 Jakub Jelinek 2023-06-05 07:17:53 UTC
Unfortunately, the clang __builtin_addc* and __builtin_subc* builtins are badly designed.
Instead of adding or subtracting a 1-bit carry, where the result is guaranteed to have 1-bit carry as well, they are:
unsigned __builtin_addc (unsigned x, unsigned y, unsigned carry_in, unsigned *carry_out)
{
  unsigned r;
  unsigned c1 = __builtin_add_overflow (x, y, &r);
  unsigned c2 = __builtin_add_overflow (r, carry_in, &r);
  *carry_out = c1 + c2;
  return r; 
}

unsigned __builtin_subc (unsigned x, unsigned y, unsigned carry_in, unsigned *carry_out)
{
  unsigned r;
  unsigned c1 = __builtin_sub_overflow (x, y, &r);
  unsigned c2 = __builtin_sub_overflow (r, carry_in, &r);
  *carry_out = c1 + c2;
  return r; 
}
So, instead of doing [0, 0xffffffff] + [0, 0xffffffff] + [0, 1] resulting in [0, 0xffffffff] plus [0, 1] carry they actually do
[0, 0xffffffff] + [0, 0xffffffff] + [0, 0xffffffff] resulting in [0, 0xffffffff] plus [0, 2] carry.  So far for good "design".

So, am not really sure if it is worth implementing those builtins, one can use __builtin_add_overflow/__builtin_sub_overflow instead, all we need is pattern detect if they are chained and so start with 0 carry in and then the carry outs are guaranteed to be [0, 1] and further pairs of .ADD_OVERFLOW/.SUB_OVERFLOW again can count on [0, 1]
carry in and produce [0, 1] carry out.  And pattern detect that into some new IFN which will try to produce efficient code for these.
Comment 15 Paweł Bylica 2023-06-05 13:01:45 UTC
For what it's worth, clang's __builtin_addc is implemented in frontend only as a pair of __builtin_add_overflow. The commit from 11 year ago does not explain why they were added. https://github.com/llvm/llvm-project/commit/54398015bf8cbdc3af54dda74807d6f3c8436164

Producing a chain of ADC instructions out of __builtin_add_overflow patterns has been done quite recently (~1 year ago). And this work is not fully finished yet.

On the other hand, Go recently added "addc" like "builtins" in https://pkg.go.dev/math/bits. And they are really pleasure to use in multi-precision arithmetic.
Comment 16 Jakub Jelinek 2023-06-06 12:01:05 UTC
Created attachment 55271 [details]
gcc14-pr79173-wip.patch

Untested WIP (with backend implementation for x86 only so far).
Still, need to add/tweak some peephole2s to make pr79173-{1,2,3,4}.c tests clean
on both x86-64 and i?86, and then as can be seen in pr79173-5.c need to do some
tweaks so that it pattern recognizes also the pattern recognized
__builtin_{add,sub}_overflow instead of those being used directly.
C23 is standardizing __builtin_{add,sub,mul}_overflow under the
ckd_{add,sub,mul} names, so pattern recognizing it that way is definitely
desirable.
Oh, and maybe incrementally check what happens if one of the addends or subtrahends are immediate.
Comment 17 Jakub Jelinek 2023-06-06 16:12:43 UTC
Created attachment 55274 [details]
gcc14-pr79173.patch

Full patch I'm going to test.
Unfortunately, I haven't been successful at getting the subc stuff working when not using __builtin_sub_overflow builtins nor _subborrow_u* instrinsics, only addc seems to work in that case, seems reassoc rewrites stuff that we no longer are able to even pattern match __builtin_sub_overflow in that case.

So, maybe even adding the ugly clang builtins as a canned way how to express it canonically would be useful, the pattern matching can't handle infinite number of different ways how to write the same thing.
Comment 18 GCC Commits 2023-06-15 07:07:17 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:b6ca11407d4f5d16ccfb580ea2d3d9aa08d7cd11

commit r14-1835-gb6ca11407d4f5d16ccfb580ea2d3d9aa08d7cd11
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Jun 15 09:05:01 2023 +0200

    i386: Add peephole2 patterns to improve add with carry or subtract with borrow with memory destination [PR79173]
    
    This patch adds various peephole2s which help to recognize add with
    carry or subtract with borrow with memory destination.
    
    2023-06-14  Jakub Jelinek  <jakub@redhat.com>
    
            PR middle-end/79173
            * config/i386/i386.md (*sub<mode>_3, @add<mode>3_carry,
            addcarry<mode>, @sub<mode>3_carry, *add<mode>3_cc_overflow_1): Add
            define_peephole2 TARGET_READ_MODIFY_WRITE/-Os patterns to prefer
            using memory destination in these patterns.
Comment 19 GCC Commits 2023-06-15 07:10:27 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:ec52d228d6db7f77188ad099a8c0ff65dead3241

commit r14-1836-gec52d228d6db7f77188ad099a8c0ff65dead3241
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Jun 15 09:08:37 2023 +0200

    i386: Add peephole2 patterns to improve subtract with borrow with memory destination [PR79173]
    
    This patch adds subborrow<mode> alternative so that it can have memory
    destination and adds various peephole2s which help to match it.
    
    2023-06-15  Jakub Jelinek  <jakub@redhat.com>
    
            PR middle-end/79173
            * config/i386/i386.md (subborrow<mode>): Add alternative with
            memory destination and add for it define_peephole2
            TARGET_READ_MODIFY_WRITE/-Os patterns to prefer using memory
            destination in these patterns.
Comment 20 GCC Commits 2023-06-15 07:16:11 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:43a3252c42af12ad90082e4088ea58eecd0bf582

commit r14-1837-g43a3252c42af12ad90082e4088ea58eecd0bf582
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Thu Jun 15 09:12:40 2023 +0200

    middle-end, i386: Pattern recognize add/subtract with carry [PR79173]
    
    The following patch introduces {add,sub}c5_optab and pattern recognizes
    various forms of add with carry and subtract with carry/borrow, see
    pr79173-{1,2,3,4,5,6}.c tests on what is matched.
    Primarily forms with 2 __builtin_add_overflow or __builtin_sub_overflow
    calls per limb (with just one for the least significant one), for
    add with carry even when it is hand written in C (for subtraction
    reassoc seems to change it too much so that the pattern recognition
    doesn't work).  __builtin_{add,sub}_overflow are standardized in C23
    under ckd_{add,sub} names, so it isn't any longer a GNU only extension.
    
    Note, clang has for these (IMHO badly designed)
    __builtin_{add,sub}c{b,s,,l,ll} builtins which don't add/subtract just
    a single bit of carry, but basically add 3 unsigned values or
    subtract 2 unsigned values from one, and result in carry out of 0, 1, or 2
    because of that.  If we wanted to introduce those for clang compatibility,
    we could and lower them early to just two __builtin_{add,sub}_overflow
    calls and let the pattern matching in this patch recognize it later.
    
    I've added expanders for this on ix86 and in addition to that
    added various peephole2s (in preparation patches for this patch) to make
    sure we get nice (and small) code for the common cases.  I think there are
    other PRs which request that e.g. for the _{addcarry,subborrow}_u{32,64}
    intrinsics, which the patch also improves.
    
    Would be nice if support for these optabs was added to many other targets,
    arm/aarch64 and powerpc* certainly have such instructions, I'd expect
    in fact that most targets do.
    
    The _BitInt support I'm working on will also need this to emit reasonable
    code.
    
    2023-06-15  Jakub Jelinek  <jakub@redhat.com>
    
            PR middle-end/79173
            * internal-fn.def (UADDC, USUBC): New internal functions.
            * internal-fn.cc (expand_UADDC, expand_USUBC): New functions.
            (commutative_ternary_fn_p): Return true also for IFN_UADDC.
            * optabs.def (uaddc5_optab, usubc5_optab): New optabs.
            * tree-ssa-math-opts.cc (uaddc_cast, uaddc_ne0, uaddc_is_cplxpart,
            match_uaddc_usubc): New functions.
            (math_opts_dom_walker::after_dom_children): Call match_uaddc_usubc
            for PLUS_EXPR, MINUS_EXPR, BIT_IOR_EXPR and BIT_XOR_EXPR unless
            other optimizations have been successful for those.
            * gimple-fold.cc (gimple_fold_call): Handle IFN_UADDC and IFN_USUBC.
            * fold-const-call.cc (fold_const_call): Likewise.
            * gimple-range-fold.cc (adjust_imagpart_expr): Likewise.
            * tree-ssa-dce.cc (eliminate_unnecessary_stmts): Likewise.
            * doc/md.texi (uaddc<mode>5, usubc<mode>5): Document new named
            patterns.
            * config/i386/i386.md (uaddc<mode>5, usubc<mode>5): New
            define_expand patterns.
            (*setcc_qi_addqi3_cconly_overflow_1_<mode>, *setccc): Split
            into NOTE_INSN_DELETED note rather than nop instruction.
            (*setcc_qi_negqi_ccc_1_<mode>, *setcc_qi_negqi_ccc_2_<mode>):
            Likewise.
    
            * gcc.target/i386/pr79173-1.c: New test.
            * gcc.target/i386/pr79173-2.c: New test.
            * gcc.target/i386/pr79173-3.c: New test.
            * gcc.target/i386/pr79173-4.c: New test.
            * gcc.target/i386/pr79173-5.c: New test.
            * gcc.target/i386/pr79173-6.c: New test.
            * gcc.target/i386/pr79173-7.c: New test.
            * gcc.target/i386/pr79173-8.c: New test.
            * gcc.target/i386/pr79173-9.c: New test.
            * gcc.target/i386/pr79173-10.c: New test.
Comment 21 Jakub Jelinek 2023-06-15 07:49:11 UTC
CCing some maintainers, could you please consider adding these uaddc<mode>5/usubc<mode>5 expanders to rs6000, aarch64, arm, s390, riscv targets if the targets have some appropriate instructions (add with carry and subtract with carry/borrow), copy the above gcc.target/i386/ testcases to other gcc.target subdirectories and verify there you get optimal code for those sequences?
The _BitInt addition/subtraction code will also use these new internal functions if possible, so implementing it will also help get usable code for _BitInt later on.
Comment 22 GCC Commits 2023-06-16 17:49:02 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:2b4e0415ad664cdb3ce87d1f7eee5ca26911a05b

commit r14-1896-g2b4e0415ad664cdb3ce87d1f7eee5ca26911a05b
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Fri Jun 16 19:47:28 2023 +0200

    uiltins: Add support for clang compatible __builtin_{add,sub}c{,l,ll} [PR79173]
    
    While the design of these builtins in clang is questionable,
    rather than being say
    unsigned __builtin_addc (unsigned, unsigned, bool, bool *)
    so that it is clear they add two [0, 0xffffffff] range numbers
    plus one [0, 1] range carry in and give [0, 0xffffffff] range
    return plus [0, 1] range carry out, they actually instead
    add 3 [0, 0xffffffff] values together but the carry out
    isn't then the expected [0, 2] value because
    0xffffffffULL + 0xffffffff + 0xffffffff is 0x2fffffffd,
    but just [0, 1] whether there was any overflow at all.
    
    It is something used in the wild and shorter to write than the
    corresponding
     #define __builtin_addc(a,b,carry_in,carry_out) \
      ({ unsigned _s; \
         unsigned _c1 = __builtin_uadd_overflow (a, b, &_s); \
         unsigned _c2 = __builtin_uadd_overflow (_s, carry_in, &_s); \
         *(carry_out) = (_c1 | _c2); \
         _s; })
    and so a canned builtin for something people could often use.
    It isn't that hard to maintain on the GCC side, as we just lower
    it to two .ADD_OVERFLOW calls early, and the already committed
    pottern recognization code can then make .UADDC/.USUBC calls out of
    that if the carry in is in [0, 1] range and the corresponding
    optab is supported by the target.
    
    2023-06-16  Jakub Jelinek  <jakub@redhat.com>
    
            PR middle-end/79173
            * builtin-types.def (BT_FN_UINT_UINT_UINT_UINT_UINTPTR,
            BT_FN_ULONG_ULONG_ULONG_ULONG_ULONGPTR,
            BT_FN_ULONGLONG_ULONGLONG_ULONGLONG_ULONGLONG_ULONGLONGPTR): New
            types.
            * builtins.def (BUILT_IN_ADDC, BUILT_IN_ADDCL, BUILT_IN_ADDCLL,
            BUILT_IN_SUBC, BUILT_IN_SUBCL, BUILT_IN_SUBCLL): New builtins.
            * builtins.cc (fold_builtin_addc_subc): New function.
            (fold_builtin_varargs): Handle BUILT_IN_{ADD,SUB}C{,L,LL}.
            * doc/extend.texi (__builtin_addc, __builtin_subc): Document.
    
            * gcc.target/i386/pr79173-11.c: New test.
            * gcc.dg/builtin-addc-1.c: New test.
Comment 23 Jeffrey A. Law 2023-06-17 15:37:52 UTC
risc-v doesn't have any special instructions to implement add-with-carry or subtract-with-borrow.  Depending on who you talk do, it's either a feature or a mis-design.
Comment 24 Jakub Jelinek 2023-06-17 16:34:46 UTC
Sorry, in that case nothing needs to be done for riscv.  I'm sure aarch64, arm has one (e.g. adcs), I think powerpc has some, but e.g. PR43892 is still open, and I'm sure s390 has them too (alc*, slb*).
Comment 25 GCC Commits 2023-06-20 18:18:24 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:f8f68c4ca622a24c2e8cf2b5f2f9fdcd47a7b369

commit r14-2001-gf8f68c4ca622a24c2e8cf2b5f2f9fdcd47a7b369
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Jun 20 20:17:41 2023 +0200

    tree-ssa-math-opts: Small uaddc/usubc pattern matching improvement [PR79173]
    
    In the following testcase we fail to pattern recognize the least significant
    .UADDC call.  The reason is that arg3 in that case is
      _3 = .ADD_OVERFLOW (...);
      _2 = __imag__ _3;
      _1 = _2 != 0;
      arg3 = (unsigned long) _1;
    and while before the changes arg3 has a single use in some .ADD_OVERFLOW
    later on, we add a .UADDC call next to it (and gsi_remove/gsi_replace only
    what is strictly necessary and leave quite a few dead stmts around which
    next DCE cleans up) and so it all of sudden isn't used just once, but twice
    (.ADD_OVERFLOW and .UADDC) and so uaddc_cast fails.  While we could tweak
    uaddc_cast and not require has_single_use in these uses, there is also
    no vrp that would figure out that because __imag__ _3 is in [0, 1] range,
    it can just use arg3 = __imag__ _3; and drop the comparison and cast.
    
    We already search if either arg2 or arg3 is ultimately set from __imag__
    of .{{ADD,SUB}_OVERFLOW,U{ADD,SUB}C} call, so the following patch just
    remembers the lhs of __imag__ from that case and uses it later.
    
    2023-06-20  Jakub Jelinek  <jakub@redhat.com>
    
            PR middle-end/79173
            * tree-ssa-math-opts.cc (match_uaddc_usubc): Remember lhs of
            IMAGPART_EXPR of arg2/arg3 and use that as arg3 if it has the right
            type.
    
            * g++.target/i386/pr79173-1.C: New test.
Comment 26 Marc Glisse 2023-06-24 09:42:29 UTC
(In reply to CVS Commits from comment #22)
>     While the design of these builtins in clang is questionable,
>     rather than being say
>     unsigned __builtin_addc (unsigned, unsigned, bool, bool *)
>     so that it is clear they add two [0, 0xffffffff] range numbers
>     plus one [0, 1] range carry in and give [0, 0xffffffff] range
>     return plus [0, 1] range carry out, they actually instead
>     add 3 [0, 0xffffffff] values together but the carry out
>     isn't then the expected [0, 2] value because
>     0xffffffffULL + 0xffffffff + 0xffffffff is 0x2fffffffd,
>     but just [0, 1] whether there was any overflow at all.

That is very strange. I always thought that the original intent was for __builtin_addc to assume that its third argument was in [0, 1] and generate a single addc instruction on hardware that has it, and the type only ended up being the same as the others for convenience (also C used not to have a bool type). The final overflow never being 2 confirms this.

It may be worth discussing with clang developers if they would be willing to document such a [0, 1] restriction, and maybe have ubsan check it.
Comment 27 Jakub Jelinek 2023-06-24 09:51:06 UTC
Given that the builtins exist for 10 years already, I think changing it for them is too late, though they don't seem to take backwards compatibility as seriously.
They don't document the [0, 1] restriction and the behavior implemented in GCC is what I saw when trying it.
Note, in many cases it isn't that big deal, because if carry_in is in [0, 1] range and compiler can see it from VRP, it can still optimize it.  And given that carry_out is always in [0, 1] range, for chained cases worst case the first additions will be less optimized but the chained will be already better.
Comment 28 Vincent Lefèvre 2023-06-24 12:09:06 UTC
(In reply to Jakub Jelinek from comment #27)
> Given that the builtins exist for 10 years already, I think changing it for
> them is too late, though they don't seem to take backwards compatibility as
> seriously.
> They don't document the [0, 1] restriction and the behavior implemented in
> GCC is what I saw when trying it.

Their documentation at https://clang.llvm.org/docs/LanguageExtensions.html is currently just

  unsigned sum = __builtin_addc(x, y, carryin, &carryout);

But a carry for a 2-ary addition is always 0 or 1, so the [0, 1] restriction is implicit (by the language that is used).

And in their example, the carries are always 0 or 1.

> Note, in many cases it isn't that big deal, because if carry_in is in [0, 1]
> range and compiler can see it from VRP, it can still optimize it.  And given
> that carry_out is always in [0, 1] range, for chained cases worst case the
> first additions will be less optimized but the chained will be already
> better.

What do you mean by "the first additions will be less optimized"? (If you don't know anything about the initial carryin and the arguments, you can't optimize at all, AFAIK.)
Comment 29 Jakub Jelinek 2023-06-24 12:32:30 UTC
(In reply to Vincent Lefèvre from comment #28)
> (In reply to Jakub Jelinek from comment #27)
> > Given that the builtins exist for 10 years already, I think changing it for
> > them is too late, though they don't seem to take backwards compatibility as
> > seriously.
> > They don't document the [0, 1] restriction and the behavior implemented in
> > GCC is what I saw when trying it.
> 
> Their documentation at https://clang.llvm.org/docs/LanguageExtensions.html
> is currently just
> 
>   unsigned sum = __builtin_addc(x, y, carryin, &carryout);
> 
> But a carry for a 2-ary addition is always 0 or 1, so the [0, 1] restriction
> is implicit (by the language that is used).

That is something that would need to be said explicitly, that it is undefined behavior if it is some other value.  Like we document that e.g. __builtin_clz is undefined behavior on 0 input.

> What do you mean by "the first additions will be less optimized"? (If you
> don't know anything about the initial carryin and the arguments, you can't
> optimize at all, AFAIK.)

I mean that if the compiler can't see it is in [0, 1], it will need to use 2 additions and or the 2 carry bits together.  But, because the ored carry bits are in [0, 1] range, all the higher limbs could be done using addc.

If you try clang trunk with -O2
unsigned int foo (unsigned x, unsigned y, unsigned carry_in, unsigned *carry_out) { return __builtin_addc (x, y, carry_in, carry_out); }
unsigned int bar (unsigned x, unsigned y, unsigned carry_in, unsigned *carry_out) { if (carry_in > 1) __builtin_unreachable (); return __builtin_addc (x, y, carry_in, carry_out); }
it shows exactly those 2 additions, rather than trying to optimize it, in both cases.
GCC trunk emits something comparable for the first case, and 
unsigned int foo (unsigned x, unsigned y, unsigned carry_in, unsigned *carry_out) { return __builtin_addc (x, y, carry_in, carry_out); }
you get
addb    $-1, %dl; adcl    %esi, %eax
for the main work.
Comment 30 Vincent Lefèvre 2023-06-24 13:10:40 UTC
(In reply to Jakub Jelinek from comment #29)
> (In reply to Vincent Lefèvre from comment #28)
> > What do you mean by "the first additions will be less optimized"? (If you
> > don't know anything about the initial carryin and the arguments, you can't
> > optimize at all, AFAIK.)
> 
> I mean that if the compiler can't see it is in [0, 1], it will need to use 2
> additions and or the 2 carry bits together.  But, because the ored carry
> bits are in [0, 1] range, all the higher limbs could be done using addc.

If the compiler can't see that carryin is in [0, 1], then it must not "or" the carry bits; it needs to add them, as carryout may be 2. So each part of the whole chain would need 2 __builtin_add_overflow and an addition of the carry bits. However, if the compiler can detect that at some point, the arguments cannot be both 0xffffffff at the same time (while carryin is in [0, 2]), then an optimization is possible for the rest of the chain.
Comment 31 Jakub Jelinek 2023-06-24 13:14:35 UTC
(In reply to Vincent Lefèvre from comment #30)
> (In reply to Jakub Jelinek from comment #29)
> > (In reply to Vincent Lefèvre from comment #28)
> > > What do you mean by "the first additions will be less optimized"? (If you
> > > don't know anything about the initial carryin and the arguments, you can't
> > > optimize at all, AFAIK.)
> > 
> > I mean that if the compiler can't see it is in [0, 1], it will need to use 2
> > additions and or the 2 carry bits together.  But, because the ored carry
> > bits are in [0, 1] range, all the higher limbs could be done using addc.
> 
> If the compiler can't see that carryin is in [0, 1], then it must not "or"
> the carry bits; it needs to add them, as carryout may be 2.

That is not how the clang builtin works, which is why I've implemented the | and documented it that way, as it is a compatibility builtin.
Comment 32 Vincent Lefèvre 2023-06-25 00:22:24 UTC
(In reply to Jakub Jelinek from comment #31)
> (In reply to Vincent Lefèvre from comment #30)
> > (In reply to Jakub Jelinek from comment #29)
> > > I mean that if the compiler can't see it is in [0, 1], it will need
> > > to use 2 additions and or the 2 carry bits together.  But, because
> > > the ored carry bits are in [0, 1] range, all the higher limbs could
> > > be done using addc.
> > 
> > If the compiler can't see that carryin is in [0, 1], then it must not "or"
> > the carry bits; it needs to add them, as carryout may be 2.
> 
> That is not how the clang builtin works, which is why I've implemented the |
> and documented it that way, as it is a compatibility builtin.

I'm confused. In Comment 14, you said that

  *carry_out = c1 + c2;

was used. This is an addition, not an OR.
Comment 33 GCC Commits 2023-08-29 08:47:27 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:a7aec76a74dd38524be325343158d3049b6ab3ac

commit r14-3541-ga7aec76a74dd38524be325343158d3049b6ab3ac
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Tue Aug 29 10:46:01 2023 +0200

    tree-ssa-math-opts: Improve uaddc/usubc pattern matching [PR111209]
    
    The uaddc/usubc usual matching is of the .{ADD,SUB}_OVERFLOW pair in the
    middle, which adds/subtracts carry-in (from lower limbs) and computes
    carry-out (to higher limbs).  Before optimizations (unless user writes
    it intentionally that way already), all the steps look the same, but
    optimizations simplify the handling of the least significant limb
    (one which adds/subtracts 0 carry-in) to just a single
    .{ADD,SUB}_OVERFLOW and the handling of the most significant limb
    if the computed carry-out is ignored to normal addition/subtraction
    of multiple operands.
    Now, match_uaddc_usubc has code to turn that least significant
    .{ADD,SUB}_OVERFLOW call into .U{ADD,SUB}C call with 0 carry-in if
    a more significant limb above it is matched into .U{ADD,SUB}C; this
    isn't necessary for functionality, as .ADD_OVERFLOW (x, y) is
    functionally equal to .UADDC (x, y, 0) (provided the types of operands
    are the same and result is complex type with that type element), and
    it also has code to match the most significant limb with ignored carry-out
    (in that case one pattern match turns both the penultimate limb pair of
    .{ADD,SUB}_OVERFLOW into .U{ADD,SUB}C and the addition/subtraction
    of the 4 values (2 carries) into another .U{ADD,SUB}C.
    
    As the following patch shows, what we weren't handling is the case when
    one uses either the __builtin_{add,sub}c builtins or hand written forms
    thereof (either __builtin_*_overflow or even that written by hand) for
    just 2 limbs, where the least significant has 0 carry-in and the most
    significant ignores carry-out.  The following patch matches that, e.g.
      _16 = .ADD_OVERFLOW (_1, _2);
      _17 = REALPART_EXPR <_16>;
      _18 = IMAGPART_EXPR <_16>;
      _15 = _3 + _4;
      _12 = _15 + _18;
    into
      _16 = .UADDC (_1, _2, 0);
      _17 = REALPART_EXPR <_16>;
      _18 = IMAGPART_EXPR <_16>;
      _19 = .UADDC (_3, _4, _18);
      _12 = IMAGPART_EXPR <_19>;
    so that we can emit better code.
    
    As the 2 later comments show, we must do that carefully, because the
    pass walks the IL from first to last stmt in a bb and we must avoid
    pattern matching this way something that should be matched on a later
    instruction differently.
    
    2023-08-29  Jakub Jelinek  <jakub@redhat.com>
    
            PR middle-end/79173
            PR middle-end/111209
            * tree-ssa-math-opts.cc (match_uaddc_usubc): Match also
            just 2 limb uaddc/usubc with 0 carry-in on lower limb and ignored
            carry-out on higher limb.  Don't match it though if it could be
            matched later on 4 argument addition/subtraction.
    
            * gcc.target/i386/pr79173-12.c: New test.